The Dawn of Open Access to Phylogenetic Data

The Dawn of Open Access to Phylogenetic Data
Andrew F. Magee, Michael R. May, Brian R. Moore
(Submitted on 23 May 2014)

The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computational investment. Moreover, the resulting phylogenetic data provide an explicit historical perspective that critically informs research in a vast and growing number of scientific disciplines. One such use is the study of changes in rates of lineage diversification (speciation – extinction) through time. As part of a meta-analysis in this area, we sought to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 studies published in 46 journals over a 13-year period. We document our attempts to procure those data (from online archives and by direct request to corresponding authors), and report results of analyses (using Bayesian logistic regression) to assess the impact of various factors on the success of our efforts. Overall, complete phylogenetic data for ~60% of these studies are effectively lost to science. Our study indicates that phylogenetic data are more likely to be deposited in online archives and/or shared upon request when: (1) the publishing journal has a strong data-sharing policy; (2) the publishing journal has a higher impact factor, and; (3) the data are requested from faculty rather than students. Although the situation appears dire, our analyses suggest that it is far from hopeless: recent initiatives by the scientific community — including policy changes by journals and funding agencies — are improving the state of affairs.

Genomic variation in a widespread Neotropical bird (Xenops minutus) reveals divergence, population expansion, and gene flow

Genomic variation in a widespread Neotropical bird (Xenops minutus) reveals divergence, population expansion, and gene flow
Michael G. Harvey, Robb T. Brumfield
(Submitted on 26 May 2014)

Elucidating the demographic and phylogeographic histories of species provides insight into the processes responsible for generating biological diversity, and genomic datasets are now permitting the estimation of histories and demographic parameters with unprecedented accuracy. We used a genomic single nucleotide polymorphism (SNP) dataset generated using a RAD-Seq method to investigate the historical demography and phylogeography of a widespread lowland Neotropical bird (Xenops minutus). As expected, we found that prominent landscape features that act as dispersal barriers, such as Amazonian rivers and the Andes Mountains, are associated with the deepest phylogeographic breaks, and also that isolation by distance is limited in areas between these barriers. In addition, we inferred positive population growth for most populations and detected evidence of historical gene flow between populations that are now physically isolated. Even with genomic estimates of historical demographic parameters, we found the prominent diversification hypotheses to be untestable. We conclude that investigations into the multifarious processes shaping species histories, aided by genomic datasets, will provide greater resolution of diversification in the Neotropics, but that future efforts should focus on understanding the processes shaping the histories of lineages rather than trying to reconcile these histories with landscape and climatic events in Earth history.

Human genomic regions with exceptionally high or low levels of population differentiation identified from 911 whole-genome sequences

Human genomic regions with exceptionally high or low levels of population differentiation identified from 911 whole-genome sequences
Vincenza Colonna, Qasim Ayub, Yuan Chen, Luca Pagani, Pierre Luisi, Marc Pybus, Erik Garrison, Yali Xue, Chris Tyler-Smith

Background: Population differentiation has proved to be effective for identifying loci under geographically-localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. Results: We demonstrate that while sites of low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. Conclusions: We have identified known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.

Powerful tests for multi-marker association analysis using ensemble learning

Powerful tests for multi-marker association analysis using ensemble learning
Badri Padhukasahasram

Multi-marker approaches are currently gaining a lot of interest in genome wide association studies and can enhance power to detect new associations under certain conditions. Gene and pathway based association tests are increasingly being viewed as useful complements to the more widely used single marker association analysis which have successfully uncovered numerous disease variants. A major drawback of single-marker based methods is that they do not consider pairwise and higher-order interactions between variants. Here, we describe multi-variate methods for gene and pathway based association analyses using phenotype predictions based on machine learning algorithms. Instead of utilizing only a linear or logistic regression model, we propose the use of ensembles of diverse machine learning algorithms for testing multi-variate associations. As the true mathematical relationship between a phenotype and any group of genetic and clinical variables is unknown in advance and may be complex, such a strategy gives us a general and flexible framework to approximate this relationship across different sets of SNPs. We show how phenotype prediction based on our method can be used for constructing tests for SNP set association analysis. We first apply our method to simulated datasets to demonstrate its power and correctness. Then, we apply our method to previously studied asthma-related genes in 2 independent asthma cohorts to conduct association tests.

Sequence co-evolution gives 3D contacts and structures of protein complexes

Sequence co-evolution gives 3D contacts and structures of protein complexes
Thomas A. Hopf, Charlotta P.I. Schärfe, João P.G.L.M. Rodrigues, Anna G. Green, Chris Sander, Alexandre M.J.J. Bonvin, Debora S. Marks

High-throughput experiments in bacteria and eukaryotic cells have identified tens of thousands of interactions between proteins. This genome-wide view of the protein interaction universe is coarse-grained, whilst fine-grained detail of macro-molecular interactions critically depends on lower throughput, labor-intensive experiments. Computational approaches using measures of residue co-evolution across proteins show promise, but have been limited to specific interactions. Here we present a new generalized method showing that patterns of evolutionary sequence changes across proteins reflect residues that are close in space, with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We demonstrate that the inferred evolutionary coupling scores accurately predict inter-protein residue interactions and can distinguish between interacting and non-interacting proteins. To illustrate the utility of the method, we predict co-evolved contacts between 50 E. coli complexes (of unknown structure), including the unknown 3D interactions between subunits of ATP synthase and find results consistent with detailed experimental data. We expect that the method can be generalized to genome-wide interaction predictions at residue resolution.

A Simple Data-Adaptive Probabilistic Variant Calling Model

A Simple Data-Adaptive Probabilistic Variant Calling Model
Steve Hoffmann, Peter F. Stadler, Korbinian Strimmer
(Submitted on 20 May 2014)

Background: Several sources of noise obfuscate the identification of single nucleotide variation in next generation sequencing data. Not only errors introduced during library construction and sequencing steps but also the quality of the reference genome and the algorithms used for the alignment of the reads play an influential role. It is not trivial to estimate the influence these factors for individual sequencing experiments.
Results: We introduce a simple data-adaptive model for variant calling. Several characteristics are sampled from sites with low mismatch rates and uses to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background.
Conclusions: In simulations we show that the proposed model is at par with frequently used SNV calling algorithms in terms of sensitivity and specificity. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.

Inferring human population size and separation history from multiple genome sequences

Inferring human population size and separation history from multiple genome sequences
Stephan Schiffels, Richard Durbin

The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

LIMIX: genetic analysis of multiple traits

LIMIX: genetic analysis of multiple traits
Christoph Lippert, Francesco Paolo Casale, Barbara Rakitsch, Oliver Stegle

Multi-trait mixed models have emerged as a promising approach for joint analyses of multiple traits. In principle, the mixed model framework is remarkably general. However, current methods implement only a very specific range of tasks to optimize the necessary computations. Here, we present a multi-trait modeling framework that is versatile and fast: LIMIX enables to flexibly adapt mixed models for a broad range of applications with different observed and hidden covariates, and variable study designs. To highlight the novel modeling aspects of LIMIX we performed three vastly different genetic studies: joint GWAS of correlated blood lipid phenotypes, joint analysis of the expression levels of the multiple transcript-isoforms of a gene, and pathway-based modeling of molecular traits across environments. In these applications we show that LIMIX increases GWAS power and phenotype prediction accuracy, in particular when integrating stepwise multi-locus regression into multi-trait models, and when analyzing large numbers of traits. An open source implementation of LIMIX is freely available at:

The distribution of deleterious genetic variation in human populations

The distribution of deleterious genetic variation in human populations
Kirk E Lohmueller

Population genetic studies suggest that most amino-acid changing mutations are deleterious. Such mutations are of tremendous interest in human population genetics as they are important for the evolutionary process and may contribute risk to common disease. Genomic studies over the past 5 years have documented differences across populations in the number of heterozygous deleterious genotypes, numbers of homozygous derived deleterious genotypes, number of deleterious segregating sites and proportion of sites that are potentially deleterious. These differences have been attributed to population history affecting the ability of natural selection to remove deleterious variants from the population. However, recent studies have suggested that the genetic load may not differ across populations, and that the efficacy of natural selection has not differed across human populations. Here I show that these observations are not incompatible with each other and that the apparent differences are due to examining different features of the genetic data and differing definitions of terms.

Sperm should evolve to make female meiosis fair.

Sperm should evolve to make female meiosis fair.
Yaniv Brandvain, Graham Coop

Genomic conflicts arise when an allele gains an evolutionary advantage at a cost to organismal fitness. Oogenesis is inherently susceptible to such conflicts because alleles compete to be the product of female meiosis transmitted to the egg. Alleles that distort meiosis in their favor (i.e. meiotic drivers) often decrease organismal fitness, and therefore indirectly favor the evolution of mechanisms to suppress meiotic drive. In this light, many facets of oogenesis and gametogenesis have been interpreted as mechanisms of protection against genomic outlaws. Why then is female meiosis often left uncompleted until after fertilization in many animals — potentially providing an opportunity for sperm alleles to meddle with its outcome and help like-alleles drive in heterozygous females? The population genetic theory presented herein suggests that sperm nearly always evolve to increase the fairness of female meiosis in the face of genomic conflicts. These results are consistent with current knowledge of sperm-dependent meiotic drivers (loci whose distortion of female meiosis depends on sperm genotype), and suggest that the requirement of fertilization for the completion of female meiosis potentially represents a mechanism employed by females to ensure a fair meiosis.