Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants
Sarah A Gagliano, Reena Ravji, Michael R Barnes, Michael E Weale, Jo Knight
Although technology has triumphed in facilitating routine genome re-sequencing, new challenges have been created for the data analyst. Genome scale surveys of human disease variation generate volumes of data that far exceed capabilities for laboratory characterization, and importantly also create a substantial burden of type I error. By incorporating a variety of functional annotations as predictors, such as regulatory and protein coding elements, statistical learning has been widely investigated as a mechanism for the prioritization of genetic variants that are more likely to be associated with complex disease. These methods offer a hope of identification of sufficiently large numbers of truly associated variants, to make cost-effective the large-scale functional characterization necessary to progress genome scale experiments. We compared the results from three published prioritization procedures which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding of the functional annotations. In this paper we also explore different combinations of algorithm and annotation set. We train the models in 60% of the data and reserve the remainder for testing the accuracy. As an application, we tested which methodology performed the best for prioritizing sub-genome-wide-significant variants using data from the first and second rounds of a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71). However, predictive accuracy results obtained from the test set do not always reflect results obtained from the application to the schizophrenia meta-analysis. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved towards the step change in the risk variant prediction required to address the impending bottleneck of the new generation of genome re-sequencing studies.
Current data show no signal of Ebola virus adapting to humans
Stephanie J. Spielman, Austin G. Meyer, Claus O. Wilke
Gire et al. (Science 345:1369–1372, 2014) analyzed 81 complete genomes sampled from the 2014 Zaire ebolavirus (EBOV) outbreak and claimed that the virus is evolving far more rapidly in the current outbreak than it has been between previous outbreaks. This assertion has received widespread attention, and many have perceived Gire et al. (2014)’s results as implying rapid adaptation of EBOV to humans during the current outbreak. Here, we show that, on the contrary, sequence divergence in EBOV is rather limited, and that the currently available data contain no signal of rapid evolution or adaptation to humans. Gire et al.’s findings resulted from an incorrect application of a molecular-clock model to a population of sequences with minimal divergence and segregating polymorphisms. Our results highlight how indiscriminate use of off-the-shelf analysis techniques may result in highly-publicized, misleading statements about an ongoing public health crisis.
How Well Can We Detect Shifts in Rates of Lineage Diversification? A Simulation Study of Sequential AIC Methods
Michael R May, Brian R Moore
Evolutionary biologists have long been fascinated by the extreme differences in species numbers across branches of the Tree of Life. This has motivated the development of statistical phy- logenetic methods for detecting shifts in the rate of lineage diversification (speciation – extinction). One of the most frequently used methods—implemented in the program MEDUSA—explores a set of diversification-rate models, where each model uniquely assigns branches of the phylogeny to a set of one or more diversification-rate categories. Each candidate model is first fit to the data, and the Akaike Information Criterion (AIC) is then used to identify the optimal diversification model. Surprisingly, the statistical behavior of this popular method is completely unknown, which is a concern in light of the poor performance of the AIC as a means of choosing among models in other phylogenetic comparative contexts, and also because of the ad hoc algorithm used to visit models. Here, we perform an extensive simulation study demonstrating that, as implemented, MEDUSA (1) has an extremely high Type I error rate (on average, spurious diversification-rate shifts are identi- fied 42% of the time), and (2) provides severely biased parameter estimates (on average, estimated net-diversification and relative-extinction rates are 183% and 20% of their true values, respectively). We performed simulation experiments to reveal the source(s) of these pathologies, which include (1) the use of incorrect critical thresholds for model selection, and (2) errors in the likelihood function. Understanding the statistical behavior of MEDUSA is critical both to empirical researchers—in order to clarify whether these methods can reliably be applied to empirical datasets—and to theoretical biologists—in order to clarify whether new methods are required, and to reveal the specific problems that need to be solved in order to develop more reliable approaches for detecting shifts in the rate of lineage diversification.
R/qtlcharts: interactive graphics for quantitative trait locus mapping
Karl W Broman
Every data visualization can be improved with some level of interactivity. Interactive graphics hold particular promise for the exploration of high-dimensional data. R/qtlcharts is an R package to create interactive graphics for experiments to map quantitative trait loci (QTL; genetic loci that influence quantitative traits). R/qtlcharts serves as a companion to the R/qtl package, providing interactive versions of R/qtl’s static graphs, as well as additional interactive graphs for the exploration of high-dimensional genotype and phenotype data.
A robust statistical framework for reconstructing genomes from metagenomic data
Dongwan Don Kang, Jeff Froula, Rob Egan, Zhong Wang
We present software that reconstructs genomes from shotgun metagenomic sequences using a reference-independent approach. This method permits the identification of OTUs in large complex communities where many species are unknown. Binning reduces the complexity of a metagenomic dataset enabling many downstream analyses previously unavailable. In this study we developed MetaBAT, a robust statistical framework that integrates probabilistic distances of genome abundance with sequence composition for automatic binning. Applying MetaBAT to a human gut microbiome dataset identified 173 highly specific genomes bins including many representing previously unidentified species.
Bayesian analyses of Yemeni mitochondrial genomes suggest multiple migration events with Africa and Western Eurasia
Deven Nikunj Vyas, Andrew Kitchen, Aida Teresa Miró-Herrans, Laurel Nichole Pearson, Ali Al-Meeri, Connie Jo Mulligan
Anatomically modern humans (AMHs) left Africa ~60,000 years ago, marking the first of multiple dispersal events by AMH between Africa and the Arabian Peninsula. The southern dispersal route (SDR) out of Africa (OOA) posits that early AMHs crossed the Bab el-Mandeb strait from the Horn of Africa into what is now Yemen and followed the coast of the Indian Ocean into eastern Eurasia. If AMHs followed the SDR and left modern descendants in situ, Yemeni populations should retain old autochthonous mitogenome lineages. Alternatively, if AMHs did not follow the SDR or did not leave modern descendants in the region, only young autochthonous lineages will remain as evidence of more recent dispersals. We sequenced 113 whole mitogenomes from multiple Yemeni regions with a focus on haplogroups M, N, and L3(xM,N) as they are considered markers of the initial OOA migrations. We performed Bayesian evolutionary analyses to generate time-measured phylogenies calibrated by Neanderthal and Denisovan mitogenome sequences in order to determine the age of Yemeni-specific clades in our dataset. Our results indicate that the M1, N1, and L3(xM,N) sequences in Yemen are the product of recent migration from Africa and western Eurasia. Although these data suggest that modern Yemeni mitogenomes are not markers of the original OOA migrants, we hypothesize that recent population dynamics may obscure any genetic signature of an ancient SDR migration.
A general condition for adaptive genetic polymorphism in temporally and spatially heterogeneous environments
Hannes Svardal, Claus Rueffler, Joachim Hermisson
Comments: Accepted for publication in Theoretical Population Biology
Subjects: Populations and Evolution (q-bio.PE)
Both evolution and ecology have long been concerned with the impact of variable environmental conditions on observed levels of genetic diversity within and between species. We model the evolution of a quantitative trait under selection that fluctuates in space and time, and derive an analytical condition for when these fluctuations promote genetic diversification. As ecological scenario we use a generalized island model with soft selection within patches in which we incorporate generation overlap. We allow for arbitrary fluctuations in the environment including spatio-temporal correlations and any functional form of selection on the trait. Using the concepts of invasion fitness and evolutionary branching, we derive a simple and transparent condition for the adaptive evolution and maintenance of genetic diversity. This condition relates the strength of selection within patches to expectations and variances in the environmental conditions across space and time. Our results unify, clarify, and extend a number of previous results on the evolution and maintenance of genetic variation under fluctuating selection. Individual-based simulations show that our results are independent of the details of the genetic architecture and on whether reproduction is clonal or sexual. The onset of increased genetic variance is predicted accurately also in small populations in which alleles can go extinct due to environmental stochasticity.