Triticeae resources in Ensembl Plants

Triticeae resources in Ensembl Plants

Dan M Bolser, Arnaud Kerhornou, Brandon Walts, Paul Kersey
doi: http://dx.doi.org/10.1101/011585

Recent developments in DNA sequencing have enabled the large and complex genomes of many crop species to be determined for the first time, even those previously intractable due to their polyploid nature. Indeed, over the course of the last two years, the genome sequences of several commercially important cereals, notably barley and bread wheat, have become available, as well as those of related wild species. While still incomplete, comparison to other, more completely assembled species suggests that coverage of genic regions is likely to be high. Ensembl Plants (http://plants.ensembl.org) is an integrative resource organising, analysing and visualising genome-scale information for important crop and model plants. Available data includes reference genome sequence, variant loci, gene models and functional annotation. For variant loci, individual and population genotypes, linkage information and, where available, phenotypic information, are shown. Comparative analyses are performed on DNA and protein sequence alignments. The resulting genome alignments and gene trees, representing the implied evolutionary history the gene family, are made available for visualisation and analysis. Driven by the use case of bread wheat, specific extensions to the analysis pipelines and web interface have recently been developed to support polyploid genomes. Data in Ensembl Plants is accessible through a genome browser incorporating various specialist interfaces for different data types, and through a variety of additional methods for programmatic access and data mining. These interfaces are consistent with those offered through the Ensembl interface for the genomes of non-plant species, including those of plant pathogens, pests and pollinators, facilitating the study of the plant in its environment.

Expansion load: recessive mutations and the role of standing genetic variation

Expansion load: recessive mutations and the role of standing genetic variation

Stephan Peischl, Laurent Excoffier
doi: http://dx.doi.org/10.1101/011593

Expanding populations incur a mutation burden – the so-called expansion load. Previous studies of expansion load have focused on co-dominant mutations. An important consequence of this assumption is that expansion load stems exclusively from the accumulation of new mutations occurring in individuals living at the wave front. Using individual-based simulations we study here the dynamics of standing genetic variation at the front of expansions, and its consequences on mean fitness if mutations are recessive. We find that deleterious genetic diversity is quickly lost at the front of the expansion, but the loss of deleterious mutations at some loci is compensated by an increase of their frequencies at other loci. The frequency of deleterious homozygotes therefore increases along the expansion axis whereas the average number of deleterious mutations per individual remains nearly constant across the species range. This reveals two important differences to co-dominant models: (i) mean fitness at the front of the expansion drops much faster if mutations are recessive, and (ii) mutation load can increase during the expansion even if the total number of deleterious mutations per individual remains constant. We use our model to make predictions about the shape of the site frequency spectrum at the front of range expansion, and about correlations between heterozygosity and fitness in different parts of the species range. Importantly, these predictions provide opportunities to empirically validate our theoretical results. We discuss our findings in the light of recent results on the distribution of deleterious genetic variation across human populations, and link them to empirical results on the correlation of heterozygosity and fitness found in many natural range expansions.

Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

Sarah A Gagliano, Reena Ravji, Michael R Barnes, Michael E Weale, Jo Knight
doi: http://dx.doi.org/10.1101/011445

Although technology has triumphed in facilitating routine genome re-sequencing, new challenges have been created for the data analyst. Genome scale surveys of human disease variation generate volumes of data that far exceed capabilities for laboratory characterization, and importantly also create a substantial burden of type I error. By incorporating a variety of functional annotations as predictors, such as regulatory and protein coding elements, statistical learning has been widely investigated as a mechanism for the prioritization of genetic variants that are more likely to be associated with complex disease. These methods offer a hope of identification of sufficiently large numbers of truly associated variants, to make cost-effective the large-scale functional characterization necessary to progress genome scale experiments. We compared the results from three published prioritization procedures which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding of the functional annotations. In this paper we also explore different combinations of algorithm and annotation set. We train the models in 60% of the data and reserve the remainder for testing the accuracy. As an application, we tested which methodology performed the best for prioritizing sub-genome-wide-significant variants using data from the first and second rounds of a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71). However, predictive accuracy results obtained from the test set do not always reflect results obtained from the application to the schizophrenia meta-analysis. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved towards the step change in the risk variant prediction required to address the impending bottleneck of the new generation of genome re-sequencing studies.

Current data show no signal of Ebola virus adapting to humans

Current data show no signal of Ebola virus adapting to humans

Stephanie J. Spielman, Austin G. Meyer, Claus O. Wilke
doi: http://dx.doi.org/10.1101/011429

Gire et al. (Science 345:1369–1372, 2014) analyzed 81 complete genomes sampled from the 2014 Zaire ebolavirus (EBOV) outbreak and claimed that the virus is evolving far more rapidly in the current outbreak than it has been between previous outbreaks. This assertion has received widespread attention, and many have perceived Gire et al. (2014)’s results as implying rapid adaptation of EBOV to humans during the current outbreak. Here, we show that, on the contrary, sequence divergence in EBOV is rather limited, and that the currently available data contain no signal of rapid evolution or adaptation to humans. Gire et al.’s findings resulted from an incorrect application of a molecular-clock model to a population of sequences with minimal divergence and segregating polymorphisms. Our results highlight how indiscriminate use of off-the-shelf analysis techniques may result in highly-publicized, misleading statements about an ongoing public health crisis.

How Well Can We Detect Shifts in Rates of Lineage Diversification? A Simulation Study of Sequential AIC Methods

How Well Can We Detect Shifts in Rates of Lineage Diversification? A Simulation Study of Sequential AIC Methods

Michael R May, Brian R Moore
doi: http://dx.doi.org/10.1101/011452

Evolutionary biologists have long been fascinated by the extreme differences in species numbers across branches of the Tree of Life. This has motivated the development of statistical phy- logenetic methods for detecting shifts in the rate of lineage diversification (speciation – extinction). One of the most frequently used methods—implemented in the program MEDUSA—explores a set of diversification-rate models, where each model uniquely assigns branches of the phylogeny to a set of one or more diversification-rate categories. Each candidate model is first fit to the data, and the Akaike Information Criterion (AIC) is then used to identify the optimal diversification model. Surprisingly, the statistical behavior of this popular method is completely unknown, which is a concern in light of the poor performance of the AIC as a means of choosing among models in other phylogenetic comparative contexts, and also because of the ad hoc algorithm used to visit models. Here, we perform an extensive simulation study demonstrating that, as implemented, MEDUSA (1) has an extremely high Type I error rate (on average, spurious diversification-rate shifts are identi- fied 42% of the time), and (2) provides severely biased parameter estimates (on average, estimated net-diversification and relative-extinction rates are 183% and 20% of their true values, respectively). We performed simulation experiments to reveal the source(s) of these pathologies, which include (1) the use of incorrect critical thresholds for model selection, and (2) errors in the likelihood function. Understanding the statistical behavior of MEDUSA is critical both to empirical researchers—in order to clarify whether these methods can reliably be applied to empirical datasets—and to theoretical biologists—in order to clarify whether new methods are required, and to reveal the specific problems that need to be solved in order to develop more reliable approaches for detecting shifts in the rate of lineage diversification.

R/qtlcharts: interactive graphics for quantitative trait locus mapping

R/qtlcharts: interactive graphics for quantitative trait locus mapping

Karl W Broman
doi: http://dx.doi.org/10.1101/011437

Every data visualization can be improved with some level of interactivity. Interactive graphics hold particular promise for the exploration of high-dimensional data. R/qtlcharts is an R package to create interactive graphics for experiments to map quantitative trait loci (QTL; genetic loci that influence quantitative traits). R/qtlcharts serves as a companion to the R/qtl package, providing interactive versions of R/qtl’s static graphs, as well as additional interactive graphs for the exploration of high-dimensional genotype and phenotype data.

A robust statistical framework for reconstructing genomes from metagenomic data

A robust statistical framework for reconstructing genomes from metagenomic data

Dongwan Don Kang, Jeff Froula, Rob Egan, Zhong Wang
doi: http://dx.doi.org/10.1101/011460

We present software that reconstructs genomes from shotgun metagenomic sequences using a reference-independent approach. This method permits the identification of OTUs in large complex communities where many species are unknown. Binning reduces the complexity of a metagenomic dataset enabling many downstream analyses previously unavailable. In this study we developed MetaBAT, a robust statistical framework that integrates probabilistic distances of genome abundance with sequence composition for automatic binning. Applying MetaBAT to a human gut microbiome dataset identified 173 highly specific genomes bins including many representing previously unidentified species.

What to compare and how: comparative transcriptomics for Evo-Devo

What to compare and how: comparative transcriptomics for Evo-Devo

Julien Roux, Marta Rosikiewicz, Marc Robinson-Rechavi
doi: http://dx.doi.org/10.1101/011213

Evolutionary developmental biology has grown historically from the capacity to relate patterns of evolution in anatomy to patterns of evolution of expression of specific genes, whether between very distantly related species, or very closely related species or populations. Scaling up such studies by taking advantage of modern transcriptomics brings promising improvements, allowing us to estimate the overall impact and molecular mechanisms of convergence, constraint or innovation in anatomy and development. But it also presents major challenges, including the computational definitions of anatomical homology and of organ function, the criteria for the comparison of developmental stages, the annotation of transcriptomics data to proper anatomical and developmental terms, and the statistical methods to compare transcriptomic data between species to highlight significant conservation or changes. In this article, we review these challenges, and the ongoing efforts to address them, which are emerging from bioinformatics work on ontologies, evolutionary statistics, and data curation, with a focus on their implementation in the context of the development of our database Bgee (http://bgee.org).

Tools and Methods from the Anopheles 16 Genome Project

Tools and Methods from the Anopheles 16 Genome Project

Aaron Steele, Michael C. Fontaine, Andres Martin, Scott J Emrich
doi: http://dx.doi.org/10.1101/011205

The dramatic reduction in sequencing costs has resulted in many initiatives to sequence certain organisms and populations. These initiatives aim to not only sequence and assemble genomes but also to perform a more broader analysis of the population structure. As part of the Anopheline Genome Consortium, which has a vested interest in studying anpopheline mosquitoes, we developed novel methods and tools to further the communities goals. We provide a brief description of these methods and tools as well as assess the contributions that each offers to the broader study of comparative genomics.

Recombination and peak jumping

Recombination and peak jumping

Kristina Crona
(Submitted on 7 Nov 2014)

We find an advantage of recombination for a category of complex fitness landscapes. Recent studies of empirical fitness landscapes reveal complex gene interactions and multiple peaks, and recombination can be a powerful mechanism for escaping suboptimal peaks. However classical work on recombination largely ignores the effect of complex gene interactions. The advantage we find has no correspondence for 2-locus systems or for smooth landscapes. The effect is sometimes extreme, in the sense that shutting off recombination could result in that the organism fails to adapt. A standard question about recombination is if the mechanism tends to accelerate or decelerate adaptation. However, we argue that extreme effects may be more important than how the majority falls.