Inferring fitness landscapes by regression produces biased estimates of epistasis

Inferring fitness landscapes by regression produces biased estimates of epistasis

Jakub Otwinowski, Joshua B. Plotkin
(Submitted on 3 Apr 2014)

The genotype-fitness map plays a fundamental role in shaping the dynamics of evolution. However, it is difficult to directly measure a fitness landscape in practice, because the number of possible genotypes is astronomical. One approach is to sample as many genotypes as possible, measure their fitnesses, and fit a statistical model of the landscape that includes additive and pairwise interactive effects between loci. Here we elucidate the pitfalls of using such regressions, by studying artificial but mathematically convenient fitness landscapes. We identify two sources of bias inherent in these regression procedures that each tends to under-estimate high fitnesses and over-estimate low fitnesses. We characterize these biases for random sampling of genotypes, as well as for samples drawn from a population under selection in the Wright-Fisher model of evolutionary dynamics. We show that common measures of epistasis, such as the number of monotonically increasing paths between ancestral and derived genotypes, the prevalence of sign epistasis, and the number of local fitness maxima, are distorted in the inferred landscape. As a result, the inferred landscape will provide systematically biased predictions for the dynamics of adaptation. We identify the same biases in a computational RNA-folding landscape, as well as in regulatory sequence binding data, treated with the same fitting procedure. Finally, we present a method that may ameliorate these biases in some cases.

Stability and response of polygenic traits to stabilizing selection and mutation

Stability and response of polygenic traits to stabilizing selection and mutation

Harold P. de Vladar, Nick Barton
(Submitted on 3 Apr 2014)

When polygenic traits are under stabilizing selection, many different combinations of alleles allow close adaptation to the optimum. If alleles have equal effects, all combinations that result in the same deviation from the optimum are equivalent. Furthermore, the genetic variance that is maintained by mutation-selection balance is 2μ/S per locus, where μ is the mutation rate and S the strength of stabilizing selection. In reality, alleles vary in their effects, making the fitness landscape asymmetric, and complicating analysis of the equilibria. We show that that the resulting genetic variance depends on the fraction of alleles near fixation, which contribute by 2μ/S, and on the total mutational effects of alleles that are at intermediate frequency. The interplay between stabilizing selection and mutation leads to a sharp transition: alleles with effects smaller than a threshold value of 2μ/S‾‾‾‾√ remain polymorphic, whereas those with larger effects are fixed. The genetic load in equilibrium is less than for traits of equal effects, and the fitness equilibria are more similar. We find that if the optimum is displaced, alleles with effects close to the threshold value sweep first, and their rate of increase is bounded by μS‾‾‾√. Long term response leads in general to well-adapted traits, unlike the case of equal effects that often end up at a sub-optimal fitness peak. However, the particular peaks to which the populations converge are extremely sensitive to the initial states, and to the speed of the shift of the optimum trait value.

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

J. Dröge, I. Gregor, A. C. McHardy
(Submitted on 3 Apr 2014)

Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows us to identify the sequenced community members and to reconstruct taxonomic bins with sequence data for the individual taxa. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignments by fast approximate determination of evolutionary neighbors from sequence similarities. Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples becauseit identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with ten CPU cores and microbial RefSeq as the genomic reference data.

An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs

An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs

Jesse D Bloom

Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are unknown, and so most phylogenetic models treat this selection crudely with a variety of free parameters designed to represent general features of mutation and selection. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than existing models. Here I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg et al, 2014) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much better than existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.

PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species

Paulo Bandiera-Paiva, Marcelo R.S. Briones
(Submitted on 2 Apr 2014)

The Phylogenetic Genome Annotator (PGA) is a computer program that enables real-time comparison of ‘gene trees’ versus ‘species trees’ obtained from predicted open reading frames of whole genome data. The gene phylogenies are inferred for each individual genome predicted proteins whereas the species phylogenies are inferred from rDNA data. The correlated protein domains, defined by PFAM, are then displayed side-by-side with a phylogeny of the corresponding species. The statistical support of gene clusters (branches) is given by the quartet puzzling method. This analysis readily discriminates paralogs from orthologs, enabling the identification of proteins originated by gene duplications and the prediction of possible functional divergence in groups of similar sequences.

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

Steve Evans, Alexandru Hening, Sebastian Schreiber

We consider consider a population living in a patchy environment that varies stochastically in space and time. The population is composed of two morphs (that is, individuals of the same species with different genotypes). In terms of survival and reproductive success, the associated phenotypes differ only in their habitat selection strategies. We compute invasion rates corresponding to the rates at which the abundance of an initially rare morph increases in the presence of the other morph established at equilibrium. If both morphs have positive invasion rates when rare, then there is an equilibrium distribution such that the two morphs coexist; that is, there is a protected polymorphism for habitat selection. Alternatively, if one morph has a negative invasion rate when rare, then it is asymptotically displaced by the other morph under all initial conditions where both morphs are present. We refine the characterization of an evolutionary stable strategy for habitat selection from [Schreiber, 2012] in a mathematically rigorous manner. We provide a necessary and sufficient condition for the existence of an ESS that uses all patches and determine when using a single patch is an ESS. We also provide an explicit formula for the ESS when there are two habitat types. We show that adding environmental stochasticity results in an ESS that, when compared to the ESS for the corresponding model without stochasticity, spends less time in patches with larger carrying capacities and possibly makes use of sink patches, thereby practicing a spatial form of bet hedging.

Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

Heng Li
(Submitted on 3 Apr 2014)

Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.
Results: We made ten SNP and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb without significant compromise on the sensitivity.

Comparison of the theoretical and real-world evolutionary potential of a genetic circuit.

Comparison of the theoretical and real-world evolutionary potential of a genetic circuit.

Manuel Razo-Mejia, James Boedicker, Daniel Jones, Alexander de Luna, Justin Block Kinney, Rob Phillips

With the development of next-generation sequencing technologies, many large scale experimental efforts aim to map genotypic variability among individuals. This natural variability in populations fuels many fundamental biological processes, ranging from evolutionary adaptation and speciation to the spread of genetic diseases and drug resistance. An interesting and important component of this variability is present within the regulatory regions of genes. As these regions evolve, accumulated mutations lead to modulation of gene expression, which may have consequences for the phenotype. A simple model system where the link between genetic variability, gene regulation and function can be studied in detail is missing. In this article we develop a model to explore how the sequence of the wild-type lac promoter dictates the fold change in gene expression. The model combines single-base pair resolution maps of transcription factor and RNA polymerase binding energies with a comprehensive thermodynamic model of gene regulation. The model was validated by predicting and then measuring the variability of lac operon regulation in a collection of natural isolates. We then implement the model to analyze the sensitivity of the promoter sequence to the regulatory output, and predict the potential for regulation to evolve due to point mutations in the promoter region.

Multilocus Species Trees Show the Recent Adaptive Radiation of the Mimetic Heliconius Butterflies

Multilocus Species Trees Show the Recent Adaptive Radiation of the Mimetic Heliconius Butterflies

Krzysztof M Kozak, Niklas Wahlberg, Andrew Neild, Kanchon K Dasmahapatra, James Mallet, Chris D Jiggins

Müllerian mimicry among Neotropical Heliconiini butterflies is an excellent example of natural selection, and is associated with the diversification of a large continental-scale radiation. Some of the processes driving the evolution of mimicry rings are likely to generate incongruent phylogenetic signals across the assemblage, and thus pose a challenge for systematics. We use a dataset of 22 mitochondrial and nuclear markers from 92% of species in the tribe to re-examine the phylogeny of Heliconiini with both supermatrix and multi-species coalescent approaches, characterise the patterns of conflicting signal and compare the performance of various methodological approaches to reflect the heterogeneity across the data. Despite the large extent of reticulate signal and strong conflict between markers, nearly identical topologies are consistently recovered by most of the analyses, although the supermatrix approach fails to reflect the underlying variation in the history of individual loci. The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Amazonia over the past 23 million years, or changes caused by diversity-dependent effects on the rate of diversification. We find that the tribe Heliconiini had doubled its rate of speciation around 11 Ma and that the presently most speciose genus Heliconius started diversifying rapidly at 10 Ma, likely in response to the recent drastic changes in topography of the region. Our study provides comprehensive evidence for a rapid adaptive radiation among an important insect radiation in the most biodiverse region of the planet.

New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica

New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica

Michael C Schatz, Lyza G Maron, Joshua C Stein, Alejandro Hernandez Wences, James Gurtowski, Eric Biggers, Hayan Lee, Melissa Kramer, Eric Antonio, Elena Ghiban, Mark H Wright, Jer-ming Chia, Doreen Ware, Susan R McCouch, William Richard McCombie

The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. Currently, when the genomes of different strains of a given organism are compared, whole genome resequencing data are aligned to an established reference sequence. However when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate. Here, we use rice as a model to explore the extent of structural variation among strains adapted to different ecologies and geographies, and show that this variation can be significant, often matching or exceeding the variation present in closely related human populations or other mammals. We demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared to provide an unbiased assessment. Using this approach, we are able to accurately assess the ?pan-genome? of three divergent rice varieties and document several megabases of each genome absent in the other two. Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard resequencing approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.