Quantifying evolutionary dynamics of the basic genome of E. coli

Quantifying evolutionary dynamics of the basic genome of E. coli

Purushottam Dixit, Tin Yau Pang, F. William Studier, Sergei Maslov
(Submitted on 11 May 2014)

The ~4-Mbp basic genome shared by 32 independent isolates of E. coli representing considerable population diversity has been approximated by whole-genome multiple-alignment and computational filtering designed to remove mobile elements and highly variable regions. Single nucleotide polymorphisms (SNPs) in the 496 basic-genome pairs are identified and clonally inherited stretches are distinguished from those acquired by horizontal transfer (HT) by sharp discontinuities in SNP density. The six least diverged genome-pairs each have only one or two HT stretches, each occupying 42-115-kbp of basic genome and containing at least one gene cluster known to confer selective advantage. At higher divergences, the typical mosaic pattern of interspersed clonal and HT stretches across the entire basic genome are observed, including likely fragmented integrations across a restriction barrier. A simple model suggests that individual HT events are of the order of 10-kbp and are the chief contributor to genome divergence, bringing in almost 12 times more SNPs than point mutations. As a result of continuing horizontal transfer of such large segments, 400 out of the 496 strain-pairs beyond genomic divergence of share virtually no genomic material with their common ancestor. We conclude that the active and continuing horizontal transfer of moderately large genomic fragments is likely to be mediated primarily by a co evolving population of phages that distribute random genome fragments throughout the population by generalized transduction, allowing efficient adaptation to environmental changes.

Quantifying MCMC Exploration of Phylogenetic Tree Space

Quantifying MCMC Exploration of Phylogenetic Tree Space
Christopher Whidden, Frederick A. Matsen IV
Comments: 30 pages, 10 figures
Subjects: Populations and Evolution (q-bio.PE)

In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the true posterior distribution. In this paper we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree-prune-regraft (SPR) metric. This metric directly corresponds to the minimum number of MCMC rearrangements required to move between trees in common phylogenetic MCMC implementations. We use a novel graph-based approach to analyze tree space and find that the SPR metric is much more informative than simpler metrics that are unrelated to MCMC moves. In doing so we show conclusively that topological peaks do occur in real Bayesian phylogenetic posteriors with standard MCMC moves, investigate the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks, and show that conditional clade probability (CCP) can have systematic problems when there are multiple peaks.

Background selection as baseline for nucleotide variation across the Drosophila genome

Background selection as baseline for nucleotide variation across the Drosophila genome
Josep M Comeron

The constant removal of deleterious mutations by natural selection causes a reduction in neutral diversity and efficacy of selection at genetically linked sites (a process called Background Selection, BGS). Population genetic studies, however, often ignore BGS effects when investigating demographic events or the presence of other types of selection. To obtain a more realistic evolutionary expectation that incorporates the unavoidable consequences of deleterious mutations, we generated high-resolution landscapes of variation across the Drosophila melanogaster genome under a BGS scenario independent of polymorphism data. We find that BGS plays a significant role in shaping levels of variation across the entire genome, including long introns and intergenic regions distant from annotated genes. We also find that a very large percentage of the observed variation in diversity across autosomes can be explained by BGS alone, up to 70% across individual chromosome arms, thus indicating that BGS predictions can be used as baseline to infer additional types of selection and demographic events. This approach allows detecting several outlier regions with signal of recent adaptive events and selective sweeps. The use of a BGS baseline, however, is particularly appropriate to investigate the presence of balancing selection and our study exposes numerous genomic regions with the predicted signature of higher polymorphism than expected when a BGS context is taken into account. Importantly, we show that these conclusions are robust to the mutation and selection parameters of the BGS model. Finally, analyses of protein evolution together with previous comparisons of genetic maps between Drosophila species, suggest temporally variable recombination landscapes and thus, local BGS effects that may differ between extant and past phases. Because genome-wide BGS and temporal changes in linkage effects can skew approaches to estimate demographic and selective events, future analyses should incorporate BGS predictions and capture local recombination variation across genomes and along lineages.

Genetic dissection of MAPK-mediated complex traits across S. cerevisiae

Genetic dissection of MAPK-mediated complex traits across S. cerevisiae
Sebastian Treusch, Frank W Albert, Joshua S Bloom, Iulia E Kotenko, Leonid Kruglyak

Signaling pathways enable cells to sense and respond to their environment. Many cellular signaling strategies are conserved from fungi to humans, yet their activity and phenotypic consequences can vary extensively among individuals within a species. A systematic assessment of the impact of naturally occurring genetic variation on signaling pathways remains to be conducted. In S. cerevisiae, both response and resistance to stressors that activate signaling pathways differ between diverse isolates. Here, we present a quantitative trait locus (QTL) mapping approach that enables us to identify genetic variants underlying such phenotypic differences across the genetic and phenotypic diversity of S. cerevisiae. Using a Round-robin cross between twelve diverse strains, we determined the genetic architectures of phenotypes critically dependent on MAPK signaling cascades. Genetic variants identified fell within MAPK signaling networks themselves as well as other interconnected signaling pathways, illustrating how genetic variation can shape the phenotypic output of highly conserved signaling cascades.

A novel method for the estimation of diversity in viral populations from next generation sequencing data

A novel method for the estimation of diversity in viral populations from next generation sequencing data
Jean P. Zukurov, Sieberth N. Brito, Luiz M. R. Janini, Fernando Antoneli
Comments: 17 pages, 6 figures, site: this http URL
Subjects: Quantitative Methods (q-bio.QM); Genomics (q-bio.GN)

In this paper we describe the structure and use of a computational tool for the analysis of viral genetic diversity on data generated by high- throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. This work focuses on two main fronts: the first is a novel alignment strategy that allows the recovery of the highest possible number of short-reads; the second is the estimation of the populational genetic diversity through a Bayesian approach based on Dirichlet distributions inspired by word count modeling. The software is available as an integrated platform capable of performing all operations described here, it is written in C# (Microsoft) and runs on Windows platforms. The executable, the documentation and the auxiliary files are freely available and may be obtained from: biocomp.epm.br/tanden.

A statistical test for lineage-specific natural selection on quantitative traits based on multiple-line crosses

A statistical test for lineage-specific natural selection on quantitative traits based on multiple-line crosses
Nico Riedel, Bhavin S. Khatri, Michael Lässig, Johannes Berg
Comments: 21 pages, 11 figures
Subjects: Populations and Evolution (q-bio.PE)

Phenotypic differences between species may be attributable to natural selection. However, it is a difficult task to quantify the strength of evidence for selection acting on a particular trait. Here we develop a population-genetic test for selection acting on a quantitative trait, which is based on multiple-line crosses. We show that using multiple lines increases both the power and the scope of selection inference. First, a test based on three or more lines detects selection on a quantitative trait with strongly increased statistical significance, which is quantified by our analysis. Second, a multiple-line test allows to distinguish selection from neutral evolution as well as lineage-specific selection from selection under uniform selection strength. This is in contrast to tests based on two lines, where only differences in selection coefficients can be inferred. Our analytical results are complemented by extensive numerical simulations. We apply the multiple-line test to QTL data on floral character traits in plant species of the Mimulus genus and on photoperiodic traits in different maize strains. In both cases, we find a signature of lineage-specific selection that is not seen in a two-line test. We also extend the multiple-line test to short divergence times.

Sequence co-evolution gives 3D contacts and structures of protein complexes

Sequence co-evolution gives 3D contacts and structures of protein complexes

Thomas A. Hopf, Charlotta P.I. Schärfe, João P.G.L.M. Rodrigues, Anna G. Green, Chris Sander, Alexandre M.J.J. Bonvin, Debora S. Marks

High-throughput experiments in bacteria and eukaryotic cells have identified tens of thousands of possible interactions between proteins. This genome-wide view of the protein interaction universe is coarse-grained, whilst fine-grained detail of macro- molecular interactions critically depends on lower throughput, labor-intensive experiments. Computational approaches using measures of residue co-evolution across proteins show promise, but have been limited to specific interactions. Here we present a new generalized method showing that patterns of evolutionary sequence changes across proteins reflect residues that are close in space, and with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We demonstrate that the inferred evolutionary coupling scores distinguish between interacting and non-interacting proteins and the accurate prediction of residue interactions. To illustrate the utility of the method, we predict unknown 3D interactions between subunits of ATP synthase and find results consistent with detailed experimental data. We expect that the method can be generalized to genome- wide interaction predictions at residue resolution.

Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans

Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans
Rebekah L Rogers, Julie M Cridland, Ling Shao, Tina T Hu, Peter Andolfatto, Kevin R Thornton
Subjects: Populations and Evolution (q-bio.PE)

Tandem duplications are an essential source of genetic novelty, and their prevalence in natural populations is expected to influence the trajectory of adaptive walks. Here, we describe evolutionary impacts of recently-derived, segregating tandem duplications in Drosophila yakuba and Drosophila simulans. We observe an excess of duplicated genes involved in defense against pathogens, chorion development, cuticular peptides, and lipases or endopeptidases associated with the accessory glands, as well as insecticide metabolism, suggesting that duplications function in Red Queen dynamics and rapid evolution. We observe evidence of widespread selection on the D. simulans X, suggesting adaptation through duplication is common on the X. Though we find many high frequency variants, duplicates display an excess of low frequency variants consistent with largely detrimental impacts, limiting the variation that can effectively facilitate adaptation. Although we observe hundreds of gene duplications, we show that segregating variation is insufficient to provide duplicate copies of the entire genome, and the number of duplications in the population spans 13.4% of major chromosome arms in D. yakuba and 9.7% in D. simulans. Whole gene duplication rates are low at $1.1 \times 10^{-9}$ in D. yakuba and $6.1 \times 10^{-9}$ in D. simulans, suggesting long wait times for new mutations. Hence, if adaptive processes are dependent on individual duplications, evolution will be severely limited by mutation. Hence, parallel recruitment of the same duplicated gene in different species will be rare and standing variation will define evolutionary outcomes, in spite of convergence across rapidly evolving phenotypes.

Consistency of the Maximum Likelihood Estimator of Evolutionary Tree

Consistency of the Maximum Likelihood Estimator of Evolutionary Tree
Arindam RoyChoudhury
Subjects: Populations and Evolution (q-bio.PE)

Maximum likelihood estimation (MLE) methods are widely used for evolutionary tree. As evolutionary tree is not a smooth parameter, the consistency of its MLE has been a topic of debate. It has been noted without proof that the classical proof of consistency by Wald holds for the MLE of evolutionary tree. Other proofs of consistency under various models were also proposed. Here we will discuss some shortcomings in some of these proofs and comment on the applicability of Wald’s proof.

Author post: Spatial localization of recent ancestors for admixed individuals

A guest post by Bogdan Pasaniuc [@bpasaniuc] on his paper with coauthors: Spatial localization of recent ancestors for admixed individuals by Wen-Yun Yang, Alexander Platt, Charleston Wen-Kai Chiang, Eleazar Eskin, John Novembre, Bogdan Pasaniuc. bioRxived here.

Geographic localization based on genetic data has received much attention recently. Here we present a preprint that aims to address one of the drawbacks of existing approaches. As opposed to existing works that typically make a very strong assumption that all recent ancestors come from the same location on a map, we seek to infer multiple locations for a given individual corresponding to its ancestors. That is, our approach uses genetic data from a given individual to localize on the map its recent ancestors several generations ago (e.g. grandparents).

To accomplish this we approximate the admixture process (i.e. mixing of genetic variants from different sources) in a genetic-geographic continuum. We view the mixed ancestry genome as being generated from several locations on a map (corresponding to its recent ancestors) and model the mosaic structure of local ancestries across the genome through an admixture HMM. We link geography to the admixture process by allowing allele frequencies at every site in the genome to vary across geography according to a logistic gradient function (as in SPA[1]); the complete model is an admixture HMM for a genotype-specific pair of ancestral locations on the map.

As the number of generations since admixture increases the total number of ancestors to localize increases dramatically making the inference infeasible (http://gcbias.org/2013/11/11/how-does-your-number-of-genetic-ancestors-grow-back-over-time/). To account for this, we limit the number of different “ancestry locations” that contribute to admixture to a small constant, each with varying amount of contribution. We devise efficient algorithms to make inferences in our model and show that accuracy decreases with number of locations to infer, with number of generations in the admixture and with geographic distance among ancestors. For example, SPAMIX can localize the grandparents of the POPRES[2] individuals with multiple sub-continental European ancestries within 470Km of their reported locations.

As with all methods, limitations do exist and we outline several here. We use logistic gradient functions to relate geography to genetics and investigating more complex functions may prove fruitful. We developed an efficient algorithm for producing point estimates for location and locus-specific ancestry; in some cases a probabilistic output may be desired. Finally, our approach models admixture-LD and assumes no background LD; more involved procedures to model background LD (such as the one we proposed [3]) is an interesting area of research.

1. Yang, Wen-Yun, et al. “A model-based approach for analysis of spatial structure in genetic data.” Nature genetics 44.6 (2012): 725-731.
2. Nelson, Matthew R., et al. “The population reference sample, POPRES: a resource for population, disease, and pharmacological genetics research.” The American Journal of Human Genetics 83.3 (2008): 347-358.
3. Baran, Yael, et al. “Enhanced localization of genetic samples through linkage-disequilibrium correction.” The American Journal of Human Genetics 92.6 (2013): 882-894.