Genome-wide scan of 29,141 African Americans finds no evidence of selection since admixture

Genome-wide scan of 29,141 African Americans finds no evidence of selection since admixture
Gaurav Bhatia, Arti Tandon, Melinda C. Aldrich, Christine B. Ambrosone, Christopher Amos, Elisa V. Bandera, Sonja I. Berndt, Leslie Bernstein, William J. Blot, Cathryn H. Bock, Neil Caporaso, Graham Casey, Sandra L. Deming, W. Ryan Diver, Susan M. Gapstur, Elizabeth M. Gillanders, Curtis C. Harris, Brian E. Henderson, Sue A. Ingles, William Isaacs, Esther M. John, Rick A. Kittles, Emma Larkin, Lorna H. McNeill, Robert C. Millikan, Adam Murphy, Christine Neslund-Dudas, Sarah Nyante, Michael F. Press, Jorge L. Rodriguez-Gil, Benjamin A. Rybicki, Ann G. Schwartz, Lisa B. Signorello, Margaret Spitz, Sara S. Strom, Margaret A. Tucker, John K. Wiencke, John S. Witte, Xifeng Wu, Yuko Yamamura, Krista A. Zanetti, Wei Zheng, Regina G. Ziegler, Stephen J. Chanock, Christopher A. Haiman, David Reich, Alkes L. Price
(Submitted on 10 Dec 2013)

We scanned through the genomes of 29,141 African Americans, searching for loci where the average proportion of African ancestry deviates significantly from the genome-wide average. We failed to find any genome-wide significant deviations, and conclude that any selection in African Americans since admixture is sufficiently weak that it falls below the threshold of our power to detect it using a large sample size. These results stand in contrast to the findings of a recent study of selection in African Americans. That study, which had 15 times fewer samples, reported six loci with significant deviations. We show that the discrepancy is likely due to insufficient correction for multiple hypothesis testing in the previous study. The same study reported 14 loci that showed greater population differentiation between African Americans and Nigerian Yoruba than would be expected in the absence of natural selection. Four such loci were previously shown to be genome-wide significant and likely to be affected by selection, but we show that most of the 10 additional loci are likely to be false positives. Additionally, the most parsimonious explanation for the loci that have significant evidence of unusual differentiation in frequency between Nigerians and Africans Americans is selection in Africa prior to their forced migration to the Americas.

Probabilistic Graphical Model Representation in Phylogenetics

Probabilistic Graphical Model Representation in Phylogenetics
Sebastian Höhna, Tracy A. Heath, Bastien Boussau, Michael J. Landis, Fredrik Ronquist, John P. Huelsenbeck
(Submitted on 9 Dec 2013)

Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (1) reproducibility of an analysis, (2) model development and (3) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and non-specialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies.
Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference.
Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis-Hastings or Gibbs sampling of the posterior distribution.

Probabilistic models of genetic variation in structured populations applied to global human studies

Probabilistic models of genetic variation in structured populations applied to global human studies
Wei Hao, Minsun Song, John D. Storey
(Submitted on 7 Dec 2013)

Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important, unsolved problem is how to formulate and estimate probabilistic models of observed genotypes that allow for complex population structure. We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis (PCA) can be utilized to estimate a general model that includes the well-known Pritchard-Stephens-Donnelly mixed-membership model as a special case. Noting some drawbacks of this approach, we introduce a new “logistic factor analysis” (LFA) framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the human genome diversity panel and 1000 genomes project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions.

Human blood genotypes dynamics

Human blood genotypes dynamics
Timur Sadykov
(Submitted on 9 Dec 2013)

We give a complete closed form description of the evolution of human blood genotypes frequencies (in the ABO and Rh classification) after any (finite or infinite) number of generations and for any initial distribution.

The time-dependent reconstructed evolutionary process with a key-role for mass-extinction events

The time-dependent reconstructed evolutionary process with a key-role for mass-extinction events
Sebastian Höhna
(Submitted on 9 Dec 2013)

The homogeneous reconstructed evolutionary process is a birth-death process without observed extinct lineages. Each species evolves independently with the same diversification rates (speciation rate λ(t) and extinction rate μ(t)) that may change over time. The process is commonly applied to model species diversification where the data are reconstructed phylogenies, e.g., trees reconstructed from present-day molecular data, and used to infer diversification rates.
In the present paper I develop the general probability density of a reconstructed tree under any time-dependent birth-death process. I elaborate on how to adapt this probability density if conditioned on survival of one or two initial lineages, or having sampled n species and show how to transform between the probability density of a reconstructed and the probability density of the speciation times.
I demonstrate the use of the general time-dependent probability density functions by deriving the probability density of a reconstructed tree under a birth-death-shift model with explicit mass-extinction events. I enrich this compendium by providing and discussing several special cases, including: the pure birth process, the pure death process, the birth-death process and the critical branching process. Thus, I provide here most of the commonly used birth-death models in a unified framework (e.g., same condition and same data) with common notation.

Species Delimitation using Genome-Wide SNP Data

Species Delimitation using Genome-Wide SNP Data

Adam Leache, Matthew Fujita, Vladimir Minin, Remco Bouckaert

The multi-species coalescent has provided important progress for evolutionary inferences, including increasing the statistical rigor and objectivity of comparisons among competing species delimitation models. However, Bayesian species delimitation methods typically require brute force integration over gene trees via Markov chain Monte Carlo (MCMC), which introduces a large computation burden and precludes their application to genomic-scale data. Here we combine a recently introduced dynamic programming algorithm for estimating species trees that bypasses MCMC integration over gene trees with sophisticated methods for estimating marginal likelihoods, needed for Bayesian model selection, to provide a rigorous and computationally tractable technique for genome-wide species delimitation. We provide a critical yet simple correction that brings the likelihoods of different species trees, and more importantly their corresponding marginal likelihoods, to the same common denominator, which enables direct and accurate comparisons of competing species delimitation models using Bayes factors. We test this approach, which we call Bayes factor delimitation (*with genomic data; BFD*), using common species delimitation scenarios with computer simulations. Varying the numbers of loci and the number of samples suggest that the approach can distinguish the true model even with few loci and limited samples per species. Misspecification of the prior for population size θ has little impact on support for the true model. We apply the approach to West African forest geckos (Hemidactylus fasciatus complex) using genome-wide SNP data data. This new Bayesian method for species delimitation builds on a growing trend for objective species delimitation methods with explicit model assumptions that are easily tested.

Formal properties of the probability of fixation: identities, inequalities and approximations

Formal properties of the probability of fixation: identities, inequalities and approximations
David M. McCandlish, Charles L. Epstein, Joshua B. Plotkin
(Submitted on 5 Dec 2013)

The formula for the probability of fixation of a new mutation is widely used in theoretical population genetics and molecular evolution. Here we derive a series of identities, inequalities and approximations for the exact probability of fixation of a new mutation under the Moran process (equivalent results hold for the approximate probability of fixation for the Wright-Fisher process after an appropriate change of variables). We show that the behavior of the logarithm of the probability of fixation is particularly simple when the selection coefficient is measured as a difference of Malthusian fitnesses, and we exploit this simplicity to derive several inequalities and approximations. We also present a comprehensive comparison of both existing and new approximations for the probability of fixation, highlighting in particular approximations that result in a reversible Markov chain when used to model the dynamics of evolution under weak mutation.

Error-prone polymerase activity causes multinucleotide mutations in humans

Error-prone polymerase activity causes multinucleotide mutations in humans
Kelley Harris, Rasmus Nielsen
(Submitted on 5 Dec 2013)

About 2% of human genetic polymorphisms have been hypothesized to arise via multinucleotide mutations (MNMs), complex events that generate SNPs at multiple sites in a single generation. MNMs have the potential to accelerate the pace at which single genes evolve and to confound studies of demography and selection that assume all SNPs arise independently. In this paper, we examine clustered mutations that are segregating in a set of 1,092 human genomes, demonstrating that MNMs become enriched as large numbers of individuals are sampled. We leverage the size of the dataset to deduce new information about the allelic spectrum of MNMs, estimating the percentage of linked SNP pairs that were generated by simultaneous mutation as a function of the distance between the affected sites and showing that MNMs exhibit a high percentage of transversions relative to transitions. These findings are reproducible in data from multiple sequencing platforms. Among tandem mutations that occur simultaneously at adjacent sites, we find an especially skewed distribution of ancestral and derived dinucleotides, with GC→AA, GA→TT and their reverse complements making up 36% of the total. These same mutations dominate the spectrum of tandem mutations produced by the upregulation of low-fidelity Polymerase ζ in mutator strains of S. cerevisiae that have impaired DNA excision repair machinery. This suggests that low-fidelity DNA replication by Pol ζ is at least partly responsible for the MNMs that are segregating in the human population, and that useful information about the biochemistry of MNM can be extracted from ordinary population genomic data. We incorporate our findings into a mathematical model of the multinucleotide mutation process that can be used to correct phylogenetic and population genetic methods for the presence of MNMs.

Author post: Evolution at two levels of gene expression in yeast

This guest post is by Carlo Arteri and Hunter Fraser on their preprint Evolution at two levels of gene expression in yeast, arXived here

Taking studies of regulatory evolution to the next level: translation

Understanding the molecular basis of regulatory variation within and between species has become a major focus of modern genetics. For instance, the majority of identified human disease-risk alleles lie in non-coding regions of the genome, suggesting that they affect gene regulation (Epstein 2009). Furthermore, it has been argued that regulatory changes have played a dominant role in explaining uniquely human attributes (King and Wilson 1975). However, our knowledge of gene regulatory evolution is based almost entirely on studies of mRNA levels, despite both the greater functional importance of protein abundance, and evidence that post-transcriptional regulation is pervasive. The availability of high-throughput methods for measuring mRNA abundance coupled to the lack of comparable methods at the protein level have contributed to this focus; however, a new method known as ribosome profiling (Ingolia et al. 2009) has enabled us to study divergence in the regulation of translation.

‘Riboprofiling’ involves the construction of two RNA-seq libraries: one measuring mRNA abundance (the ‘mRNA’ fraction), and the second capturing the portion of the transcriptome that is actively being translated by ribosomes (the ‘Ribo’ fraction). We performed riboprofiling on interspecific hybrids of two closely related species of budding yeast, Saccharomyces cerevisiae and S. paradoxus, (~5 million years diverged) as well as the parental strains. As both parental alleles at a locus share the same trans cellular environment in the hybrid, differences in the relative allelic abundance (termed allele-specific expression, or ASE) reveal cis-regulatory divergence. Consequently, interspecies differences not attributable to cis-effects indicate trans divergence. By measuring differences in the magnitudes of ASE between the two hybrid riboprofiling fractions, we identified independent cis and trans regulatory changes in both mRNA abundance and translational efficiency.

We found that both cis and trans regulatory divergence in translation are widespread, and of comparable magnitude to divergence at the mRNA level – indicating that we miss much regulatory evolution by focusing on mRNA in isolation. Moreover, we observed an overwhelming bias towards divergence in opposing parental directions, suggesting the action of stabilizing selection in order to maintain more similar protein levels between species than would be expected by comparing mRNA abundances alone. Interestingly, while we confirmed the results of previous studies indicating that both cis and trans regulatory divergence at the mRNA level are associated with the presence of TATA boxes and nucleosome free regions in promoters, no such relationship was found for translational divergence, indicating that these regulatory systems have different underlying architectures.

We also searched for evidence of polygenic selection in and between both regulatory levels by applying a recently developed modification of Orr’s sign test (Orr 1998; Fraser et al. 2010; Bullard et al. 2010). Under neutral divergence, no pattern is expected with regards to the parental direction of up or down-regulating alleles among orthologs within a functional group (e.g., a pathway or multi-gene complex). However, a significant bias towards one parental lineage is evidence of lineage-specific selection. This analysis uncovered evidence of polygenic selection at both regulatory levels in a number of functional groups. In particular, genes involved in tolerance to heavy metals were enriched for reinforcing divergence in mRNA abundance and translation favoring S. cerevisiae. Increased tolerance to these metals has been observed in S. cerevisiae (Warringer et al. 2011), suggesting that domesticated yeasts have experienced a history of polygenic adaptation across regulatory levels allowing them to grow on metals such as copper. Finally, we also uncovered multiple instances of stop-codon readthrough that are conserved between species, highlighting yet another post-transcriptional mechanism leading to increased proteomic diversity.

By applying a novel approach to a long-standing question, our analysis has revealed the underappreciated complexity of post-transcriptional regulatory divergence. We argue that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ overlooks the many opportunities for selection to act at multiple regulatory levels along the path from genotype to phenotype.

References:

Bullard JH, Mostovoy Y, Dudoit S, Brem RB. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci USA 107: 5058-5063.

Epstein DJ. 2009. Cis-regulatory mutations in human disease. Brief Funct Genomic Proteomic 8: 310–316.

Fraser HB, Moses AM, Schadt EE. 2010. Evidence for widespread adaptive evolution of gene expression in budding yeast. Proc Natl Acad Sci USA 107: 2977-2982.

Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. 2009. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218-223.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116.

Orr HA. 1998. Testing natural selection vs. genetic drift in phenotypic evolution using quantitative trait locus data. Genetics 149: 2099-2104.

Warringer J, Zörgö E, Cubillos FA, Zia A, Gjuvsland A, Simpson JT, Forsmark A, Durbin R, Omholt SW, Louis EJ, Liti G, Moses A, Blomberg A. 2011. Trait variation in yeast is defined by population history. PLoS Genet 7 :e1002111.

Evolution at two levels of gene expression in yeast

Evolution at two levels of gene expression in yeast
Carlo G. Artieri, Hunter B. Fraser
(Submitted on 27 Nov 2013)

Despite the greater functional importance of protein levels, our knowledge of gene expression evolution is based almost entirely on studies of mRNA levels. In contrast, our understanding of how translational regulation evolves has lagged far behind. Here we have applied ribosome profiling – which measures both global mRNA levels and their translation rates – to two species of Saccharomyces yeast and their interspecific hybrid in order to assess the relative contributions of changes in mRNA abundance and translation to regulatory evolution. We report that both cis and trans-acting regulatory divergence in translation are abundant, affecting at least 35% of genes. The majority of translational divergence acts to buffer changes in mRNA abundance, suggesting a widespread role for stabilizing selection acting across regulatory levels. Nevertheless, we observe evidence of lineage-specific selection acting on a number of yeast functional modules, including instances of reinforcing selection acting at both levels of regulation. Finally, we also uncover multiple instances of stop-codon readthrough that are conserved between species. Our analysis reveals the under-appreciated complexity of post-transcriptional regulatory divergence and indicates that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ may overlook a significant source of selection, acting at multiple regulatory levels along the path from genotype to phenotype.