Too packed to change: site-specific substitution rates and side-chain packing in protein evolution

Too packed to change: site-specific substitution rates and side-chain packing in protein evolution
María Laura Marcos, Julian Echave
doi: http://dx.doi.org/10.1101/013359

In protein evolution, due to functional and biophysical constraints, the rates of amino acid substitution differ from site to site. Among the best predictors of site-specific rates is packing density. The packing density measure that best correlates with rates is the weighted contact number (WCN), the sum of inverse square distances between the site’s Cα and the other Cαs . According to a mechanistic stress model proposed recently, rates are determined by packing because mutating packed sites stresses and destabilizes the protein’s active conformation. While WCN is a measure of Cα packing, mutations replace side chains, which prompted us to consider whether a site’s evolutionary divergence is constrained by main-chain packing or side-chain packing. To address this issue, we extended the stress theory to model side chains explicitly. The theory predicts that rates should depend solely on side-chain packing. We tested these predictions on a data set of structurally and functionally diverse monomeric enzymes. We found that, on average, side-chain contact density (WCNρ ) explains 39.1% of among-sites rate variation, larger than main-chain contact density (WCNα ) which explains 32.1%. More importantly, the independent contribution of WCNα is only 0.7%. Thus, as predicted by the stress theory, site-specific evolutionary rates are determined by side-chain packing.

High-resolution genomic surveillance of 2014 ebolavirus using shared subclonal variants

High-resolution genomic surveillance of 2014 ebolavirus using shared subclonal variants

Kevin J Emmett, Albert K Lee, Hossein Khiabanian, Raul Rabadan
doi: http://dx.doi.org/10.1101/013318

Viral outbreaks, such as the 2014 ebolavirus, can spread rapidly and have complex evolutionary dynamics, including coinfection and bulk transmission of multiple viral populations. Genomic surveillance can be hindered when the spread of the outbreak exceeds the evolutionary rate, in which case consensus approaches will have limited resolution. Deep sequencing of infected patients can identify genomic variants present in intrahost populations at subclonal frequencies (i.e. <50%). Shared subclonal variants (SSVs) can provide additional phylogenetic resolution and inform about disease transmission patterns. Here, we use metrics from population genetics to analyze data from the 2014 ebolavirus outbreak in Sierra Leone and identify phylogenetic signal arising from SSVs. We use methods derived from information theory to measure a lower bound on transmission bottleneck size that is larger than one founder population, yet significantly smaller than the intrahost effective population. Our results demonstrate the important role of shared subclonal variants in genomic surveillance.

Reconstructing gene content in the last common ancestor of cellular life: is it possible, should it be done, and are we making any progress?

Reconstructing gene content in the last common ancestor of cellular life: is it possible, should it be done, and are we making any progress?

Arcady Mushegian
doi: http://dx.doi.org/10.1101/013326

I review recent literature on the reconstruction of gene repertoire of the Last Universal Common Ancestor of cellular life (LUCA). The form of the phylogenetic record of cellular life on Earth is important to know in order to reconstruct any ancestral state; therefore I also discuss the emerging understanding that this record does not take the form of a tree. I argue that despite this, “tree-thinking” remains an essential component in evolutionary thinking and that “pattern pluralism” in evolutionary biology can be only epistemological, but not ontological.

The evolutionarily stable distribution of fitness effects

The evolutionarily stable distribution of fitness effects

Daniel P. Rice, Benjamin H. Good, Michael M. Desai
doi: http://dx.doi.org/10.1101/013052

The distribution of fitness effects of new mutations (the DFE) is a key parameter in determining the course of evolution. This fact has motivated extensive efforts to measure the DFE or to predict it from first principles. However, just as the DFE determines the course of evolution, the evolutionary process itself constrains the DFE. Here, we analyze a simple model of genome evolution in a constant environment in which natural selection drives the population toward a dynamic steady state where beneficial and deleterious substitutions balance. The distribution of fitness effects at this steady state is stable under further evolution, and provides a natural null expectation for the DFE in a population that has evolved in a constant environment for a long time. We calculate how the shape of the evolutionarily stable DFE depends on the underlying population genetic parameters. We show that, in the absence of epistasis, the ratio of beneficial to deleterious mutations of a given fitness effect obeys a simple relationship independent of population genetic details. Finally, we analyze how the stable DFE changes in the presence of a simple form of diminishing returns epistasis.

DNA-guided establishment of canonical nucleosome patterns in a eukaryotic genome

DNA-guided establishment of canonical nucleosome patterns in a eukaryotic genome

Leslie Y Beh, Noam Kaplan, Manuel M Muller, Tom W Muir, Laura F Landweber
doi: http://dx.doi.org/10.1101/013250

A conserved hallmark of eukaryotic chromatin architecture is the distinctive array of well-positioned nucleosomes downstream of transcription start sites (TSS). Recent studies indicate that trans-acting factors establish this stereotypical array. Here, we present the first genome-wide in vitro and in vivo nucleosome maps for the ciliate Tetrahymena thermophila. In contrast with previous studies in yeast, we find that the stereotypical nucleosome array is preserved in the in vitro reconstituted map, which is governed only by the DNA sequence preferences of nucleosomes. Remarkably, this average in vitro pattern arises from the presence of subsets of nucleosomes, rather than the whole array, in individual Tetrahymena genes. Variation in GC content contributes to the positioning of these sequence-directed nucleosomes, and affects codon usage and amino acid composition in genes. We propose that these ‘seed’ nucleosomes may aid the AT-rich Tetrahymena genome – which is intrinsically unfavorable for nucleosome formation – in establishing nucleosome arrays in vivo in concert with trans-acting factors, while minimizing changes to the coding sequences they are embedded within.

Model Inadequacy and Mistaken Inferences of Trait-Dependent Speciation

Model Inadequacy and Mistaken Inferences of Trait-Dependent Speciation

Daniel L. Rabosky, Emma E. Goldberg
(Submitted on 22 Dec 2014)

Species richness varies widely across the tree of life, and there is great interest in identifying ecological, geographic, and other factors that affect rates of species proliferation. Recent methods for explicitly modeling the relationships among character states, speciation rates, and extinction rates on phylogenetic trees- BiSSE, QuaSSE, GeoSSE, and related models – have been widely used to test hypotheses about character state-dependent diversification rates. Here, we document the disconcerting ease with which neutral traits are inferred to have statistically significant associations with speciation rate. We first demonstrate this unfortunate effect for a known model assumption violation: shifts in speciation rate associated with a character not included in the model. We further show that for many empirical phylogenies, characters simulated in the absence of state-dependent diversification exhibit an even higher Type I error rate, indicating that the method is susceptible to additional, unknown model inadequacies. For traits that evolve slowly, the root cause appears to be a statistical framework that does not require replicated shifts in character state and diversification. However, spurious associations between character state and speciation rate arise even for traits that lack phylogenetic signal, suggesting that phylogenetic pseudoreplication alone cannot fully explain the problem. The surprising severity of this phenomenon suggests that many trait-diversification relationships reported in the literature may not be real. More generally, we highlight the need for diagnosing and understanding the consequences of model inadequacy in phylogenetic comparative methods.

Marker-based estimation of heritability in immortal populations

Marker-based estimation of heritability in immortal populations

Willem Kruijer, Martin Boer, Marcos Malosetti, Padraic J. Flood, Bas Engel, Rik Kooke, Joost Keurentjes, Fred van Eeuwijk
(Submitted on 21 Dec 2014)

Heritability is a central parameter in quantitative genetics, both from an evolutionary and a breeding perspective. For plant traits heritability is traditionally estimated by comparing within and between genotype variability. This approach estimates broad-sense heritability, and does not account for different genetic relatedness. With the availability of high-density markers there is growing interest in marker based estimates of narrow-sense heritability, using mixed models in which genetic relatedness is estimated from genetic markers. Such estimates have received much attention in human genetics but are rarely reported for plant traits. A major obstacle is that current methodology and software assume a single phenotypic value per genotype, hence requiring genotypic means. An alternative that we propose here, is to use mixed models at individual plant or plot level. Using statistical arguments, simulations and real data we investigate the feasibility of both approaches, and how these affect genomic prediction with G-BLUP and genome-wide association studies. Heritability estimates obtained from genotypic means had very large standard errors and were sometimes biologically unrealistic. Mixed models at individual plant or plot level produced more realistic estimates, and for simulated traits standard errors were up to 13 times smaller. Genomic prediction was also improved by using these mixed models, with up to a 49% increase in accuracy. For GWAS on simulated traits, the use of individual plant data gave almost no increase in power. The new methodology is applicable to any complex trait where multiple replicates of individual genotypes can be scored. This includes important agronomic crops, as well as bacteria and fungi.

Bias in Estimators of Archaic Admixture

Bias in Estimators of Archaic Admixture

Alan R. Rogers, Ryan J. Bohlender
(Submitted on 20 Dec 2014)

This article evaluates bias in one class of methods used to estimate archaic admixture in modern humans. These methods study the pattern of allele sharing among modern and archaic genomes. They are sensitive to “ghost” admixture, which occurs when a population receives archaic DNA from sources not acknowledged by the statistical model. The effect of ghost admixture depends on two factors: branch-length bias and population-size bias. Branch-length bias occurs because a given amount of admixture has a larger effect if the two populations have been separated for a long time. Population-size bias occurs because differences in population size distort branch lengths in the gene genealogy. In the absence of ghost admixture, these effects are small. They become important, however, in the presence of ghost admixture. Estimators differ in the pattern of response. Increasing a given parameter may inflate one estimator but deflate another. For this reason, comparisons among estimators are informative. Using such comparisons, this article supports previous findings that the archaic population was small and that Europeans received little gene flow from archaic populations other than Neanderthals. It also identifies an inconsistency in estimates of archaic admixture into Melanesia.

Scaling probabilistic models of genetic variation to millions of humans

Scaling probabilistic models of genetic variation to millions of humans

Prem Gopalan, Wei Hao, David M. Blei, John D. Storey
doi: http://dx.doi.org/10.1101/013227

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. Researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans. The number of humans that have been densely genotyped across the genome has grown significantly in recent years. In aggregate about 1M individuals have been densely genotyped to date, and if we could analyze this data then we would have a nearly complete picture of human genetic variation. Existing state-of-the-art methods, however, cannot scale to data of this size. To this end, we have developed TeraStructure. TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to approximate Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure. On real and simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.

A FISH-based chromosome map for the European corn borers yields insights into ancient chromosomal fusions in the silkworm.

A FISH-based chromosome map for the European corn borers yields insights into ancient chromosomal fusions in the silkworm.

Yuji Yasukochi, Mizuki Ohno, Fukashi Shibata, Akiya Jouraku, Ryo Nakano, Yukio Ishikawa, Ken Sahara
doi: http://dx.doi.org/10.1101/013284

A significant feature of the genomes of Lepidoptera, butterflies and moths, is the high conservation of chromosome organization. Recent remarkable progress in genome sequencing of Lepidoptera has revealed that syntenic gene order is extensively conserved across phylogenetically distant species. The ancestral karyotype of Lepidoptera is thought to be n = 31; however, that of the most well studied moth, Bombyx mori, is n = 28, suggesting that three chromosomal fusion events occurred in this lineage. To identify the boundaries between predicted ancient fusions involving B. mori chromosomes 11, 23 and 24, we constructed FISH-based chromosome maps of the European corn borer, Ostrinia nubilalis (n = 31). We first determined 511 Mb genomic sequence of the Asian corn borer, Ostrinia furnacalis, a congener of O. nubilalis, and isolated BAC and fosmid clones that were expected to localize in candidate regions for the boundaries using these sequences. Combined with FISH and genetic analysis, we narrowed down the candidate regions to 40kb ??? 1.5Mb, in strong agreement with a previous estimate based on the genome of a butterfly, Melitaea cinxia. The significant difference in the lengths of the candidate regions where no functional genes were observed may reflect the evolutionary time after fusion events.