The epigenome of evolving Drosophila neo-sex chromosomes: dosage compensation and heterochromatin formation

The epigenome of evolving Drosophila neo-sex chromosomes: dosage compensation and heterochromatin formation
Qi Zhou, Christopher E. Ellison, Vera B. Kaiser, Artyom A. Alekseyenko, Andrey A. Gorchakov, Doris Bachtrog
(Submitted on 26 Sep 2013)

Drosophila Y chromosomes are composed entirely of silent heterochromatin, while male X chromosomes have highly accessible chromatin and are hypertranscribed due to dosage compensation. Here, we dissect the molecular mechanisms and functional pressures driving heterochromatin formation and dosage compensation of the recently formed neo-sex chromosomes of Drosophila miranda. We show that the onset of heterochromatin formation on the neo-Y is triggered by an accumulation of repetitive DNA. The neo-X has evolved partial dosage compensation and we find that diverse mutational paths have been utilized to establish several dozen novel binding consensus motifs for the dosage compensation complex on the neo-X, including simple point mutations at pre-binding sites, insertion and deletion mutations, microsatellite expansions, or tandem amplification of weak binding sites. Spreading of these silencing or activating chromatin modifications to adjacent regions results in massive mis-expression of neo-sex linked genes, and little correspondence between functionality of genes and their silencing on the neo-Y or dosage compensation on the neo-X. Intriguingly, the genomic regions being targeted by the dosage compensation complex on the neo-X and those becoming heterochromatic on the neo-Y show little overlap, possibly reflecting different propensities along the ancestral chromosome to adopt active or repressive chromatin configurations. Our findings have broad implications for current models of sex chromosome evolution, and demonstrate how mechanistic constraints can limit evolutionary adaptations. Our study also highlights how evolution can follow predictable genetic trajectories, by repeatedly acquiring the same 21-bp consensus motif for recruitment of the dosage compensation complex, yet utilizing a diverse array of random mutational changes to attain the same phenotypic outcome.

A computational model for histone mark propagation reproduces the distribution of heterochromatin in different human cell types

A computational model for histone mark propagation reproduces the distribution of heterochromatin in different human cell types
Veit Schwämmle, Ole Nørregaard Jensen
(Submitted on 27 Sep 2013)

Chromatin is a highly compact and dynamic nuclear structure that consists of DNA and associated proteins. The main organizational unit is the nucleosome, which consists of a histone octamer with DNA wrapped around it. Histone proteins are implicated in the regulation of eukaryote genes and they carry numerous reversible post-translational modifications that control DNA-protein interactions and the recruitment of chromatin binding proteins. Heterochromatin, the transcriptionally inactive part of the genome, is densely packed and contains histone H3 that is methylated at Lys 9 (H3K9me). The propagation of H3K9me in nucleosomes along the DNA in chromatin is antagonizing by methylation of H3 Lysine 4 (H3K4me) and acetylations of several lysines, which is related to euchromatin and active genes. We show that the related histone modifications form antagonized domains on a coarse scale. These histone marks are assumed to be initiated within distinct nucleation sites in the DNA and to propagate bi-directionally. We propose a simple computer model that simulates the distribution of heterochromatin in human chromosomes. The simulations are in agreement with previously reported experimental observations from two different human cell lines. We reproduced different types of barriers between heterochromatin and euchromatin providing a unified model for their function. The effect of changes in the nucleation site distribution and of propagation rates were studied. The former occurs mainly with the aim of (de-)activation of single genes or gene groups and the latter has the power of controlling the transcriptional programs of entire chromosomes. Generally, the regulatory program of gene transcription is controlled by the distribution of nucleation sites along the DNA string.

Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants

Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants
Aridaman Pandit, Rob J de Boer
(Submitted on 26 Sep 2013)

Following transmission, HIV-1 evolves into a diverse population, and next generation sequencing enables us to detect variants occurring at low frequencies. Studying viral evolution at the level of whole genomes was hitherto not possible because next generation sequencing delivers relatively short reads. We here provide a proof of principle that whole HIV-1 genomes can be reliably reconstructed from short reads, and use this to study the selection of immune escape mutations at the level of whole genome haplotypes. Using realistically simulated HIV-1 populations, we demonstrate that reconstruction of complete genome haplotypes is feasible with high fidelity. We do not reconstruct all genetically distinct genomes, but each reconstructed haplotype represents one or more of the quasispecies in the HIV-1 population. We then reconstruct 30 whole genome haplotypes from published short sequence reads sampled longitudinally from a single HIV-1 infected patient. We confirm the reliability of the reconstruction by validating our predicted haplotype genes with single genome amplification sequences, and by comparing haplotype frequencies with observed epitope escape frequencies. Phylogenetic analysis shows that the HIV-1 population undergoes selection driven evolution, with successive replacement of the viral population by novel dominant strains. We demonstrate that immune escape mutants evolve in a dependent manner with various mutations hitchhiking along with others. As a consequence of this clonal interference, selection coefficients have to be estimated for complete haplotypes and not for individual immune escapes.

Neutral genomic regions refine models of recent rapid human population growth

Neutral genomic regions refine models of recent rapid human population growth
Elodie Gazave, Li Ma, Diana Chang, Alex Coventry, Feng Gao, Donna Muzny, Eric Boerwinkle, Richard Gibbs, Charles F. Sing, Andrew G. Clark, Alon Keinan
(Submitted on 25 Sep 2013)

Human populations have experienced dramatic growth since the Neolithic revolution. Recent studies that sequenced a very large number of individuals observed an extreme excess of rare variants, and provided clear evidence of recent rapid growth in effective population size, though estimates have varied greatly among studies. As medical applications drove the datasets therein, all studies were based on protein-coding genes, in which variants are also impacted by natural selection. In this study, we introduce targeted sequencing data for studying recent human history with minimal confounding by natural selection. We sequenced putatively neutral loci that are very far from genes and that meet a wide array of additional criteria. As population structure also skews allele frequencies, we sequenced a sample of relatively homogeneous ancestry by first analyzing the population structure of 9,716 European Americans. We employed very high coverage sequencing to reliably call rare variants, and fit an extensive array of models of recent European demographic history to the site frequency spectrum. The best-fit model estimates ~3.4% growth per generation during the last ~140 generations, resulting in a population size increase of two orders of magnitude. This model fits the data very well, largely due to our observation that assumptions of more ancient demography can impact estimates of recent growth. This observation and results also shed light on the discrepancy in demographic estimates among recent studies.

Some background on Bhaskar and Song’s paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”

This post is by Graham Coop [@Graham_Coop], and is an introduction to the background of Anand Bhaskar and Yun Song preprint: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

Anand and Yun’s preprint focuses on what we can learn about population demographic history from the site frequency spectrum. It takes as its starting point an article by Myers et al (Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.). I wrote about Myers et al’s article back in 2008, when it came out (see here). I thought I’d post an edited version of that post by way of additional background to Anand and Yun’s preprint and guest post.

Edited version of original post:
The best way to learn about demography from population genetic data is to look at patterns of diversity across many unlinked regions. The distribution of frequencies in a populations of unlinked neutral alleles at SNPs (the site frequency spectrum) is potentially very informative about population history. For example an excess of low frequency mutations is consistent with recent population growth, as the increase in population size introduces new mutations but these mutations have not yet had time to drift to higher frequencies. Many authors have made use of the frequency spectrum of unlinked, putatively neutral SNPs to learn about demography.

Back in 2008 a technical but elegant article by Myers et al shows that while informative about demography, the site frequency spectrum at unlinked SNPs can not help you chose between certain demographic histories. This is not a question of imperfect knowledge of the site frequency spectrum (which more data would solve) but because for any particular demographic model, as Myers et al formally show, there are a large family of demographic histories that can give rise to the same site frequency spectrum. They explained: ‘Informally, changes in population size at some past time are canceled out by other changes in the opposite direction’. I think that this lack of information comes from the fact that each unlinked SNP only tells you about the placement of a single mutation on the genealogy of the population at that site, and over sites you learn about the expected amount of time in different parts of the genealogy. By fluctuating the population size in just the right way, i.e. speeding up and slowing down the rate of coalescence, we can get very different population histories to give us the same expected coalescent times.

For example Myers et al showed that the population size history below, gives the same population frequency spectrum as a a constant population
1-s2.0-S0040580908000038-gr1
Figure taken from Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

That’s somewhat worrying as it says that when we fit a model of population size changes over time there are actually a family of quite different looking population histories that would give us just as good a fit to the data. However, these alternative histories do look quite strange and may not be biologically reasonable.

Anand and Yun’s article takes this as their starting place. They show that if we write our population history as a series of piecewise functions that the parameters of these functions are identifable, and provide simple estimates of the sample size needed. You can read more about their results in their guest post here at Haldane’s sieve.

Chaos and Unpredictability in Evolution

Chaos and Unpredictability in Evolution
Iaroslav Ispolatov, Michael Doebeli
(Submitted on 24 Sep 2013)

The possibility of complicated dynamic behaviour driven by non-linear feedbacks in dynamical systems has revolutionized science in the latter part of the last century. Yet despite examples of complicated frequency dynamics, the possibility of long-term evolutionary chaos is rarely considered. The concept of “survival of the fittest” is central to much evolutionary thinking and embodies a perspective of evolution as a directional optimization process exhibiting simple, predictable dynamics. This perspective is adequate for simple scenarios, when frequency-independent selection acts on scalar phenotypes. However, in most organisms many phenotypic properties combine in complicated ways to determine ecological interactions, and hence frequency-dependent selection. Therefore, it is natural to consider models for the evolutionary dynamics generated by frequency-dependent selection acting simultaneously on many different phenotypes. Here we show that complicated, chaotic dynamics of long-term evolutionary trajectories in phenotype space is very common in a large class of such models when the dimension of phenotype space is large, and when there are epistatic interactions between the phenotypic components. Our results suggest that the perspective of evolution as a process with simple, predictable dynamics covers only a small fragment of long-term evolution. Our analysis may also be the first systematic study of the occurrence of chaos in multidimensional and generally dissipative systems as a function of the dimensionality of phase space.

Fast Inference of Admixture Coefficients Using Sparse Non-negative Matrix Factorization Algorithms

Fast Inference of Admixture Coefficients Using Sparse Non-negative Matrix Factorization Algorithms
Eric Frichot, François Mathieu, Théo Trouillon, Guillaume Bouchard, Olivier François
(Submitted on 24 Sep 2013)

Inference of individual admixture coefficients, which is important for population genetic and association studies, is commonly performed using compute-intensive likelihood algorithms. With the availability of large population genomic data sets, fast versions of likelihood algorithms have attracted considerable attention. Reducing the computational burden of estimation algorithms remains, however, a major challenge. Here, we present a fast and efficient method for estimating individual admixture coefficients based on sparse non-negative matrix factorization algorithms. We implemented our method in the computer program sNMF, and applied it to human and plant genomic data sets. The performances of sNMF were then compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of admixture coefficients within run-times approximately 10 to 30 times faster than those of ADMIXTURE.

Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes

Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes
Timothy B. Sackton, John H. Werren, Andrew G. Clark
(Submitted on 23 Sep 2013)

The innate immune system in insects consists of a conserved core signaling network and rapidly diversifying effector and recognition components, often containing a high proportion of taxonomically-restricted genes. In the absence of functional annotation, genes encoding immune system proteins can thus be difficult to identify, as homology-based approaches generally cannot detect lineage-specific genes. Here, we use RNA-seq to compare the uninfected and infection-induced transcriptome in the parasitoid wasp Nasonia vitripennis to identify genes regulated by infection. We identify 183 genes significantly up-regulated by infection and 61 genes significantly down-regulated by infection. We also produce a new homology-based immune catalog in N. vitripennis, and show that most infection-induced genes are not assigned an immune function from homology alone, suggesting the potential for substantial novel immune components in less-well-studied systems. Finally, we show that a high proportion of these novel induced genes are taxonomically-restricted, highlighting the rapid evolution of immune gene content. The combination of functional annotation using RNA-seq and homology-based annotation provides a robust method to characterize the innate immune response across a wide variety of insects, and reveals significant novel features of the Nasonia immune response.

Author post: The identifiability of piecewise demographic models from the sample frequency spectrum

This guest post is by Anand Bhaskar and Yun Song on their paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

With the advent of high-throughput sequencing technologies, it has been of great interest to use genomic data to understand human demographic history. For example, we now estimate that modern humans migrated out of Africa around 60K-120K years ago [1,2], and that Neandertals may have admixed with modern humans in Europe as recently as 47,000 years ago [3]. Apart from satisfying curiosity about our anthropological history, the inference of demography is important for several scientific reasons. Most importantly, demographic processes influence genetic variation, and understanding the interplay between natural selection, genetic drift, and demography is a key question in population genetics. Also, controlling for demography is important for practical applications. For example, the demography inferred from neutrally evolving genomic regions can serve as a null model when searching for regions under selection. Demographic models could also be used to circumvent the problem of spurious associations in case-control studies induced by population substructure.

A summary of whole haplotypes that is commonly used in population genetic analyses is the sample frequency spectrum (SFS). For a sample of n haplotypes from a panmictic (i.e. without substructure) population, the SFS is an (n-1)-dimensional vector where the i-th entry is the proportion of SNPs with i copies of the mutant allele in the sample. One can talk about a mutant/derived allele because most analyses assume that mutations are rare enough that the observed SNPs are dimorphic. The first few entries of the SFS capture the proportion of rare SNPs in the sample and are especially useful for inferring recent population history. Several recent large sample sequencing studies [4-6] have found that humans have many more putatively neutral rare SNPs compared to predictions from a constant population size model. Using the SFS from their data, these studies all infer demographic models with recent exponential population expansion.

However, until fairly recently, it was not known whether the SFS of a sample uniquely determines the underlying demographic model. Could it be possible that two different demographic models produce the exact same expected SFS for all sample sizes? In 2008, Simon Myers, Charles Fefferman, and Nick Patterson came up with an elegant mathematical argument [7] to show that there are infinitely many population size histories that generate the same expected SFS for all sample sizes. They even provided an explicit example of a population size history which produced the same expected SFS as a constant population size model. However, their example history had increasingly rapid oscillations in the population size in the recent past, something that we might not expect to find in real biological populations. After all, even though we commonly use continuous-time models of evolution like coalescent theory and diffusion processes, biological populations evolve in discrete events of birth and death.

Our research group has been working on demographic inference from the SFS and from full sequence data for the last several years, and so it was natural for us to ask whether the class of population size histories that are commonly inferred using statistical algorithms might also suffer from this non-identifiability problem. Most statistical methods infer piecewise population size histories, where the pieces come from some biologically-motivated family of functions. In particular, piecewise constant and piecewise exponential models commonly appear in the literature. And if one can indeed uniquely identify piecewise demographic models from the SFS, what sample sizes are needed to do so?

In our paper, we address this question by proving that if the underlying population size function is piecewise with at most K pieces, then the expected SFS of a random sample of size n uniquely determines the demography as long as n is larger than some function of K that depends on the type of pieces of the population size function. For example, if the underlying demographic model was piecewise constant with at most K pieces (i.e. described by at most 2K – 1 parameters), then the expected SFS of a sample of size 2K uniquely determines the demographic model. In other words, no two piecewise constant population size functions with at most K pieces can generate the same expected SFS for a sample of size 2K or larger. For piecewise exponential demographic models with at most K pieces, a sample size of 4K – 1 is sufficient to uniquely determine the demographic model. When one doesn’t know which allele is ancestral and which is derived (for example, if outgroup information is lacking at the relevant SNPs), demographic analysis can still be carried out using the SFS by “folding” it. The folded SFS has floor(n/2) entries, where the i-th entry is the proportion of SNPs with i copies of the minor allele (which might be an ancestral or derived allele). Since the folded SFS has only roughly half the dimension as the full SFS, one might expect to require twice as many samples to uniquely determine the demographic model from the folded SFS compared to the full SFS. We formally prove in our paper that this intuition is indeed correct.

It is important to stress that this identifiability result is statistical rather than algorithmic in that that one would need to have perfect information about the expected SFS of a random sample in order to uniquely determine the underlying piecewise demography. In practice, one can get good estimates of the expected SFS by considering a large number of SNPs in the inference procedure, and by considering SNPs that are farther apart along the chromosomes so that the coalescent trees for the sample at different SNPs will be roughly independent of each other. More work is certainly needed to understand how much genomic data (measured both in terms of the number of SNPs and the sample size) would be needed in practice to robustly infer realistic demographic models.

Works cited:

[1] Li, H. and Durbin, R. (2011) Inference of human population history from individual whole-genome sequences. Nature 475, 493–496.

[2] Scally, A. and Durbin, R. (2012). Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics, 13(10), 745-753.

[3] Sankararaman, S., Patterson, N., Li, H., Pääbo, S., and Reich, D. (2012) The date of interbreeding between Neandertals and modern humans. PLoS Genetics 8, e1002947.

[4] Nelson, Matthew R., et al. (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104.

[5] Tennessen, Jacob A., et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69.

[6] Fu, Wenqing, et al. (2012) Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220.

[7] Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

mTim: Rapid and accurate transcript reconstruction from RNA-Seq data


mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Georg Zeller, Nico Goernitz, Andre Kahles, Jonas Behr, Pramod Mudrakarta, Soeren Sonnenburg, Gunnar Raetsch
(Submitted on 20 Sep 2013)

Recent advances in high-throughput cDNA sequencing (RNA-Seq) technology have revolutionized transcriptome studies. A major motivation for RNA-Seq is to map the structure of expressed transcripts at nucleotide resolution. With accurate computational tools for transcript reconstruction, this technology may also become useful for genome (re-)annotation, which has mostly relied on de novo gene finding where gene structures are primarily inferred from the genome sequence. We developed a machine-learning method, called mTim (margin-based transcript inference method) for transcript reconstruction from RNA-Seq read alignments that is based on discriminatively trained hidden Markov support vector machines. In addition to features derived from read alignments, it utilizes characteristic genomic sequences, e.g. around splice sites, to improve transcript predictions. mTim inferred transcripts that were highly accurate and relatively robust to alignment errors in comparison to those from Cufflinks, a widely used transcript assembly method.