Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species
Rori Rohlfs, Rasmus Nielsen

Thanks to the reduced cost of RNA-Sequencing and other advanced methods for quantifying expression levels, accurate and expansive comparative expression data sets including data from multiple individuals per species are emerging. Comparative genomics has been greatly facilitated by the availability of statistical methods considering both between and within species variation for testing hypotheses regarding the evolution of DNA sequences. Similar methods are now needed to fully leverage comparative expression data. In this paper, we describe the β model which parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses, including a test for expression divergence or diversity for a single gene or a class of genes. The β model can also be used to test for lineage-specific shifts in expression level, amongst other applications. We use simulations to explore the functionality of these tests under a variety of circumstances. We then apply them to a mammalian phylogeny of 15 species typed in liver tissue. We identify genes with high expression divergence between species as candidates for expression level adaptation, and genes with high expression diversity within species as candidates for expression level conservation and plasticity. Using the test for lineage-specific expression shifts, we identify several candidate genes for expression level adaptation on the catarrhine and human lineages, including genes possibly related to dietary changes in humans. We compare these results to those reported previously using the species mean model which ignores population expression variance, uncovering important differences in performance.

Model adequacy and the macroevolution of angiosperm functional traits

Model adequacy and the macroevolution of angiosperm functional traits
Matthew Pennell, Richard G FitzJohn, William K Cornwell, Luke J Harmon

All models are wrong and sometimes even the best of a set of models is useless. Modern phylogenetic comparative methods (PCMs) are almost exclusively model–based and therefore making robust inferences from PCMs requires using a model of trait evolution that is a good explanation for the data. To date, researchers using PCMs have evaluated the explanatory power of a model only in terms of relative, not absolute, fit. Here we develop a general statistical framework for assessing the absolute fit, or adequacy, of phylogenetic models for the evolution of quantitative traits. We use our approach to test whether commonly used models are adequate descriptors of the macroevolutionary dynamics of real comparative data. We fit models of trait evolution to 337 comparative datasets covering three key Angiosperm functional traits and evaluated the absolute fit of the models to each dataset. Overall, the models we used are very inadequate for the evolution of these traits; this was true for many different groups and at many different scales. Furthermore, the relative support for a model had very little to do with its absolute adequacy. We argue that assessing model adequacy should be a key step in comparative analyses.

A chromatin structure based model accurately predicts DNA replication timing in human cells

A chromatin structure based model accurately predicts DNA replication timing in human cells
Yevgeniy Gindin, Manuel S. Valenzuela, Mirit I. Aladjem, Paul S. Meltzer, Sven Bilke
Subjects: Subcellular Processes (q-bio.SC); Genomics (q-bio.GN)

The metazoan genome is replicated in precise cell lineage specific temporal order. However, the mechanism controlling this orchestrated process is poorly understood as no molecular mechanisms have been identified that actively regulate the firing sequence of genome replication. Here we develop a mechanistic model of genome replication capable of predicting, with accuracy rivaling experimental repeats, observed empirical replication timing program in humans. In our model, replication is initiated in an uncoordinated (time-stochastic) manner at well-defined sites. The model contains, in addition to the choice of the genomic landmark that localizes initiation, only a single adjustable parameter of direct biological relevance: the number of replication forks. We find that DNase hypersensitive sites are optimal and independent determinants of DNA replication initiation. We demonstrate that the DNA replication timing program in human cells is a robust emergent phenomenon that, by its very nature, does not require a regulatory mechanism determining a proper replication initiation firing sequence.

Population genetics on islands connected by an arbitrary network: An analytic approach

Population genetics on islands connected by an arbitrary network: An analytic approach
George W A Constable, Alan J McKane
(Submitted on 11 Feb 2014)

We analyse a model consisting of a population of individuals which is subdivided into a finite set of demes, each of which has a fixed but differing number of individuals. The individuals can reproduce, die and migrate between the demes according to an arbitrary migration network. They are haploid, with two alleles present in the population; frequency independent selection is also incorporated, where the strength and direction of selection can vary from deme to deme. The system is formulated as an individual-based model, and the diffusion approximation systematically applied to express it as a set of nonlinear coupled stochastic differential equations. These can be made amenable to analysis through the elimination of fast-time variables. The resulting reduced model is analysed in a number of situations, including migration-selection balance leading to a polymorphic equilibrium of the two alleles, and an illustration of how the subdivision of the population can lead to non-trivial behaviour in the case where the network is a simple hub. The method we develop is systematic, may be applied to any network, and agrees well with the results of simulations in all cases studied and across a wide range of parameter values.

Discovering functional DNA elements using population genomic information: A proof of concept using human mtDNA

Discovering functional DNA elements using population genomic information: A proof of concept using human mtDNA
Daniel R. Schrider, Andrew D. Kern
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN)

Identifying the complete set of functional elements within the human genome would be a windfall for multiple areas of biological research including medicine, molecular biology, and evolution. Complete knowledge of function would aid in the prioritization of loci when searching for the genetic bases of disease or adaptive phenotypes. Because mutations that disrupt function are disfavored by natural selection, purifying selection leaves a detectable signature within functional elements; accordingly this signal has been exploited through the use of genomic comparisons of distantly related species. However, the functional complement of the genome changes extensively across time and between lineages, therefore, evidence of the current action of purifying selection is essential. Because the removal of deleterious mutations by natural selection also reduces within-species genetic diversity within functional loci, dense population genetic data have the potential to reveal genomic elements that are currently functional. Here we assess the potential of this approach using 16,411 human mitochondrial genomes. We show that the high density of polymorphism in this dataset precisely delineates regions experiencing purifying selection. Further, we show that the number of segregating alleles at a site is strongly correlated with its divergence across species after accounting for known mutational biases in human mtDNA. These two measures track one another at a remarkably fine scale across many loci–a correlation that is purely the result of natural selection. Our results demonstrate that genetic variation has the potential to reveal exactly which nucleotides in the genome are currently performing important functions and likely to have deleterious fitness effects when mutated. As more complete genomes are sequenced, similar power to reveal purifying selection may be achievable in the human nuclear genome.

Author post: Dynamic DNA Processing: A Microcode Model of Cell Differentiation

The following guest post is by Barry Jacobson on his preprint “Dynamic DNA Processing: A Microcode Model of Cell Differentiation”, arXived here.

The paper suggests that DNA should be viewed as a processor that operates by means of base-pairing with remote regions of the genome. If one sequence matches another (or is complementary to it) it will set up a structural loop, or other interaction. However, the paper postulates that at least one region of the genome of every cell will have a unique clock sequence that is shared by no other cell. Therefore, the clock of one cell may not match the same distant sequences as the clock of another. Thus, the pattern of loops that is formed, and the overall 3-D DNA structure, may differ from cell to cell. This will either assist or hinder binding of transcription factors in one type of cell, as compared to another, thus providing a mechanism of differential gene expression.

We discuss a method by how these differing clock sequences could be generated in cell division, so that the daughters each end up with a unique identifier. The identifier then unlocks certain conformations only for those cell types for which it is relevant. Similarly, SNP’s may function in a similar manner, by modifying 3-D configurations, thus altering TF activity.

We further postulate that if a clock or target is errantly mutated, so that it matches a target farther away than was intended, it may stretch the chromosome to the breaking point, and this is the cause of chromosomal breakage or translocations in cancer.

Finally, we allow for the possibility that a cell can modify its clock in response to the environment, such as when healing from trauma, or accepting a graft, in which case it needs to coordinate with neighboring cells. We suggest that perhaps chemical analogs of cell surface proteins may occasionally mistrigger such a clock modification, when none is necessary, and thereby cause incorrect matches and conformations in that cell, which can damage DNA, and lead to cancer, as before.

We realize this is all purely speculative, but we mention that we originally submitted this model to Nature without success 16 years ago, and since then, a number of its assumptions have been verified, as detailed in the recent submission to arXiv, therefore we believe it deserves a second look.

Author post: The evolution of sex differences in disease genetics

This guest post is by Ted Morrow, Jessica Abbott, and Will Gilks on their review paper Gilks et al. “The evolution of sex differences in disease genetics”

Our paper forms part of a research project (2Sexes_1Genome, 2012-16) devoted to investigating how sex-specific and sexually antagonistic selection influences the genome, and in particular whether genetic variants that are maintained as a result of these forms of selection could contribute to disease risk. We had three main aims with our paper, which we outline below together with a motivation for each.

Our first aim was to summarise evidence for sex-dependent genetic architecture in complex traits that were otherwise shared between the sexes. We focused particularly on disease phenotypes in humans, although a range of complex traits from diverse taxa were considered. The motivation for this was to establish a baseline for how widespread or rare sex-specific genetic architecture is. An important paper in this respect, published in Nature Reviews Genetics (Ober et al., 2008) specifically addressed the question of sex-specific genetic architecture in human diseases. It reviewed selected examples within the human disease genetics literature for sex-specific effects on a range of phenotypes. They concluded that studies where sex was ignored would miss some important variants that contribute to disease risk. While the Ober et al. (2008) paper makes a robust case for investigating sex as a factor in genetic analyses, several other genome-wide association studies in the primary literature have been published since, suggesting that an up to date review of these would be worthwhile. We did not intend to conduct a full-scale meta-analysis, although that would probably be a very informative exercise given potential problems in terms of reporting bias, non-independence of traits, and selection of traits with known sexual dimorphism. Nonetheless, a clear pattern emerges of widespread evidence of sex-specific genetic architecture based on heritability estimates (see Figure 1 in our paper), eQTLs, gene manipulations, expression studies, and SNPs with sex-by-genotype effects (see Table 1 in our paper). A recently published paper (not included in our review) even reports 10 out of 13 loci reaching genome-wide significance for recombination rate having sex-specific effects (Kong et al., 2013).

The second aim was to show how evolutionary theory could provide ultimate explanations for the origins of sex-specific genetic architecture. In this way, we propose that a deeper understanding of why genes cause disease, and why some common diseases show sexually dimorphic expression, may emerge. The evolutionary theory of why the sexes may differ phenotypically goes back to Darwin’s observations (1871) of how selection acts in males and females. He characterized males as active competitors, engaging in physical battles with rivals or investing in costly signals with which to woo potential mates. Females, on the other hand were characterized as being coy and choosy. There is now good evidence that mate choice is something not only limited to females, and that sexual selection also operates well after copulation (i.e. sperm competition and cryptic female choice). The key point is that fundamental differences between the sexes occur in terms of investment in reproduction, and as a consequence the routes by which males and females may maximize their fitness are often different. In other words, both natural and sexual selection frequently take sex-specific forms in terms of strength and/or direction. The latter possibility that selection acts antagonistically between the sexes is well established in several laboratory and wild populations, including humans. From a human disease perspective, disease may occur as a result of an individual’s phenotypic difference (or departure) from an optimal phenotype (where a particular trait value has the greatest fitness). This difference could be the result of a genetic constraint imposed by an intersexual genetic correlation for that trait, or indirectly (i.e. pleiotropically) though genetic correlations with other traits. Sex-specific or sexually antagonistic selection could therefore maintain genetic variation within a population that is either less favourable or actually deleterious for one sex. A recent model (Morrow & Connallon, 2013) shows how alleles with sex-specific or sexually antagonistic effects will contribute more to genetic variation for disease predisposition than alleles that are deleterious to both sexes in equal measure, and achieve higher allele frequencies. As a result, sexual dimorphism in the genetic architecture of complex polygenic diseases would emerge within the population. This evolutionary model clearly indicates that the search for loci contributing to disease risk in humans would benefit from exploring sex-specific genetic effects.

The final aim was to provide readers with an overview of the analytical options available for detecting sex-specific associations in genome-wide studies of complex diseases and phenotypes. As we show, more studies are investigating and discovering sex-dependent effects using GWAS data, Common strategies are to separate or stratify the samples within case and control groups by sex, or to model sex as a covariate. The first approach reduces the statistical power to detect sex-dependent effects, and thus only strong ones will be detected. The second simply controls for any sex-specific effects, it is not intended to identify them. We instead advocate the inclusion of a genotype-by-sex interaction term in statistical models, available as an option in some of the commonly used analytical platforms such as GenABEL and PLINK.

Overall, we hope our article raises the profile of sex-specific genetic effects, a topic that is already apparently receiving increasing interest judging by the recent crop of sex-specific associations appearing in the GWAS literature. This forms a more general theme within the field of human disease genetics, of exploring the impact of interaction effects, such as genotype-by-environment interactions. The identification of strong main effects has had successes but the debate over the ‘missing heritability’ of complex traits has activated researchers to look beyond to more complex processes such as epistasis and environmental effects. We welcome any comments either here on Haldane’s Sieve or in the comments section of biorXiv where are article is currently posted.

2Sexes_1Genome. 2012-16. Edward H. Morrow. FP7 ERC Starting Grant – Evolutionary, population and environmental biology.
Darwin, C. 1871. The Descent of Man. Prometheus Books, New York.
Kong, A., Thorleifsson, G., Frigge, M.L., Masson, G., Gudbjartsson, D.F., Villemoes, R., et al. 2013. Common and low-frequency variants associated with genome-wide recombination rate. Nat. Genet. doi:10.1038/ng.2833.
Morrow, E.H. & Connallon, T. 2013. Implications of sex-specific selection for the genetic basis of disease. Evol. Appl. doi:10.1111/eva.12097.
Ober, C., Loisel, D.A. & Gilad, Y. 2008. Sex-specific genetic architecture of human disease. Nat Rev Genet 9: 911–922.

Author post: The identifiability of piecewise demographic models from the sample frequency spectrum

This guest post is by Anand Bhaskar and Yun Song on their paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

With the advent of high-throughput sequencing technologies, it has been of great interest to use genomic data to understand human demographic history. For example, we now estimate that modern humans migrated out of Africa around 60K-120K years ago [1,2], and that Neandertals may have admixed with modern humans in Europe as recently as 47,000 years ago [3]. Apart from satisfying curiosity about our anthropological history, the inference of demography is important for several scientific reasons. Most importantly, demographic processes influence genetic variation, and understanding the interplay between natural selection, genetic drift, and demography is a key question in population genetics. Also, controlling for demography is important for practical applications. For example, the demography inferred from neutrally evolving genomic regions can serve as a null model when searching for regions under selection. Demographic models could also be used to circumvent the problem of spurious associations in case-control studies induced by population substructure.

A summary of whole haplotypes that is commonly used in population genetic analyses is the sample frequency spectrum (SFS). For a sample of n haplotypes from a panmictic (i.e. without substructure) population, the SFS is an (n-1)-dimensional vector where the i-th entry is the proportion of SNPs with i copies of the mutant allele in the sample. One can talk about a mutant/derived allele because most analyses assume that mutations are rare enough that the observed SNPs are dimorphic. The first few entries of the SFS capture the proportion of rare SNPs in the sample and are especially useful for inferring recent population history. Several recent large sample sequencing studies [4-6] have found that humans have many more putatively neutral rare SNPs compared to predictions from a constant population size model. Using the SFS from their data, these studies all infer demographic models with recent exponential population expansion.

However, until fairly recently, it was not known whether the SFS of a sample uniquely determines the underlying demographic model. Could it be possible that two different demographic models produce the exact same expected SFS for all sample sizes? In 2008, Simon Myers, Charles Fefferman, and Nick Patterson came up with an elegant mathematical argument [7] to show that there are infinitely many population size histories that generate the same expected SFS for all sample sizes. They even provided an explicit example of a population size history which produced the same expected SFS as a constant population size model. However, their example history had increasingly rapid oscillations in the population size in the recent past, something that we might not expect to find in real biological populations. After all, even though we commonly use continuous-time models of evolution like coalescent theory and diffusion processes, biological populations evolve in discrete events of birth and death.

Our research group has been working on demographic inference from the SFS and from full sequence data for the last several years, and so it was natural for us to ask whether the class of population size histories that are commonly inferred using statistical algorithms might also suffer from this non-identifiability problem. Most statistical methods infer piecewise population size histories, where the pieces come from some biologically-motivated family of functions. In particular, piecewise constant and piecewise exponential models commonly appear in the literature. And if one can indeed uniquely identify piecewise demographic models from the SFS, what sample sizes are needed to do so?

In our paper, we address this question by proving that if the underlying population size function is piecewise with at most K pieces, then the expected SFS of a random sample of size n uniquely determines the demography as long as n is larger than some function of K that depends on the type of pieces of the population size function. For example, if the underlying demographic model was piecewise constant with at most K pieces (i.e. described by at most 2K – 1 parameters), then the expected SFS of a sample of size 2K uniquely determines the demographic model. In other words, no two piecewise constant population size functions with at most K pieces can generate the same expected SFS for a sample of size 2K or larger. For piecewise exponential demographic models with at most K pieces, a sample size of 4K – 1 is sufficient to uniquely determine the demographic model. When one doesn’t know which allele is ancestral and which is derived (for example, if outgroup information is lacking at the relevant SNPs), demographic analysis can still be carried out using the SFS by “folding” it. The folded SFS has floor(n/2) entries, where the i-th entry is the proportion of SNPs with i copies of the minor allele (which might be an ancestral or derived allele). Since the folded SFS has only roughly half the dimension as the full SFS, one might expect to require twice as many samples to uniquely determine the demographic model from the folded SFS compared to the full SFS. We formally prove in our paper that this intuition is indeed correct.

It is important to stress that this identifiability result is statistical rather than algorithmic in that that one would need to have perfect information about the expected SFS of a random sample in order to uniquely determine the underlying piecewise demography. In practice, one can get good estimates of the expected SFS by considering a large number of SNPs in the inference procedure, and by considering SNPs that are farther apart along the chromosomes so that the coalescent trees for the sample at different SNPs will be roughly independent of each other. More work is certainly needed to understand how much genomic data (measured both in terms of the number of SNPs and the sample size) would be needed in practice to robustly infer realistic demographic models.

Works cited:

[1] Li, H. and Durbin, R. (2011) Inference of human population history from individual whole-genome sequences. Nature 475, 493–496.

[2] Scally, A. and Durbin, R. (2012). Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics, 13(10), 745-753.

[3] Sankararaman, S., Patterson, N., Li, H., Pääbo, S., and Reich, D. (2012) The date of interbreeding between Neandertals and modern humans. PLoS Genetics 8, e1002947.

[4] Nelson, Matthew R., et al. (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104.

[5] Tennessen, Jacob A., et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69.

[6] Fu, Wenqing, et al. (2012) Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220.

[7] Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

Author Post: The Population Genetic Signature of Polygenic Local Adaptation

This guest post is by Jeremy Berg [@JeremyJBerg] and Graham Coop [@Graham_coop] on their paper The Population Genetic Signature of Polygenic Local Adaptation arXived here

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.


In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.


We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.

Some preprint comment streams at Haldane’s sieve and related sites

Given our one year anniversary, I thought I’d collect together a few examples of preprint commenting at work. These have taken place in the comment boxes of Haldane’s sieve and/or across a range of other blogs.

These are somewhat isolated cases, as the majority of preprints pass without any comment. It would be great to see more of this level of commentary. Remember comments can be simple inquiries about methods/figures/reference etc and don’t have to be super involved. In general we’ve found authors to be very responsive to comments, perhaps in part because they can take place as a more informal conversation without the pressures of publication concerns.

Genome sequencing highlights genes under selection and the dynamic early history of dogs
Reconstructing the population genetic history of the Caribbean
the population genetic signal of polygenic adaptation
The geography of recent genetic ancestry across Europe
Loss and Recovery of Genetic Diversity in Adapting Populations of HIV
Sailfish RNA-seq quantification
Genome-wide inference of ancestral recombination graphs

The date of interbreeding between Neandertals and modern humans.

Ancient west Eurasian ancestry in southern and eastern Africa.

The identifiability of piecewise demographic models from the sample frequency spectrum