Towards a new history and geography of human genes informed by ancient DNA

Towards a new history and geography of human genes informed by ancient DNA

Joseph Pickrell, David Reich

Genetic information contains a record of the history of our species, and technological advances have transformed our ability to access this record. Many studies have used genome-wide data from populations today to learn about the peopling of the globe and subsequent adaptation to local conditions. Implicit in this research is the assumption that the geographic locations of people today are informative about the geographic locations of their ancestors in the distant past. However, it is now clear that long-range migration, admixture and population replacement have been the rule rather than the exception in human history. In light of this, we argue that it is time to critically re-evaluate current views of the peopling of the globe and the importance of natural selection in determining the geographic distribution of phenotypes. We specifically highlight the transformative potential of ancient DNA. By accessing the genetic make-up of populations living at archaeologically-known times and places, ancient DNA makes it possible to directly track migrations and responses to natural selection.

Author post: Genetic influences on translation in yeast

This guest post is by Frank Albert and Leonid Kruglyak on their preprint (with co-authors) “Genetic influences on translation in yeast.

This post is on a manuscript we recently posted to both biorXiv and arXiv on how genetic differences between yeast strains influence protein translation. Here, we give some background for the project and what the results mean in our eyes. We also highlight a few interesting aspects of regression analysis that may be useful to other researchers in genomics, but aren’t currently widely appreciated (at least they weren’t obvious to us before we dove into these data). We appreciate any comments and thoughts!

Protein translation and the genetics of gene expression

Our study was motivated by earlier reports that regulatory variation among individuals in a species can have surprisingly different effects on mRNA vs. on protein levels. For those not fully up to speed on these questions, here’s a quick recap. Individuals differ from each other in many aspects (e.g. in appearance and disease susceptibility), and for many traits, these differences are at least in part due to genetic variation. Research in genetics (both in humans and in other species) is focused on identifying DNA sequence variants in the genome that contribute to phenotypic variation, and on understanding the molecular mechanisms through which these variants alter traits. One general class of such mechanisms is alteration of gene expression, and genetic loci that affect gene expression are known as “expression quantitative trait loci” (eQTL). DNA sequence variants can change expression levels of genes in a number of ways, and some of the expression differences in turn are thought to alter organismal traits. Large and growing eQTL catalogues exist for several species. However, to measure “gene expression”, virtually all these studies measure mRNA rather than protein levels. This isn’t because mRNA is the more relevant molecule to look at, but primarily because mRNA is much easier and cheaper to measure on a global basis than are proteins. There have been a few studies in model organisms and—more recently— in humans that examined genetic variants that influence protein levels. Their results were surprising: the loci that influenced proteins (protein QTL, pQTL) were apparently often different from those that influenced mRNA levels, and vice versa. These results were both troubling and exciting. They were troubling because if genetic changes in protein levels are not a faithful reflection of effects on mRNA levels, the relevance of mRNA-based eQTL maps is less obvious. The results are exciting from a basic science perspective, because they suggest that lots of genetic variants that specifically influence posttranscriptional processes remain to be discovered.

Our current paper begins to explore these issues by measuring one such posttranscriptional process: protein translation, the process by which mRNA molecules are read by ribosomes and “translated” into peptide chains. Translation is interesting because there is literature that suggests that its regulation is a major determinant of protein levels, perhaps as important as the regulation of mRNA levels. Translation is also convenient because it is the last step along the gene expression cascade that can still be assayed using the high throughput sequencing technologies that underlie much of the current boom in genomics in general and eQTL detection in particular. The trick is to isolate only those bits of mRNA that sit inside of ribosomes as they march along the mRNAs during translation. These “ribosome footprints” can then be sequenced, counted, and compared to mRNA levels measured in parallel to get a quantitative readout of how much each gene is being translated.

For our current paper, we teamed up with Jonathan Weissman at UCSF. The Weissman lab pioneered measuring translation by sequencing, and Dale Muzzey (a postdoc with Jonathan) generated the data for our current experiment. We chose a very simple but powerful design: we compared translation in two strains of yeast that are genetically different from each other, and also measured allele-specific translation in the diploid hybrid between these two strains. Using the strain comparison, we can quantify the aggregate effect of all genetic differences between the two strains on translation. In the hybrid, we can specifically see the effects of those variants that act in cis, an important sub-group of eQTL.

We encourage you to read the detailed results in the paper, but in a nutshell, we found that genetic differences clearly do have an effect on translation—but not a terribly large one. Genes that differ in mRNA abundance typically also differ in footprint abundance, and the effect of translation was typically to subtly modulate mRNA differences, rather than erase them or create protein differences from scratch. While there are some exceptions, the take-home message is that on average, differences in protein synthesis are reasonably well approximated by differences in mRNA levels.

Thus, translation does not appear to create major discrepancies between eQTL and pQTL. So what explains such reported discrepancies? Without going into great detail, we think that at least a part of the explanation is that those discrepancies may have been overestimated. This fits with our recent paper in which we used extremely large yeast populations to map pQTL with higher statistical power, and recovered many pQTL at sites where earlier work had found an eQTL, but no pQTL. Are pQTL in most cases simply a reflection of eQTL? It is too early to tell, and we’ll need improved designs and datasets to answer to answer this question with certainty. At the very least, our recent work suggests that genetic influences on protein levels more accurately reflect those on mRNA levels than previous reports had suggested.

“Spurious” correlations and adventures in line fitting

While we worked on the translation paper, we encountered a few technical aspects that are worth sharing in some detail. Specifically, a natural question to ask is how differences in translation compare to differences in mRNA levels. When a gene differs in mRNA abundance between strains, does translation typically lead to a stronger or weaker footprint difference? Does translation typically “reinforce” or “buffer” mRNA differences (or is there no preference either way)? An intuitively attractive analysis is to plot the mRNA differences versus differences in “translation efficiency”, or TE (i.e. the amount of ribosome footprints divided by the amount of mRNA for a given gene). This comparison is shown in Figure 1 for our hybrid data.


In the figure, differences are plotted as log2-transformed fold changes so that zero indicates “no change”, and the resulting distributions of differences are more or less normally distributed. We see a seductively strong negative correlation between mRNA differences and TE differences. Taken at face value, this seems to suggest that mRNA differences are typically accompanied by a difference in TE that buffers the mRNA difference: more mRNA in one strain is counteracted by lower TE, presumably resulting in a protein difference that is smaller than the mRNA difference would predict.

However, this analysis is misleading. The key point is that the TE difference is not independent from the mRNA difference. In log2 space, the TE difference is simply the footprint difference minus the mRNA difference (in non-log space, the TE difference is the footprint difference divided by the mRNA difference). Therefore, the larger the mRNA difference becomes, the smaller the TE difference becomes simply by definition. The fact that the correlation between TE differences and mRNA differences is negative is by itself not informative about the relationship between differences in translation and mRNA levels. This can further be illustrated in Figure 2 (which is Supplementary Figure S4 in our preprint). Here, we simply draw two uncorrelated samples a and b from a normal distribution. The plot of b – a over a has a strong negative correlation, although there is no systematic relationship at all between these two quantities.


It turns out that the problems with comparing ratios (such as the TE difference) to their components (e.g., the mRNA difference, which serves as the denominator in computing TE) have been noted a long time ago (e.g., Karl Pearson termed them “spurious” correlations in 1897). They are worth remembering when carrying out functional genomic and systems biology data analyses in which multiple classes of molecules (mRNA, proteins, metabolites, etc.) are compared to each other and to ratios between them.

Because of this effect, we directly analyzed differences in the two quantities we measured: mRNA and ribosome footprint abundance. The slope of a linear regression between these two quantities was less than one (the blue line in Figure 3, which shows the data from the strain comparison). This might once again suggest a predominance of buffering interactions—for any given mRNA difference, we would predict a smaller footprint difference.


However, this inference is again misleading, because regression to the mean ensures that even if we measure the same quantity twice with some measurement noise (which is usually unavoidable), the slope of the regression line between these two replicates is always less than one. This is because when the first measure for a given gene is by chance larger than its true value, the measure is likely to be smaller in the second replicate. We estimated the regression slope between footprint differences and mRNA differences that would be expected if there were no preference for reinforcing or buffering interactions by using a randomization test that you can read about in the manuscript. The upshot is that the observed regression slope was steeper than those in the randomized data. We therefore concluded that on average, translation more often reinforces than buffers mRNA differences.

A related issue (pointed out to us by J.J. Emerson at UC Irvine) is that linear regression fits a line by minimizing only the error on the y-axis. This makes linear regression ideal for predicting y from x, but less than ideal for measuring the relationship (the slope) between x and y. For our purposes, it is better to use a different way of fitting lines to bivariate data: “Major Axis Estimation”. You can read all about MA in this great review, but briefly, the idea is to fit a line that minimizes the perpendicular distance of points from the line. The minimized error gets distributed between the y and the x axis, resulting in fitted lines that, unlike regression, provide unbiased estimates of the slope, and hence of the true linear relationship between x and y. Using MA on our data did not alter the conclusions we arrived at by using regression together with the randomization test, but it did provide more directly interpretable values for the various slopes. For example, the slopes in the randomized datasets were now centered on 1, whereas they had been degraded to be less than one using linear regression. The MA slope in our data was greater than one (the red line in Figure 3), supporting the inference of reinforcement.

These points on “spurious” correlations, regression to the mean and MA will be obvious to some (they were not to us going in), but we suspect and hope that they may prove useful for others working on genomic datasets.

Population genetics of identity by descent

Population genetics of identity by descent
Pier Francesco Palamara, Ph.D. thesis

Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.

Principal component gene set enrichment (PCGSE)

Principal component gene set enrichment (PCGSE)
H. Robert Frost, Zhigang Li, Jason H. Moore

Motivation: Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with few non-zero loadings. Although useful when just a few variables dominate the population PCs, these methods are often inadequate for characterizing the PCs of high-dimensional genomic data. For genomic data, reproducible and biologically meaningful PC interpretation requires methods based on the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. Results: We describe a novel approach, principal component gene set enrichment (PCGSE), for computing the statistical association between gene sets and the PCs of genomic data. The PCGSE method performs a two-stage competitive gene set test using the correlation between each gene and each PC as the gene-level test statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic. Using simulated data with simulated gene sets and real gene expression data with curated gene sets, we demonstrate that biologically meaningful and computationally efficient results can be obtained from a simple parametric version of the PCGSE method that performs a correlation-adjusted two-sample t-test between the gene-level test statistics for gene set members and genes not in the set. Availability: this http URL Contact: or

Phylogenetic Stochastic Mapping without Matrix Exponentiation

Phylogenetic Stochastic Mapping without Matrix Exponentiation
Jan Irvahn, Vladimir N. Minin

Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organisms carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and work well for small state spaces. The computations slow down considerably for larger state spaces (e.g. space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices — an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, that does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.

Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates

Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates

Xiang Zhou, Carolyn Cain, Marsha Myrthil, Noah Lewellen, Katelyn Michelini, Emily Davenport, Matthew Stephens, Jonathan Pritchard, Yoav Gilad

Changes in gene regulation level have long been thought to play an important role in evolution and speciation, especially in primates. Over the past decade, comparative genomic studies have revealed extensive inter-species differences in gene expression levels yet we know much less about the extent to which regulatory mechanisms differ between species. To begin addressing this gap, we performed a comparative epigenetic study in primate lymphoblastoid cell lines (LCLs), to query the contribution of RNA polymerase II (Pol II) and four histone modifications (H3K4me1, H3K4me3, H3K27ac, and H3K27me3) to inter-species variation in gene expression levels. We found that inter-species differences in mark enrichment near transcription start sites are significantly more often associated with inter-species differences in the corresponding gene expression level than expected by chance alone. Interestingly, we also found that first-order interactions among the histone marks and Pol II do not markedly contribute to the degree of association between the marks and inter-species variation in gene expression levels, suggesting that the marginal effects of the five marks dominate this contribution.

The Role of Migration in the Evolution of Phenotypic Switching

The Role of Migration in the Evolution of Phenotypic Switching

Oana Carja, Robert E Furrow, Marc W Feldman

Stochastic switching is an example of phenotypic bet-hedging, where an individual can switch between different phenotypic states in a fluctuating environment. Although the evolution of stochastic switching has been studied when the environment varies temporally, there has been little theoretical work on the evolution of phenotypic switching under both spatially and temporally fluctuating selection pressures. Here we use a population genetic model to explore the interaction of temporal and spatial variation in the evolutionary dynamics of phenotypic switching. We find that spatial variation in selection is important; when selection pressures are similar across space, migration can decrease the rate of switching, but when selection pressures differ spatially, increasing migration between demes can facilitate the evolution of higher rates of switching. These results may help explain the diverse array of non-genetic contributions to phenotypic variability and phenotypic inheritance observed in both wild and experimental populations.