High burden of private mutations due to explosive human population growth and purifying selection

High burden of private mutations due to explosive human population growth and purifying selection

Feng Gao, Alon Keinan
(Submitted on 22 Mar 2014)

Recent studies have shown that human populations have experienced a complex demographic history, including a recent epoch of rapid population growth that led to an excess in the proportion of rare genetic variants in humans today. This excess can impact the burden of private mutations for each individual, defined here as the proportion of heterozygous variants in each newly sequenced individual that are novel compared to another large sample of sequenced individuals. We calculated the burden of private mutations predicted by different demographic models, and compared with empirical estimates based on data from the NHLBI Exome Sequencing Project and data from the Neutral Regions (NR) dataset. We observed a significant excess in the proportion of private mutations in the empirical data compared with models of demographic history without a recent epoch of population growth. Incorporating recent growth into the model provides a much improved fit to empirical observations. This phenomenon becomes more marked for larger sample sizes. The proportion of private mutations is additionally increased by purifying selection, which differentially affect mutations of different functional annotations. These results have important implications to the design and analysis of sequencing-based association studies of complex human disease as they pertain to private and very rare variants.

Towards a new history and geography of human genes informed by ancient DNA

Towards a new history and geography of human genes informed by ancient DNA

Joseph Pickrell, David Reich

Genetic information contains a record of the history of our species, and technological advances have transformed our ability to access this record. Many studies have used genome-wide data from populations today to learn about the peopling of the globe and subsequent adaptation to local conditions. Implicit in this research is the assumption that the geographic locations of people today are informative about the geographic locations of their ancestors in the distant past. However, it is now clear that long-range migration, admixture and population replacement have been the rule rather than the exception in human history. In light of this, we argue that it is time to critically re-evaluate current views of the peopling of the globe and the importance of natural selection in determining the geographic distribution of phenotypes. We specifically highlight the transformative potential of ancient DNA. By accessing the genetic make-up of populations living at archaeologically-known times and places, ancient DNA makes it possible to directly track migrations and responses to natural selection.

Author post: Genetic influences on translation in yeast

This guest post is by Frank Albert and Leonid Kruglyak on their preprint (with co-authors) “Genetic influences on translation in yeast.

This post is on a manuscript we recently posted to both biorXiv and arXiv on how genetic differences between yeast strains influence protein translation. Here, we give some background for the project and what the results mean in our eyes. We also highlight a few interesting aspects of regression analysis that may be useful to other researchers in genomics, but aren’t currently widely appreciated (at least they weren’t obvious to us before we dove into these data). We appreciate any comments and thoughts!

Protein translation and the genetics of gene expression

Our study was motivated by earlier reports that regulatory variation among individuals in a species can have surprisingly different effects on mRNA vs. on protein levels. For those not fully up to speed on these questions, here’s a quick recap. Individuals differ from each other in many aspects (e.g. in appearance and disease susceptibility), and for many traits, these differences are at least in part due to genetic variation. Research in genetics (both in humans and in other species) is focused on identifying DNA sequence variants in the genome that contribute to phenotypic variation, and on understanding the molecular mechanisms through which these variants alter traits. One general class of such mechanisms is alteration of gene expression, and genetic loci that affect gene expression are known as “expression quantitative trait loci” (eQTL). DNA sequence variants can change expression levels of genes in a number of ways, and some of the expression differences in turn are thought to alter organismal traits. Large and growing eQTL catalogues exist for several species. However, to measure “gene expression”, virtually all these studies measure mRNA rather than protein levels. This isn’t because mRNA is the more relevant molecule to look at, but primarily because mRNA is much easier and cheaper to measure on a global basis than are proteins. There have been a few studies in model organisms and—more recently— in humans that examined genetic variants that influence protein levels. Their results were surprising: the loci that influenced proteins (protein QTL, pQTL) were apparently often different from those that influenced mRNA levels, and vice versa. These results were both troubling and exciting. They were troubling because if genetic changes in protein levels are not a faithful reflection of effects on mRNA levels, the relevance of mRNA-based eQTL maps is less obvious. The results are exciting from a basic science perspective, because they suggest that lots of genetic variants that specifically influence posttranscriptional processes remain to be discovered.

Our current paper begins to explore these issues by measuring one such posttranscriptional process: protein translation, the process by which mRNA molecules are read by ribosomes and “translated” into peptide chains. Translation is interesting because there is literature that suggests that its regulation is a major determinant of protein levels, perhaps as important as the regulation of mRNA levels. Translation is also convenient because it is the last step along the gene expression cascade that can still be assayed using the high throughput sequencing technologies that underlie much of the current boom in genomics in general and eQTL detection in particular. The trick is to isolate only those bits of mRNA that sit inside of ribosomes as they march along the mRNAs during translation. These “ribosome footprints” can then be sequenced, counted, and compared to mRNA levels measured in parallel to get a quantitative readout of how much each gene is being translated.

For our current paper, we teamed up with Jonathan Weissman at UCSF. The Weissman lab pioneered measuring translation by sequencing, and Dale Muzzey (a postdoc with Jonathan) generated the data for our current experiment. We chose a very simple but powerful design: we compared translation in two strains of yeast that are genetically different from each other, and also measured allele-specific translation in the diploid hybrid between these two strains. Using the strain comparison, we can quantify the aggregate effect of all genetic differences between the two strains on translation. In the hybrid, we can specifically see the effects of those variants that act in cis, an important sub-group of eQTL.

We encourage you to read the detailed results in the paper, but in a nutshell, we found that genetic differences clearly do have an effect on translation—but not a terribly large one. Genes that differ in mRNA abundance typically also differ in footprint abundance, and the effect of translation was typically to subtly modulate mRNA differences, rather than erase them or create protein differences from scratch. While there are some exceptions, the take-home message is that on average, differences in protein synthesis are reasonably well approximated by differences in mRNA levels.

Thus, translation does not appear to create major discrepancies between eQTL and pQTL. So what explains such reported discrepancies? Without going into great detail, we think that at least a part of the explanation is that those discrepancies may have been overestimated. This fits with our recent paper in which we used extremely large yeast populations to map pQTL with higher statistical power, and recovered many pQTL at sites where earlier work had found an eQTL, but no pQTL. Are pQTL in most cases simply a reflection of eQTL? It is too early to tell, and we’ll need improved designs and datasets to answer to answer this question with certainty. At the very least, our recent work suggests that genetic influences on protein levels more accurately reflect those on mRNA levels than previous reports had suggested.

“Spurious” correlations and adventures in line fitting

While we worked on the translation paper, we encountered a few technical aspects that are worth sharing in some detail. Specifically, a natural question to ask is how differences in translation compare to differences in mRNA levels. When a gene differs in mRNA abundance between strains, does translation typically lead to a stronger or weaker footprint difference? Does translation typically “reinforce” or “buffer” mRNA differences (or is there no preference either way)? An intuitively attractive analysis is to plot the mRNA differences versus differences in “translation efficiency”, or TE (i.e. the amount of ribosome footprints divided by the amount of mRNA for a given gene). This comparison is shown in Figure 1 for our hybrid data.

TEvsmRNAHybrid

In the figure, differences are plotted as log2-transformed fold changes so that zero indicates “no change”, and the resulting distributions of differences are more or less normally distributed. We see a seductively strong negative correlation between mRNA differences and TE differences. Taken at face value, this seems to suggest that mRNA differences are typically accompanied by a difference in TE that buffers the mRNA difference: more mRNA in one strain is counteracted by lower TE, presumably resulting in a protein difference that is smaller than the mRNA difference would predict.

However, this analysis is misleading. The key point is that the TE difference is not independent from the mRNA difference. In log2 space, the TE difference is simply the footprint difference minus the mRNA difference (in non-log space, the TE difference is the footprint difference divided by the mRNA difference). Therefore, the larger the mRNA difference becomes, the smaller the TE difference becomes simply by definition. The fact that the correlation between TE differences and mRNA differences is negative is by itself not informative about the relationship between differences in translation and mRNA levels. This can further be illustrated in Figure 2 (which is Supplementary Figure S4 in our preprint). Here, we simply draw two uncorrelated samples a and b from a normal distribution. The plot of b – a over a has a strong negative correlation, although there is no systematic relationship at all between these two quantities.

FigureS4_twoNormalsSpuriousCorrelation

It turns out that the problems with comparing ratios (such as the TE difference) to their components (e.g., the mRNA difference, which serves as the denominator in computing TE) have been noted a long time ago (e.g., Karl Pearson termed them “spurious” correlations in 1897). They are worth remembering when carrying out functional genomic and systems biology data analyses in which multiple classes of molecules (mRNA, proteins, metabolites, etc.) are compared to each other and to ratios between them.

Because of this effect, we directly analyzed differences in the two quantities we measured: mRNA and ribosome footprint abundance. The slope of a linear regression between these two quantities was less than one (the blue line in Figure 3, which shows the data from the strain comparison). This might once again suggest a predominance of buffering interactions—for any given mRNA difference, we would predict a smaller footprint difference.

slopeComparisonParents

However, this inference is again misleading, because regression to the mean ensures that even if we measure the same quantity twice with some measurement noise (which is usually unavoidable), the slope of the regression line between these two replicates is always less than one. This is because when the first measure for a given gene is by chance larger than its true value, the measure is likely to be smaller in the second replicate. We estimated the regression slope between footprint differences and mRNA differences that would be expected if there were no preference for reinforcing or buffering interactions by using a randomization test that you can read about in the manuscript. The upshot is that the observed regression slope was steeper than those in the randomized data. We therefore concluded that on average, translation more often reinforces than buffers mRNA differences.

A related issue (pointed out to us by J.J. Emerson at UC Irvine) is that linear regression fits a line by minimizing only the error on the y-axis. This makes linear regression ideal for predicting y from x, but less than ideal for measuring the relationship (the slope) between x and y. For our purposes, it is better to use a different way of fitting lines to bivariate data: “Major Axis Estimation”. You can read all about MA in this great review, but briefly, the idea is to fit a line that minimizes the perpendicular distance of points from the line. The minimized error gets distributed between the y and the x axis, resulting in fitted lines that, unlike regression, provide unbiased estimates of the slope, and hence of the true linear relationship between x and y. Using MA on our data did not alter the conclusions we arrived at by using regression together with the randomization test, but it did provide more directly interpretable values for the various slopes. For example, the slopes in the randomized datasets were now centered on 1, whereas they had been degraded to be less than one using linear regression. The MA slope in our data was greater than one (the red line in Figure 3), supporting the inference of reinforcement.

These points on “spurious” correlations, regression to the mean and MA will be obvious to some (they were not to us going in), but we suspect and hope that they may prove useful for others working on genomic datasets.

Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates

Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates

Xiang Zhou, Carolyn Cain, Marsha Myrthil, Noah Lewellen, Katelyn Michelini, Emily Davenport, Matthew Stephens, Jonathan Pritchard, Yoav Gilad

Changes in gene regulation level have long been thought to play an important role in evolution and speciation, especially in primates. Over the past decade, comparative genomic studies have revealed extensive inter-species differences in gene expression levels yet we know much less about the extent to which regulatory mechanisms differ between species. To begin addressing this gap, we performed a comparative epigenetic study in primate lymphoblastoid cell lines (LCLs), to query the contribution of RNA polymerase II (Pol II) and four histone modifications (H3K4me1, H3K4me3, H3K27ac, and H3K27me3) to inter-species variation in gene expression levels. We found that inter-species differences in mark enrichment near transcription start sites are significantly more often associated with inter-species differences in the corresponding gene expression level than expected by chance alone. Interestingly, we also found that first-order interactions among the histone marks and Pol II do not markedly contribute to the degree of association between the marks and inter-species variation in gene expression levels, suggesting that the marginal effects of the five marks dominate this contribution.

The Role of Migration in the Evolution of Phenotypic Switching

The Role of Migration in the Evolution of Phenotypic Switching

Oana Carja, Robert E Furrow, Marc W Feldman

Stochastic switching is an example of phenotypic bet-hedging, where an individual can switch between different phenotypic states in a fluctuating environment. Although the evolution of stochastic switching has been studied when the environment varies temporally, there has been little theoretical work on the evolution of phenotypic switching under both spatially and temporally fluctuating selection pressures. Here we use a population genetic model to explore the interaction of temporal and spatial variation in the evolutionary dynamics of phenotypic switching. We find that spatial variation in selection is important; when selection pressures are similar across space, migration can decrease the rate of switching, but when selection pressures differ spatially, increasing migration between demes can facilitate the evolution of higher rates of switching. These results may help explain the diverse array of non-genetic contributions to phenotypic variability and phenotypic inheritance observed in both wild and experimental populations.

Genetic influences on translation in yeast

Genetic influences on translation in yeast

Frank W. Albert, Dale Muzzey, Jonathan Weissman, Leonid Kruglyak
(Submitted on 13 Mar 2014)

Heritable differences in gene expression between individuals are an important source of phenotypic variation. The question of how closely the effects of genetic variation on protein levels mirror those on mRNA levels remains open. Here, we addressed this question by using ribosome profiling to examine how genetic differences between two strains of the yeast S. cerevisiae affect translation. Strain differences in translation were observed for hundreds of genes, more than half as many as showed genetic differences in mRNA levels. Similarly, allele specific measurements in the diploid hybrid between the two strains revealed roughly half as many cis-acting effects on translation as were observed for mRNA levels. In both the parents and the hybrid, strong effects on translation were rare, such that the direction of an mRNA difference was typically reflected in a concordant footprint difference. The relative importance of cis and trans acting variation on footprint levels was similar to that for mRNA levels. Across all expressed genes, there was a tendency for translation to more often reinforce than buffer mRNA differences, resulting in footprint differences with greater magnitudes than the mRNA differences. A reanalysis of two earlier studies which reported translational buffering between two yeast species showed that translational reinforcement is in fact more common between these species, consistent with our results. Finally, we catalogued instances of premature translation termination in the two yeast strains. Overall, genetic variation clearly influences translation, but primarily does so by subtly modulating differences in mRNA levels. Translation does not appear to create strong discrepancies between genetic influences on mRNA and protein levels.

Predicting discovery rates of genomic features

Predicting discovery rates of genomic features

Simon Gravel, NHLBI GO Exome Sequencing Project
(Submitted on 13 Mar 2014)

Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict “omics” variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require about 15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and sub-sampled 1000 Genomes Project data. Extrapolating based on the NHLBI Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African-Americans, and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types.

Increased genetic diversity improves crop yield stability under climate variability: a computational study on sunflower

Increased genetic diversity improves crop yield stability under climate variability: a computational study on sunflower

Pierre Casadebaig (1), Ronan Trépos (2), Victor Picheny (2), Nicolas B. Langlade (3), Patrick Vincourt (3), Philippe Debaeke (1) ((1) INRA, UMR1248 AGIR, Castanet-Tolosan, France, (2) INRA, UR875 MIAT, Castanet-Tolosan, France, (3) INRA, UMR441 LIPM, Castanet-Tolosan, France)
(Submitted on 12 Mar 2014)

A crop can be represented as a biotechnical system in which components are either chosen (cultivar, management) or given (soil, climate) and whose combination generates highly variable stress patterns and yield responses. Here, we used modeling and simulation to predict the crop phenotypic plasticity resulting from the interaction of plant traits (G), climatic variability (E) and management actions (M). We designed two in silico experiments that compared existing and virtual sunflower cultivars (Helianthus annuus L.) in a target population of cropping environments by simulating a range of indicators of crop performance. Optimization methods were then used to search for GEM combinations that matched desired crop specifications. Computational experiments showed that the fit of particular cultivars in specific environments is gradually increasing with the knowledge of pedo-climatic conditions. At the regional scale, tuning the choice of cultivar impacted crop performance the same magnitude as the effect of yearly genetic progress made by breeding. When considering virtual genetic material, designed by recombining plant traits, cultivar choice had a greater positive impact on crop performance and stability. Results suggested that breeding for key traits conferring plant plasticity improved cultivar global adaptation capacity whereas increasing genetic diversity allowed to choose cultivars with distinctive traits that were more adapted to specific conditions. Consequently, breeding genetic material that is both plastic and diverse may improve yield stability of agricultural systems exposed to climatic variability. We argue that process-based modeling could help enhancing spatial management of cultivated genetic diversity and could be integrated in functional breeding approaches.

Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals

Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals

Connor O. McCoy, Trevor Bedford, Vladimir N. Minin, Harlan Robins, Frederick A. Matsen IV
(Submitted on 12 Mar 2014)

The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep short-read data set of B cell receptors. We find that the substitution process is conserved across individuals but varies significantly across gene segments. We investigate selection on B cell receptors using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection, which we find is dominated by purifying selection, though not uniformly so.

Mapping quantitative trait loci underlying function-valued phenotypes

Mapping quantitative trait loci underlying function-valued phenotypes

Il-Youp Kwak, Candace R. Moore, Edgar P. Spalding, Karl W. Broman
(Submitted on 12 Mar 2014)

Most statistical methods for QTL mapping focus on a single phenotype. However, multiple phenotypes are commonly measured, and recent technological advances have greatly simplified the automated acquisition of numerous phenotypes, including function-valued phenotypes, such as growth measured over time. While there exist methods for QTL mapping with function-valued phenotypes, they are generally computationally intensive and focus on single-QTL models. We propose two simple, fast methods that maintain high power and precision and are amenable to extensions with multiple-QTL models using a penalized likelihood approach. After identifying multiple QTL by these approaches, we can view the function-valued QTL effects to provide a deeper understanding of the underlying processes. Our methods have been implemented as a package for R, funqtl.