Towards the Recapitulation of Ancient History in the Laboratory: Combining Synthetic Biology with Experimental Evolution

Towards the Recapitulation of Ancient History in the Laboratory: Combining Synthetic Biology with Experimental Evolution

Betul Kacar, Eric Gaucher
(Submitted on 23 Sep 2012)

One way to understand the role history plays on evolutionary trajectories is by giving ancient life a second opportunity to evolve. Our ability to empirically perform such an experiment, however, is limited by current experimental designs. Combining ancestral sequence reconstruction with synthetic biology allows us to resurrect the past within a modern context and has expanded our understanding of protein functionality within a historical context. Experimental evolution, on the other hand, provides us with the ability to study evolution in action, under controlled conditions in the laboratory. Here we describe a novel experimental setup that integrates two disparate fields – ancestral sequence reconstruction and experimental evolution. This allows us to rewind and replay the evolutionary history of ancient biomolecules in the laboratory. We anticipate that our combination will provide a deeper understanding of the underlying roles that contingency and determinism play in shaping evolutionary processes.

Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution

Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution

Daniël P. Melters, Keith R. Bradnam, Hugh A. Young, Natalie Telis, Michael R. May, J. Graham Ruby, Robert Sebra, Paul Peluso, John Eid, David Rank, José Fernando Garcia, Joseph L. DeRisi, Timothy Smith, Christian Tobias, Jeffrey Ross-Ibarra, Ian F. Korf, Simon W.-L. Chan
(Submitted on 22 Sep 2012)

Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data. The assumption that the most abundant tandem repeat is the centromere DNA was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and in length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond ~50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution, including the appearance of higher order repeat structures in which several polymorphic monomers make up a larger repeating unit. While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animals and plants. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data

Darren Kessner, Tom Turner, John Novembre
(Submitted on 19 Sep 2012)

DNA samples are often pooled, either by experimental design, or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g. bacterial species comprising a microbiome, or pathogen strains in a blood sample). We present an expectation-maximization (EM) algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different strains within a metagenomics sample. Our method outperforms existing methods based on single- site allele frequencies, as well as simple approaches using sequence read data. We have implemented the method in a freely available open-source software tool.

Our paper: Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

[This author post is by Peter Carbonetto on Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease, available from the arXiv here.]

I expect that most readers of this blog appreciate the impact that genome-wide association studies have had on our understanding of many common diseases. Still, I think it is important to reiterate a major appeal of genome-wide association studies: the analysis is conceptually straightforward to understand, even for people who have never had to suffer through a course on statistics or epidemiology. To find links between genetic loci and disease, the analysis consists of systematically searching across the genome for variants that show statistically significant correlation with susceptibility to disease. These correlations signal the presence of nearby genes—or perhaps DNA elements that regulate other genes—that are risk factors for disease.

Many readers of this blog will also appreciate, due to the multifactorial nature of most common diseases, the difficulty of establishing compelling evidence for disease-variant correlations. Hence the search for more effective data-driven strategies for discovering genetic factors underlying common diseases.

One strategy is to assess evidence for the accumulation, or “enrichment,” of disease-conferring mutations within known biological pathways. The intuition is that identifying the accumulation of small genetic effects acting in a common pathway is easier than mapping the individual genes within the pathway that contribute to disease susceptibility.

We asked whether identifying these enriched pathways can also give us useful feedback about the individual gene variants associated with disease. To answer this question, we developed a statistical method that adjusts the support for disease-variant associations to reflect enrichment of associations in a pathway. Our approach was to introduce an enrichment parameter that quantifies the increase in the probability that each variant in the pathway is associated with disease risk.

Is this a valid approach? To investigate, we applied our approach to data from the Wellcome Trust Crohn’s disease study from 2007. First, we identified a broad class of cytokine signaling genes that were enriched for genetic associations with Crohn’s disease. Next, by prioritizing variants in this pathway, we discovered candidates for association—including the STAT3 gene, the IBD5 locus, and the MHC class II genes—that were not identified in conventional analyses of the same data. These results help validate our approach, as these genetic associations have been independently confirmed in other studies and meta-analyses with much larger combined samples.

Several other important lessons emerged from our case study:

1. Interrogate as many pathways as possible. Because we collected over 3000 candidate pathways from several sources (Reactome, KEGG, BioCarta, BioCyc, etc.), many of the pathways highlighted in previous analyses of the same data were eclipsed by much stronger enrichment signals in our analysis.

2. Assess evidence for combinations of enriched pathways. Some pathways become interesting only after assessing enrichment of the pathway in combination with another pathway.

3. Account for the heterogeneity of effect sizes in Crohn’s disease. One of the assumptions we made in our analysis, mainly out of convenience, was that the additive effects on disease risk are normally distributed. While this assumption simplified this analysis, we suspect that a normal distribution does not adequately capture the smaller effect sizes in pathways, leading to a loss of power to detect enriched pathways.

At conferences, and around the lab, I’ve heard many complaints about pathway analysis (or gene set enrichment analysis) for genome-wide association studies. One complaint is that the results are difficult to interpret. Another common complaint is that the findings are sensitive to arbitrary significance thresholds. While we didn’t devote much space in the paper to a discussion of these issues, we believe that our approach offers a coherent solution to many of these problems.

Ultimately, we would like other researchers to use our methods to analyze data from their own genome-wide association studies. We tried to make our paper as accessible as possible, especially to biologists that are not well-acquainted with Bayesian approaches, by carefully explaining how to interpret the Bayes factors and posterior statistics used in the analysis. We are working on releasing the full source code (in R and MATLAB) for all our methods, and accompanying documentation.

Peter Carbonetto

Our paper: The genetic prehistory of southern Africa

[This author post is by Joe Pickrell (@joe_pickrell), Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf on The genetic prehistory of southern Africa, available from arXiv here]

The indigenous populations of southern Africa are phenotypically, linguistically, culturally, and genetically diverse. Although many groups speak Bantu languages (having arrived in the region during an expansion of Iron-Age agriculturalists), there are a number of populations who speak diverse non-Bantu languages with heavy use of click consonants. We refer to these populations as “Khoisan“. Most of the Khoisan populations are hunter-gatherers, but some are pastoralists; the extensive linguistic and cultural diversity of the Khoisan (who live in a relatively small region around the Kalahari semi-desert) is historically puzzling.

Two hunter-gatherer (or formerly hunter-gatherer) populations in East Africa, the Hadza and Sandawe, also speak languages that also make use of click consonants. Linguists see little in common between the languages in southern Africa and Hadza, although Sandawe might be genealogically related to some of the Khoisan languages. Nevertheless, the shared use of click consonants and a foraging lifestyle led many to hypothesize that the southern African Khoisan populations are genetically related to the Hadza and Sandawe, which would imply that their ancestors were once considerably more widespread. This hypothesis has been controversial for decades.

Tree relating the Khoisan-like proportion of ancestry (shown in blue in the barplot) in Khoisan, Hadza, and Sandawe after accounting for non-Khoisan admixture.

In our study, we use genetic data to address the history of the diverse groups within southern Africa and their relationship to the Hadza and Sandawe. Specifically, we genotyped individuals from 16 Khoisan populations, 5 neighboring populations that speak Bantu languages, and the Hadza (the latter thanks to Brenna Henn, Joanna Mountain, and Carlos Bustamante) on a SNP array designed for studies of human history, in that the SNP ascertainement scheme is known and includes SNPs ascertained in the Khoisan. We then merged in Hadza and Sandawe samples from a recent paper by Joseph Lachance, Sarah Tishkoff and colleagues. The main conclusions are as follows:

  1. Within the southern African Khoisan, there are two genetic groups, which correspond roughly to populations in the northwest and southeast Kalahari semi-desert. Populations from these two groups have been labeled in the tree in this post (see also Figure 1B in the preprint). We estimate that these two groups diverged within the last 30,000 years. However, this date should be taken as an upper bound due to point #2 below.
  2. All southern African Khoisan groups are admixed with non-Khoisan populations. Even the most isolated Khoisan groups (i.e. the “San” from the HGDP, who are included in the “Ju|’hoan_North” group in our paper) show some evidence of admixture with agricultualist and/or pastoralist groups. A subtle technical point is that this had not been previously noticed because methods that rely on correlations in allele frequencies are sometimes unable to detect admixture if all populations are admixed (this is related to Mr. Razib Khan’s post on why ADMIXTURE is not a test for admixure). To get around this, we developed new methods based on the decay of linkage disequilibrum.
  3. The Hadza and Sandawe trace part of their ancestry to admixture with a population related to the Khoisan. After accounting for admixture, we built a tree of “Khoisan-like” ancestry in the southern and eastern African populations (see the Figure above). The striking thing is that the Hadza and Sandawe fall with high confidence on the same branch as the Khoisan. This suggests that, prior to subsequent migrations of food-producing peoples over most of sub-Saharan Africa, populations related to the Khoisan were indeed spread continuously over a huge geographic range including Tanzania and southern Africa.

We’re excited about these results for a number of reasons. First of all, we’re now on our way towards understanding the history of the diverse Khoisan populations–for years these populations have been treated as genetically equivalent, but it’s clear that each population has its own complex history. Secondly, with the new statistical methods we’ve developed we were able to show not only the varying amounts of admixture that has occurred at different times in southern African populations, but were also able to peel away these layers of admixture to learn about the relationships among Khoisan populations that existed thousands of years ago. Finally, we think that these results have important implications for work using genetics to understand the geographic origin of modern humans within Africa. Though both southern and eastern Africa have been proposed as potential origins, from the tree in this post, we see no genetic evidence in favor of either; from our point of view this question remains open.

Joe Pickrell, Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf

An age-of-allele test of neutrality for transposable element insertions not at equilibrium

An age-of-allele test of neutrality for transposable element insertions not at equilibrium

Justin P. Blumenstiel, Miaomiao He, Casey M. Bergman
(Submitted on 16 Sep 2012)

How natural selection acts to limit the proliferation of transposable elements (TEs) in genomes has been of interest to evolutionary biologists for many years. To describe TE dynamics in populations, many previous studies have relied on the assumption of equilibrium between transposition and selection. However, since TE invasions are known to happen in bursts through time, this assumption may not be reasonable. Here we derive a test of neutrality for TE insertions that does not rely on the assumption of transpositional equilibrium. We consider the case of TE insertions that have been ascertained from a single haploid reference genome sequence and have had their allele frequency estimated in a population sample. By conditioning on age information provided within the sequence of a TE insertion in the form of the number of substitutions that have occurred within the fragment since insertion into a reference genome, we derive the probability distribution for the TE allele frequency in a population sample under neutrality. Taking models of population fluctuation into account, we then test the fit of predictions of our model to allele frequency data from 190 retrotransposon insertion loci in North American and African populations of Drosophila melanogaster. Using this non-equilibrium model, we are able to explain about 80% of the variance in TE insertion allele frequencies. Controlling for nonequilibrium dynamics of transposition and host demography, we demonstrate how one may detect negative selection acting against most TEs as well as evidence for a small subset of TEs being driven to high frequency by positive selection. Our work establishes a new framework for the analysis of the evolutionary forces governing large insertion mutations like TEs or gene duplications.

Robust identification of local adaptation from allele frequencies

Robust identification of local adaptation from allele frequencies

Torsten Günther, Graham Coop
(Submitted on 13 Sep 2012)

Comparing allele frequencies among populations that differ in environment has long been a tool for detecting loci involved in local adaptation. However, such analyses are complicated by an imperfect knowledge of population allele frequencies and neutral correlations of allele frequencies among populations due to shared population history and gene flow. Here we develop a set of methods to robustly test for unusual allele frequency patterns, and correlations between environmental variables and allele frequencies while accounting for these complications based on a Bayesian model previously implemented in the software Bayenv. Using this model, we calculate a set of `standardized allele frequencies’ that allows investigators to apply tests of their choice to multiple populations, while accounting for sampling and covariance due to population history. We illustrate this first by showing that these standardized frequencies can be used to calculate powerful tests to detect non-parametric correlations with environmental variables, which are also less prone to spurious results due to outlier populations. We then demonstrate how these standardized allele frequencies can be used to construct a test to detect SNPs that deviate strongly from neutral population structure. This test is conceptually related to FST but should be more powerful as we account for population history. We also extend the model to next-generation sequencing of population pools, which is a cost-efficient way to estimate population allele frequencies, but it implies an additional level of sampling noise. The utility of these methods is demonstrated in simulations and by re-analyzing human SNP data from the HGDP populations. An implementation of our method will be available from this http URL.

Genome-wide analysis points to roles for extracellular matrix remodeling, the visual cycle, and neuronal development in myopia

Genome-wide analysis points to roles for extracellular matrix remodeling, the visual cycle, and neuronal development in myopia

Amy K. Kiefer, Joyce Y. Tung, Chuong B. Do, David A. Hinds, Joanna L. Mountain, Uta Francke, Nicholas Eriksson
(Submitted on 10 Sep 2012)

Myopia, or nearsightedness, is the most common eye disorder, resulting primarily from excess elongation of the eye. The etiology of myopia, although known to be complex, is poorly understood. Here we report the largest ever genome-wide association study (43,360 participants) on myopia in Europeans. We performed a survival analysis on age of myopia onset and identified 19 significant associations (p < 5e-8), two of which are replications of earlier associations with refractive error. These 19 associations in total explain 2.7% of the variance in myopia age of onset, and point towards a number of different mechanisms behind the development of myopia. One association is in the gene PRSS56, which has previously been linked to abnormally small eyes; one is in a gene that forms part of the extracellular matrix (LAMA2); two are in or near genes involved in the regeneration of 11-cis-retinal (RGR and RDH5); two are near genes known to be involved in the growth and guidance of retinal ganglion cells (ZIC2, SFRP1); and five are in or near genes involved in neuronal signaling or development. These novel findings point towards multiple genetic factors involved in the development of myopia and suggest that complex interactions between extracellular matrix remodeling, neuronal development, and visual signals from the retina may underlie the development of myopia in humans.

A genetic variant near olfactory receptor genes influences cilantro preference

A genetic variant near olfactory receptor genes influences cilantro preference

Nicholas Eriksson, Shirley Wu, Chuong B. Do, Amy K. Kiefer, Joyce Y. Tung, Joanna L. Mountain, David A. Hinds, Uta Francke
(Submitted on 10 Sep 2012)

The leaves of the Coriandrum sativum plant, known as cilantro or coriander, are widely used in many cuisines around the world. However, far from being a benign culinary herb, cilantro can be polarizing—many people love it while others claim that it tastes or smells foul, often like soap or dirt. This soapy or pungent aroma is largely attributed to several aldehydes present in cilantro. Cilantro preference is suspected to have a genetic component, yet to date nothing is known about specific mechanisms. Here we present the results of a genome-wide association study among 14,604 participants of European ancestry who reported whether cilantro tasted soapy, with replication in a distinct set of 11,851 participants who declared whether they liked cilantro. We find a single nucleotide polymorphism (SNP) significantly associated with soapy-taste detection that is confirmed in the cilantro preference group. This SNP, rs72921001, (p=6.4e-9, odds ratio 0.81 per A allele) lies within a cluster of olfactory receptor genes on chromosome 11. Among these olfactory receptor genes is OR6A2, which has a high binding specificity for several of the aldehydes that give cilantro its characteristic odor. We also estimate the heritability of cilantro soapy-taste detection in our cohort, showing that the heritability tagged by common SNPs is low, about 0.087. These results confirm that there is a genetic component to cilantro taste perception and suggest that cilantro dislike may stem from genetic variants in olfactory receptors. We propose that OR6A2 may be the olfactory receptor that contributes to the detection of a soapy smell from cilantro in European populations.

Our paper: A faster-X effect for gene expression in Drosophila embryos

[This author post is by Alex Kalinka and Pavel Tomancak on their paper, An excess of gene expression divergence on the X chromosome in Drosophila embryos: implications for the faster-X hypothesis, posted to the arXiv here.]

We have been working towards publishing our study of gene expression evolution on the X chromosome in Drosophila embryos since the beginning of March this year. Recently, Casey Bergman suggested that we upload our manuscript to the arXiv, and after we did so, we were kindly invited by Graham Coop to write a guest post about our work for Haldane's Sieve.

It makes sense to post here since the roots of our study go back to Haldane in 1924 [1]; he recognised that the unusual inheritance pattern of the X chromosome, in which a single copy is present in the heterogametic sex versus two copies in the homogametic sex, could in turn lead to unusual evolutionary patterns on the X relative to the autosomes. If, for example, a beneficial mutation is recessive, then it would be more exposed to natural selection in the heterogametic sex where, relative to an equivalent autosomal allele, it would spend less time being masked by the dominant, less beneficial allele [1]. The prediction that adaptive evolution might proceed more quickly on the X than on the autosomes has been dubbed the faster-X hypothesis. However, the X chromosome might also be expected to evolve more rapidly for non-adaptive reasons. In each mating pair there will be 3 copies of the X chromosome versus 4 copies of each autosome, which might in turn lead to a lower chromosomal effective population size for the X thereby increasing the strength of random genetic drift.

While some studies have reported evidence for a faster-X effect for adaptive protein evolution in Drosophila, other studies have reported that there is no difference between the X and the autosomes, and to date the evidence is somewhat inconclusive. As we focused our study on gene expression, we had an opportunity to relax the implicit assumption that the majority of adaptive evolution occurs in coding regions. To help disentangle adaptive and non-adaptive evolutionary signatures in our data, we used both between-species measures of gene expression divergence and within-species measures of gene expression variation using inbred strains of D. melanogaster generated by the Drosophila Genetic Reference Panel (DGRP).

We found an excess of gene expression divergence on the X chromosome between six Drosophila species (a mean increase of ~20%). In contrast, we found that the X exhibits a significantly lower level of gene expression variation between inbred strains of D. melanogaster (a mean decrease of ~10%). Taken together, these results suggest that the divergence that we find between species is not driven by a relaxation of selective constraint on the X chromosome. To further explore whether such a signature could be driven by the hemizygosity of the X, we analysed gene expression in mutation accumulation lines of D. melanogaster. If the single copy of the X in males is driving the excess of expression divergence that we found on the X, then we would expect to find an excess of expression variation between lines that have independently accumulated mutations. In fact, we found the opposite was the case – the X chromosome displayed a significantly lower rate of mutation accumulation than the autosomes suggesting that the hemizygosity of the X alone is not sufficient to drive a higher rate of fixation of gene expression mutations.

Overall, we argue that the excess divergence we find on the X is best understood within the framework of the faster-X hypothesis. In support of this interpretation, we find that there is an excess of gene expression divergence on Muller's D element along the branch leading to the obscura sub-group; Muller's D element segregates as a neo-X chromosome in the obscura sub-group, and hence provides a powerful, independent test of faster evolution of the X chromosome.

Several questions remain, however, and we hope that our findings will help to stimulate further research into the details underpinning the differences we find on the X. In particular, work needs to be done to discover the genetics underlying divergence on the X, such as the relative importance of cis versus trans-acting factors, and, crucially, we need to develop a better understanding of how variation in gene expression impacts organismal fitness. Research into the latter question is essential if we are to bridge the conspicuous gaps between sequence variation, gene expression variation, and organismal fitness.

Dave Gerrard initially found elevated gene expression divergence on the X in the course of analysing data for our developmental hourglass paper, and spoke about his findings at the 43rd population genetics conference; that was more than two years ago. Since then we collected new data, and it took a while to put the paper together although it is still not certain that it will be published in a traditional journal. The arXiv is a great way to let the scientific community know about your results before the academic process runs its course. We only regret we didn't make use of this excellent outlet back in March.

[1] Haldane JBS (1924) A mathematical theory of natural and artificial selection. Part I. Trans Camb Phil Soc 23: 19-41.