Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Anand Bhaskar, Y.X. Rachel Wang, Yun S. Song

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal which is difficult to pick up with small sample sizes. Lastly, we apply our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing dataset of tens of thousands of individuals assayed at a few hundred genic regions.

Multi-locus analysis of genomic time series data from experimental evolution


Multi-locus analysis of genomic time series data from experimental evolution

Jonathan Terhorst, Yun S. Song

Genomic time series data generated by evolve-and-resequence (E&R) experiments offer a powerful window into the mechanisms that drive evolution. However, standard population genetic inference procedures do not account for sampling serially over time, and new methods are needed to make full use of modern experimental evolution data. To address this problem, we develop a Gaussian process approximation to the multi-locus Wright-Fisher process with selection over a time course of tens of generations. The mean and covariance structure of the Gaussian process are obtained by computing the corresponding moments in discrete-time Wright-Fisher models conditioned on the presence of a linked selected site. This enables our method to account for the effects of linkage and selection, both along the genome and across sampled time points, in an approximate but principled manner. Using simulated data, we demonstrate the power of our method to correctly detect, locate and estimate the fitness of a selected allele from among several linked sites. We also study how this power changes for different values of selection strength, initial haplotypic diversity, population size, sampling frequency, experimental duration, number of replicates, and sequencing coverage depth. In addition to providing quantitative estimates of selection parameters from experimental evolution data, our model can be used by practitioners to design E&R experiments with requisite power. Finally, we explore how our likelihood-based approach can be used to infer other model parameters, including effective population size and recombination rate, and discuss extensions to more complex models.

The effective founder effect in a spatially expanding population

The effective founder effect in a spatially expanding population
Benjamin Marco Peter, Montgomery Slatkin

The gradual loss of diversity associated with range expansions is a well known pattern observed in many species, and can be explained with a serial founder model. We show that under a branching process approximation, this loss in diversity is due to the difference in offspring variance between individuals at and away from the expansion front, which allows us to measure the strength of the founder effect, dependant on an effective founder size. We demonstrate that the predictions from the branching process model fit very well with Wright-Fisher forward simulations and backwards simulations under a modified Kingman coalescent, and further show that estimates of the effective founder size are robust to possibly confounding factors such as migration between subpopulations. We apply our method to a data set of Arabidopsis thaliana, where we find that the founder effect is about three times stronger in the Americas than in Europe, which may be attributed to the more recent, faster expansion.

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors

Victor Hanson-Smith, Christopher Baker, Alexander Johnson
(Submitted on 11 Jun 2014)

A central challenge in the study of protein evolution is the identification of historic amino acid sequence changes responsible for creating novel functions observed in present-day proteins. To address this problem, we developed a new method to identify and rank amino acid mutations in ancestral protein sequences according to their function-shifting potential. Our approach scans the changes between two reconstructed ancestral sequences in order to find (1) sites with sequence changes that significantly deviate from our model-based probabilistic expectations, (2) sites that demonstrate extreme changes in mutual information, and (3) sites with extreme gains or losses of information content. By taking the overlaps of these statistical signals, the method accurately identifies cryptic evolutionary patterns that are often not obvious when examining only the conservation of modern-day protein sequences. We validated this method with a training set of previously-discovered function-shifting mutations in three essential protein families in animals and fungi, whose evolutionary histories were the prior subject of systematic molecular biological investigation. Our method identified the known function-shifting mutations in the training set with a very low rate of false positive discovery. Further, our approach significantly outperformed other methods that use variability in evolutionary rates to detect functional loci. The accuracy of our approach indicates it could be a useful tool for generating specific testable hypotheses regarding the acquisition of new functions across a wide range of protein families.

Author post: Predicting evolution from the shape of genealogical trees

This guest post by Richard Neher discusses his preprint Predicting evolution from the shape of genealogical trees. Richard A. Neher, Colin A. Russell, Boris I. Shraiman. arXived here. This is cross-posted from the Neher lab website.

In this preprint — a collaboration with Colin Russell and Boris Shraiman — we show that it is possible to predict which individual from a population is most closely related to future populations. To this end, we have developed a method that uses the branching pattern of genealogical trees to estimate which part of the tree contains the “fittest” sequences, where fit means rapidly multiplying. Those that multiply rapidly, are most likely to take over the population. We demonstrate the power of our method by predicting the evolution of seasonal influenza viruses.

How does it work?
Individuals adapt to a changing environment by accumulating beneficial mutations, while avoiding deleterious mutations. We model this process assuming that there are many such mutations which change fitness in small increments. Using this model, we calculate the probability that an individual that lived in the past at time t leaves n descendants in the present. This distributions depends critically on the fitness of the ancestral individual. We then extend this calculation to the probability of observing a certain branch in a genealogical tree reconstructed from a sample of sequences. A branch in a tree connects an individual A that lived at time tA and had fitness xA and with an individual B that lived at a later time tB with fitness xB as illustrated in the figure. B has descendants in the sample, otherwise the branch would not be part of the tree. Furthermore, all sampled descendants of A are also descendants of B, otherwise the connection between A and B would have branched between tA and tB. We call the mathematical object describing fitness evolution between A and B “branch propagator” and propagatordenote it by g(xB,tB|xA,tA). The joint probability distribution of fitness values of all nodes of the tree is given by a product of branch propagators. We then calculate the expected fitness of each node and use it to rank the sampled sequences. The top ranked sequence is our prediction for the sequence of the progenitor of the future population.

Why do we care?
flu_tree Being able to predict evolution could have immediate applications. The best example is the seasonal influenza vaccine, that needs to be updated frequently to keep up with the evolving virus. Vaccine strains are chosen among sampled virus strains, and the more closely this strain matches the future influenza virus population, the better the vaccine is going to be. Hence by predicting a likely progenitor of the future, our method could help to improve influenza vaccines. One of our predictions is shown in the figure, with the top ranked sequence marked by a black arrow. Influenza is not the only possible application. Since the algorithm only requires a reconstructed tree as input, it can be applied to other rapidly evolving pathogens or cancer cell populations. In addition, to being useful, the ability to predict also implies that the model captures an essential aspect of evolutionary dynamics: influenza evolution is to a substantial degree — enough to enable prediction — dependent on the accumulation of small effect mutations.

Comparison to other approaches
Given the importance of good influenza vaccines, there has been a number of previous efforts to anticipate influenza virus evolution, typically based on using patterns of molecular evolution from historical data. Along these lines, Luksza and Lässig have recently presented an explicit fitness model for influenza virus evolution that rewards mutations at positions known to convey antigenic novelty and penalizes likely deleterious mutations (+a few other things). By using molecular influenza specific signatures, this model is complementary to ours that uses only the tree reconstructed from nucleotide sequences. Interestingly, the two models do more or less equally well and combining different methods of prediction should result in more reliable results.

High performance computation of landscape genomic models integrating local indices of spatial association


High performance computation of landscape genomic models integrating local indices of spatial association

Sylvie Stucki, Pablo Orozco-terWengel, Michael W. Bruford, Licia Colli, Charles Masembe, Riccardo Negrini, Pierre Taberlet, Stéphane Joost, the NEXTGEN Consortium
Comments: 1 figure in text, 1 figure in supplementary material
Subjects: Populations and Evolution (q-bio.PE)

Motivation: The increasing availability of high-throughput datasets requires powerful methods to support the detection of signatures of selection in landscape genomics. Results: We present an integrated approach to study signatures of local adaptation, providing rapid processing of whole genome data and enabling assessment of spatial association using molecular markers. Availabilty: Sam{\ss}ada is an open source software written in C++ available at http:lasig.epfl.ch/sambada (under the license GNU GPL 3). Compiled versions are provided for Windows, Linux and MacOS X. Contact: stephane.joost@epfl.ch, sylvie.stucki@a3.epfl.ch. Supplementary material is available online.

Inferring human population size and separation history from multiple genome sequences

Inferring human population size and separation history from multiple genome sequences
Stephan Schiffels, Richard Durbin

The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

Cosi2 : An efficient simulator of exact and approximate coalescent with selection

Cosi2 : An efficient simulator of exact and approximate coalescent with selection

Ilya Shlyakhter, Pardis C. Sabeti, Stephen F. Schaffner

Motivation: Efficient simulation of population genetic samples under a given demographic model is a prerequisite for many analyses. Coalescent theory provides an efficient framework for such simulations, but simulating longer regions and higher recombination rates remains challenging. Simulators based on a Markovian approximation to the coalescent scale well, but do not support simulation of selection. Gene conversion is not supported by any published coalescent simulators that support selection. Results: We describe cosi2 , an efficient simulator that supports both exact and approximate coalescent simulation with positive selection. cosi2 improves on the speed of existing exact simulators, and permits further speedup in approximate mode while retaining support for selection. cosi2 supports a wide range of demographic scenarios including recombination hot spots, gene conversion, population size changes, population structure and migration. cosi2 implements coalescent machinery efficiently by tracking only a small subset of the Ancestral Recombination Graph, sampling only relevant recombination events, and using augmented skip lists to represent tracked genetic segments. To preserve support for selection in approximate mode, the Markov approximation is implemented not by moving along the chromosome but by performing a standard backwards-in-time coalescent simulation while restricting coalescence to node pairs with overlapping or near-overlapping genetic material. We describe the algorithms used by cosi2 and present comparisons with existing selection simulators.

diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals

diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals

Paula Tataru, Jasmine A. Nirody, Yun S. Song

Summary: We present a tool, diCal-IBD, for detecting identity-by-descent (IBD) tracts between pairs of genomic sequences. Our method builds on a recent demographic inference method based on the coalescent with recombination, and is able to incorporate demographic information as a prior. Simulation study shows that diCal-IBD has significantly higher recall and precision than that of existing IBD detection methods, while retaining reasonable accuracy for IBD tracts as small as 0.1 cM. Availability: https://sourceforge.net/projects/dical-ibd/ Contact: yss@eecs.berkeley.edu

Author post: Spatial localization of recent ancestors for admixed individuals

A guest post by Bogdan Pasaniuc [@bpasaniuc] on his paper with coauthors: Spatial localization of recent ancestors for admixed individuals by Wen-Yun Yang, Alexander Platt, Charleston Wen-Kai Chiang, Eleazar Eskin, John Novembre, Bogdan Pasaniuc. bioRxived here.

Geographic localization based on genetic data has received much attention recently. Here we present a preprint that aims to address one of the drawbacks of existing approaches. As opposed to existing works that typically make a very strong assumption that all recent ancestors come from the same location on a map, we seek to infer multiple locations for a given individual corresponding to its ancestors. That is, our approach uses genetic data from a given individual to localize on the map its recent ancestors several generations ago (e.g. grandparents).

To accomplish this we approximate the admixture process (i.e. mixing of genetic variants from different sources) in a genetic-geographic continuum. We view the mixed ancestry genome as being generated from several locations on a map (corresponding to its recent ancestors) and model the mosaic structure of local ancestries across the genome through an admixture HMM. We link geography to the admixture process by allowing allele frequencies at every site in the genome to vary across geography according to a logistic gradient function (as in SPA[1]); the complete model is an admixture HMM for a genotype-specific pair of ancestral locations on the map.

As the number of generations since admixture increases the total number of ancestors to localize increases dramatically making the inference infeasible (http://gcbias.org/2013/11/11/how-does-your-number-of-genetic-ancestors-grow-back-over-time/). To account for this, we limit the number of different “ancestry locations” that contribute to admixture to a small constant, each with varying amount of contribution. We devise efficient algorithms to make inferences in our model and show that accuracy decreases with number of locations to infer, with number of generations in the admixture and with geographic distance among ancestors. For example, SPAMIX can localize the grandparents of the POPRES[2] individuals with multiple sub-continental European ancestries within 470Km of their reported locations.

As with all methods, limitations do exist and we outline several here. We use logistic gradient functions to relate geography to genetics and investigating more complex functions may prove fruitful. We developed an efficient algorithm for producing point estimates for location and locus-specific ancestry; in some cases a probabilistic output may be desired. Finally, our approach models admixture-LD and assumes no background LD; more involved procedures to model background LD (such as the one we proposed [3]) is an interesting area of research.

1. Yang, Wen-Yun, et al. “A model-based approach for analysis of spatial structure in genetic data.” Nature genetics 44.6 (2012): 725-731.
2. Nelson, Matthew R., et al. “The population reference sample, POPRES: a resource for population, disease, and pharmacological genetics research.” The American Journal of Human Genetics 83.3 (2008): 347-358.
3. Baran, Yael, et al. “Enhanced localization of genetic samples through linkage-disequilibrium correction.” The American Journal of Human Genetics 92.6 (2013): 882-894.