Sampling through time and phylodynamic inference with coalescent and birth-death models
Erik M. Volz, Simon DW Frost
(Submitted on 28 Aug 2014)
Many population genetic models have been developed for the purpose of inferring population size and growth rates from random samples of genetic data. We examine two popular approaches to this problem, the coalescent and the birth-death-sampling model, in the context of estimating population size and birth rates in a population growing exponentially according to the birth-death branching process. For sequences sampled at a single time, we found the coalescent and the birth-death-sampling model gave virtually indistinguishable results in terms of the growth rates and fraction of the population sampled, even when sampling from a small population. For sequences sampled at multiple time points, we find that the birth-death model estimators are subject to large bias if the sampling process is misspecified. Since birth-death-sampling models incorporate a model of the sampling process, we show how much of the statistical power of birth-death-sampling models arises from the sequence of sample times and not from the genealogical tree. This motivates the development of a new coalescent estimator, which is augmented with a model of the known sampling process and is potentially more precise than the coalescent that does not use sample time information.
A genomic map of the effects of linked selection in Drosophila
Eyal Elyashiv, Shmuel Sattath, Tina T. Hu, Alon Strustovsky, Graham McVicker, Peter Andolfatto, Graham Coop, Guy Sella
(Submitted on 23 Aug 2014)
Natural selection at one site shapes patterns of genetic variation at linked sites. Quantifying the effects of ‘linked selection’ on levels of genetic diversity is key to making reliable inference about demography, building a null model in scans for targets of adaptation, and learning about the dynamics of natural selection. Here, we introduce the first method that jointly infers parameters of distinct modes of linked selection, notably background selection and selective sweeps, from genome-wide diversity data, functional annotations and genetic maps. The central idea is to calculate the probability that a neutral site is polymorphic given local annotations, substitution patterns, and recombination rates. Information is then combined across sites and samples using composite likelihood in order to estimate genome-wide parameters of distinct modes of selection. In addition to parameter estimation, this approach yields a map of the expected neutral diversity levels along the genome. To illustrate the utility of our approach, we apply it to genome-wide resequencing data from 125 lines in Drosophila melanogaster and reliably predict diversity levels at the 1Mb scale. Our results corroborate estimates of a high fraction of beneficial substitutions in proteins and untranslated regions (UTR). They allow us to distinguish between the contribution of sweeps and other modes of selection around amino acid substitutions and to uncover evidence for pervasive sweeps in untranslated regions (UTRs). Our inference further suggests a substantial effect of linked selection from non-classic sweeps. More generally, we demonstrate that linked selection has had a larger effect in reducing diversity levels and increasing their variance in D. melanogaster than previously appreciated.
Robust Population Structure Inference and Correction in the Presence of Known or Cryptic Relatedness
Matthew P Conomos, Michael B Miller, Timothy A Thornton
Population structure inference with genetic data has been motivated by a variety of applications in population genetics and genetic association studies. Several approaches have been proposed for the identification of genetic ancestry differences in samples where study participants are assumed to be unrelated, including principal components analysis (PCA), multi-dimensional scaling (MDS), and model-based methods for proportional ancestry estimation. Many genetic studies, however, include individuals with some degree of relatedness, and existing methods for inferring genetic ancestry fail in related samples. We present a method, PC-AiR, for robust population structure inference in the presence of known or cryptic relatedness. PC-AiR utilizes genome-screen data and an efficient algorithm to identify a diverse subset of unrelated individuals that is representative of all ancestries in the sample. The PC-AiR method directly performs PCA on the identified ancestry representative subset and then predicts components of variation for all remaining individuals based on genetic similarities. In simulation studies and in applications to real data from Phase III of the HapMap Project, we demonstrate that PC-AiR provides a substantial improvement over existing approaches for population structure inference in related samples. We also demonstrate significant efficiency gains, where a single axis of variation from PC-AiR provides better prediction of ancestry in a variety of structure settings than using ten (or more) components of variation from widely used PCA and MDS approaches. Finally, we illustrate that PC-AiR can provide improved population stratification correction over existing methods in genetic association studies with population structure and relatedness.
The impact of macroscopic epistasis on long-term evolutionary dynamics
Benjamin H. Good, Michael M. Desai
(Submitted on 18 Aug 2014)
Genetic interactions can strongly influence the fitness effects of individual mutations, yet the impact of these epistatic interactions on evolutionary dynamics remains poorly understood. Here we investigate the evolutionary role of epistasis over 50,000 generations in a well-studied laboratory evolution experiment in E. coli. The extensive duration of this experiment provides a unique window into the effects of epistasis during long-term adaptation to a constant environment. Guided by analytical results in the weak-mutation limit, we develop a computational framework to assess the compatibility of a given epistatic model with the observed patterns of fitness gain and mutation accumulation through time. We find that the average fitness trajectory alone provides little power to distinguish between competing models, including those that lack any direct epistatic interactions between mutations. However, when combined with the mutation trajectory, these observables place strong constraints on the set of possible models of epistasis, ruling out most existing explanations of the data. Instead, we find the strongest support for a “two-epoch” model of adaptation, in which an initial burst of diminishing returns epistasis is followed by a steady accumulation of mutations under a constant distribution of fitness effects. Our results highlight the need for additional DNA sequencing of these populations, as well as for more sophisticated models of epistasis that are compatible with all of the experimental data.
Understanding Admixture Fractions
Mason Liang, Rasmus Nielsen
Estimation of admixture fractions has become one of the most commonly used computational tools in population genomics. However, there is remarkably little population genetic theory on their statistical properties. We develop theoretical results that can accurately predict means and variances of admixture proportions within a population using models with recombination and genetic drift. Based on established theory on measures of multilocus disequilibrium, we show that there is a set of recurrence relations that can be used to derive expectations for higher moments of the admixture fraction distribution. We obtain closed form solutions for some special cases. Using these results, we develop a method for estimating admixture parameters from estimated admixture proportion obtained from programs such as Structure or Admixture. We apply this method to HapMap data and find that the population history of African Americans, as expected, is not best explained by a single admixture event between people of European and African ancestry. A model of constant gene flow for the past 11 generations until 2 generations ago gives a better fit.
inPHAP: Interactive visualization of genotype and phased haplotype data
Günter Jäger, Alexander Peltzer, Kay Nieselt
Comments: BioVis 2014 conference
Subjects: Graphics (cs.GR); Genomics (q-bio.GN)
Background: To understand individual genomes it is necessary to look at the variations that lead to changes in phenotype and possibly to disease. However, genotype information alone is often not sufficient and additional knowledge regarding the phase of the variation is needed to make correct interpretations. Interactive visualizations, that allow the user to explore the data in various ways, can be of great assistance in the process of making well informed decisions. But, currently there is a lack for visualizations that are able to deal with phased haplotype data. Results: We present inPHAP, an interactive visualization tool for genotype and phased haplotype data. inPHAP features a variety of interaction possibilities such as zooming, sorting, filtering and aggregation of rows in order to explore patterns hidden in large genetic data sets. As a proof of concept, we apply inPHAP to the phased haplotype data set of Phase 1 of the 1000 Genomes Project. Thereby, inPHAP’s ability to show genetic variations on the population as well as on the individuals level is demonstrated for several disease related loci. Conclusions: As of today, inPHAP is the only visual analytical tool that allows the user to explore unphased and phased haplotype data interactively. Due to its highly scalable design, inPHAP can be applied to large datasets with up to 100 GB of data, enabling users to visualize even large scale input data. inPHAP closes the gap between common visualization tools for unphased genotype data and introduces several new features, such as the visualization of phased data.
Improved genome inference in the MHC using a population reference graph
Alexander Dilthey, Charles J Cox, Zamin Iqbal, Matthew R Nelson, Gil McVean
In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and long-read data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.