Variational Inference of Population Structure in Large SNP Datasets

Variational Inference of Population Structure in Large SNP Datasets
Anil Raj, Matthew Stephens, Jonathan K Pritchard

Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a dataset and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data, and illustrate using genotype data from the CEPH-Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias towards detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html.

Population genetics and substitution models of adaptive evolution

Population genetics and substitution models of adaptive evolution
Mario dos Reis
(Submitted on 26 Nov 2013)

The ratio of non-synonymous to synonymous substitutions ω(=dN/dS) has been widely used as a measure of adaptive evolution in protein coding genes. Omega can be defined in terms of population genetics parameters as the fixation ratio of selected vs. neutral mutants. Here it is argued that approaches based on the infinite sites model are not appropriate to define ω for single codon locations. Simple models of amino acid substitution with reversible mutation and selection are analysed, and used to define ω under several evolutionary scenarios. In most practical cases ω1 can be sometimes expected for single locations at equilibrium. An example with influenza data is discussed.

Calibrated birth-death phylogenetic time-tree priors for Bayesian inference

Calibrated birth-death phylogenetic time-tree priors for Bayesian inference
Joseph Heled, Alexei J.Drummond
(Submitted on 19 Nov 2013)

Here we introduce a general class of multiple calibration birth-death tree priors for use in Bayesian phylogenetic inference. All tree priors in this class separate ancestral node heights into a set of “calibrated nodes” and “uncalibrated nodes” such that the marginal distribution of the calibrated nodes is user-specified whereas the density ratio of the birth-death prior is retained for trees with equal values for the calibrated nodes. We describe two formulations, one in which the calibration information informs the prior on ranked tree topologies, through the (conditional) prior, and the other which factorizes the prior on divergence times and ranked topologies, thus allowing uniform, or any arbitrary prior distribution on ranked topologies. While the first of these formulations has some attractive properties the algorithm we present for computing its prior density is computationally intensive. On the other hand, the second formulation is always computationally efficient. We demonstrate the utility of the new class of multiple-calibration tree priors using both small simulations and a real-world analysis and compare the results to existing schemes. The two new calibrated tree priors described in this paper offer greater flexibility and control of prior specification in calibrated time-tree inference and divergence time dating, and will remove the need for indirect approaches to the assessment of the combined effect of calibration densities and tree process priors in Bayesian phylogenetic inference.

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis
Eric Y. Durand, Nicholas Eriksson, Cory Y. McLean
(Submitted on 5 Nov 2013)

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, but IBD detection accuracy in non-simulated data is largely unknown. Using 25,432 genotyped European individuals, and exploiting known familial relationships in 2,952 father-mother-child trios contained therein, we identify a false positive rate over 67% for short (2-4 centiMorgan) segments. We introduce a novel, computationally-efficient, haplotype-based metric that enables accurate IBD detection on population-scale datasets.

An HMM-based Comparative Genomic Framework for Detecting Introgression in Eukaryotes

An HMM-based Comparative Genomic Framework for Detecting Introgression in Eukaryotes
Kevin J. Liu, Jingxuan Dai, Kathy Truong, Ying Song, Michael H. Kohn, Luay Nakhleh
(Submitted on 30 Oct 2013)

One outcome of interspecific hybridization and subsequent effects of evolutionary forces is introgression, which is the integration of genetic material from one species into the genome of an individual in another species. The evolution of several groups of eukaryotic species has involved hybridization, and cases of adaptation through introgression have been already established. In this work, we report on a new comparative genomic framework for detecting introgression in genomes, called PhyloNet-HMM, which combines phylogenetic networks, that capture reticulate evolutionary relationships among genomes, with hidden Markov models (HMMs), that capture dependencies within genomes. A novel aspect of our work is that it also accounts for incomplete lineage sorting and dependence across loci.
Application of our model to variation data from chromosome 7 in the mouse (Mus musculus domesticus) genome detects a recently reported adaptive introgression event involving the rodent poison resistance gene Vkorc1, in addition to other newly detected introgression regions. Based on our analysis, it is estimated that about 12% of all sites withinchromosome 7 are of introgressive origin (these cover about 18 Mbp of chromosome 7, and over 300 genes). Further, our model detects no introgression in two negative control data sets. Our work provides a powerful framework for systematic analysis of introgression while simultaneously accounting for dependence across sites, point mutations, recombination, and ancestral polymorphism.

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection
Darren Kessner, John Novembre
(Submitted on 11 Oct 2013)

forqs is a forward-in-time simulation of recombination, quantitative traits, and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. forqs is implemented as a command- line C++ program. Source code and binary executables for Linux, OSX, and Windows are freely available under a permissive BSD license.

A novel spectral method for inferring general selection from time series genetic data

A novel spectral method for inferring general selection from time series genetic data

Matthias Steinrücken, Anand Bhaskar, Yun S. Song
(Submitted on 3 Oct 2013)

Recently there has been growing interest in using time series genetic variation data, either from experimental evolution studies or ancient DNA samples, to make inference about evolutionary processes. While such temporal data can facilitate identifying genomic regions under selective pressure and estimating associated fitness parameters, it is a challenging problem to compute the likelihood of the underlying selection model given DNA samples obtained at several time points. Here, we develop an efficient algorithm to tackle this challenge. The key methodological advance in our work is the development of a novel spectral method to analytically and efficiently integrate over all trajectories of the population allele frequency between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the allele frequency space to approximate certain integrals using numerical schemes. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and apply the method to analyze time series ancient DNA data from genetic loci (ASIP and MC1R) associated with coat coloration in horses. In contrast to the conclusions of previous studies which considered only a few special selection schemes, our exploration of the full fitness parameter space reveals that balancing selection (in the form of heterozygote advantage) may have been acting on these loci.

Fast Inference of Admixture Coefficients Using Sparse Non-negative Matrix Factorization Algorithms

Fast Inference of Admixture Coefficients Using Sparse Non-negative Matrix Factorization Algorithms
Eric Frichot, François Mathieu, Théo Trouillon, Guillaume Bouchard, Olivier François
(Submitted on 24 Sep 2013)

Inference of individual admixture coefficients, which is important for population genetic and association studies, is commonly performed using compute-intensive likelihood algorithms. With the availability of large population genomic data sets, fast versions of likelihood algorithms have attracted considerable attention. Reducing the computational burden of estimation algorithms remains, however, a major challenge. Here, we present a fast and efficient method for estimating individual admixture coefficients based on sparse non-negative matrix factorization algorithms. We implemented our method in the computer program sNMF, and applied it to human and plant genomic data sets. The performances of sNMF were then compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of admixture coefficients within run-times approximately 10 to 30 times faster than those of ADMIXTURE.

Watterson estimators for Next Generation Sequencing: from trios to autopolyploids

Watterson estimators for Next Generation Sequencing: from trios to autopolyploids
Luca Ferretti, Sebástian E. Ramos-Onsins
(Submitted on 17 Sep 2013)

Several variation of the Watterson estimator of variability for Next Generation Sequencing (NGS) data have been proposed in the literature. We present a unified framework for generalized Watterson estimators based on Maximum Composite Likelihood, which encompasses most of the existing estimators. We propose this class of unbiased estimators as generalized Watterson estimators for a large class of NGS data, including pools and trios. We also discuss the relation with the estimators that have been proposed in the literature and show that they admit two equivalent but seemingly different forms, deriving a set of combinatorial identities as a byproduct. Finally, we give a detailed treatment of Watterson estimators for single or multiple autopolyploid individuals.

Inferring selective constraint and recent gain and loss of function from population genomic data

Inferring selective constraint and recent gain and loss of function from population genomic data
Daniel R. Schrider, Andrew D. Kern
(Submitted on 10 Sep 2013)

The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human specific function in the genome. Using only allele frequency information from the complete low coverage 1000 Genomes Project dataset in conjunction with a support vector machine trained from known functional and non-functional portions of the genome, we are able to identify functional portions of the genome with extremely high accuracy (~88%). Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain of function along the human lineage include a novel isoform of a killer cell immunoglobulin-like receptor, while loss of function candidates include many members of a gene cluster involved in shaping the complexity of synaptic connections in the brain. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods.