Spatial localization of recent ancestors for admixed individuals

Spatial localization of recent ancestors for admixed individuals
Wen-Yun Yang, Alexander Platt, Charleston Wen-Kai Chiang, Eleazar Eskin, John Novembre, Bogdan Pasaniuc

Ancestry analysis from genetic data plays a critical role in studies of human disease and evolution. Recent work has introduced explicit models for the geographic distribution of genetic variation and has shown that such explicit models yield superior accuracy in ancestry inference over non-model-based methods. Here we extend such work to introduce a method that models admixture between ancestors from multiple sources across a geographic continuum. We devise efficient algorithms based on hidden Markov models to localize on a map the recent ancestors (e.g. grandparents) of admixed individuals, joint with assigning ancestry at each locus in the genome. We validate our methods using empirical data from individuals with mixed European ancestry from the POPRES study and show that our approach is able to localize their recent ancestors within an average of 470Km of the reported locations of their grandparents. Furthermore, simulations from real POPRES genotype data show that our method attains high accuracy in localizing recent ancestors of admixed individuals in Europe (an average of 550Km from their true location for localization of 2 ancestries in Europe, 4 generations ago). We explore the limits of ancestry localization under our approach and find that performance decreases as the number of distinct ancestries and generations since admixture increases. Finally, we build a map of expected localization accuracy across admixed individuals according to the location of origin within Europe of their ancestors.

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

Josef C Uyeda, Luke J Harmon

Our understanding of macroevolutionary patterns of adaptive evolution has greatly increased with the advent of large-scale phylogenetic comparative methods. Widely used Ornstein-Uhlenbeck (OU) models can describe an adaptive process of divergence and selection. However, inference of the dynamics of adaptive landscapes from comparative data is complicated by interpretational difficulties, lack of identifiability among parameter values and the common requirement that adaptive hypotheses must be assigned a priori. Here we develop a reversible-jump Bayesian method of fitting multi-optima OU models to phylogenetic comparative data that estimates the placement and magnitude of adaptive shifts directly from the data. We show how biologically informed hypotheses can be tested against this inferred posterior of shift locations using Bayes Factors to establish whether our a priori models adequately describe the dynamics of adaptive peak shifts. Furthermore, we show how the inclusion of informative priors can be used to restrict models to biologically realistic parameter space and test particular biological interpretations of evolutionary models. We argue that Bayesian model-fitting of OU models to comparative data provides a framework for integrating of multiple sources of biological data–such as microevolutionary estimates of selection parameters and paleontological timeseries–allowing inference of adaptive landscape dynamics with explicit, process-based biological interpretations.

Flexible methods for estimating genetic distances from nucleotide data

Flexible methods for estimating genetic distances from nucleotide data

Simon Joly, David J Bryant, Peter J Lockhart

With the increasing use of massively parallel sequencing approaches in evolutionary biology, the need for fast and accurate methods suitable to investigate genetic structure and evolutionary history are more important than ever. We propose new distance measures for estimating genetic distances between individuals when allelic variation, gene dosage and recombination could compromise standard approaches. We present four distance measures based on single nucleotide polymorphisms (SNP) and evaluate them against previously published measures using coalescent-based simulations. Simulations were used to test (i) whether the measures give unbiased and accurate distance estimates, (ii) if they can accurately identify the genomic mixture of hybrid individuals and (iii) if they give precise (low variance) estimates. The results showed that the SNP-based genpofad distance we propose appears to work well in the widest circumstances. It was the most accurate method for estimating genetic distances and is also relatively good at estimating the genomic mixture of hybrid individuals. Our simulations provide benchmarks to compare the performance of different distance measures in specific situations.

Selscan: an efficient multi-threaded program to perform EHH-based scans for positive selection

Selscan: an efficient multi-threaded program to perform EHH-based scans for positive selection

Zachary A Szpiech, Ryan D Hernandez
(Submitted on 26 Mar 2014)

Haplotype-based scans to detect natural selection are useful to identify recent or ongoing positive selection in genomes. As both real and simulated genomic datasets grow larger, spanning thousands of samples and millions of markers, there is a need for a fast and efficient implementation of these scans for general use. Here we present selscan, an efficient multi-threaded application that implements Extended Haplotype Homozygosity (EHH), Integrated Haplotype Score (iHS), and Cross-population Extended Haplotype Homozygosity (XPEHH). selscan performs extremely well on both simulated and real data and over an order of magnitude faster than existing available implementations. It calculates iHS on chromosome 22 (22,147 loci) across 204 CEU haplotypes in 502s on one thread (77s on 16 threads) and calculates XPEHH for the same data relative to 210 YRI haplotypes in 907s on one thread (107s on 16 threads). Source code and binaries (Windows, OSX and Linux) are available at this https URL.

Population genetics of identity by descent

Population genetics of identity by descent
Pier Francesco Palamara, Ph.D. thesis

Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.

Identifying recombination hotspots using population genetic data

Identifying recombination hotspots using population genetic data
Adam Auton, Simon Myers, Gil McVean
(Submitted on 17 Mar 2014)

Motivation: Recombination rates vary considerably at the fine scale within mammalian genomes, with the majority of recombination occurring within hotspots of ~2 kb in width. We present a method for inferring the location of recombination hotspots from patterns of linkage disequilibrium within samples of population genetic data. Results: Using simulations, we show that our method has hotspot detection power of approximately 50-60%, but depending on the magnitude of the hotspot. The false positive rate is between 0.24 and 0.56 false positives per Mb for data typical of humans. Availability: this http URL

An improved sequence measure used to scan genomes for regions of recent gene flow

An improved sequence measure used to scan genomes for regions of recent gene flow

Anthony J. Geneva, Christina A. Muirhead, LeAnne M. Lovato, Sarah B. Kingan, Daniel Garrigan
(Submitted on 6 Mar 2014)

The study of complex speciation, or speciation with gene flow, requires the identification of genomic regions that are either unusually divergent or that have experienced recent gene flow. Furthermore, the rapid growth of population genomic datasets relevant to studying complex speciation requires that analytical tools be scalable to the level of whole-genome analysis. We present a simple sequence measure, Gmin which is specifically designed to identify regions of diverging genomes as candidates for experiencing recent gene flow. Gmin is defined as the ratio of the minimum number of nucleotide differences between sequences from two different populations to the average number of between-population differences. We compare the sensitivity of Gmin to that of the widely used index of population differentiation, Fst. Extensive computer simulations demonstrate that Gmin has greater sensitivity and specificity to detect gene flow than Fst. Additionally, the sensitivity of Gmin to detect gene flow is robust with respect to both the population mutation and recombination rates, suggesting that it is flexible and can be applied to a variety of biological scenarios. Finally, a scan of Gmin across the X~chromosome of Drosophila melanogaster identifies candidate regions of introgression between sub-Saharan African and cosmopolitan populations that were previously missed by other methods. These results demonstrate that Gmin is a biologically straightforward, yet powerful, alternative to Fst, as well as to more computationally intensive model-based methods for detecting gene flow.

A renewal theory approach to IBD sharing

A renewal theory approach to IBD sharing

Shai Carmi, Itsik Pe’er
(Submitted on 6 Mar 2014)

Long genomic segments that are nearly identical between a pair of individuals and are inherited from a recent common ancestor without recombination are called identical-by-descent (IBD) segments. IBD sharing has numerous applications in genetics, from demographic inference to phasing, imputation, pedigree reconstruction, and disease mapping. Here, we provide a theoretical analysis of IBD sharing under Markovian approximations of the coalescent with recombination. We describe a general framework for the IBD process along the chromosome under the Markovian models (SMC/SMC’), as well as introduce and justify a new model, which we term the renewal approximation, under which lengths of successive segments are independent. Then, considering the infinite-chromosome limit of the IBD process, we recover previous results (for SMC) and derive new results (for SMC’) for the average fraction of the chromosome found in long shared segments and the average number of such segments. A number of new results for tree heights in SMC’ are proved as lemmas. We then use renewal theory to derive an expression (in Laplace space) for the distribution of the number of shared segments and demonstrate implications for demographic inference. We also use renewal theory to compute the distribution of the fraction of the chromosome shared. While the expression is again in Laplace space, we could invert the first two moments and compare a number of approximations. Finally, we generalized all results to populations with variable historical effective size.

Decoding coalescent hidden Markov models in linear time

Decoding coalescent hidden Markov models in linear time

Kelley Harris, Sara Sheehan, John A. Kamm, Yun S. Song
(Submitted on 4 Mar 2014)

In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.

Genome scans for detecting footprints of local adaptation using a Bayesian factor model


Genome scans for detecting footprints of local adaptation using a Bayesian factor model

N. Duforet-Frebourg, E. Bazin, M.G.B. Blum
(Submitted on 21 Feb 2014)

A central part of population genomics consists of finding genomic regions implicated in local adaptation. Population genomic analyses are based on genotyping numerous molecular markers and looking for outlier loci in terms of patterns of genetic differentiation. One of the most common approach for selection scan is based on statistics that measure population differentiation such as FST. However they are important caveats with approaches related to FST because they require grouping individuals into populations and they additionally assume a particular model of population structure. Here we implement a more flexible individual-based approach based on Bayesian factor models. Using hierarchical Bayesian modeling, we both infer population structure and identify outlier loci that are candidates for local adaptation. Factor models are strongly related to principal components analysis (PCA) and they model population structure with latent variables called factors. The hierarchical factor model considers that outlier loci are atypically explained by one of the factors. In a model of population divergence, we show that it can achieve a 2-fold or more reduction of false discovery rate compared to the software BayeScan or compared to a FST approach. We show that our software can handle large SNP datasets by analyzing the HGDP SNP dataset. The Bayesian factor model is implemented in the command-line PCAdapt software.