Population Genetics of Rare Variants and Complex Diseases

Population Genetics of Rare Variants and Complex Diseases
M. Cyrus Maher, Lawrence H. Uricchio, Dara G. Torgerson, Ryan D. Hernandez
(Submitted on 12 Feb 2013)

Identifying drivers of complex traits from the noisy signals of genetic variation obtained from high throughput genome sequencing technologies is a central challenge faced by human geneticists today. We hypothesize that the variants involved in complex diseases are likely to exhibit non-neutral evolutionary signatures. Uncovering the evolutionary history of all variants is therefore of intrinsic interest for complex disease research. However, doing so necessitates the simultaneous elucidation of the targets of natural selection and population-specific demographic history. Here we characterize the action of natural selection operating across complex disease categories, and use population genetic simulations to evaluate the expected patterns of genetic variation in large samples. We focus on populations that have experienced historical bottlenecks followed by explosive growth (consistent with most human populations), and describe the differences between evolutionarily deleterious mutations and those that are neutral. Genes associated with several complex disease categories exhibit stronger signatures of purifying selection than non-disease genes. In addition, loci identified through genome-wide association studies of complex traits also exhibit signatures consistent with being in regions recurrently targeted by purifying selection. Through simulations, we show that population bottlenecks and rapid growth enables deleterious rare variants to persist at low frequencies just as long as neutral variants, but low frequency and common variants tend to be much younger than neutral variants. This has resulted in a large proportion of modern-day rare alleles that have a deleterious effect on function, and that potentially contribute to disease susceptibility.

Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations

Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations
Katarzyna Bryc, Wlodek Bryc, Jack W. Silverstein
(Submitted on 18 Jan 2013)

We present a mathematical model, and the corresponding mathematical analysis, that justifies and quantifies the use of principal component analysis of biallelic genetic marker data for a set of individuals to detect the number of subpopulations represented in the data. We indicate that the power of the technique relies more on the number of individuals genotyped than on the number of markers.

SLiM: Simulating Evolution with Selection and Linkage

SLiM: Simulating Evolution with Selection and Linkage
Philipp W. Messer
(Submitted on 14 Jan 2013)

SLiM is an efficient forward population genetic simulation designed for studying the effects of linkage and selection on a chromosome-wide scale. The program can incorporate complex scenarios of demography and population substructure, various models for selection and dominance of new mutations, arbitrary gene and chromosomal structure, and user-defined recombination maps.

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites
Katarzyna Bryc, Nick Patterson, David Reich
(Submitted on 17 Dec 2012)

High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual’s genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual is limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual’s genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step calling genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first by its performance on simulated sequence data, and secondly on real sequence data where we obtain estimates using low coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse world-wide populations, and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show filters can correct for the confounding by sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates, and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher coverage data.

Efficient moment-based inference of admixture parameters and sources of gene flow

Efficient moment-based inference of admixture parameters and sources of gene flow
Mark Lipson, Po-Ru Loh, Alex Levin, David Reich, Nick Patterson, Bonnie Berger
(Submitted on 11 Dec 2012)

The recent explosion in available genetic data has led to significant advances in understanding the demographic histories of and relationships among human populations. It is still a challenge, however, to infer reliable parameter values for complicated models involving many populations. Here we present MixMapper, an efficient, interactive method for constructing phylogenetic trees including admixture events using single nucleotide polymorphism (SNP) genotype data. MixMapper implements a novel two-phase approach to admixture inference using moment statistics, first building an unadmixed scaffold tree and then adding admixed populations by solving systems of equations that express allele frequency divergences in terms of mixture parameters. Importantly, all features of the tree, including topology, sources of gene flow, branch lengths, and mixture proportions, are optimized automatically from the data and include estimates of statistical uncertainty. MixMapper also uses a new method to express branch lengths in easily interpretable drift units. We apply MixMapper to recently published data for HGDP individuals genotyped on a SNP array designed especially for use in population genetics studies, obtaining confident results for 30 populations, 20 of them admixed. Notably, we confirm a signal of ancient admixture in European populations—including previously undetected admixture in Sardinians and Basques—involving a proportion of 20-40% ancient northern Eurasian ancestry.

Fast Algorithms for Reconciliation under Hybridization and Incomplete Lineage Sorting

Fast Algorithms for Reconciliation under Hybridization and Incomplete Lineage Sorting
Yun Yu, Luay Nakhleh
(Submitted on 9 Dec 2012)

Reconciling a gene tree with a species tree is an important task that reveals much about the evolution of genes, genomes, and species, as well as about the molecular function of genes. A wide array of computational tools have been devised for this task under certain evolutionary events such as hybridization, gene duplication/loss, or incomplete lineage sorting. Work on reconciling gene tree with species phylogenies under two or more of these events have also begun to emerge. Our group recently devised both parsimony and probabilistic frameworks for reconciling a gene tree with a phylogenetic network, thus allowing for the detection of hybridization in the presence of incomplete lineage sorting. While the frameworks were general and could handle any topology, they are computationally intensive, rendering their application to large datasets infeasible. In this paper, we present two novel approaches to address the computational challenges of the two frameworks that are based on the concept of ancestral configurations. Our approaches still compute exact solutions while improving the computational time by up to five orders of magnitude. These substantial gains in speed scale the applicability of these unified reconciliation frameworks to much larger data sets. We discuss how the topological features of the gene tree and phylogenetic network may affect the performance of the new algorithms. We have implemented the algorithms in our PhyloNet software package, which is publicly available in open source.

Our paper: The McDonald-Kreitman Test and its Extensions under Frequent Adaptation: Problems and Solutions

For our next guest post Philipp Messer and Dmitri Petrov write about their paper
The McDonald-Kreitman Test and its Extensions under Frequent Adaptation: Problems and Solutions, arXived here

The McDonald-Kreitman (MK) test is the basis of most modern approaches to measure the rate of adaptation from population genomic data. This test was used to argue that in some organisms, such as Drosophila, the rate of adaptation is surprisingly high. However, the MK test, and in fact most of the current machinery of population genetics, relies on the assumption that adaptation is rare so that the effects of selective sweeps on linked variation can be neglected. We test this assumption using a powerful forward simulation and show that the MK test is severely biased even when the rate of adaptation is only moderate. The biases arise from the complex linkage effects between slightly deleterious and strongly advantageous mutations. In order to deal with these biases, we suggest a new robust approach based on a simple asymptotic extension of the MK test.

We further show that already under very moderate amounts of adaptation, linkage effects from recurrent selective sweeps can profoundly affect key population genetic parameters, such as the fixation probabilities of deleterious mutations and the frequency distributions of polymorphisms. In synonymous polymorphism data, these linkage effects leave signatures that can easily be mistaken for the signatures of recent, severe population expansion.

The bigger claim of our paper is that the effects of linked selection cannot be simply swept under the rug by introducing effective parameters, such as effective population size or effective strength of selection, and then using these effective parameters in formulae derived from the diffusion approximation under the assumption of free recombination. Given that most of our estimates of the key evolutionary parameters are still obtained from methods based on this paradigm, we argue that it is crucial to verify whether they are robust to linkage effects.

Philipp Messer and Dmitri Petrov

Inference of Admixture Parameters in Human Populations Using Weighted Linkage Disequilibrium

Inference of Admixture Parameters in Human Populations Using Weighted Linkage Disequilibrium

Po-Ru Loh, Mark Lipson, Nick Patterson, Priya Moorjani, Joseph K Pickrell, David Reich, Bonnie Berger
(Submitted on 1 Nov 2012)

Long-range migrations and the resulting admixture between populations have been an important force shaping human genetic diversity. Most existing methods for detecting and reconstructing historical admixture events are based on allele frequency divergences or patterns of ancestry segments in chromosomes of admixed individuals. An emerging new approach harnesses the exponential decay of admixture-induced linkage disequilibrium (LD) as a function of genetic distance. Here, we comprehensively develop LD-based inference into a versatile tool for investigating admixture. We present a new weighted LD statistic that can be used to infer mixture proportions as well as dates with fewer constraints on reference populations than previous methods. We define an LD-based three-population test for admixture and identify scenarios in which it can detect admixture that previous formal tests cannot. We further show that we can discover phylogenetic relationships between populations by comparing weighted LD curves obtained using a suite of references. Finally, we describe several improvements to the computation and fitting of weighted LD curves that greatly increase the robustness and speed of the computation. We implement all of these advances in a software package, ALDER, which we validate in simulations and apply to test for admixture among all populations from the Human Genome Diversity Project (HGDP), highlighting insights into the admixture history of Central African Pygmies, Sardinians, and Japanese.

The McDonald-Kreitman Test and its Extensions under Frequent Adaptation: Problems and Solutions

The McDonald-Kreitman Test and its Extensions under Frequent Adaptation: Problems and Solutions

Philipp W. Messer, Dmitri A. Petrov
(Submitted on 1 Nov 2012)

Population genomic studies have shown that genetic draft and background selection can profoundly affect the genome-wide patterns of molecular variation. We performed forward simulations under realistic gene-structure and selection scenarios to investigate whether such linkage effects impinge on the ability of the McDonald-Kreitman (MK) test to infer the rate of positive selection (\alpha) from polymorphism and divergence data. We find that in the presence of slightly deleterious mutations, MK estimates of \alpha\ severely underestimate the true rate of adaptation even if all polymorphisms with population frequencies under 50% are excluded. Furthermore, already under intermediate rates of adaptation, genetic draft substantially distorts the site frequency spectra at neutral and functional sites from the expectations under mutation-selection-drift balance. MK-type approaches that first infer demography from synonymous sites and then use the inferred demography to correct the estimation of \alpha\ obtain almost the correct \alpha\ in our simulations. However, these approaches typically infer a severe past population expansion although there was no such expansion in the simulations, casting doubt on the accuracy of methods that infer demography from synonymous polymorphism data. We suggest a simple asymptotic extension of the MK test that should yield accurate estimates of \alpha\ even in the presence of linkage effects.

Using haplotype differentiation among hierarchically structured populations for the detection of selection signatures

Using haplotype differentiation among hierarchically structured populations for the detection of selection signatures

Marìa Inès Fariello, Simon Boitard, Hugo Naya, Magali SanCristobal, Bertrand Servin
(Submitted on 29 Oct 2012)

The detection of molecular signatures of selection is one of the major concerns of modern population genetics. A widely used strategy in this context is to compare samples from several populations, and to look for genomic regions with outstanding genetic differentiation between these populations. Genetic differentiation is generally based on allele frequency differences between populations, which are measured by Fst or related statistics. Here we introduce a new statistic, denoted hapFLK, which focuses instead on the differences of haplotype frequencies between populations. In contrast to most existing statistics, hapFLK accounts for the hierarchical structure of the sampled populations. Using computer simulations, we show that each of these two features – the use of haplotype information and of the hierarchical structure of populations – significantly improves the detection power of selected loci, and that combining them in the hapFLK statistic provides even greater power. We also show that hapFLK is robust with respect to bottlenecks and migration and improves over existing approaches in many situations. Finally, we apply hapFLK to a set of six sheep breeds from Northern Europe, and identify seven regions under selection, which include already reported regions but also several new ones. We propose a method to help identifying the population(s) under selection in a detected region, which reveals that in many of these regions selection most likely occurred in more than one population. Furthermore, several of the detected regions correspond to incomplete sweeps, where the favourable haplotype is only at intermediate frequency in the population(s) under selection.