The variance of identity-by-descent sharing in the Wright-Fisher model

The variance of identity-by-descent sharing in the Wright-Fisher model

Shai Carmi, Pier Francesco Palamara, Vladimir Vacic, Todd Lencz, Ariel Darvasi, Itsik Pe’er
(Submitted on 21 Jun 2012)

Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of populations that have experienced a recent bottleneck. The detection of these IBD segments is now feasible, enabling a wide range of applications from phasing and imputation to demographic inference. Here, we study the distribution of IBD sharing in the Wright-Fisher model. Using coalescent theory, we calculate the mean and variance of the total sharing between arbitrary pairs of individuals. We then study the cohort-averaged sharing: the average total sharing between one individual to the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing is distributed approximately normally. Surprisingly, the variance of this distribution remains large even for large cohorts, implying the existence of “hyper-sharing” individuals. The presence of such individuals bears important consequences to the design of sequencing studies, since, if they are selected for whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate the expected gain in power of imputation by IBD, and subsequently, in power to detect an association, when individuals are either randomly selected or are specifically the hyper-sharing individuals. Finally, we study the distribution of pairwise sharing and cohort-averaged sharing in the Ashkenazi Jewish population.

Approximate Bayesian computation via empirical likelihood

Approximate Bayesian computation via empirical likelihood
K. L. Mengersen (QUT, Brisbane), P. Pudlo (Universite Montpellier 2), C. P. Robert (Universite Paris-Dauphine)
(Submitted on 25 May 2012)

Approximate Bayesian computation (ABC) has now become an essential tool for the analysis of complex stochastic models when the likelihood function is unavailable. The well-established statistical method of empirical likelihood however provides another route to such settings that bypasses simulations from the model and the choices of the ABC parameters (summary statistics, distance, tolerance), while being provably convergent in the number of observations. Furthermore, avoiding model simulations leads to significant time savings in complex models, such as those used in population genetics. The ABCel algorithm we develop in this paper also provides an evaluation of its own performance through an associated effective sample size. The method is illustrated using several examples, including estimation of standard and quantile distributions, and time series and population genetics models.

Single–crossover recombination and ancestral recombination trees.

Single–crossover recombination and ancestral recombination trees.
by Ellen Baake, Ute von Wangenheim

We consider the Wright-Fisher model for a population of $N$ individuals, each identified with a sequence of a finite number of sites, and single-crossover recombination between them. We trace back the ancestry of single individuals from the present population. In the $N \to \infty$ limit without rescaling of parameters or time, this ancestral process is described by a random tree, whose branching events correspond to the splitting of the sequence due to recombination. With the help of a decomposition of the trees into subtrees and an inclusion-exclusion principle, we find a closed-form expression for the probabilities of the topologies of the ancestral trees. At the same time, these probabilities lead to an explicit solution of the deterministic single-crossover equation. The latter is a discrete-time dynamical system that emerges from the Wright-Fisher model via a law of large numbers and has been waiting for a solution for many decades.

Generative Probabilistic Model for Detecting Selection on Dispersed Genomic Elements from Polymorphism and Divergence

Generative Probabilistic Model for Detecting Selection on Dispersed Genomic Elements from Polymorphism and Divergence
Ilan Gronau, Leonardo Arbiza, Adam Siepel
(Submitted on 29 Sep 2011 (v1), last revised 13 Aug 2012 (this version, v3))

We present a new probabilistic method for measuring the influence of natural selection on a collection of short elements scattered across a genome based on observed patterns of polymorphism and divergence. This is a challenging task for various reasons, including variation across loci in mutation rates and genealogical backgrounds, and the influence of demography on patterns of polymorphism. In addition, accounting for the combined effects of different modes of selection is known to be a serious challenge for tests of selection that use patterns of polymorphism and divergence. Our method addresses these challenges by contrasting patterns of polymorphism and divergence in the elements of interest with those in flanking neutral sites. While this general approach is common to several existing tests of selection, our method improves substantially on these methods by making use of a full generative probabilistic model, directly accommodating weak negative selection, allowing information from many short elements to be combined in a statistically rigorous manner, and integrating phylogenetic information from multiple outgroup species with genome-wide population genetic data. Our model is able to account for of weak negative, strong negative, and strong positive selection, by making a small set of simple assumptions on their separate effects on polymorphism and divergence. We implemented an expectation maximization algorithm for inference under this model and applied it to simulated and real data. Using simulations, we show that our inference procedure effectively disentangles the different modes of selection and provides accurate estimates of the parameters of interest that are robust to demography. We demonstrate an application of our methods to real data by analyzing several collections of human transcription factor binding sites identified using recently generated genome-wide ChIP-seq data.

A semi-automatic method to guide the choice of ridge parameter in ridge regression

A semi-automatic method to guide the choice of ridge parameter in ridge regression
Erika Cule, Maria De Iorio
(Submitted on 3 May 2012)

We consider the application of a popular penalised regression method, Ridge Regression, to data with very high dimensions and many more covariates than observations. Our motivation is the problem of out-of-sample prediction and the setting is high-density genotype data from a genome-wide association or resequencing study. Ridge regression has previously been shown to offer improved performance for prediction when compared with other penalised regression methods. One problem with ridge regression is the choice of an appropriate parameter for controlling the amount of shrinkage of the coefficient estimates. Here we propose a method for choosing the ridge parameter based on controlling the variance of the predicted observations in the model.
Using simulated data, we demonstrate that our method outperforms subset selection based on univariate tests of association and another penalised regression method, HyperLasso regression, in terms of improved prediction error. We extend our approach to regression problems when the outcomes are binary (representing cases and controls, as is typically the setting for genome-wide association studies) and demonstrate the method on a real data example consisting of case-control and genotype data on Bipolar Disorder, taken from the Wellcome Trust Case Control Consortium and the Genetic Association Information Network.

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction
Barbara Rakitsch, Christoph Lippert, Oliver Stegle, Karsten Borgwardt
(Submitted on 30 May 2012)

Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In simple cases, single polymorphic loci explain a significant fraction of the phenotype variability. However, many traits of interest appear to be subject to multifactorial control by groups of genetic loci instead. Accurate detection of such multivariate associations is nontrivial and often hindered by limited power. At the same time, confounding influences such as population structure cause spurious association signals that result in false positive findings if they are not accounted for in the model. Here, we propose LMM-Lasso, a mixed model that allows for both, multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters, effectively controls for population structure and scales to genome-wide datasets. We show practical use in genome-wide association studies and linkage mapping through retrospective analyses. In data from Arabidopsis thaliana and mouse, our method is able to find a genetic cause for significantly greater fractions of phenotype variation in 91% of the phenotypes considered. At the same time, our model dissects this variability into components that result from individual SNP effects and population structure. In addition to this increase of genetic heritability, enrichment of known candidate genes suggests that the associations retrieved by LMM-Lasso are more likely to be genuine.

The Genomic Signature of Crop-Wild Introgression in Maize

The Genomic Signature of Crop-Wild Introgression in Maize
Matthew B. Hufford, Pesach Lubinksy, Tanja Pyhäjärvi, Michael T. Devengenzo, Norman C. Ellstrand, Jeffrey Ross Ibarra
(Submitted on 19 Aug 2012)

The evolutionary significance of hybridization and introgression has long been appreciated, but evaluation of the genome-wide effects of these phenomena has only recently become possible. Crop-wild study systems represent ideal opportunities to examine evolution through hybridization. For example, maize and the conspecific wild teosinte Zea mays ssp. mexicana are known to hybridize in the fields of highland Mexico. Despite widespread evidence of gene flow, maize and mexicana maintain distinct morphologies and have done so in sympatry for thousands of years. Neither the genomic extent nor the evolutionary importance of introgression between these taxa is understood. We assessed patterns of genome-wide introgression based on 39,029 single nucleotide polymorphisms genotyped in 189 individuals from nine sympatric maize-mexicana populations and reference allopatric populations. While portions of these genomes were particularly resistant to introgression (notably near known cross-incompatibility and domestication loci), we detected widespread evidence for introgression in both directions of gene flow. Through further characterization of these regions and a growth chamber experiment we found evidence consistent with the incorporation of adaptive mexicana alleles into maize during its expansion to the highlands of central Mexico. In contrast, very little evidence was found indicating introgression from maize to mexicana altered the niche of this wild taxon, increasing its capacity to persist commensal to agriculture. The methods we have applied here can be replicated widely across species, greatly informing our understanding of evolution through introgressive hybridization. Crop species, due to their exceptional genomic resources and frequent histories of diffusion into sympatry with relatives, should be particularly influential in these studies.

Kernel Approximate Bayesian Computation for Population Genetic Inferences

Kernel Approximate Bayesian Computation for Population Genetic Inferences
Shigeki Nakagome, Kenji Fukumizu, Shuhei Mano
(Submitted on 15 May 2012)

As genomic data accumulate, Bayesian inferences can be applied to estimate evolutionary parameters. However, the complexity of stochastic models used in population genetics makes it difficult to derive the likelihoods needed for Bayesian inferences. Approximate Bayesian Computation (ABC) is an alternative approach for obtaining Bayesian inferences without likelihoods. ABC is a rejection-based method that applies a tolerance of dissimilarity between sets of summary statistics from observed and simulated data. ABC gives an exact sampler from the posterior density in the limit of zero tolerance. However, the choices for summary statistics and metrics of dissimilarity are ambiguous, and acceptance rates decrease with an increasing number of summary statistics. Therefore, it is difficult to maintain estimator consistency using ABC. In this study, we apply the kernel Bayes’ rule proposed by Fukumizu et al. (2011) to ABC. We report that kernel ABC (i) avoids the need for tolerance, (ii) upholds the consistency of estimators, and (iii) is tractable for a large number of summary statistics. We demonstrate these advantages by comparing kernel ABC with conventional ABC for population genetic inferences.

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

Seunghak Lee, Eric P. Xing
(Submitted on 9 May 2012)

We consider the problem of learning a high-dimensional multi-task regression model, under sparsity constraints induced by presence of grouping structures on the input covariates and on the output predictors. This problem is primarily motivated by expression quantitative trait locus (eQTL) mapping, of which the goal is to discover genetic variations in the genome (inputs) that influence the expression levels of multiple co-expressed genes (outputs), either epistatically, or pleiotropically, or both. A structured input-output lasso (SIOL) model based on an intricate l1/l2-norm penalty over the regression coefficient matrix is employed to enable discovery of complex sparse input/output relationships; and a highly efficient new optimization algorithm called hierarchical group thresholding (HiGT) is developed to solve the resultant non-differentiable, non-separable, and ultra high-dimensional optimization problem. We show on both simulation and on a yeast eQTL dataset that our model leads to significantly better recovery of the structured sparse relationships between the inputs and the outputs, and our algorithm significantly outperforms other optimization techniques under the same model. Additionally, we propose a novel approach for efficiently and effectively detecting input interactions by exploiting the prior knowledge available from biological experiments.

An efficient group test for genetic markers that handles confounding.

An efficient group test for genetic markers that handles confounding. (arXiv:1205.0793v1 [q-bio.GN])
by Jennifer Listgarten, Christoph Lippert, David Heckerman

Approaches for testing groups of variants for association with complex traits are becoming critical. Examples of groups typically include a set of rare or common variants within a gene, but could also be variants within a pathway or any other set. These tests are critical for aggregation of weak signal within a group, allow interplay among variants to be captured, and also reduce the problem of multiple hypothesis testing. Unfortunately, these approaches do not address confounding by, for example, family relatedness and population structure, a problem that is becoming more important as larger data sets are used to increase power. We introduce a new approach for group tests that can handle confounding, based on Bayesian linear regression, which is equivalent to the linear mixed model. The approach uses two sets of covariates (equivalently, two random effects), one to capture the group association signal and one to capture confounding. We also introduce a computational speedup for the two-random-effects model that makes this approach feasible even for extremely large cohorts, whereas it otherwise would not be. Application of our approach to richly structured GAW14 data, comprising over eight ethnicities and many related family members, demonstrates that our method successfully corrects for population structure, while application of our method to WTCCC Crohn’s disease and hypertension data demonstrates that our method recovers genes not recoverable by univariate analysis, while still correcting for confounding structure.