Approximate Bayesian computation via empirical likelihood

Approximate Bayesian computation via empirical likelihood
K. L. Mengersen (QUT, Brisbane), P. Pudlo (Universite Montpellier 2), C. P. Robert (Universite Paris-Dauphine)
(Submitted on 25 May 2012)

Approximate Bayesian computation (ABC) has now become an essential tool for the analysis of complex stochastic models when the likelihood function is unavailable. The well-established statistical method of empirical likelihood however provides another route to such settings that bypasses simulations from the model and the choices of the ABC parameters (summary statistics, distance, tolerance), while being provably convergent in the number of observations. Furthermore, avoiding model simulations leads to significant time savings in complex models, such as those used in population genetics. The ABCel algorithm we develop in this paper also provides an evaluation of its own performance through an associated effective sample size. The method is illustrated using several examples, including estimation of standard and quantile distributions, and time series and population genetics models.

Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster

Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster

Mark F. Richardson, Lucy A. Weinert, John J. Welch, Raquel S. Linheiro, Michael M. Magwire, Francis M. Jiggins, Casey M. Bergman
(Submitted on 25 May 2012 (v1), last revised 2 Aug 2012 (this version, v2))

Wolbachia are maternally-inherited symbiotic bacteria commonly found in arthropods, which are able to manipulate the reproduction of their host in order to maximise their transmission. Here we use whole genome resequencing data from 290 lines of Drosophila melanogaster from North America, Europe and Africa to predict Wolbachia infection status, estimate cytoplasmic genome copy number, and reconstruct Wolbachia and mtDNA genome sequences. Complete Wolbachia and mitochondrial genomes show congruent phylogenies, consistent with strict vertical transmission through the maternal cytoplasm and recurrent loss of Wolbachia in multiple populations. Bayesian phylogenetic analysis reveals that the most recent common ancestor of all Wolbachia and mitochondrial genomes in D. melanogaster dates to around 8,000 years ago. We find evidence for a recent incomplete global replacement of ancestral Wolbachia and mtDNA lineages, which is likely to be one of several similar incomplete replacement events that have occurred since the out-of-Africa migration that allowed D. melanogaster to colonize worldwide habitats.

Single–crossover recombination and ancestral recombination trees.

Single–crossover recombination and ancestral recombination trees.
by Ellen Baake, Ute von Wangenheim

We consider the Wright-Fisher model for a population of $N$ individuals, each identified with a sequence of a finite number of sites, and single-crossover recombination between them. We trace back the ancestry of single individuals from the present population. In the $N \to \infty$ limit without rescaling of parameters or time, this ancestral process is described by a random tree, whose branching events correspond to the splitting of the sequence due to recombination. With the help of a decomposition of the trees into subtrees and an inclusion-exclusion principle, we find a closed-form expression for the probabilities of the topologies of the ancestral trees. At the same time, these probabilities lead to an explicit solution of the deterministic single-crossover equation. The latter is a discrete-time dynamical system that emerges from the Wright-Fisher model via a law of large numbers and has been waiting for a solution for many decades.

Generative Probabilistic Model for Detecting Selection on Dispersed Genomic Elements from Polymorphism and Divergence

Generative Probabilistic Model for Detecting Selection on Dispersed Genomic Elements from Polymorphism and Divergence
Ilan Gronau, Leonardo Arbiza, Adam Siepel
(Submitted on 29 Sep 2011 (v1), last revised 13 Aug 2012 (this version, v3))

We present a new probabilistic method for measuring the influence of natural selection on a collection of short elements scattered across a genome based on observed patterns of polymorphism and divergence. This is a challenging task for various reasons, including variation across loci in mutation rates and genealogical backgrounds, and the influence of demography on patterns of polymorphism. In addition, accounting for the combined effects of different modes of selection is known to be a serious challenge for tests of selection that use patterns of polymorphism and divergence. Our method addresses these challenges by contrasting patterns of polymorphism and divergence in the elements of interest with those in flanking neutral sites. While this general approach is common to several existing tests of selection, our method improves substantially on these methods by making use of a full generative probabilistic model, directly accommodating weak negative selection, allowing information from many short elements to be combined in a statistically rigorous manner, and integrating phylogenetic information from multiple outgroup species with genome-wide population genetic data. Our model is able to account for of weak negative, strong negative, and strong positive selection, by making a small set of simple assumptions on their separate effects on polymorphism and divergence. We implemented an expectation maximization algorithm for inference under this model and applied it to simulated and real data. Using simulations, we show that our inference procedure effectively disentangles the different modes of selection and provides accurate estimates of the parameters of interest that are robust to demography. We demonstrate an application of our methods to real data by analyzing several collections of human transcription factor binding sites identified using recently generated genome-wide ChIP-seq data.

A semi-automatic method to guide the choice of ridge parameter in ridge regression

A semi-automatic method to guide the choice of ridge parameter in ridge regression
Erika Cule, Maria De Iorio
(Submitted on 3 May 2012)

We consider the application of a popular penalised regression method, Ridge Regression, to data with very high dimensions and many more covariates than observations. Our motivation is the problem of out-of-sample prediction and the setting is high-density genotype data from a genome-wide association or resequencing study. Ridge regression has previously been shown to offer improved performance for prediction when compared with other penalised regression methods. One problem with ridge regression is the choice of an appropriate parameter for controlling the amount of shrinkage of the coefficient estimates. Here we propose a method for choosing the ridge parameter based on controlling the variance of the predicted observations in the model.
Using simulated data, we demonstrate that our method outperforms subset selection based on univariate tests of association and another penalised regression method, HyperLasso regression, in terms of improved prediction error. We extend our approach to regression problems when the outcomes are binary (representing cases and controls, as is typically the setting for genome-wide association studies) and demonstrate the method on a real data example consisting of case-control and genotype data on Bipolar Disorder, taken from the Wellcome Trust Case Control Consortium and the Genetic Association Information Network.

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction
Barbara Rakitsch, Christoph Lippert, Oliver Stegle, Karsten Borgwardt
(Submitted on 30 May 2012)

Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In simple cases, single polymorphic loci explain a significant fraction of the phenotype variability. However, many traits of interest appear to be subject to multifactorial control by groups of genetic loci instead. Accurate detection of such multivariate associations is nontrivial and often hindered by limited power. At the same time, confounding influences such as population structure cause spurious association signals that result in false positive findings if they are not accounted for in the model. Here, we propose LMM-Lasso, a mixed model that allows for both, multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters, effectively controls for population structure and scales to genome-wide datasets. We show practical use in genome-wide association studies and linkage mapping through retrospective analyses. In data from Arabidopsis thaliana and mouse, our method is able to find a genetic cause for significantly greater fractions of phenotype variation in 91% of the phenotypes considered. At the same time, our model dissects this variability into components that result from individual SNP effects and population structure. In addition to this increase of genetic heritability, enrichment of known candidate genes suggests that the associations retrieved by LMM-Lasso are more likely to be genuine.

Finding the sources of missing heritability in a yeast cross

Finding the sources of missing heritability in a yeast cross

Joshua S. Bloom, Ian M. Ehrenreich, Wesley Loo, Thúy-Lan Võ Lite, Leonid Kruglyak
(Submitted on 14 Aug 2012)

For many traits, including susceptibility to common diseases in humans, causal loci uncovered by genetic mapping studies explain only a minority of the heritable contribution to trait variation. Multiple explanations for this “missing heritability” have been proposed. Here we use a large cross between two yeast strains to accurately estimate different sources of heritable variation for 46 quantitative traits and to detect underlying loci with high statistical power. We find that the detected loci explain nearly the entire additive contribution to heritable variation for the traits studied. We also show that the contribution to heritability of gene-gene interactions varies among traits, from near zero to 50%. Detected two-locus interactions explain only a minority of this contribution. These results substantially advance our understanding of the missing heritability problem and have important implications for future studies of complex and quantitative traits.

The Genomic Signature of Crop-Wild Introgression in Maize

The Genomic Signature of Crop-Wild Introgression in Maize
Matthew B. Hufford, Pesach Lubinksy, Tanja Pyhäjärvi, Michael T. Devengenzo, Norman C. Ellstrand, Jeffrey Ross Ibarra
(Submitted on 19 Aug 2012)

The evolutionary significance of hybridization and introgression has long been appreciated, but evaluation of the genome-wide effects of these phenomena has only recently become possible. Crop-wild study systems represent ideal opportunities to examine evolution through hybridization. For example, maize and the conspecific wild teosinte Zea mays ssp. mexicana are known to hybridize in the fields of highland Mexico. Despite widespread evidence of gene flow, maize and mexicana maintain distinct morphologies and have done so in sympatry for thousands of years. Neither the genomic extent nor the evolutionary importance of introgression between these taxa is understood. We assessed patterns of genome-wide introgression based on 39,029 single nucleotide polymorphisms genotyped in 189 individuals from nine sympatric maize-mexicana populations and reference allopatric populations. While portions of these genomes were particularly resistant to introgression (notably near known cross-incompatibility and domestication loci), we detected widespread evidence for introgression in both directions of gene flow. Through further characterization of these regions and a growth chamber experiment we found evidence consistent with the incorporation of adaptive mexicana alleles into maize during its expansion to the highlands of central Mexico. In contrast, very little evidence was found indicating introgression from maize to mexicana altered the niche of this wild taxon, increasing its capacity to persist commensal to agriculture. The methods we have applied here can be replicated widely across species, greatly informing our understanding of evolution through introgressive hybridization. Crop species, due to their exceptional genomic resources and frequent histories of diffusion into sympatry with relatives, should be particularly influential in these studies.

Kernel Approximate Bayesian Computation for Population Genetic Inferences

Kernel Approximate Bayesian Computation for Population Genetic Inferences
Shigeki Nakagome, Kenji Fukumizu, Shuhei Mano
(Submitted on 15 May 2012)

As genomic data accumulate, Bayesian inferences can be applied to estimate evolutionary parameters. However, the complexity of stochastic models used in population genetics makes it difficult to derive the likelihoods needed for Bayesian inferences. Approximate Bayesian Computation (ABC) is an alternative approach for obtaining Bayesian inferences without likelihoods. ABC is a rejection-based method that applies a tolerance of dissimilarity between sets of summary statistics from observed and simulated data. ABC gives an exact sampler from the posterior density in the limit of zero tolerance. However, the choices for summary statistics and metrics of dissimilarity are ambiguous, and acceptance rates decrease with an increasing number of summary statistics. Therefore, it is difficult to maintain estimator consistency using ABC. In this study, we apply the kernel Bayes’ rule proposed by Fukumizu et al. (2011) to ABC. We report that kernel ABC (i) avoids the need for tolerance, (ii) upholds the consistency of estimators, and (iii) is tractable for a large number of summary statistics. We demonstrate these advantages by comparing kernel ABC with conventional ABC for population genetic inferences.

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

Seunghak Lee, Eric P. Xing
(Submitted on 9 May 2012)

We consider the problem of learning a high-dimensional multi-task regression model, under sparsity constraints induced by presence of grouping structures on the input covariates and on the output predictors. This problem is primarily motivated by expression quantitative trait locus (eQTL) mapping, of which the goal is to discover genetic variations in the genome (inputs) that influence the expression levels of multiple co-expressed genes (outputs), either epistatically, or pleiotropically, or both. A structured input-output lasso (SIOL) model based on an intricate l1/l2-norm penalty over the regression coefficient matrix is employed to enable discovery of complex sparse input/output relationships; and a highly efficient new optimization algorithm called hierarchical group thresholding (HiGT) is developed to solve the resultant non-differentiable, non-separable, and ultra high-dimensional optimization problem. We show on both simulation and on a yeast eQTL dataset that our model leads to significantly better recovery of the structured sparse relationships between the inputs and the outputs, and our algorithm significantly outperforms other optimization techniques under the same model. Additionally, we propose a novel approach for efficiently and effectively detecting input interactions by exploiting the prior knowledge available from biological experiments.