A semi-automatic method to guide the choice of ridge parameter in ridge regression

A semi-automatic method to guide the choice of ridge parameter in ridge regression
Erika Cule, Maria De Iorio
(Submitted on 3 May 2012)

We consider the application of a popular penalised regression method, Ridge Regression, to data with very high dimensions and many more covariates than observations. Our motivation is the problem of out-of-sample prediction and the setting is high-density genotype data from a genome-wide association or resequencing study. Ridge regression has previously been shown to offer improved performance for prediction when compared with other penalised regression methods. One problem with ridge regression is the choice of an appropriate parameter for controlling the amount of shrinkage of the coefficient estimates. Here we propose a method for choosing the ridge parameter based on controlling the variance of the predicted observations in the model.
Using simulated data, we demonstrate that our method outperforms subset selection based on univariate tests of association and another penalised regression method, HyperLasso regression, in terms of improved prediction error. We extend our approach to regression problems when the outcomes are binary (representing cases and controls, as is typically the setting for genome-wide association studies) and demonstrate the method on a real data example consisting of case-control and genotype data on Bipolar Disorder, taken from the Wellcome Trust Case Control Consortium and the Genetic Association Information Network.

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction
Barbara Rakitsch, Christoph Lippert, Oliver Stegle, Karsten Borgwardt
(Submitted on 30 May 2012)

Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In simple cases, single polymorphic loci explain a significant fraction of the phenotype variability. However, many traits of interest appear to be subject to multifactorial control by groups of genetic loci instead. Accurate detection of such multivariate associations is nontrivial and often hindered by limited power. At the same time, confounding influences such as population structure cause spurious association signals that result in false positive findings if they are not accounted for in the model. Here, we propose LMM-Lasso, a mixed model that allows for both, multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters, effectively controls for population structure and scales to genome-wide datasets. We show practical use in genome-wide association studies and linkage mapping through retrospective analyses. In data from Arabidopsis thaliana and mouse, our method is able to find a genetic cause for significantly greater fractions of phenotype variation in 91% of the phenotypes considered. At the same time, our model dissects this variability into components that result from individual SNP effects and population structure. In addition to this increase of genetic heritability, enrichment of known candidate genes suggests that the associations retrieved by LMM-Lasso are more likely to be genuine.

Finding the sources of missing heritability in a yeast cross

Finding the sources of missing heritability in a yeast cross

Joshua S. Bloom, Ian M. Ehrenreich, Wesley Loo, Thúy-Lan Võ Lite, Leonid Kruglyak
(Submitted on 14 Aug 2012)

For many traits, including susceptibility to common diseases in humans, causal loci uncovered by genetic mapping studies explain only a minority of the heritable contribution to trait variation. Multiple explanations for this “missing heritability” have been proposed. Here we use a large cross between two yeast strains to accurately estimate different sources of heritable variation for 46 quantitative traits and to detect underlying loci with high statistical power. We find that the detected loci explain nearly the entire additive contribution to heritable variation for the traits studied. We also show that the contribution to heritability of gene-gene interactions varies among traits, from near zero to 50%. Detected two-locus interactions explain only a minority of this contribution. These results substantially advance our understanding of the missing heritability problem and have important implications for future studies of complex and quantitative traits.

The Genomic Signature of Crop-Wild Introgression in Maize

The Genomic Signature of Crop-Wild Introgression in Maize
Matthew B. Hufford, Pesach Lubinksy, Tanja Pyhäjärvi, Michael T. Devengenzo, Norman C. Ellstrand, Jeffrey Ross Ibarra
(Submitted on 19 Aug 2012)

The evolutionary significance of hybridization and introgression has long been appreciated, but evaluation of the genome-wide effects of these phenomena has only recently become possible. Crop-wild study systems represent ideal opportunities to examine evolution through hybridization. For example, maize and the conspecific wild teosinte Zea mays ssp. mexicana are known to hybridize in the fields of highland Mexico. Despite widespread evidence of gene flow, maize and mexicana maintain distinct morphologies and have done so in sympatry for thousands of years. Neither the genomic extent nor the evolutionary importance of introgression between these taxa is understood. We assessed patterns of genome-wide introgression based on 39,029 single nucleotide polymorphisms genotyped in 189 individuals from nine sympatric maize-mexicana populations and reference allopatric populations. While portions of these genomes were particularly resistant to introgression (notably near known cross-incompatibility and domestication loci), we detected widespread evidence for introgression in both directions of gene flow. Through further characterization of these regions and a growth chamber experiment we found evidence consistent with the incorporation of adaptive mexicana alleles into maize during its expansion to the highlands of central Mexico. In contrast, very little evidence was found indicating introgression from maize to mexicana altered the niche of this wild taxon, increasing its capacity to persist commensal to agriculture. The methods we have applied here can be replicated widely across species, greatly informing our understanding of evolution through introgressive hybridization. Crop species, due to their exceptional genomic resources and frequent histories of diffusion into sympatry with relatives, should be particularly influential in these studies.

Kernel Approximate Bayesian Computation for Population Genetic Inferences

Kernel Approximate Bayesian Computation for Population Genetic Inferences
Shigeki Nakagome, Kenji Fukumizu, Shuhei Mano
(Submitted on 15 May 2012)

As genomic data accumulate, Bayesian inferences can be applied to estimate evolutionary parameters. However, the complexity of stochastic models used in population genetics makes it difficult to derive the likelihoods needed for Bayesian inferences. Approximate Bayesian Computation (ABC) is an alternative approach for obtaining Bayesian inferences without likelihoods. ABC is a rejection-based method that applies a tolerance of dissimilarity between sets of summary statistics from observed and simulated data. ABC gives an exact sampler from the posterior density in the limit of zero tolerance. However, the choices for summary statistics and metrics of dissimilarity are ambiguous, and acceptance rates decrease with an increasing number of summary statistics. Therefore, it is difficult to maintain estimator consistency using ABC. In this study, we apply the kernel Bayes’ rule proposed by Fukumizu et al. (2011) to ABC. We report that kernel ABC (i) avoids the need for tolerance, (ii) upholds the consistency of estimators, and (iii) is tractable for a large number of summary statistics. We demonstrate these advantages by comparing kernel ABC with conventional ABC for population genetic inferences.

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

Seunghak Lee, Eric P. Xing
(Submitted on 9 May 2012)

We consider the problem of learning a high-dimensional multi-task regression model, under sparsity constraints induced by presence of grouping structures on the input covariates and on the output predictors. This problem is primarily motivated by expression quantitative trait locus (eQTL) mapping, of which the goal is to discover genetic variations in the genome (inputs) that influence the expression levels of multiple co-expressed genes (outputs), either epistatically, or pleiotropically, or both. A structured input-output lasso (SIOL) model based on an intricate l1/l2-norm penalty over the regression coefficient matrix is employed to enable discovery of complex sparse input/output relationships; and a highly efficient new optimization algorithm called hierarchical group thresholding (HiGT) is developed to solve the resultant non-differentiable, non-separable, and ultra high-dimensional optimization problem. We show on both simulation and on a yeast eQTL dataset that our model leads to significantly better recovery of the structured sparse relationships between the inputs and the outputs, and our algorithm significantly outperforms other optimization techniques under the same model. Additionally, we propose a novel approach for efficiently and effectively detecting input interactions by exploiting the prior knowledge available from biological experiments.

An efficient group test for genetic markers that handles confounding.

An efficient group test for genetic markers that handles confounding. (arXiv:1205.0793v1 [q-bio.GN])
by Jennifer Listgarten, Christoph Lippert, David Heckerman

Approaches for testing groups of variants for association with complex traits are becoming critical. Examples of groups typically include a set of rare or common variants within a gene, but could also be variants within a pathway or any other set. These tests are critical for aggregation of weak signal within a group, allow interplay among variants to be captured, and also reduce the problem of multiple hypothesis testing. Unfortunately, these approaches do not address confounding by, for example, family relatedness and population structure, a problem that is becoming more important as larger data sets are used to increase power. We introduce a new approach for group tests that can handle confounding, based on Bayesian linear regression, which is equivalent to the linear mixed model. The approach uses two sets of covariates (equivalently, two random effects), one to capture the group association signal and one to capture confounding. We also introduce a computational speedup for the two-random-effects model that makes this approach feasible even for extremely large cohorts, whereas it otherwise would not be. Application of our approach to richly structured GAW14 data, comprising over eight ethnicities and many related family members, demonstrates that our method successfully corrects for population structure, while application of our method to WTCCC Crohn’s disease and hypertension data demonstrates that our method recovers genes not recoverable by univariate analysis, while still correcting for confounding structure.

Landscape genomic tests for associations between loci and environmental gradients

Landscape genomic tests for associations between loci and environmental gradients
Eric Frichot (1), Sean Schoville (1), Guillaume Bouchard (2), Olivier François (1) ((1) UJF, CNRS, TIMC-IMAG, FRANCE, (2) Xerox Research Center Europe, France)
(Submitted on 15 May 2012)

Adaptation to local environments often occurs through natural selection acting on large number of alleles, each having a weak phenotypic effect. One way to detect those alleles is by identifying genetic polymorphisms that exhibit high correlation with some environmental gradient or with the variables used as proxies for ecological pressures. Here we proposed an integrated framework based on population genetics, ecological modeling and machine learning techniques for screening genomes for signatures of local adaptation. We implemented fast algorithms using a hierarchical Bayesian mixed model based on a variant of principal component analysis in which residual population structure is introduced via unobserved or latent factors. Our algorithms can detect correlations between environmental and genetic variation at the same time as they infer the background levels of population structure. We provided evidence that latent factor models efficiently estimated random effects due to population history and isolation-by-distance mechanisms when computing gene-environment correlations, and that they decreased the number of false-positive associations in genome scans for selection. We applied these models to plant and human genetic data and we detected several genes with functions related to multicellular organ development exhibiting unusual correlations with climatic gradients.

Emergence of clones in sexual populations

Emergence of clones in sexual populations
Richard A. Neher, Marija Vucelja, Marc Mézard, Boris I. Shraiman
(Submitted on 9 May 2012 (v1), last revised 21 Jul 2012 (this version, v2))

In sexual population, recombination reshuffles genetic variation and produces novel combinations of existing alleles, while selection amplifies the fittest genotypes in the population. If recombination is more rapid than selection, populations consist of a diverse mixture of many genotypes, as is observed in many populations. In the opposite regime, which is realized for example in the facultatively sexual populations that outcross in only a fraction of reproductive cycles, selection can amplify individual genotypes into large clones. Such clones emerge when the fitness advantage of some of the genotypes is large enough that they grow to a significant fraction of the population despite being broken down by recombination. The occurrence of this “clonal condensation” depends, in addition to the outcrossing rate, on the heritability of fitness. Clonal condensation leads to a strong genetic heterogeneity of the population which is not adequately described by traditional population genetics measures, such as Linkage Disequilibrium. Here we point out the similarity between clonal condensation and the freezing transition in the Random Energy Model of spin glasses. Guided by this analogy we explicitly calculate the probability, Y, that two individuals are genetically identical as a function of the key parameters of the model. While Y is the analog of the spin-glass order parameter, it is also closely related to rate of coalescence in population genetics: Two individuals that are part of the same clone have a recent common ancestor.

Welcome to Haldane’s sieve

The ease of communication facilitated by the Internet has dramatically affected the process of scientific communication in many fields. Most notably, many physics, math, and economics communities have adopted a system in which new research papers are immediately distributed throughout the world prior to formal evaluation in the form of peer review. This system allows for rapid distribution of “bleeding edge” results among all the experts in a field, allowing them to see and build upon the most recent advances.

This practice has historically been uncommon in biology, where instead results are generally made available to the community (including many people qualified to judge them) only after a delay of generally around six months to a year, during which a paper is reviewed, formatted, and published. We believe this is unfortunate. However, there is growing pressure in some parts of biology (in particular our fields of evolutionary and population genetics) to follow physics and math in posting papers to preprint servers ahead of formal publication.

Some authors have a variety of reasonable concerns about posting their papers to preprint servers. In particular, one worry is that, in a morass of online content, their work will not reach the relevant audience. Others see no benefit in posting their papers prior to review if they will not receive useful feedback. The goal of Haldane’s Sieve is to partially remedy these issues. We aim to provide a simple feed of preprints in the fields of evolutionary and population genetics (though we may later expand to other fields). Thus, instead of checking arXiv, PeerJ, or Figshare for relevant preprints, readers in these fields could simply check Haldane’s Sieve.

What to expect

As described above, most posts to Haldane’s Sieve will be basic descriptions of relevant preprints, with little to no commentary. All posts will have comment sections where discussion of the papers will be welcome. A second type of post will be detailed comments on a preprint of particular interest to a contributor. These posts could take the style of a journal review, or may simply be some brief comments. We hope they will provide useful feedback to the authors of the preprint. Finally, there will be posts by authors of preprints in which they describe their work and place it in broader context.

We ask the commenters to remember that by submitting articles to preprint servers the authors (often biologists) are taking a somewhat unusual step. Therefore, comments should be phrased in a constructive manner to aid the authors.

Authors: Our choice of what to post reflects our interests and knowledge, so we will only post a biased subset of evolutionary, population, and statistical genetics preprints that attract our interest. We will endeavor to be somewhat thorough but we will doubtless miss some interesting preprints, e.g. especially if they are not in the quantitative biology arXiv subfield. If you want us to link to your preprint please drop us a line, our emails can easily be found via our University sites. Alternatively send a tweet to @Haldanessieve.

Why “Haldane’s Sieve”?

A brief description of the name of this site is perhaps in order. When a new beneficial allele arises in a population, the probability that it eventually reaches fixation is influenced by a number of factors. One of these is the dominance coefficient of the allele. The reason the dominance coefficient matters is because early in the life of the allele, while it is at low frequency, it is almost always present in the population in heterozygous form. Therefore all else being equal, dominant beneficial alleles can increase in frequency due to selection faster than recessive alleles, increasing their probability of eventual fixation (or establishment in the population). This effect was noted by Haldane (Haldane 1924,1927) and has become known as “Haldane’s sieve” (Turner 1981; Charlesworth 1992). Analogously, we seek to increase the exposure of interesting papers early in their lifespan, hopefully increasing the probability that they reach their target audience.

A nameless wit has pointed out to us that preprints would really count as standing variation in this analogy and might therefore not be subject to Haldane’s sieve (see Orr and Betancourt Genetics 2001 ). We leave it to the reader to decide whether the analogy holds.

Image

The image of Haldane is from wikipedia.
The image of sieve is from fdctsevilla who kindly uses the creative commons 2.0. It’s surprisingly difficult to find a usable picture of a sieve.

Graham Coop and Joe Pickrell