Most viewed on Haldane’s Sieve: July 2014

The most viewed posts on Haldane’s Sieve this month were:

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data

Hua Zhou, John Blangero, Thomas D Dyer, Kei-hang K Chan, Eric M Sobel, Kenneth Lange
(Submitted on 31 Jul 2014)

Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even data sets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered and controlled for. In addition, family designs possess compelling advantages. They are better equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Pedigrees selected for extreme trait values often segregate a single gene with strong effect. Finally, many pedigrees are available as an important legacy from the era of linkage analysis. Unfortunately, pedigree likelihoods are notoriously hard to compute. In this paper we re-examine the computational bottlenecks and implement ultra-fast pedigree-based GWAS analysis. Kinship coefficients can either be based on explicitly provided pedigrees or automatically estimated from dense markers. Our strategy (a) works for random sample data, pedigree data, or a mix of both; (b) entails no loss of power; (c) allows for any number of covariate adjustments, including correction for population stratification; (d) allows for testing SNPs under additive, dominant, and recessive models; and (e) accommodates both univariate and multivariate quantitative traits. On a typical personal computer (6 CPU cores at 2.67 GHz), analyzing a univariate HDL (high-density lipoprotein) trait from the San Antonio Family Heart Study (935,392 SNPs on 1357 individuals in 124 pedigrees) takes less than 2 minutes and 1.5 GB of memory. Complete multivariate QTL analysis of the three time-points of the longitudinal HDL multivariate trait takes less than 5 minutes and 1.5 GB of memory.

Fast Genome-Wide QTL Analysis Using Mendel

Fast Genome-Wide QTL Analysis Using Mendel

Hua Zhou, Jin Zhou, Tao Hu, Eric M Sobel, Kenneth Lange
(Submitted on 31 Jul 2014)

Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (e) correctly deals with missing data in traits, covariates, and genotypes, (f) allows for covariate adjustment and constraints among parameters, (g) uses either theoretical or SNP-based empirical kinship matrix for additive polygenic effects, (h) allows extra variance components such as dominant polygenic effects and household effects, (i) detects and reports outlier individuals and pedigrees, and (j) allows for robust estimation via the t-distribution. The current paper assesses these capabilities on the genetics analysis workshop 19 (GAW19) sequencing data. We analyzed simulated and real phenotypes for both family and random sample data sets. For instance, when jointly testing the 8 longitudinally measured systolic blood pressure (SBP) and diastolic blood pressure (DBP) traits, it takes Mendel 78 minutes on a standard laptop computer to read, quality check, and analyze a data set with 849 individuals and 8.3 million SNPs. Genome-wide eQTL analysis of 20,643 expression traits on 641 individuals with 8.3 million SNPs takes 30 hours using 20 parallel runs on a cluster. Mendel is freely available at \url{this http URL}.

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Charles K. Fisher, Pankaj Mehta
(Submitted on 30 Jul 2014)

Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach — the Bayesian Ising Approximation (BIA) — to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high dimensional regression by analyzing a gene expression dataset with nearly 30,000 features.