Improved haplotyping of rare variants using next-generation sequence data

Improved haplotyping of rare variants using next-generation sequence data
Fouad Zakharia, Carlos Bustamante
(Submitted on 9 Nov 2012)

Accurate identification of haplotypes in sequenced human genomes can provide invaluable information about population demography and fine-scale correlations along the genome, thus empowering both population genomic and medical association studies. Yet phasing unrelated individuals remains a challenging problem. Incorporating available data from high throughput sequencing into traditional statistical phasing approaches is a promising avenue to alleviate these issues. We present a novel statistical method that expands on an existing graphical haplotype reconstruction method (shapeIT) to incorporate phasing information from paired-end read data. The algorithm harnesses the haplotype graph information estimated by shapeIT from genotypes across the population and refines haplotype likelihoods for a given individual to be compatible with the sequencing data. Applying the method to HapMap individuals genotyped on the Affymetrix Axiom chip at 7,745,081 SNPs and on a trio sequenced by Complete Genomics, we found that the inclusion of paired end read data significantly improved phasing, with reductions in switch error on the order of 4-15% against shapeIT across all panels. As expected, the improvements were found to be most significant at sites harboring rare variants; furthermore, we found that longer read sizes and higher throughput translated to greater decreases in switching error, as did higher variance in the size of the insert separating the two reads–suggesting that multi-platform next generation sequencing may be exploited to yield particularly accurate haplotypes. Overall, the phasing improvements afforded by this new method highlight the power of integrating sequencing read information and population genotype data for reconstructing haplotypes in unrelated individuals.

Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data

Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data

Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein, Quaid Morris
(Submitted on 11 Oct 2012)

We consider the problem of inferring the clonal evolutionary structure of cancer cells from high-throughput next generation sequencing data. We address this problem using statistical machine learning to infer a relational clustering of objects, where the clusters are connected in the form of a rooted tree. We present a hierarchical Bayesian mixture model that uses a non-parametric prior over trees to automatically estimate the number of clones (clusters) and their clonal frequencies (cluster means) in the population, and to identify the phylogenetic relationship between these subclones. Experiments on three real data sets comprising 12 tumor samples from triple-negative breast cancer, acute myeloid leukemia and chronic lymphocytic leukemia patients demonstrate the efficacy of our method.

Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs

Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs
Christopher D Brown, Lara M Mangravite, Barbara E Engelhardt
(Submitted on 11 Oct 2012)

Genetic variants in cis-regulatory elements or trans-acting regulators commonly influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in four parts: first we identified eQTLs from eleven studies on seven cell types; next we quantified cell type specific eQTLs across the studies; then we integrated eQTL data with cis-regulatory element (CRE) data sets from the ENCODE project; finally we built a classifier to predict cell type specific eQTLs. Consistent with prior studies, we demonstrate that allelic heterogeneity is pervasive at cis-eQTLs and that cis-eQTLs are often cell type specific. Within and between cell type eQTL replication is associated with eQTL SNP overlap with hundreds of cell type specific CRE element classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. Using a random forest classifier including 526 CRE data sets as features, we successfully predict the cell type specificity of eQTL SNPs in the absence of gene expression data from the cell type of interest. We anticipate that such integrative, predictive modeling will improve our ability to understand the mechanistic basis of human complex phenotypic variation.

LUMPY: A probabilistic framework for structural variant discovery

LUMPY: A probabilistic framework for structural variant discovery
Ryan M. Layer, Ira M. Hall, Aaron R. Quinlan
(Submitted on 8 Oct 2012)
Comprehensive discovery of structural variation (SV) in human genomes from DNA sequencing requires the integration of multiple alignment signals including read-pair, split-read and read-depth. However, owing to inherent technical challenges, most existing SV discovery approaches utilize only one signal and consequently suffer from reduced sensitivity, especially at low sequence coverage and for smaller SVs. We present a novel and extremely flexible probabilistic SV discovery framework that is capable of integrating any number of SV detection signals including those generated from read alignments or prior evidence. We demonstrate improved sensitivity over extant methods by combining paired-end and split-read alignments and emphasize the utility of our framework for comprehensive studies of structural variation in heterogeneous tumor genomes. We further discuss the broader utility of this approach for probabilistic integration of diverse genomic interval datasets.

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data

Darren Kessner, Tom Turner, John Novembre
(Submitted on 19 Sep 2012)

DNA samples are often pooled, either by experimental design, or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g. bacterial species comprising a microbiome, or pathogen strains in a blood sample). We present an expectation-maximization (EM) algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different strains within a metagenomics sample. Our method outperforms existing methods based on single- site allele frequencies, as well as simple approaches using sequence read data. We have implemented the method in a freely available open-source software tool.

Our paper: Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

[This author post is by Peter Carbonetto on Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease, available from the arXiv here.]

I expect that most readers of this blog appreciate the impact that genome-wide association studies have had on our understanding of many common diseases. Still, I think it is important to reiterate a major appeal of genome-wide association studies: the analysis is conceptually straightforward to understand, even for people who have never had to suffer through a course on statistics or epidemiology. To find links between genetic loci and disease, the analysis consists of systematically searching across the genome for variants that show statistically significant correlation with susceptibility to disease. These correlations signal the presence of nearby genes—or perhaps DNA elements that regulate other genes—that are risk factors for disease.

Many readers of this blog will also appreciate, due to the multifactorial nature of most common diseases, the difficulty of establishing compelling evidence for disease-variant correlations. Hence the search for more effective data-driven strategies for discovering genetic factors underlying common diseases.

One strategy is to assess evidence for the accumulation, or “enrichment,” of disease-conferring mutations within known biological pathways. The intuition is that identifying the accumulation of small genetic effects acting in a common pathway is easier than mapping the individual genes within the pathway that contribute to disease susceptibility.

We asked whether identifying these enriched pathways can also give us useful feedback about the individual gene variants associated with disease. To answer this question, we developed a statistical method that adjusts the support for disease-variant associations to reflect enrichment of associations in a pathway. Our approach was to introduce an enrichment parameter that quantifies the increase in the probability that each variant in the pathway is associated with disease risk.

Is this a valid approach? To investigate, we applied our approach to data from the Wellcome Trust Crohn’s disease study from 2007. First, we identified a broad class of cytokine signaling genes that were enriched for genetic associations with Crohn’s disease. Next, by prioritizing variants in this pathway, we discovered candidates for association—including the STAT3 gene, the IBD5 locus, and the MHC class II genes—that were not identified in conventional analyses of the same data. These results help validate our approach, as these genetic associations have been independently confirmed in other studies and meta-analyses with much larger combined samples.

Several other important lessons emerged from our case study:

1. Interrogate as many pathways as possible. Because we collected over 3000 candidate pathways from several sources (Reactome, KEGG, BioCarta, BioCyc, etc.), many of the pathways highlighted in previous analyses of the same data were eclipsed by much stronger enrichment signals in our analysis.

2. Assess evidence for combinations of enriched pathways. Some pathways become interesting only after assessing enrichment of the pathway in combination with another pathway.

3. Account for the heterogeneity of effect sizes in Crohn’s disease. One of the assumptions we made in our analysis, mainly out of convenience, was that the additive effects on disease risk are normally distributed. While this assumption simplified this analysis, we suspect that a normal distribution does not adequately capture the smaller effect sizes in pathways, leading to a loss of power to detect enriched pathways.

At conferences, and around the lab, I’ve heard many complaints about pathway analysis (or gene set enrichment analysis) for genome-wide association studies. One complaint is that the results are difficult to interpret. Another common complaint is that the findings are sensitive to arbitrary significance thresholds. While we didn’t devote much space in the paper to a discussion of these issues, we believe that our approach offers a coherent solution to many of these problems.

Ultimately, we would like other researchers to use our methods to analyze data from their own genome-wide association studies. We tried to make our paper as accessible as possible, especially to biologists that are not well-acquainted with Bayesian approaches, by carefully explaining how to interpret the Bayes factors and posterior statistics used in the analysis. We are working on releasing the full source code (in R and MATLAB) for all our methods, and accompanying documentation.

Peter Carbonetto

Polygenic Modeling with Bayesian Sparse Linear Mixed Models

Polygenic Modeling with Bayesian Sparse Linear Mixed Models
Xiang Zhou, Peter Carbonetto, Matthew Stephens
(Submitted on 6 Sep 2012)

Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given data set one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a “Bayesian sparse linear mixed model” (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters, and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For estimating PVE, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from this http URL

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing
Erik Garrison, Gabor Marth
(Submitted on 17 Jul 2012 (v1), last revised 20 Jul 2012 (this version, v2))

The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.

Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

Peter Carbonetto, Matthew Stephens
(Submitted on 21 Aug 2012)

Many common diseases are highly polygenic, modulated by a large number genetic factors with small effects on susceptibility to disease. These small effects are difficult to map reliably in genetic association studies. To address this problem, researchers have developed methods that aggregate information over sets of related genes, such as biological pathways, to identify gene sets that are enriched for genetic variants associated with disease. However, these methods fail to answer a key question: which genes and genetic variants are associated with disease risk? We develop a method based on sparse multiple regression that simultaneously identifies enriched pathways, and prioritizes the variants within these pathways, to locate additional variants associated with disease susceptibility. A central feature of our approach is an estimate of the strength of enrichment, which yields a coherent way to prioritize variants in enriched pathways. We illustrate the benefits of our approach in a genome-wide association study of Crohn’s disease with ~440,000 genetic variants genotyped for ~4700 study subjects. We obtain strong support for enrichment of IL-12, IL-23 and other cytokine signaling pathways. Furthermore, prioritizing variants in these enriched pathways yields support for additional disease-association variants, all of which have been independently reported in other case-control studies for Crohn’s disease.