Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences
Swee Hoe Ong, Vinutha Uppoor Kukkillaya, Andreas Wilm, Christophe Lay, Eliza Xin Pei Ho, Louie Low, Martin Lloyd Hibberd, Niranjan Nagarajan
(Submitted on 12 Oct 2012)
The high throughput and cost-effectiveness afforded by short-read sequencing technologies, in principle, enable researchers to perform 16S rRNA profiling of complex microbial communities at unprecedented depth and resolution. Existing Illumina sequencing protocols are, however, limited by the fraction of the 16S rRNA gene that is interrogated and therefore limit the resolution and quality of the profiling. To address this, we present the design of a novel protocol for shotgun Illumina sequencing of the bacterial 16S rRNA gene, optimized to capture more than 90% of sequences in the Greengenes database and with nearly twice the resolution of existing protocols. Using several in silico and experimental datasets, we demonstrate that despite the presence of multiple variable and conserved regions, the resulting shotgun sequences can be used to accurately quantify the diversity of complex microbial communities. The reconstruction of a significant fraction of the 16S rRNA gene also enabled high precision (>90%) in species-level identification thereby opening up potential application of this approach for clinical microbial characterization.
Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data
Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein, Quaid Morris
(Submitted on 11 Oct 2012)
We consider the problem of inferring the clonal evolutionary structure of cancer cells from high-throughput next generation sequencing data. We address this problem using statistical machine learning to infer a relational clustering of objects, where the clusters are connected in the form of a rooted tree. We present a hierarchical Bayesian mixture model that uses a non-parametric prior over trees to automatically estimate the number of clones (clusters) and their clonal frequencies (cluster means) in the population, and to identify the phylogenetic relationship between these subclones. Experiments on three real data sets comprising 12 tumor samples from triple-negative breast cancer, acute myeloid leukemia and chronic lymphocytic leukemia patients demonstrate the efficacy of our method.
Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs
Christopher D Brown, Lara M Mangravite, Barbara E Engelhardt
(Submitted on 11 Oct 2012)
Genetic variants in cis-regulatory elements or trans-acting regulators commonly influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in four parts: first we identified eQTLs from eleven studies on seven cell types; next we quantified cell type specific eQTLs across the studies; then we integrated eQTL data with cis-regulatory element (CRE) data sets from the ENCODE project; finally we built a classifier to predict cell type specific eQTLs. Consistent with prior studies, we demonstrate that allelic heterogeneity is pervasive at cis-eQTLs and that cis-eQTLs are often cell type specific. Within and between cell type eQTL replication is associated with eQTL SNP overlap with hundreds of cell type specific CRE element classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. Using a random forest classifier including 526 CRE data sets as features, we successfully predict the cell type specificity of eQTL SNPs in the absence of gene expression data from the cell type of interest. We anticipate that such integrative, predictive modeling will improve our ability to understand the mechanistic basis of human complex phenotypic variation.
Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression
Troels T. Marstrand, John D. Storey
(Submitted on 11 Oct 2012)
A problem of substantial interest is to systematically map variation in chromatin structure to gene expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity (DHS) and genome-wide gene expression levels in 20 diverse human cell lines. We show that ~25% of genes show cell-type specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., +/- 200kb) capture more genes with this relationship than local regions (e.g., +/- 2.5kb), yet the local regions show a more pronounced effect. By exploiting variation across cell-types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type specific expression, which we show have a range of contextual usages. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene expression variation.
A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data
Barbara Rakitsch, Christoph Lippert, Hande Topa, Karsten Borgwardt, Antti Honkela, Oliver Stegle
(Submitted on 10 Oct 2012)
RNA-Seq technology allows for studying the transcriptional state of the cell at an unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible to disentangle the abundance of individual alternatively spliced transcript isoforms of a gene. A central question is to understand the regulatory processes that lead to differences in relative abundance variation due to external and genetic factors. Here, we present a mixed model approach that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and (ii) mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes of variation and correlation between transcript isoforms, including the genomic background and technical quantification uncertainty. As a result, our method allows to accurately test for shared as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained picture of the genetic basis of gene expression variation.
Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance
Ruchi Chaudhary, J. Gordon Burleigh, David Fernández-Baca
(Submitted on 9 Oct 2012)
We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.
LUMPY: A probabilistic framework for structural variant discovery
Ryan M. Layer, Ira M. Hall, Aaron R. Quinlan
(Submitted on 8 Oct 2012)
Comprehensive discovery of structural variation (SV) in human genomes from DNA sequencing requires the integration of multiple alignment signals including read-pair, split-read and read-depth. However, owing to inherent technical challenges, most existing SV discovery approaches utilize only one signal and consequently suffer from reduced sensitivity, especially at low sequence coverage and for smaller SVs. We present a novel and extremely flexible probabilistic SV discovery framework that is capable of integrating any number of SV detection signals including those generated from read alignments or prior evidence. We demonstrate improved sensitivity over extant methods by combining paired-end and split-read alignments and emphasize the utility of our framework for comprehensive studies of structural variation in heterogeneous tumor genomes. We further discuss the broader utility of this approach for probabilistic integration of diverse genomic interval datasets.
LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data
Alison F. Feder, Dmitri A. Petrov, Alan O. Bergland
(Submitted on 8 Oct 2012)
High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r2) between pairs of SNPs that can be observed within and among single reads. LDx also reports r2 estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r2 estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r2 estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.
Birth and death processes with neutral mutations
Nicolas Champagnat, Amaury Lambert, Mathieu Richard
(Submitted on 27 Sep 2012)
In this paper, we review recent results of ours concerning branching processes with general lifetimes and neutral mutations, under the infinitely many alleles model, where mutations can occur either at birth of individuals or at a constant rate during their lives.
In both models, we study the allelic partition of the population at time t. We give closed formulae for the expected frequency spectrum at t and prove pathwise convergence to an explicit limit, as t goes to infinity, of the relative numbers of types younger than some given age and carried by a given number of individuals (small families). We also provide convergences in distribution of the sizes or ages of the largest families and of the oldest families.
In the case of exponential lifetimes, population dynamics are given by linear birth and death processes, and we can most of the time provide general formulations of our results unifying both models.
Efficient Forward Simulation of Fisher-Wright Populations with Stochastic Population Size and Neutral Single Step Mutations in Haplotypes
Mikkel Meyer Andersen, Poul Svante Eriksen
(Submitted on 5 Oct 2012)
In both population genetics and forensic genetics it is important to know how haplotypes are distributed in a population. Simulation of population dynamics helps facilitating research on the distribution of haplotypes. In forensic genetics, the haplotypes can for example consist of lineage markers such as short tandem repeat loci on the Y chromosome (Y-STR). A dominating model for describing population dynamics is the simple, yet powerful, Fisher-Wright model. We describe an efficient algorithm for exact forward simulation of exact Fisher-Wright populations (and not approximative such as the coalescent model). The efficiency comes from convenient data structures by changing the traditional view from individuals to haplotypes. The algorithm is implemented in the open-source R package ‘fwsim’ and is able to simulate very large populations. We focus on a haploid model and assume stochastic population size with flexible growth specification, no selection, a neutral single step mutation process, and self-reproducing individuals. These assumptions make the algorithm ideal for studying lineage markers such as Y-STR.