Fluctuations in fitness distributions and the effects of weak linked selection on sequence evolution

Fluctuations in fitness distributions and the effects of weak linked selection on sequence evolution

Benjamin H. Good, Michael M. Desai
(Submitted on 15 Oct 2012)

Evolutionary dynamics and patterns of molecular evolution are strongly influenced by selection on linked regions of the genome, but our quantitative understanding of these effects remains incomplete. Recent work has focused on predicting the distribution of fitness within an evolving population, and this forms the basis for several methods that leverage the fitness distribution to predict the patterns of genetic diversity when selection is strong. However, in weakly selected populations random fluctuations due to genetic drift are more severe, and neither the distribution of fitness nor the sequence diversity within the population are well understood. Here, we briefly review the motivations behind the fitness-distribution picture, and summarize the general approaches that have been used to analyze this distribution in the strong-selection regime. We then extend these approaches to the case of weak selection, by outlining a perturbative treatment of selection at a large number of linked sites. This allows us to quantify the stochastic behavior of the fitness distribution and yields exact analytical predictions for the sequence diversity and substitution rate in the limit that selection is weak.

A 454 survey of the community composition and core microbiome of the common bed bug, Cimex lectularius, reveals significant microbial community structure across an urban landscape

A 454 survey of the community composition and core microbiome of the common bed bug, Cimex lectularius, reveals significant microbial community structure across an urban landscape

Matthew Meriweather, Sara Matthews, Rita Rio, Regina S Baucom
(Submitted on 13 Oct 2012)

Elucidating the spatial dynamic and core constituents of the microbial communities found in association with arthropod hosts is of crucial importance for insects that may vector human or agricultural pathogens. The hematophagous Cimex lectularius, known as the common bed bug, has made a recent resurgence in North America, as well as worldwide, potentially owing to increased travel and resistance to insecticides. A comprehensive survey of the bed bug microbiome has not been performed to date, nor has an assessment of the spatial dynamics of its microbiome. Here we present a survey of bed bug microbial communities by amplifying the V4-V6 hypervariable region of the 16S rDNA gene region followed by 454 Titanium sequencing using 31 individuals from eight natural populations collected from residences in Cincinnati, OH. Across all samples, 97% of the microbial community is made up of two dominant OTUs identified as the \alpha-proteobacterium Wolbachia and an unnamed \gamma-proteobacterium from the Enterobacteriaceae. Microbial communities varied among host populations for measures of community diversity and exhibited significant population structure. We also uncovered a strong negative correlation in the abundance of the two dominant OTUs, suggesting they may fulfill similar roles as nutritional mutualists. This broad survey represents the most comprehensive assessment, to date, of the microbes that associate with bed bugs, and uncovers evidence for potential antagonism between the two dominant members of the bed bug microbiome.

Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences

Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences

Swee Hoe Ong, Vinutha Uppoor Kukkillaya, Andreas Wilm, Christophe Lay, Eliza Xin Pei Ho, Louie Low, Martin Lloyd Hibberd, Niranjan Nagarajan
(Submitted on 12 Oct 2012)

The high throughput and cost-effectiveness afforded by short-read sequencing technologies, in principle, enable researchers to perform 16S rRNA profiling of complex microbial communities at unprecedented depth and resolution. Existing Illumina sequencing protocols are, however, limited by the fraction of the 16S rRNA gene that is interrogated and therefore limit the resolution and quality of the profiling. To address this, we present the design of a novel protocol for shotgun Illumina sequencing of the bacterial 16S rRNA gene, optimized to capture more than 90% of sequences in the Greengenes database and with nearly twice the resolution of existing protocols. Using several in silico and experimental datasets, we demonstrate that despite the presence of multiple variable and conserved regions, the resulting shotgun sequences can be used to accurately quantify the diversity of complex microbial communities. The reconstruction of a significant fraction of the 16S rRNA gene also enabled high precision (>90%) in species-level identification thereby opening up potential application of this approach for clinical microbial characterization.

Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data

Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data

Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein, Quaid Morris
(Submitted on 11 Oct 2012)

We consider the problem of inferring the clonal evolutionary structure of cancer cells from high-throughput next generation sequencing data. We address this problem using statistical machine learning to infer a relational clustering of objects, where the clusters are connected in the form of a rooted tree. We present a hierarchical Bayesian mixture model that uses a non-parametric prior over trees to automatically estimate the number of clones (clusters) and their clonal frequencies (cluster means) in the population, and to identify the phylogenetic relationship between these subclones. Experiments on three real data sets comprising 12 tumor samples from triple-negative breast cancer, acute myeloid leukemia and chronic lymphocytic leukemia patients demonstrate the efficacy of our method.

Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs

Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs
Christopher D Brown, Lara M Mangravite, Barbara E Engelhardt
(Submitted on 11 Oct 2012)

Genetic variants in cis-regulatory elements or trans-acting regulators commonly influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in four parts: first we identified eQTLs from eleven studies on seven cell types; next we quantified cell type specific eQTLs across the studies; then we integrated eQTL data with cis-regulatory element (CRE) data sets from the ENCODE project; finally we built a classifier to predict cell type specific eQTLs. Consistent with prior studies, we demonstrate that allelic heterogeneity is pervasive at cis-eQTLs and that cis-eQTLs are often cell type specific. Within and between cell type eQTL replication is associated with eQTL SNP overlap with hundreds of cell type specific CRE element classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. Using a random forest classifier including 526 CRE data sets as features, we successfully predict the cell type specificity of eQTL SNPs in the absence of gene expression data from the cell type of interest. We anticipate that such integrative, predictive modeling will improve our ability to understand the mechanistic basis of human complex phenotypic variation.

Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression

Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression
Troels T. Marstrand, John D. Storey
(Submitted on 11 Oct 2012)

A problem of substantial interest is to systematically map variation in chromatin structure to gene expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity (DHS) and genome-wide gene expression levels in 20 diverse human cell lines. We show that ~25% of genes show cell-type specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., +/- 200kb) capture more genes with this relationship than local regions (e.g., +/- 2.5kb), yet the local regions show a more pronounced effect. By exploiting variation across cell-types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type specific expression, which we show have a range of contextual usages. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene expression variation.

A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data

A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data

Barbara Rakitsch, Christoph Lippert, Hande Topa, Karsten Borgwardt, Antti Honkela, Oliver Stegle
(Submitted on 10 Oct 2012)

RNA-Seq technology allows for studying the transcriptional state of the cell at an unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible to disentangle the abundance of individual alternatively spliced transcript isoforms of a gene. A central question is to understand the regulatory processes that lead to differences in relative abundance variation due to external and genetic factors. Here, we present a mixed model approach that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and (ii) mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes of variation and correlation between transcript isoforms, including the genomic background and technical quantification uncertainty. As a result, our method allows to accurately test for shared as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained picture of the genetic basis of gene expression variation.

Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

Ruchi Chaudhary, J. Gordon Burleigh, David Fernández-Baca
(Submitted on 9 Oct 2012)

We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.

LUMPY: A probabilistic framework for structural variant discovery

LUMPY: A probabilistic framework for structural variant discovery
Ryan M. Layer, Ira M. Hall, Aaron R. Quinlan
(Submitted on 8 Oct 2012)
Comprehensive discovery of structural variation (SV) in human genomes from DNA sequencing requires the integration of multiple alignment signals including read-pair, split-read and read-depth. However, owing to inherent technical challenges, most existing SV discovery approaches utilize only one signal and consequently suffer from reduced sensitivity, especially at low sequence coverage and for smaller SVs. We present a novel and extremely flexible probabilistic SV discovery framework that is capable of integrating any number of SV detection signals including those generated from read alignments or prior evidence. We demonstrate improved sensitivity over extant methods by combining paired-end and split-read alignments and emphasize the utility of our framework for comprehensive studies of structural variation in heterogeneous tumor genomes. We further discuss the broader utility of this approach for probabilistic integration of diverse genomic interval datasets.

LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data

LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data
Alison F. Feder, Dmitri A. Petrov, Alan O. Bergland
(Submitted on 8 Oct 2012)
High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r2) between pairs of SNPs that can be observed within and among single reads. LDx also reports r2 estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r2 estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r2 estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.