A targeted subgenomic approach for phylogenomics based on microfluidic PCR and high throughput sequencing

A targeted subgenomic approach for phylogenomics based on microfluidic PCR and high throughput sequencing

Simon Uribe-Convers, Matthew L Settles, David C Tank
doi: http://dx.doi.org/10.1101/021246

Advances in high-throughput sequencing (HTS) have allowed researchers to obtain large amounts of biological sequence information at speeds and costs unimaginable only a decade ago. Phylogenetics, and the study of evolution in general, is quickly migrating towards using HTS to generate larger and more complex molecular datasets. In this paper, we present a method that utilizes microfluidic PCR and HTS to generate large amounts of sequence data suitable for phylogenetic analyses. The approach uses a Fluidigm microfluidic PCR array and two sets of PCR primers to simultaneously amplify 48 target regions across 48 samples, incorporating sample-specific barcodes and HTS adapters (2,304 unique amplicons per microfluidic array). The final product is a pooled set of amplicons ready to be sequenced, and thus, there is no need to construct separate, costly genomic libraries for each sample. Further, we present a bioinformatics pipeline to process the raw HTS reads to either generate consensus sequences (with or without ambiguities) for every locus in every sample or—more importantly—recover the separate alleles from heterozygous target regions in each sample. This is important because it adds allelic information that is well suited for coalescent-based phylogenetic analyses that are becoming very common in conservation and evolutionary biology. To test our subgenomic method and bioinformatics pipeline, we sequenced 576 samples across 96 target regions belonging to the South American clade of the genus Bartsia L. in the plant family Orobanchaceae. After sequencing cleanup and alignment, the experiment resulted in ~25,300bp across 486 samples for a set of 48 primer pairs targeting the plastome, and ~13,500bp for 363 samples for a set of primers targeting regions in the nuclear genome. Finally, we constructed a combined concatenated matrix from all 96 primer combinations, resulting in a combined aligned length of ~40,500bp for 349 samples.

A novel normalization approach unveils blind spots in gene expression profiling

A novel normalization approach unveils blind spots in gene expression profiling

Carlos P. Roca, Susana I. L. Gomes, Mónica J. B. Amorim, Janeck J. Scott-Fordsmand
doi: http://dx.doi.org/10.1101/021212

RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, by measuring the concentration of tens of thousands of mRNA molecules in single assays. However, lack of accuracy and reproducibility have hindered the application of these high-throughput technologies. A key challenge in the data analysis is the normalization of gene expression levels, which is required to make them comparable between samples. This normalization is currently performed following approaches resting on an implicit assumption that most genes are not differentially expressed. Here we show that this assumption is unrealistic and likely results in failure to detect numerous gene expression changes. We have devised a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much greater than currently believed, and that it can be measured with available technologies. Our results also explain, at least partially, the problems encountered in transcriptomics studies. We expect this improvement in detection to help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression, such as cell differentiation, toxic responses and cancer.

Inference under a Wright-Fisher model using an accurate beta approximation

Inference under a Wright-Fisher model using an accurate beta approximation

Paula Tataru, Thomas Bataillon, Asger Hobolth
doi: http://dx.doi.org/10.1101/021261

The large amount and high quality of genomic data available today enables, in principle, accurate inference of evolutionary history of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytic form. Existing approximations build on the computationally intensive diffusion limit, or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (zero and one). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here, we introduce the beta with spikes, an extension of the beta approximation, which explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations, with comparable performance to existing state-of-the-art method.

TESS: Bayesian inference of lineage diversification rates from (incompletely sampled) molecular phylogenies in R

TESS: Bayesian inference of lineage diversification rates from (incompletely sampled) molecular phylogenies in R

Sebastian Höhna, Michael R. May, Brian R. Moore
doi: http://dx.doi.org/10.1101/021238

Many fundamental questions in evolutionary biology entail estimating rates of lineage diversification (speciation–extinction). We develop a flexible Bayesian framework for specifying an effectively infinite array of diversification models—where rates are constant, vary continuously, or change episodically through time—and implement numerical methods to estimate parameters of these models from molecular phylogenies, even when species sampling is incomplete. Additionally we provide robust methods for comparing the relative and absolute fit of competing branching-process models to a given tree, thereby providing rigorous tests of biological hypotheses regarding patterns and processes of lineage diversification.

Exon capture optimization in large-genome amphibians

Exon capture optimization in large-genome amphibians

Evan McCartney-Melstad, Genevieve G. Mount, H. Bradley Shaffer
doi: http://dx.doi.org/10.1101/021253

Background Gathering genomic-scale data efficiently is challenging for non-model species with large, complex genomes. Transcriptome sequencing is accessible for even large-genome organisms, and sequence capture probes can be designed from such mRNA sequences to enrich and sequence exonic regions. Maximizing enrichment efficiency is important to reduce sequencing costs, but, relatively little data exist for exon capture experiments in large-genome non-model organisms. Here, we conducted a replicated factorial experiment to explore the effects of several modifications to standard protocols that might increase sequence capture efficiency for large-genome amphibians. Methods We enriched 53 genomic libraries from salamanders for a custom set of 8,706 exons under differing conditions. Libraries were prepared using pools of DNA from 3 different salamanders with approximately 30 gigabase genomes: California tiger salamander (Ambystoma californiense), barred tiger salamander (Ambystoma mavortium), and an F1 hybrid between the two. We enriched libraries using different amounts of c0t-1 blocker, individual input DNA, and total reaction DNA. Enriched libraries were sequenced with 150 bp paired-end reads on an Illumina HiSeq 2500, and the efficiency of target enrichment was quantified using unique read mapping rates and average depth across targets. The different enrichment treatments were evaluated to determine if c0t-1 and input DNA significantly impact enrichment efficiency in large-genome amphibians. Results Increasing the amounts of c0t-1 and individual input DNA both reduce the rates of PCR duplication. This reduction led to an increase in the percentage of unique reads mapping to target sequences, essentially doubling overall efficiency of the target capture from 10.4% to nearly 19.9%. We also found that post-enrichment DNA concentrations and qPCR enrichment verification were useful for predicting the success of enrichment. Conclusions Increasing the amount of individual sample input DNA and the amount of c0t-1 blocker both increased the efficiency of target capture in large-genome salamanders. By reducing PCR duplication rates, the number of unique reads mapping to targets increased, making target capture experiments more efficient and affordable. Our results indicate that target capture protocols can be modified to efficiently screen large-genome vertebrate taxa including amphibians.

PrediXcan: Trait Mapping Using Human Transcriptome Regulation

PrediXcan: Trait Mapping Using Human Transcriptome Regulation

Eric R Gamazon, Heather E Wheeler, Kaanan Shah, Sahar V Mozaffari, Keston Aquino-Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, Dan L Nicolae, Nancy J Cox, Hae Kyung Im, GTEx Consortium
doi: http://dx.doi.org/10.1101/020164

Genome-wide association studies (GWAS) have identified thousands of variants robustly associated with complex traits. However, the biological mechanisms underlying these associations are, in general, not well understood. We propose a gene-based association method called PrediXcan that directly tests the molecular mechanisms through which genetic variation affects phenotype. The approach estimates the component of gene expression determined by an individual’s genetic profile and correlates the “imputed” gene expression with the phenotype under investigation to identify genes involved in the etiology of the phenotype. The genetically regulated gene expression is estimated using whole-genome tissue-dependent prediction models trained with reference transcriptome datasets. PrediXcan enjoys the benefits of gene- based approaches such as reduced multiple testing burden, more comprehensive annotation of gene function compared to that derived from single variants, and a principled approach to the design of follow-up experiments while also integrating knowledge of regulatory function. Since no actual expression data are used in the analysis of GWAS data – only in silico expression – reverse causality problems are largely avoided. PrediXcan harnesses reference transcriptome data for disease mapping studies. Our results demonstrate that PrediXcan can detect known and novel genes associated with disease traits and provide insights into the mechanism of these associations.

CARGO: Effective format-free compressed storage of genomic information

CARGO: Effective format-free compressed storage of genomic information

Łukasz Roguski, Paolo Ribeca
(Submitted on 17 Jun 2015)

The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors, and scale well to multi-TB datasets.

Evolution and coexistence in response to a key innovation in a long-term evolution experiment with Escherichia coli

Evolution and coexistence in response to a key innovation in a long-term evolution experiment with Escherichia coli

Caroline B. Turner, Zachary D. Blount, Daniel H. Mitchell, Richard E. Lenski
doi: http://dx.doi.org/10.1101/020958

Evolution of a novel function can greatly alter the effects of an organism on its environment. These environmental changes can, in turn, affect the further evolution of that organism and any coexisting organisms. We examine these effects and feedbacks following evolution of a novel function in the long-term evolution experiment (LTEE) with Escherichia coli. A characteristic feature of E. coli is its inability to consume citrate aerobically. However, that ability evolved in one of the LTEE populations. In this population, citrate-utilizing bacteria (Cit+) coexisted stably with another clade of bacteria that lacked the capacity to utilize citrate (Cit−). This coexistence was shaped by the evolution of a cross-feeding relationship in which Cit+ cells released the dicarboxylic acids succinate, fumarate, and malate into the medium, and Cit− cells evolved improved growth on these carbon sources, as did the Cit+ cells. Thus, the evolution of citrate consumption led to a flask-based ecosystem that went from a single limiting resource, glucose, to one with five resources either shared or partitioned between two coexisting clades. Our findings show how evolutionary novelties can change environmental conditions, thereby facilitating diversity and altering both the structure of an ecosystem and the evolutionary trajectories of coexisting organisms.

Dynamics of transcription factor binding site evolution

Dynamics of transcription factor binding site evolution

Murat Tuğrul, Tiago Paixão, Nicholas H. Barton, Gašper Tkačik
(Submitted on 16 Jun 2015)

Evolution of gene regulation is crucial for our understanding of the phenotypic differences between species, populations and individuals. Sequence-specific binding of transcription factors to the regulatory regions on the DNA is a key regulatory mechanism that determines gene expression and hence heritable phenotypic variation. We use a biophysical model for directional selection on gene expression to estimate the rates of gain and loss of transcription factor binding sites (TFBS) in finite populations under both point and insertion/deletion mutations. Our results show that these rates are typically slow for a single TFBS in an isolated DNA region, unless the selection is extremely strong. These rates decrease drastically with increasing TFBS length or increasingly specific protein-DNA interactions, making the evolution of sites longer than ~10 bp unlikely on typical eukaryotic speciation timescales. Similarly, evolution converges to the stationary distribution of binding sequences very slowly, making the equilibrium assumption questionable. The availability of longer regulatory sequences in which multiple binding sites can evolve simultaneously, the presence of “pre-sites” or partially decayed old sites in the initial sequence, and biophysical cooperativity between transcription factors, can all facilitate gain of TFBS and reconcile theoretical calculations with timescales inferred from comparative genetics.

On the Origins and Control of Community Types in the Human Microbiome

On the Origins and Control of Community Types in the Human Microbiome

Travis E. Gibson, Amir Bashan, Hong-Tai Cao, Scott T. Weiss, Yang-Yu Liu
(Submitted on 17 Jun 2015)

Microbiome-based stratification of healthy individuals into compositional categories, referred to as “community types”, holds promise for drastically improving personalized medicine. Despite this potential, the existence of community types and the degree of their distinctness have been highly debated. Here we adopted a dynamic systems approach and found that heterogeneity in the interspecific interactions or the presence of strongly interacting species is sufficient to explain community types, independent of the topology of the underlying ecological network. By controlling the presence or absence of these strongly interacting species we can steer the microbial ecosystem to any desired community type. This open-loop control strategy still holds even when the community types are not distinct but appear as dense regions within a continuous gradient. This finding can be used to develop viable therapeutic strategies for shifting the microbial composition to a healthy configuration