Exon capture optimization in large-genome amphibians

Exon capture optimization in large-genome amphibians

Evan McCartney-Melstad, Genevieve G. Mount, H. Bradley Shaffer
doi: http://dx.doi.org/10.1101/021253

Background Gathering genomic-scale data efficiently is challenging for non-model species with large, complex genomes. Transcriptome sequencing is accessible for even large-genome organisms, and sequence capture probes can be designed from such mRNA sequences to enrich and sequence exonic regions. Maximizing enrichment efficiency is important to reduce sequencing costs, but, relatively little data exist for exon capture experiments in large-genome non-model organisms. Here, we conducted a replicated factorial experiment to explore the effects of several modifications to standard protocols that might increase sequence capture efficiency for large-genome amphibians. Methods We enriched 53 genomic libraries from salamanders for a custom set of 8,706 exons under differing conditions. Libraries were prepared using pools of DNA from 3 different salamanders with approximately 30 gigabase genomes: California tiger salamander (Ambystoma californiense), barred tiger salamander (Ambystoma mavortium), and an F1 hybrid between the two. We enriched libraries using different amounts of c0t-1 blocker, individual input DNA, and total reaction DNA. Enriched libraries were sequenced with 150 bp paired-end reads on an Illumina HiSeq 2500, and the efficiency of target enrichment was quantified using unique read mapping rates and average depth across targets. The different enrichment treatments were evaluated to determine if c0t-1 and input DNA significantly impact enrichment efficiency in large-genome amphibians. Results Increasing the amounts of c0t-1 and individual input DNA both reduce the rates of PCR duplication. This reduction led to an increase in the percentage of unique reads mapping to target sequences, essentially doubling overall efficiency of the target capture from 10.4% to nearly 19.9%. We also found that post-enrichment DNA concentrations and qPCR enrichment verification were useful for predicting the success of enrichment. Conclusions Increasing the amount of individual sample input DNA and the amount of c0t-1 blocker both increased the efficiency of target capture in large-genome salamanders. By reducing PCR duplication rates, the number of unique reads mapping to targets increased, making target capture experiments more efficient and affordable. Our results indicate that target capture protocols can be modified to efficiently screen large-genome vertebrate taxa including amphibians.

PrediXcan: Trait Mapping Using Human Transcriptome Regulation

PrediXcan: Trait Mapping Using Human Transcriptome Regulation

Eric R Gamazon, Heather E Wheeler, Kaanan Shah, Sahar V Mozaffari, Keston Aquino-Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, Dan L Nicolae, Nancy J Cox, Hae Kyung Im, GTEx Consortium
doi: http://dx.doi.org/10.1101/020164

Genome-wide association studies (GWAS) have identified thousands of variants robustly associated with complex traits. However, the biological mechanisms underlying these associations are, in general, not well understood. We propose a gene-based association method called PrediXcan that directly tests the molecular mechanisms through which genetic variation affects phenotype. The approach estimates the component of gene expression determined by an individual’s genetic profile and correlates the “imputed” gene expression with the phenotype under investigation to identify genes involved in the etiology of the phenotype. The genetically regulated gene expression is estimated using whole-genome tissue-dependent prediction models trained with reference transcriptome datasets. PrediXcan enjoys the benefits of gene- based approaches such as reduced multiple testing burden, more comprehensive annotation of gene function compared to that derived from single variants, and a principled approach to the design of follow-up experiments while also integrating knowledge of regulatory function. Since no actual expression data are used in the analysis of GWAS data – only in silico expression – reverse causality problems are largely avoided. PrediXcan harnesses reference transcriptome data for disease mapping studies. Our results demonstrate that PrediXcan can detect known and novel genes associated with disease traits and provide insights into the mechanism of these associations.

CARGO: Effective format-free compressed storage of genomic information

CARGO: Effective format-free compressed storage of genomic information

Łukasz Roguski, Paolo Ribeca
(Submitted on 17 Jun 2015)

The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors, and scale well to multi-TB datasets.

Evolution and coexistence in response to a key innovation in a long-term evolution experiment with Escherichia coli

Evolution and coexistence in response to a key innovation in a long-term evolution experiment with Escherichia coli

Caroline B. Turner, Zachary D. Blount, Daniel H. Mitchell, Richard E. Lenski
doi: http://dx.doi.org/10.1101/020958

Evolution of a novel function can greatly alter the effects of an organism on its environment. These environmental changes can, in turn, affect the further evolution of that organism and any coexisting organisms. We examine these effects and feedbacks following evolution of a novel function in the long-term evolution experiment (LTEE) with Escherichia coli. A characteristic feature of E. coli is its inability to consume citrate aerobically. However, that ability evolved in one of the LTEE populations. In this population, citrate-utilizing bacteria (Cit+) coexisted stably with another clade of bacteria that lacked the capacity to utilize citrate (Cit−). This coexistence was shaped by the evolution of a cross-feeding relationship in which Cit+ cells released the dicarboxylic acids succinate, fumarate, and malate into the medium, and Cit− cells evolved improved growth on these carbon sources, as did the Cit+ cells. Thus, the evolution of citrate consumption led to a flask-based ecosystem that went from a single limiting resource, glucose, to one with five resources either shared or partitioned between two coexisting clades. Our findings show how evolutionary novelties can change environmental conditions, thereby facilitating diversity and altering both the structure of an ecosystem and the evolutionary trajectories of coexisting organisms.

Dynamics of transcription factor binding site evolution

Dynamics of transcription factor binding site evolution

Murat Tuğrul, Tiago Paixão, Nicholas H. Barton, Gašper Tkačik
(Submitted on 16 Jun 2015)

Evolution of gene regulation is crucial for our understanding of the phenotypic differences between species, populations and individuals. Sequence-specific binding of transcription factors to the regulatory regions on the DNA is a key regulatory mechanism that determines gene expression and hence heritable phenotypic variation. We use a biophysical model for directional selection on gene expression to estimate the rates of gain and loss of transcription factor binding sites (TFBS) in finite populations under both point and insertion/deletion mutations. Our results show that these rates are typically slow for a single TFBS in an isolated DNA region, unless the selection is extremely strong. These rates decrease drastically with increasing TFBS length or increasingly specific protein-DNA interactions, making the evolution of sites longer than ~10 bp unlikely on typical eukaryotic speciation timescales. Similarly, evolution converges to the stationary distribution of binding sequences very slowly, making the equilibrium assumption questionable. The availability of longer regulatory sequences in which multiple binding sites can evolve simultaneously, the presence of “pre-sites” or partially decayed old sites in the initial sequence, and biophysical cooperativity between transcription factors, can all facilitate gain of TFBS and reconcile theoretical calculations with timescales inferred from comparative genetics.

On the Origins and Control of Community Types in the Human Microbiome

On the Origins and Control of Community Types in the Human Microbiome

Travis E. Gibson, Amir Bashan, Hong-Tai Cao, Scott T. Weiss, Yang-Yu Liu
(Submitted on 17 Jun 2015)

Microbiome-based stratification of healthy individuals into compositional categories, referred to as “community types”, holds promise for drastically improving personalized medicine. Despite this potential, the existence of community types and the degree of their distinctness have been highly debated. Here we adopted a dynamic systems approach and found that heterogeneity in the interspecific interactions or the presence of strongly interacting species is sufficient to explain community types, independent of the topology of the underlying ecological network. By controlling the presence or absence of these strongly interacting species we can steer the microbial ecosystem to any desired community type. This open-loop control strategy still holds even when the community types are not distinct but appear as dense regions within a continuous gradient. This finding can be used to develop viable therapeutic strategies for shifting the microbial composition to a healthy configuration

Environmental fluctuations do not select for increased variation or population-based resistance in Escherichia coli

Environmental fluctuations do not select for increased variation or population-based resistance in Escherichia coli

Shraddha Madhav Karve, Kanishka Tiwary, S Selveshwari, Sutirth Dey
doi: http://dx.doi.org/10.1101/021030

In nature, organisms often face unpredictably fluctuating environments. However, little is understood about the mechanisms that allow organisms to cope with such unpredictability. To address this issue, we used replicate populations of Escherichia coli selected under complex, randomly changing environments. We assayed growth at the level of single cells under four different novel stresses that had no known correlation with the selection environments. Under such conditions, the individuals of the selected populations had significantly lower lag and greater yield compared to the controls. More importantly, there were no outliers in terms of growth, thus ruling out the evolution of population-based resistance. We also assayed the standing phenotypic variation of the selected populations, in terms of their growth on 94 different substrates. Contrary to extant theoretical predictions, there was no increase in the standing variation of the selected populations, nor was there any significant divergence from the ancestors. This suggested that the greater fitness in novel environments is brought about by selection at the level of the individuals, which restricts the suite of traits that can potentially evolve through this mechanism. Given that day-to-day climatic variability of the world is rising, these results have potential public health implications. Our results also underline the need for a very different kind of theoretical approach to study the effects of fluctuating environments.

Leveraging distant relatedness to quantify human mutation and gene conversion rates

Leveraging distant relatedness to quantify human mutation and gene conversion rates

Pier Francesco Palamara, Laurent Francioli, Giulio Genovese, Peter Wilton, Alexander Gusev, Hilary Finucane, Sriram Sankararaman, Shamil Sunyaev, Paul Debakker, John Wakeley, Itsik Pe’er, Alkes L. Price, The Genome of the Netherlands Consortium
doi: http://dx.doi.org/10.1101/020776

The rate at which human genomes mutate is a central biological parameter that has many implications for our ability to understand demographic and evolutionary phenomena. We present a method for inferring mutation and gene conversion rates using the number of sequence differences observed in identical-by-descent (IBD) segments together with a reconstructed model of recent population size history. This approach is robust to, and can quantify, the presence of substantial genotyping error, as validated in coalescent simulations. We applied the method to 498 trio-phased Dutch individuals from the Genome of the Netherlands (GoNL) project, sequenced at an average depth of 13×. We infer a point mutation rate of 1.66 ± 0.04 × 10-8 per base per generation, and a rate of 1.26 ± 0.06 × 10-9 for <20 bp indels. Our estimated average genome-wide mutation rate is higher than most pedigree-based estimates reported thus far, but lower than estimates obtained using substitution rates across primates. By quantifying how estimates vary as a function of allele frequency, we infer the probability that a site is involved in non-crossover gene conversion as 5.99 ± 0.69 × 10-6, consistent with recent reports. We find that recombination does not have observable mutagenic effects after gene conversion is accounted for, and that local gene conversion rates reflect recombination rates. We detect a strong enrichment for recent deleterious variation among mismatching variants found within IBD regions, and observe summary statistics of local IBD sharing to closely match previously proposed metrics of background selection, but find no significant effects of selection on our estimates of mutation rate. We detect no evidence for strong variation of mutation rates in a number of genomic annotations obtained from several recent studies.

Signatures of Dobzhansky-Muller Incompatibilities in the Genomes of Recombinant Inbred Lines

Signatures of Dobzhansky-Muller Incompatibilities in the Genomes of Recombinant Inbred Lines

Maria Colomé-Tatché, Frank Johannes
doi: http://dx.doi.org/10.1101/021006

In the construction of Recombinant Inbred Lines (RILs) from two divergent inbred parents certain genotype (or epigenotype) combinations may be functionally “incompatible” when brought together in the genomes of the progeny, thus resulting in sterility or lower fertility. Natural selection against these epistatic combinations during inbreeding can change haplotype frequencies and distort linkage disequilibrium (LD) relations between loci within and across chromosomes. These LD distortions have received increased experimental attention, because they point to genomic regions that may drive Dobzhansky-Muller-type of reproductive isolation and, ultimately, speciation in the wild. Here we study the selection signatures of two-locus epistatic incompatibility models and quantify their impact on the genetic composition of the genomes of 2-way RILs obtained by selfing. We also consider the biases introduced by breeders when trying to counteract the loss of lines by selectively propagating only viable seeds. Building on our theoretical results, we develop model-based maximum likelihood (ML) tests which can be employed in pairwise genome scans for incompatibility loci using multi-locus genotype data. We illustrate this ML approach in the context of two published A. thaliana RIL panels. Our work lays the theoretical foundation for studying more complex systems such as RILs obtained by sibling mating and/or from multi-parental crosses.

Linkage disequilibrium between single nucleotide polymorphisms and hypermutable loci

Linkage disequilibrium between single nucleotide polymorphisms and hypermutable loci

Sterling Sawaya, Matt Jones, Matt Keller
doi: http://dx.doi.org/10.1101/020909

Some diseases are caused by genetic loci with a high rate of change, and heritability in complex traits is likely to be partially caused by variation at these loci. These hypermutable elements, such as tandem repeats, change at rates that are orders of magnitude higher than the rates at which most single nucleotides mutate. However, single nucleotide polymorphisms, or SNPs, are currently the primary focus of genetic studies of human disease. Here we quantify the degree to which SNPs are correlated with hypermutable loci, examining a range of mutation rates that correspond to mutation rates at tandem repeat loci. We use established population genetics theory to relate mutation rates to recombination rates and compare the theoretical predictions to simulations. Both simulations and theory agree that, at the highest mutation rates, almost all correlation is lost between a hypermutable locus and surrounding SNPs. The theoretical predictions break down for middle to low mutation rates, differing widely from the simulated results. The simulation results suggest that some correlation remains between SNPs and hypermutable loci when mutation rates are on the lower end of the mutation spectrum. Consequently, in some cases SNPs can tag variation caused by tandem repeat loci. We also examine the linkage between SNPs and other SNPs and uncover ways in which the linkage disequilibrium of rare SNPs differs from that of hypermutable loci.