Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees
Brad Solomon , Carleton Kingsford
doi: http://dx.doi.org/10.1101/017087

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts. The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/~ckingsf/software/bloomtree.

Repeatability of evolution on epistatic landscapes

Repeatability of evolution on epistatic landscapes
Benedikt Bauer , Chaitanya S Gokhale
doi: http://dx.doi.org/10.1101/016782

Evolution is a dynamic process. The two classical forces of evolution are mutation and selection. Assuming small mutation rates, evolution can be predicted based solely on the fitness differences between phenotypes. Predicting an evolutionary process under varying mutation rates as well as varying fitness is still an open question. Experimental procedures, however, do include these complexities along with fluctuating population sizes and stochastic events such as extinctions. We investigate the mutational path probabilities of systems having epistatic effects on both fitness and mutation rates using a theoretical and computational framework. In contrast to previous models, we do not limit ourselves to the typical strong selection, weak mutation (SSWM)-regime or to fixed population sizes. Rather we allow epistatic interactions to also affect mutation rates. This can lead to qualitatively non-trivial dynamics. Pathways, that are negligible in the SSWM-regime, can overcome fitness valleys and become accessible. This finding has the potential to extend the traditional predictions based on the SSWM foundation and bring us closer to what is observed in experimental systems.

SumVg: Total heritability explained by all variants in genome-wide association studies based on summary

SumVg: Total heritability explained by all variants in genome-wide association studies based on summary statistics with standard error estimates
Hon-Cheong SO , Pak C. SHAM
doi: http://dx.doi.org/10.1101/016857

Genome-wide association studies (GWAS) have become increasingly popular these days and one of the key questions is how much heritability could be explained by all variants in GWAS. We have previously proposed an approach to answer this question, based on recovering the “true” z-statistics from a set of observed z-statistics. Only summary statistics are required. However, methods for standard error (SE) estimation are not available yet, thereby limiting the interpretation of the results. In this study we developed resampling-based approaches to estimate the SE and the methods are implemented in an R package. We found that delete-d-jackknife and parametric bootstrap approaches provide good estimates of the SE. Methods to compute the sum of heritability explained and the corresponding SE are implemented in the R package SumVg, available at https://sites.google.com/site/honcheongso/software/var-totalvg

The advent of genome-wide association studies for bacteria

The advent of genome-wide association studies for bacteria
Peter E Chen , B Jesse Shapiro
doi: http://dx.doi.org/10.1101/016873

Significant advances in sequencing technologies and genome-wide association studies (GWAS) have revealed substantial insight into the genetic architecture of human phenotypes. In recent years, the application of this approach in bacteria has begun to reveal the genetic basis of bacterial host preference, antibiotic resistance, and virulence. Here, we consider relevant differences between bacterial and human genome dynamics, apply GWAS to a global sample of Mycobacterium tuberculosis genomes to highlight the impacts of linkage disequilibrium, population stratification, and natural selection, and finally compare the traditional GWAS against phyC, a contrasting method of mapping genotype to phenotype based upon evolutionary convergence. We discuss strengths and weaknesses of both methods, and make suggestions for factors to be considered in future bacterial GWAS.

Cline coupling and uncoupling in a stickleback hybrid zone

Cline coupling and uncoupling in a stickleback hybrid zone
Tim Vines , Anne Dalziel , Arianne Albert , Thor Veen , Patricia Schulte , Dolph Schluter
doi: http://dx.doi.org/10.1101/016832

Strong ecological selection on a genetic locus can maintain allele frequency differences between populations in different environments, even in the face of hybridization. When alleles at divergent loci come into tight linkage disequilibria, selection acts on them as a unit and can significantly reduce gene flow. For populations interbreeding across a hybrid zone, linkage disequilibria between loci can force clines to share the same slopes and centers. However, strong ecological selection can push clines away from the others, reducing linkage disequilibria and weakening the barrier to gene flow. We looked for this ‘cline uncoupling’ effect in a hybrid zone between stream resident and anadromous sticklebacks at two genes known to be under divergent natural selection (Eda and ATP1a1) and five morphological traits that repeatedly evolve in freshwater stickleback. We used 10 anonymous SNPs to characterize the shape of the zone. We found that the clines at Eda, ATP1a1, and four morphological traits were concordant and coincident, suggesting that direct selection on each is outweighed by the indirect selection generated by linkage disequilibria. Interestingly, the cline for pectoral fin length was much steeper and displaced 200m downstream, and two anonymous SNPs also had steep clines.

Exploring functional variation affecting ceRNA regulation in humans

Exploring functional variation affecting ceRNA regulation in humans
Mulin Jun Li , Jiexing Wu , Peng Jiang , Wei Li , Yun Zhu , Daniel Fernandez , Russell J. H. Ryan , Yiwen Chen , Junwen Wang , Jun S. Liu , X. Shirley Liu
doi: http://dx.doi.org/10.1101/016865

MicroRNA (miRNA) sponges have been shown to function as competing endogenous RNAs (ceRNAs) to regulate the expression of other miRNA targets in the network by sequestering available miRNAs. As the first systematic investigation of the genome-wide genetic effect on ceRNA regulation, we applied multivariate response regression and identified widespread genetic variations that are associated with ceRNA competition using 462 Geuvadis RNA-seq data in multiple human populations. We showed that SNPs in gene 3’UTRs at the miRNA seed binding regions can simultaneously regulate gene expression changes in both cis and trans by the ceRNA mechanism. We termed these loci as endogenous miRNA sponge expression quantitative trait loci or “emsQTLs”, and found that a large number of them were unexplored in conventional eQTL mapping. We identified many emsQTLs are undergoing recent positive selection in different human populations. Using GWAS results, we found that emsQTLs are significantly enriched in traits/diseases associated loci. Functional prediction and prioritization extend our understanding on causality of emsQTL allele in disease pathways. We illustrated that emsQTL can synchronously regulate the expression of tumor suppressor and oncogene through ceRNA competition in angiogenesis. Together these results provide a distinct catalog and characterization of functional noncoding regulatory variants that control ceRNA crosstalk.

Transcriptome Differences between Alternative Sex Determining Genotypes in the House Fly, Musca domestica

Transcriptome Differences between Alternative Sex Determining Genotypes in the House Fly, Musca domestica
Richard P Meisel , Jeffrey G Scott , Andrew G Clark
doi: http://dx.doi.org/10.1101/016774

Sex determination evolves rapidly, often because of turnover of the genes at the top of the pathway. The house fly, Musca domestica, has a multifactorial sex determination system, allowing us to identify the selective forces responsible for the evolutionary turnover of sex determination in action. There is a male determining factor, M, on the Y chromosome (YM), which is probably the ancestral state. An M factor on the third chromosome (IIIM) has reached high frequencies in multiple populations across the world, but the evolutionary forces responsible for the invasion of IIIM are not resolved. To test if the IIIM chromosome invaded because of sex-specific selection pressures, we used mRNA sequencing to determine if isogenic males that differ only in the presence of the YM or IIIM chromosome have different gene expression profiles. We find that more genes are differentially expressed between YM and IIIM males in testis than head, and that genes with male-biased expression are most likely to be differentially expressed between YM and IIIM males. This suggests that male phenotypes, especially those related to male fertility, are more likely to be affected by the male-determining chromosome, supporting the hypothesis that sex-specific selection acts on alleles linked to the male-determining locus driving evolutionary turnover in the sex determination pathway. We additionally find that IIIM males have a “masculinized” gene expression profile, suggesting that the IIIM chromosome has accumulated an excess of male- beneficial alleles because of its male-limited transmission.

Breaking through evolutionary constraint by environmental fluctuations

Breaking through evolutionary constraint by environmental fluctuations
Marjon GJ de Vos , Alexandre Dawid , Vanda Sunderlikova , Sander J Tans
doi: http://dx.doi.org/10.1101/016790

Epistatic interactions can frustrate and shape evolutionary change. Indeed, phenotypes may fail to evolve because essential mutations can only be selected positively if fixed simultaneously. How environmental variability affects such constraints is poorly understood. Here we studied genetic constraints in fixed and fluctuating environments, using the Escherichia coli lac operon as a model system for genotype-environment interactions. The data indicated an apparent paradox: in different fixed environments, mutational trajectories became trapped at sub-optima where no further improvements were possible, while repeated switching between these same environments allowed unconstrained adaptation by continuous improvements. Pervasive cross-environmental trade-offs transformed peaks into valleys upon environmental change, thus enabling escape from entrapment. This study shows that environmental variability can lift genetic constraint, and that trade-offs not only impede but can also facilitate adaptive evolution.

Calculating the Human Mutation Rate by Using a NUMT from the Early Oligocene

Calculating the Human Mutation Rate by Using a NUMT from the Early Oligocene
Ian Logan
doi: http://dx.doi.org/10.1101/016428

As the number of whole genomes available for study increases, so also does the opportunity to find unsuspected features hidden within our genetic code. One such feature allows for an estimate of the Human Mutation Rate in human chromosomes to be made. A NUMT is a small fragment of the mitochondrial DNA that enters the nucleus of a cell, gets captured by a chromosome and thereafter passed on from generation to generation. Over the millions of years of evolution, this unexpected phenomenon has happened many times. But it is usually very difficult to be able to say just when a NUMT might have been created. However, this paper presents evidence to show that for one particular NUMT the date of formation was around 29 million ago, which places the event in the Early Oligocene; when our ancestors were small monkey-like creatures. So now all of us carry this NUMT in each of our cells as do Old World Monkeys, the Great Apes and our nearest relations, the Chimpanzees. The estimate of the Human Mutation obtained by the method outlined here gives a value which is higher than has been generally found; but this new value perhaps only applies to non-coding regions of the Human genome where there is little, if any, selection pressure against new mutations.

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads
Rene L Warren , Benjamin P Vandervalk , Steven JM Jones , Inanc Birol
doi: http://dx.doi.org/10.1101/016519

Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. Established and emerging long read technologies show great promise in this regard, but their current associated higher error rates typically require com-putational base correction and/or additional bioinformatics pre-processing before they could be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a solution that makes use of the information in error-rich long reads, without the need for read alignment or base correction. We show how the conti-guity of an ABySS E. coli K-12 genome assembly could be in-creased over five-fold by the use of beta-released Oxford Nanopore Ltd. (ONT) long reads and how LINKS leverages long-range infor-mation in S. cerevisiae W303 ONT reads to yield an assembly with less than half the errors of competing applications. Re-scaffolding the colossal white spruce assembly draft (PG29, 20 Gbp) and how LINKS scales to larger genomes is also presented. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.