Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis

Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis

Yuzhen Ye, Haixu Tang
(Submitted on 6 Apr 2015)

Metagenomics research has accelerated the studies of microbial organisms, providing insights into the composition and potential functionality of various microbial communities. Metatranscriptomics (studies of the transcripts from a mixture of microbial species) and other meta-omics approaches hold even greater promise for providing additional insights into functional and regulatory characteristics of the microbial communities. Current metatranscriptomics projects are often carried out without matched metagenomic datasets (of the same microbial communities). For the projects that produce both metatranscriptomic and metagenomic datasets, their analyses are often not integrated. Metagenome assemblies are far from perfect, partially explaining why metagenome assemblies are not used for the analysis of metatranscriptomic datasets. Here we report a reads mapping algorithm for mapping of short reads onto a de Bruijn graph of assemblies. A hash table of junction k-mers (k-mers spanning branching structures in the de Bruijn graph) is used to facilitate fast mapping of reads to the graph. We developed an application of this mapping algorithm: a reference based approach to metatranscriptome assembly using graphs of metagenome assembly as the reference. Our results show that this new approach (called TAG) helps to assemble substantially more transcripts that otherwise would have been missed or truncated because of the fragmented nature of the reference metagenome. TAG was implemented in C++ and has been tested extensively on the linux platform. It is available for download as open source at this http URL

Whole Genome Regulatory Variant Evaluation for Transcription Factor Binding

Whole Genome Regulatory Variant Evaluation for Transcription Factor Binding

Haoyang Zeng , Tatsunori Hashimoto , Daniel D. Kang , David K. Gifford
doi: http://dx.doi.org/10.1101/017392

Contemporary approaches to predict single nucleotide polymorphisms (SNPs) that alter transcription factor binding rely upon the sequence affinity of a transcription factor as represented by its canonical motif. WAVE (Whole-genome regulAtory Variants Evaluation) is a novel method for predicting more general regulatory variants that affect transcription factor binding, including those that fall outside of the canonical motif. WAVE learns a k-mer based generative model of transcription factor binding from ChIP-seq data and scores variants using its generative binding model. The k-mers learned by WAVE capture more sequence feature in transcription factor binding than a motif-based approach alone, including both a transcription factor’s canonical motif as well as associated co-factor motifs. WAVE significantly outperforms motif-based methods in predicting SNPs associated with allele-specific binding.

XWAS: a software toolset for genetic data analysis and association studies of the X chromosome

XWAS: a software toolset for genetic data analysis and association studies of the X chromosome

Feng Gao , Diana Chang , Arjun Biddanda , Li Ma , Yingjie Guo , Zilu Zhou , Alon Keinan
doi: http://dx.doi.org/10.1101/009795

XWAS is a new software for the analysis of the X chromosome in association studies and similar studies. The X chromosome plays an important role in human disease, especially those with sexually dimorphic characteristics. Special attention needs to be given to its analysis due to the unique inheritance pattern, leading to analytical complications that have resulted in the majority of genome-wide association studies (GWAS) either not considering X or mishandling it with GWAS toolsets that have been designed for non-sex chromosomes.. Hence, XWAS fills the need for tools that are specially designed for analysis of X. Following extensive, stringent, and X-specific quality control, XWAS offers an array of statistical tests of association, including: (1) the standard test between a SNP (single nucleotide polymorphism) and disease risk, including after first stratifying individuals by sex, (2) a test for a differential effect of a SNP on disease between males and females, (3) motivated by X-inactivation, a test for higher variance of a trait in heterozygous females as compared to homozygous females, and (4) for all tests, a version that allows for combining evidence across all SNPs in a whole gene. We applied the toolset analysis pipeline to 16 GWAS datasets of immune-related disorders and to 7 risk factors of coronary artery disease, and discovered several new X-linked genetic associations. XWAS will provide the tools and incentive for others to incorporate the X chromosome into GWAS, hence enabling discoveries of novel loci implicated in many diseases and in their sexual dimorphism.

Network analysis of genome-wide selective constraint reveals a gene network active in early fetal brain intolerant of mutation

Network analysis of genome-wide selective constraint reveals a gene network active in early fetal brain intolerant of mutation

Jinmyung Choi , Parisa Shooshtari , Kaitlin E Samocha , Mark J Daly , Chris Cotsapas
doi: http://dx.doi.org/10.1101/017277
AbstractInfo/HistoryMetrics Preview PDF
Abstract

Using robust, integrated analysis of multiple genomic datasets, we show that genes depleted for non-synonymous de novo mutations form a subnetwork of 72 members under strong selective constraint. We further show this subnetwork is preferentially expressed in the early development of the human hippocampus and is enriched for genes mutated in neurological, but not other, Mendelian disorders. We thus conclude that carefully orchestrated developmental processes are under strong constraint in early brain development, and perturbations caused by mutation have adverse outcomes subject to strong purifying selection. Our findings demonstrate that selective forces can act on groups of genes involved in the same process, supporting the notion that adaptation can act coordinately on multiple genes. Our approach provides a statistically robust, interpretable way to identify the tissues and developmental times where groups of disease genes are active. Our findings highlight the importance of considering the interactions between genes when analyzing genome-wide sequence data.

RiboDiff: Detecting Changes of Translation Efficiency from Ribosome Footprints

RiboDiff: Detecting Changes of Translation Efficiency from Ribosome Footprints

Yi Zhong , Theofanis Karaletsos , Philipp Drewe , Vipin Thankam T Sreedharan , Kamini Singh , Hans-Guido Wendel , Gunnar Rätsch
doi: http://dx.doi.org/10.1101/017111

Motivation: Deep sequencing based ribosome footprint profiling can provide novel insights into the regulatory mechanisms of protein translation. However, the observed ribosome profile is fundamentally confounded by transcriptional activity. In order to decipher principles of translation regulation, tools that can reliably detect changes in translation efficiency in case-control studies are needed. Results: We present a statistical framework and analysis tool, RiboDiff, to detect genes with changes in translation efficiency across experimental treatments. RiboDiff uses generalized linear models to estimate the over-dispersion of RNA-Seq and ribosome profiling measurements separately, and performs a statistical test for differential translation efficiency using both mRNA abundance and ribosome occupancy. Availability: Source code and documentation are available at http://github.com/ratschlab/ribodiff. Supplementary Material can be found at http://bioweb.me/ribo.

MMR: A Tool for Read Multi-Mapper Resolution

MMR: A Tool for Read Multi-Mapper Resolution
Andre Kahles , Jonas Behr , Gunnar Rätsch
doi: http://dx.doi.org/10.1101/017103

Motivation: Mapping high throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase, but also has significant impact on the results of downstream analyses. We present the multimapper resolution (MMR) tool that infers optimal mapping locations from the coverage density of other mapped reads. Results: Filtering alignments with MMR can significantly improve the performance of downstream analyses like transcript quantitation and differential testing. We illustrate that the accuracy (Spearman correlation) of transcript quantification increases by 17% when using reads of length 51. In addition, MMR decreases the alignment file sizes by more than 50% and this leads to a reduced running time of the quantification tool. Our efficient implementation of the MMR algorithm is easily applicable as a post-processing step to existing alignment files in BAM format. Its complexity scales linearly with the number of alignments and requires no further inputs. Supplementary Material: Source code and documentation are available for download at http://github.com/ratschlab/mmr. Supplementary text and figures, comprehensive testing results and further information can be found at http://bioweb.me/mmr.

Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees
Brad Solomon , Carleton Kingsford
doi: http://dx.doi.org/10.1101/017087

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts. The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/~ckingsf/software/bloomtree.

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads

LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads
Rene L Warren , Benjamin P Vandervalk , Steven JM Jones , Inanc Birol
doi: http://dx.doi.org/10.1101/016519

Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. Established and emerging long read technologies show great promise in this regard, but their current associated higher error rates typically require com-putational base correction and/or additional bioinformatics pre-processing before they could be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a solution that makes use of the information in error-rich long reads, without the need for read alignment or base correction. We show how the conti-guity of an ABySS E. coli K-12 genome assembly could be in-creased over five-fold by the use of beta-released Oxford Nanopore Ltd. (ONT) long reads and how LINKS leverages long-range infor-mation in S. cerevisiae W303 ONT reads to yield an assembly with less than half the errors of competing applications. Re-scaffolding the colossal white spruce assembly draft (PG29, 20 Gbp) and how LINKS scales to larger genomes is also presented. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.

ISMapper: Identifying insertion sequences in bacterial genomes from short read sequence data

ISMapper: Identifying insertion sequences in bacterial genomes from short read sequence data
Jane Hawkey , Mohammad Hamidian , Ryan R Wick , David J Edwards , Helen Billman-Jacobe , Ruth M Hall , Kathryn E Holt
doi: http://dx.doi.org/10.1101/016345

Background Insertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, direct from paired-end short read data. Results ISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 96% of known IS insertions in the analysis of simulated reads, and 98% in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was reliable for average genome-wide read depths >20x. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots. Conclusions ISMapper provides a rapid and robust method for identifying IS insertion sites direct from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.

Selection and explosive growth may hamper the performance of rare variant association tests

Selection and explosive growth may hamper the performance of rare variant association tests

Lawrence H. Uricchio , John S. Witte , Ryan D. Hernandez
doi: http://dx.doi.org/10.1101/015917

Much recent debate has focused on the role of rare variants in complex phenotypes. However, it is well known that rare alleles can only contribute a substantial proportion of the phenotypic variance when they have much larger effect sizes than common variants, which is most easily explained by natural selection constraining trait-altering alleles to low frequency. It is also plausible that demographic events will influence the genetic architecture of complex traits. Unfortunately, most rare variant association tests do not explicitly model natural selection or non-equilibrium demography. Here, we develop a novel evolutionary model of complex traits. We perform numerical calculations and simulate phenotypes under this model using inferred human demographic and selection parameters. We show that rare variants only contribute substantially to complex traits under very strong assumptions about the relationship between effect size and selection strength. We then assess the performance of state-of-the-art rare variant tests using our simulations across a broad range of model parameters. Counterintuitively, we find that statistical power is lowest when rare variants make the greatest contribution to the additive variance, and that power is substantially lower under our model than previously studied models. While many empirical studies have attempted to identify causal loci using rare variant association methods, few have reported novel associations. Some authors have interpreted this to mean that rare variants contribute little to heritability, but our results show that an alternative explanation is that rare variant tests have less power than previously estimated.