Surveying the relative impact of mRNA features on local ribosome profiling read density in 28 datasets.

Surveying the relative impact of mRNA features on local ribosome profiling read density in 28 datasets.

Patrick O’Connor , Dmitry Andreev , Pavel Baranov
doi: http://dx.doi.org/10.1101/018762

Ribosome profiling is a promising technology for exploring gene expression. However, ribosome profiling data are characterized by a substantial number of outliers due to technical and biological factors. Here we introduce a simple computational method, Ribo-seq Unit Step Transformation (RUST) for the characterization of ribosome profiling data. We show that RUST is robust and outperforms conventional normalization techniques in the presence of sporadic noise. We used RUST to analyse 28 publicly available ribosome profiling datasets obtained from mammalian cells and tissues and from yeast. This revealed substantial protocol dependent variation in the composition of footprint libraries. We selected a high quality dataset to explore the mRNA features that affect local decoding rates and found that the amino acid identity encoded by the codon in the A-site is the major contributing factor followed by the identity of the codon itself and then the amino acid in the P-site. We also found that bulky amino acids slow down ribosome movement when they occur within the peptide tunnel and Proline residues may decrease or increase ribosome velocities depending on the context in which they occur. Moreover we show that a few parameters obtained with RUST are sufficient for predicting experimental densities with high accuracy. Due to its robustness and low computational demand, RUST could be used for quick routine characterization of ribosome profiling datasets to assess their quality as well as for the analysis of the relative impact of mRNA sequence features on local decoding rates.

Distinct nucleosome distribution patterns in two structurally and functionally differentiated nuclei of a unicellular eukaryote

Distinct nucleosome distribution patterns in two structurally and functionally differentiated nuclei of a unicellular eukaryote

Jie Xiong , Shan Gao , Wen Dui , Wentao Yang , Xiao Chen , Sean D Taverna , Ronald E. Pearlman , Wendy Ashlock , Wei Miao , Yifan Liu
doi: http://dx.doi.org/10.1101/018754

The ciliate protozoan Tetrahymena thermophila contains two types of structurally and functionally differentiated nuclei: the transcriptionally active somatic macronucleus (MAC) and the transcriptionally silent germ-line micronucleus (MIC). Here we demonstrate that MAC features well-positioned nucleosomes downstream of transcription start sites (TSS) likely connected with promoter proximal pausing of RNA polymerase II, as well as in exonic regions flanking both the 5′ and 3′ splice sites. In contrast, nucleosomes in MIC are more delocalized. Nucleosome occupancy in MAC and MIC are nonetheless highly correlated with each other and with predictions based upon DNA sequence features. Arrays of well-positioned nucleosomes are often correlated with GC content oscillations, suggesting significant contributions from cis-determinants. We propose that cis- and trans-determinants may coordinately accommodate some well-positioned nucleosomes with important functions, driven by a process in which positioned nucleosomes shape the mutational landscape of associated DNA sequences, while the DNA sequences in turn reinforce nucleosome positioning.

Introgression Browser: High throughput whole-genome SNP visualization

Introgression Browser: High throughput whole-genome SNP visualization

Saulo Alves Aflitos, Gabino Sanchez-Perez, Dick de Ridder, Paul Fransz, Eric Schranz, Hans de Jong, Sander Peters
(Submitted on 21 Apr 2015)

Breeding by introgressive hybridization is a pivotal strategy to broaden the genetic basis of crops. Usually, the desired traits are monitored in consecutive crossing generations by marker-assisted selection, but their analyses fail in chromosome regions where crossover recombinants are rare or not viable. Here, we present the Introgression Browser (IBROWSER), a novel bioinformatics tool aimed at visualizing introgressions at nucleotide or SNP accuracy. The software selects homozygous SNPs from Variant Call Format (VCF) information and filters out heterozygous SNPs, Multi-Nucleotide Polymorphisms (MNPs) and insertion-deletions (InDels). For data analysis IBROWSER makes use of sliding windows, but if needed it can generate any desired fragmentation pattern through General Feature Format (GFF) information. In an example of tomato (Solanum lycopersicum) accessions we visualize SNP patterns and elucidate both position and boundaries of the introgressions. We also show that our tool is capable of identifying alien DNA in a panel of the closely related S. pimpinellifolium by examining phylogenetic relationships of the introgressed segments in tomato. In a third example, we demonstrate the power of the IBROWSER in a panel of 600 Arabidopsis accessions, detecting the boundaries of a SNP-free region around a polymorphic 1.17 Mbp inverted segment on the short arm of chromosome 4. The architecture and functionality of IBROWSER makes the software appropriate for a broad set of analyses including SNP mining, genome structure analysis, and pedigree analysis. Its functionality, together with the capability to process large data sets and efficient visualization of sequence variation, makes IBROWSER a valuable breeding tool.

The design and analysis of binary variable traits in common garden genetic experiments of highly fecund species to assess heritability

The design and analysis of binary variable traits in common garden genetic experiments of highly fecund species to assess heritability

Sarah W Davies , Samuel Scarpino , Thanapat Pongwarin , James Scott , Mikhail V Matz
doi: http://dx.doi.org/10.1101/018044

Many biologically important traits are binomially distributed, with their key phenotypes being presence or absence. Despite their prevalence, estimating the heritability of binomial traits presents both experimental and statistical challenges. Here we develop both an empirical and computational methodology for estimating the narrow-sense heritability of binary traits for highly fecund species. Our experimental approach controls for undesirable culturing effects, while minimizing culture numbers, increasing feasibility in the field. Our statistical approach accounts for known issues with model-selection by using a permutation test to calculate significance values and includes both fitting and power calculation methods. We illustrate our methodology by estimating the narrow-sense heritability for larval settlement, a key life-history trait, in the reef-building coral Orbicella faveolata. The experimental, statistical and computational methods, along with all of the data from this study, were deployed in the R package multiDimBio.

A pooling-based approach to mapping genetic variants associated with DNA methylation

A pooling-based approach to mapping genetic variants associated with DNA methylation

Irene Miriam Kaplow , Julia L MacIsaac , Sarah M Mah , Lisa M McEwen , Michael S Kobor , Hunter B Fraser
doi: http://dx.doi.org/10.1101/013649

DNA methylation is an epigenetic modification that plays a key role in gene regulation. Previous studies have investigated its genetic basis by mapping genetic variants that are associated with DNA methylation at specific sites, but these have been limited to microarrays that cover less than 2% of the genome and cannot account for allele-specific methylation (ASM). Other studies have performed whole-genome bisulfite sequencing on a few individuals, but these lack statistical power to identify variants associated with DNA methylation. We present a novel approach in which bisulfite-treated DNA from many individuals is sequenced together in a single pool, resulting in a truly genome-wide map of DNA methylation. Compared to methods that do not account for ASM, our approach increases statistical power to detect associations while sharply reducing cost, effort, and experimental variability. As a proof of concept, we generated deep sequencing data from a pool of 60 human cell lines; we evaluated almost twice as many CpGs as the largest microarray studies and identified over 2,000 genetic variants associated with DNA methylation. We found that these variants are highly enriched for associations with chromatin accessibility and CTCF binding but are less likely to be associated with traits indirectly linked to DNA, such as gene expression and disease phenotypes. In summary, our approach allows genome-wide mapping of genetic variants associated with DNA methylation in any tissue of any species, without the need for individual-level genotype or methylation data.

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

Qiongshi Lu , Yiming Hu , Jiehuan Sun , Yuwei Cheng , Kei-Hoi Cheung , Hongyu Zhao
doi: http://dx.doi.org/10.1101/018093

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

Natural selection defines the cellular complexity

Natural selection defines the cellular complexity

Han Chen , Xionglei He
doi: http://dx.doi.org/10.1101/018069

Current biology is perplexed by the lack of a theoretical framework for understanding the organization principles of the molecular system within a cell. Here we first studied growth rate, one of the seemingly most complex cellular traits, using functional data of yeast single-gene deletion mutants. We observed nearly one thousand expression informative genes (EIGs) whose expression levels are linearly correlated to the trait within an unprecedentedly large functional space. A simple model considering six EIG-formed protein modules revealed a variety of novel mechanistic insights, and also explained ~50% of the variance of cell growth rates measured by Bar-seq technique for over 400 yeast mutants (Pearson’s R = 0.69), a performance comparable to the microarray-based (R = 0.77) or colony-size-based (R = 0.66) experimental approach. We then applied the same strategy to 501 morphological traits of the yeast and achieved successes in most fitness-coupled traits each with hundreds of trait-specific EIGs. Surprisingly, there is no any EIG found for most fitness-uncoupled traits, indicating that they are controlled by super-complex epistases that allow no simple expression-trait correlation. Thus, EIGs are recruited exclusively by natural selection, which builds a rather simple functional architecture for fitness-coupled traits, and the endless complexity of a cell lies primarily in its fitness-uncoupled features.

Interrogating conserved elements of diseases using Boolean combinations of orthologous phenotypes

Interrogating conserved elements of diseases using Boolean combinations of orthologous phenotypes

John O Woods , Matthew Z Tien , Edward M Marcotte
doi: http://dx.doi.org/10.1101/017947

Conserved genetic programs often predate the homologous structures and phenotypes to which they give rise; eyes, for example, have evolved several dozen times, but their development seems to involve a common set of conserved genes. Recently, the concept of orthologous phenotypes (or phenologs) offered a quantitative way to describe this property. Phenologs are phenotypes or diseases from separate species who share an unexpectedly large set of their associated gene orthologs. It has been shown that the phenotype pairs which make up a phenolog are mutually predictive in terms of the genes involved. Recently, we demonstrated the ranking of gene–phenotype association predictions using multiple phenologs from an array of species. In this work, we demonstrate a computational method which provides a more targeted view of the conserved pathways which give rise to diseases. Our approach involves the generation of synthetic pseudo-phenotypes made up of Boolean combinations (union, intersection, and difference) of the gene sets for phenotypes from our database. We search for diseases that overlap significantly with these Boolean phenotypes, and find a number of highly predictive combinations. While set unions produce less specific predictions (as expected), intersection and difference-based combinations appear to offer insights into extremely specific aspects of target diseases. For example, breast cancer is predicted by zebrafish methylmercury response minus metal ion response, with predictions MT-COI, JUN, SOD2, GADD45B, and BAX all involved in the pro-apoptotic response to reactive oxygen species, thought to be a key player in cancer. We also demonstrate predictions from Arabidopsis Boolean phenotypes for increased brown adipose tissue in mouse (salt stress response’s intersection with sucrose stimulus response); and for human myopathy (red light response minus water deprivation response). We demonstrate the ranking of predictions for human holoprosencephaly from the set intersections between each pair of a variety of closely-related zebrafish phenotypes. Our results suggest that Boolean phenolog combinations may provide a more informed insight into the conserved pathways underlying diseases than either regular phenologs or the naïve Bayes approach.

Bayesian Modeling of Epigenetic Variation in Multiple Human Cell Types

Bayesian Modeling of Epigenetic Variation in Multiple Human Cell Types

Yu Zhang , Feng Yue , Ross C. Hardison
doi: http://dx.doi.org/10.1101/018028

With high-throughput sequencing data generated for multiple epigenetic features in many cell types, a chief challenge is to explain the dynamics in multiple epigenomes that lead to differential regulation and phenotypes. We introduce a Bayesian framework for jointly annotating multiple epigenomes and detecting differential regulation among multiple cell types. Our method, IDEAS (integrative and discriminative epigenome annotation system), achieves superior power by modeling both position and cell type specific epigenetic activities. Using ENCODE data sets in 6 cell types, we identified epigenetic variation strongly associated with differential gene expression. The detected regions are significantly enriched in disease genetic variants with much stronger enrichment scores than achievable by existing methods, and the enriched phenotypes are highly relevant to the corresponding cell types. IDEAS is a powerful tool for integrative epigenome annotation and detection of variation, which could be of important utility in elucidating the interplay between genetics, gene regulation and diseases.

Ultra-large alignments using Phylogeny-aware Profiles

Ultra-large alignments using Phylogeny-aware Profiles

Nam-phuong Nguyen, Siavash Mirarab, Keerthana Kumar, Tandy Warnow
(Submitted on 5 Apr 2015)

Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments (MSAs) and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, an MSA method that uses a new machine learning technique – the Ensemble of Hidden Markov Models – that we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at this https URL