A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

Qiongshi Lu , Yiming Hu , Jiehuan Sun , Yuwei Cheng , Kei-Hoi Cheung , Hongyu Zhao
doi: http://dx.doi.org/10.1101/018093

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

Natural selection defines the cellular complexity

Natural selection defines the cellular complexity

Han Chen , Xionglei He
doi: http://dx.doi.org/10.1101/018069

Current biology is perplexed by the lack of a theoretical framework for understanding the organization principles of the molecular system within a cell. Here we first studied growth rate, one of the seemingly most complex cellular traits, using functional data of yeast single-gene deletion mutants. We observed nearly one thousand expression informative genes (EIGs) whose expression levels are linearly correlated to the trait within an unprecedentedly large functional space. A simple model considering six EIG-formed protein modules revealed a variety of novel mechanistic insights, and also explained ~50% of the variance of cell growth rates measured by Bar-seq technique for over 400 yeast mutants (Pearson’s R = 0.69), a performance comparable to the microarray-based (R = 0.77) or colony-size-based (R = 0.66) experimental approach. We then applied the same strategy to 501 morphological traits of the yeast and achieved successes in most fitness-coupled traits each with hundreds of trait-specific EIGs. Surprisingly, there is no any EIG found for most fitness-uncoupled traits, indicating that they are controlled by super-complex epistases that allow no simple expression-trait correlation. Thus, EIGs are recruited exclusively by natural selection, which builds a rather simple functional architecture for fitness-coupled traits, and the endless complexity of a cell lies primarily in its fitness-uncoupled features.

Capturing heterotachy through multi-gamma site models

Capturing heterotachy through multi-gamma site models

Remco Bouckaert , Peter Lockhart
doi: http://dx.doi.org/10.1101/018101

Most methods for performing a phylogenetic analysis based on sequence alignments of gene data assume that the mechanism of evolution is constant through time. It is recognised that some sites do evolve somewhat faster than others, and this can be captured using a (gamma) rate heterogeneity model. Further, some species have shorter replication times than others, and this results in faster rates of substitution in some lineages. This feature of lineage specific rate variation can be captured to some extent, by using relaxed clock models. However, it is also clear that there are additional poorly characterised features of sequence data that can sometimes lead to extreme differences in lineage specific rates. This variation is poorly captured by constant time reversible substitution models. The significance of extreme lineage specific rate differences is that they lead both to errors in reconstructing evolutionary relationships as well as biased estimates for the age of ancestral nodes. We propose a new model that allows gamma rate heterogeneity to change on branches, thus offering a more realistic model of sequence evolution. It adds negligible computational cost to likelihood calculations. We illustrate its effectiveness with an example of green algae and land-plants. For many real world data sets, we find a much better fit with multi-gamma sites models as well as substantial differences in ancestral node date estimates.

Phylogenetic analysis supports a link between DUF1220 domain number and primate brain expansion

Phylogenetic analysis supports a link between DUF1220 domain number and primate brain expansion

Fabian Zimmer , Stephen H Montgomery
doi: http://dx.doi.org/10.1101/018077

The expansion of DUF1220 domain copy number during human evolution is a dramatic example of rapid and repeated domain duplication. However, the phenotypic relevance of DUF1220 dosage is unknown. Although patterns of expression, homology and disease associations suggest a role in cortical development, this hypothesis has not been robustly tested using phylogenetic methods. Here, we estimate DUF1220 domain counts across 12 primate genomes using a nucleotide Hidden Markov Model. We then test a series of hypotheses designed to examine the potential evolutionary significance of DUF1220 copy number expansion. Our results suggest a robust association with brain size, and more specifically neocortex volume. In contradiction to previous hypotheses we find a strong association with postnatal brain development, but not with prenatal brain development. Our results provide further evidence of a conserved association between specific loci and brain size across primates, suggesting human brain evolution occurred through a continuation of existing processes.

The African wolf is a missing link in the wolf-like canid phylogeny

The African wolf is a missing link in the wolf-like canid phylogeny

Eli K. Rueness , Pål Trosvik , Anagaw Atickem , Claudio Sillero-Zubiri , Emiliano Trucchi
doi: http://dx.doi.org/10.1101/017996

Here we present the first genomic data for the African wolf (Canis aureus lupaster) and conclusively demonstrate that it is a unique taxon and not a hybrid between other canids. These animals are commonly misclassified as golden jackals (Canis aureus) and have never been included in any large-scale studies of canid diversity and biogeography, or in investigations of the early stages of dog domestication. Applying massive Restriction Site Associated DNA (RAD) sequencing, 110481 polymorphic sites across the genome of 7 individuals of African wolf were aligned and compared with other wolf-like canids (golden jackal, Holarctic grey wolf, Ethiopian wolf, side-striped jackal and domestic dog). Analyses of this extensive sequence dataset (ca. 8.5Mb) show conclusively that the African wolves represent a distinct taxon more closely related to the Holarctic grey wolf than to the golden jackal. Our results strongly indicate that the distribution of the golden jackal needs to be re-evaluated and point towards alternative hypotheses for the evolution of the rare and endemic Ethiopian wolf (Canis simensis). Furthermore, the extension of the grey wolf phylogeny and distribution opens new possible scenarios for the timing and location of dog domestication.

Interrogating conserved elements of diseases using Boolean combinations of orthologous phenotypes

Interrogating conserved elements of diseases using Boolean combinations of orthologous phenotypes

John O Woods , Matthew Z Tien , Edward M Marcotte
doi: http://dx.doi.org/10.1101/017947

Conserved genetic programs often predate the homologous structures and phenotypes to which they give rise; eyes, for example, have evolved several dozen times, but their development seems to involve a common set of conserved genes. Recently, the concept of orthologous phenotypes (or phenologs) offered a quantitative way to describe this property. Phenologs are phenotypes or diseases from separate species who share an unexpectedly large set of their associated gene orthologs. It has been shown that the phenotype pairs which make up a phenolog are mutually predictive in terms of the genes involved. Recently, we demonstrated the ranking of gene–phenotype association predictions using multiple phenologs from an array of species. In this work, we demonstrate a computational method which provides a more targeted view of the conserved pathways which give rise to diseases. Our approach involves the generation of synthetic pseudo-phenotypes made up of Boolean combinations (union, intersection, and difference) of the gene sets for phenotypes from our database. We search for diseases that overlap significantly with these Boolean phenotypes, and find a number of highly predictive combinations. While set unions produce less specific predictions (as expected), intersection and difference-based combinations appear to offer insights into extremely specific aspects of target diseases. For example, breast cancer is predicted by zebrafish methylmercury response minus metal ion response, with predictions MT-COI, JUN, SOD2, GADD45B, and BAX all involved in the pro-apoptotic response to reactive oxygen species, thought to be a key player in cancer. We also demonstrate predictions from Arabidopsis Boolean phenotypes for increased brown adipose tissue in mouse (salt stress response’s intersection with sucrose stimulus response); and for human myopathy (red light response minus water deprivation response). We demonstrate the ranking of predictions for human holoprosencephaly from the set intersections between each pair of a variety of closely-related zebrafish phenotypes. Our results suggest that Boolean phenolog combinations may provide a more informed insight into the conserved pathways underlying diseases than either regular phenologs or the naïve Bayes approach.

Bayesian Modeling of Epigenetic Variation in Multiple Human Cell Types

Bayesian Modeling of Epigenetic Variation in Multiple Human Cell Types

Yu Zhang , Feng Yue , Ross C. Hardison
doi: http://dx.doi.org/10.1101/018028

With high-throughput sequencing data generated for multiple epigenetic features in many cell types, a chief challenge is to explain the dynamics in multiple epigenomes that lead to differential regulation and phenotypes. We introduce a Bayesian framework for jointly annotating multiple epigenomes and detecting differential regulation among multiple cell types. Our method, IDEAS (integrative and discriminative epigenome annotation system), achieves superior power by modeling both position and cell type specific epigenetic activities. Using ENCODE data sets in 6 cell types, we identified epigenetic variation strongly associated with differential gene expression. The detected regions are significantly enriched in disease genetic variants with much stronger enrichment scores than achievable by existing methods, and the enriched phenotypes are highly relevant to the corresponding cell types. IDEAS is a powerful tool for integrative epigenome annotation and detection of variation, which could be of important utility in elucidating the interplay between genetics, gene regulation and diseases.

Methods for distinguishing between protein-coding and long noncoding RNAs and the elusive biological purpose of translation of long noncoding RNAs

Methods for distinguishing between protein-coding and long noncoding RNAs and the elusive biological purpose of translation of long noncoding RNAs
Gali Housman , Igor Ulitsky
doi: http://dx.doi.org/10.1101/017889

Long noncoding RNAs (lncRNAs) are a diverse class of RNAs with increasingly appreciated functions in vertebrates, yet much of their biology remains poorly understood. In particular, it is unclear to what extent the current catalog of over 10,000 distinct annotated lncRNAs is indeed devoid of genes coding for proteins. Here we review the available computational and experimental schemes for distinguishing between recent genome-wide applications. We conclude that the model most consistent with available data is that a large number of mammalian lncRNAs undergo translation, but only a very small minority of such translation events result in stable and functional peptides. The outcome of the majority of the translation events and their potential biological purposes remain an intriguing topic for future investigation.

Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele

Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele
Roy Ronen , Glenn Tesler , Ali Akbari , Shay Zakov , Noah A Rosenberg , Vineet Bafna

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

Analysis of allele-specific expression reveals cis-regulatory changes associated with a recent mating system shift and floral adaptation in Capsella

Analysis of allele-specific expression reveals cis-regulatory changes associated with a recent mating system shift and floral adaptation in Capsella

Kim A Steige , Johan Reimegård , Daniel Koenig , Douglas G Scofield , Tanja Slotte
doi: http://dx.doi.org/10.1101/017749

Cis-regulatory changes have long been suggested to contribute to organismal adaptation. While cis-regulatory changes can now be identified on a transcriptome-wide scale, in most cases the adaptive significance and mechanistic basis of rapid cis-regulatory divergence remains unclear. Here, we have characterized cis-regulatory changes associated with recent adaptive floral evolution in the selfing plant Capsella rubella, which diverged from the outcrosser Capsella grandiflora less than 200 kya. We assessed allele-specific expression (ASE) in leaves and flower buds at a total of 18,452 genes in three interspecific F1 C. grandiflora x C. rubella hybrids. After accounting for technical variation and read-mapping biases using genomic reads, we estimate that an average of 44% of these genes show evidence of ASE, however only 6% show strong allelic expression biases. Flower buds, but not leaves, show an enrichment of genes with ASE in genomic regions responsible for phenotypic divergence between C. rubella and C. grandiflora. We further detected an excess of heterozygous transposable element (TE) insertions in the vicinity of genes with ASE, and TE insertions targeted by uniquely mapping 24-nt small RNAs were associated with reduced allelic expression of nearby genes. Our results suggest that cis-regulatory changes have been important for recent adaptive floral evolution in Capsella and that differences in TE dynamics between selfing and outcrossing species could be an important mechanism underlying rapid regulatory divergence.