A pooling-based approach to mapping genetic variants associated with DNA methylation

A pooling-based approach to mapping genetic variants associated with DNA methylation

Irene Miriam Kaplow , Julia L MacIsaac , Sarah M Mah , Lisa M McEwen , Michael S Kobor , Hunter B Fraser
doi: http://dx.doi.org/10.1101/013649

DNA methylation is an epigenetic modification that plays a key role in gene regulation. Previous studies have investigated its genetic basis by mapping genetic variants that are associated with DNA methylation at specific sites, but these have been limited to microarrays that cover less than 2% of the genome and cannot account for allele-specific methylation (ASM). Other studies have performed whole-genome bisulfite sequencing on a few individuals, but these lack statistical power to identify variants associated with DNA methylation. We present a novel approach in which bisulfite-treated DNA from many individuals is sequenced together in a single pool, resulting in a truly genome-wide map of DNA methylation. Compared to methods that do not account for ASM, our approach increases statistical power to detect associations while sharply reducing cost, effort, and experimental variability. As a proof of concept, we generated deep sequencing data from a pool of 60 human cell lines; we evaluated almost twice as many CpGs as the largest microarray studies and identified over 2,000 genetic variants associated with DNA methylation. We found that these variants are highly enriched for associations with chromatin accessibility and CTCF binding but are less likely to be associated with traits indirectly linked to DNA, such as gene expression and disease phenotypes. In summary, our approach allows genome-wide mapping of genetic variants associated with DNA methylation in any tissue of any species, without the need for individual-level genotype or methylation data.

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

Qiongshi Lu , Yiming Hu , Jiehuan Sun , Yuwei Cheng , Kei-Hoi Cheung , Hongyu Zhao
doi: http://dx.doi.org/10.1101/018093

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

Natural selection defines the cellular complexity

Natural selection defines the cellular complexity

Han Chen , Xionglei He
doi: http://dx.doi.org/10.1101/018069

Current biology is perplexed by the lack of a theoretical framework for understanding the organization principles of the molecular system within a cell. Here we first studied growth rate, one of the seemingly most complex cellular traits, using functional data of yeast single-gene deletion mutants. We observed nearly one thousand expression informative genes (EIGs) whose expression levels are linearly correlated to the trait within an unprecedentedly large functional space. A simple model considering six EIG-formed protein modules revealed a variety of novel mechanistic insights, and also explained ~50% of the variance of cell growth rates measured by Bar-seq technique for over 400 yeast mutants (Pearson’s R = 0.69), a performance comparable to the microarray-based (R = 0.77) or colony-size-based (R = 0.66) experimental approach. We then applied the same strategy to 501 morphological traits of the yeast and achieved successes in most fitness-coupled traits each with hundreds of trait-specific EIGs. Surprisingly, there is no any EIG found for most fitness-uncoupled traits, indicating that they are controlled by super-complex epistases that allow no simple expression-trait correlation. Thus, EIGs are recruited exclusively by natural selection, which builds a rather simple functional architecture for fitness-coupled traits, and the endless complexity of a cell lies primarily in its fitness-uncoupled features.

Capturing heterotachy through multi-gamma site models

Capturing heterotachy through multi-gamma site models

Remco Bouckaert , Peter Lockhart
doi: http://dx.doi.org/10.1101/018101

Most methods for performing a phylogenetic analysis based on sequence alignments of gene data assume that the mechanism of evolution is constant through time. It is recognised that some sites do evolve somewhat faster than others, and this can be captured using a (gamma) rate heterogeneity model. Further, some species have shorter replication times than others, and this results in faster rates of substitution in some lineages. This feature of lineage specific rate variation can be captured to some extent, by using relaxed clock models. However, it is also clear that there are additional poorly characterised features of sequence data that can sometimes lead to extreme differences in lineage specific rates. This variation is poorly captured by constant time reversible substitution models. The significance of extreme lineage specific rate differences is that they lead both to errors in reconstructing evolutionary relationships as well as biased estimates for the age of ancestral nodes. We propose a new model that allows gamma rate heterogeneity to change on branches, thus offering a more realistic model of sequence evolution. It adds negligible computational cost to likelihood calculations. We illustrate its effectiveness with an example of green algae and land-plants. For many real world data sets, we find a much better fit with multi-gamma sites models as well as substantial differences in ancestral node date estimates.

Phylogenetic analysis supports a link between DUF1220 domain number and primate brain expansion

Phylogenetic analysis supports a link between DUF1220 domain number and primate brain expansion

Fabian Zimmer , Stephen H Montgomery
doi: http://dx.doi.org/10.1101/018077

The expansion of DUF1220 domain copy number during human evolution is a dramatic example of rapid and repeated domain duplication. However, the phenotypic relevance of DUF1220 dosage is unknown. Although patterns of expression, homology and disease associations suggest a role in cortical development, this hypothesis has not been robustly tested using phylogenetic methods. Here, we estimate DUF1220 domain counts across 12 primate genomes using a nucleotide Hidden Markov Model. We then test a series of hypotheses designed to examine the potential evolutionary significance of DUF1220 copy number expansion. Our results suggest a robust association with brain size, and more specifically neocortex volume. In contradiction to previous hypotheses we find a strong association with postnatal brain development, but not with prenatal brain development. Our results provide further evidence of a conserved association between specific loci and brain size across primates, suggesting human brain evolution occurred through a continuation of existing processes.

The African wolf is a missing link in the wolf-like canid phylogeny

The African wolf is a missing link in the wolf-like canid phylogeny

Eli K. Rueness , Pål Trosvik , Anagaw Atickem , Claudio Sillero-Zubiri , Emiliano Trucchi
doi: http://dx.doi.org/10.1101/017996

Here we present the first genomic data for the African wolf (Canis aureus lupaster) and conclusively demonstrate that it is a unique taxon and not a hybrid between other canids. These animals are commonly misclassified as golden jackals (Canis aureus) and have never been included in any large-scale studies of canid diversity and biogeography, or in investigations of the early stages of dog domestication. Applying massive Restriction Site Associated DNA (RAD) sequencing, 110481 polymorphic sites across the genome of 7 individuals of African wolf were aligned and compared with other wolf-like canids (golden jackal, Holarctic grey wolf, Ethiopian wolf, side-striped jackal and domestic dog). Analyses of this extensive sequence dataset (ca. 8.5Mb) show conclusively that the African wolves represent a distinct taxon more closely related to the Holarctic grey wolf than to the golden jackal. Our results strongly indicate that the distribution of the golden jackal needs to be re-evaluated and point towards alternative hypotheses for the evolution of the rare and endemic Ethiopian wolf (Canis simensis). Furthermore, the extension of the grey wolf phylogeny and distribution opens new possible scenarios for the timing and location of dog domestication.

Interrogating conserved elements of diseases using Boolean combinations of orthologous phenotypes

Interrogating conserved elements of diseases using Boolean combinations of orthologous phenotypes

John O Woods , Matthew Z Tien , Edward M Marcotte
doi: http://dx.doi.org/10.1101/017947

Conserved genetic programs often predate the homologous structures and phenotypes to which they give rise; eyes, for example, have evolved several dozen times, but their development seems to involve a common set of conserved genes. Recently, the concept of orthologous phenotypes (or phenologs) offered a quantitative way to describe this property. Phenologs are phenotypes or diseases from separate species who share an unexpectedly large set of their associated gene orthologs. It has been shown that the phenotype pairs which make up a phenolog are mutually predictive in terms of the genes involved. Recently, we demonstrated the ranking of gene–phenotype association predictions using multiple phenologs from an array of species. In this work, we demonstrate a computational method which provides a more targeted view of the conserved pathways which give rise to diseases. Our approach involves the generation of synthetic pseudo-phenotypes made up of Boolean combinations (union, intersection, and difference) of the gene sets for phenotypes from our database. We search for diseases that overlap significantly with these Boolean phenotypes, and find a number of highly predictive combinations. While set unions produce less specific predictions (as expected), intersection and difference-based combinations appear to offer insights into extremely specific aspects of target diseases. For example, breast cancer is predicted by zebrafish methylmercury response minus metal ion response, with predictions MT-COI, JUN, SOD2, GADD45B, and BAX all involved in the pro-apoptotic response to reactive oxygen species, thought to be a key player in cancer. We also demonstrate predictions from Arabidopsis Boolean phenotypes for increased brown adipose tissue in mouse (salt stress response’s intersection with sucrose stimulus response); and for human myopathy (red light response minus water deprivation response). We demonstrate the ranking of predictions for human holoprosencephaly from the set intersections between each pair of a variety of closely-related zebrafish phenotypes. Our results suggest that Boolean phenolog combinations may provide a more informed insight into the conserved pathways underlying diseases than either regular phenologs or the naïve Bayes approach.