Disentangling effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex trait loci

Disentangling effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex trait loci

Gosia Trynka, Harm-Jan Westra, Kamil Slowikowski, Xinli Hu, Han Xu, Barbara E Stranger, Buhm Han, Soumya Raychaudhuri
doi: http://dx.doi.org/10.1101/009258

Identifying genomic annotations that differentiate causal from associated variants is critical to fine-map disease loci. While many studies have identified non-coding annotations overlapping disease variants, these annotations colocalize, complicating fine-mapping efforts. We demonstrate that conventional enrichment tests are inflated and cannot distinguish causal effects from colocalizing annotations. We developed a sensitive and specific statistical approach that is able to identify independent effects from colocalizing annotations. We first confirm that gene regulatory variants map to DNase-I hypersensitive sites (DHS) near transcription start sites. We then show that (1) 15-35% of causal variants within disease loci map to DHS independent of other annotations; (2) breast cancer and rheumatoid arthritis loci harbor potentially causal variants near the summits of histone marks rather than full peak bodies; and (3) variants associated with height are highly enriched for embryonic stem cell DHS sites. We highlight specific loci where we can most effectively prioritize causal variation.

Inference of Gorilla demographic and selective history from whole genome sequence data

Inference of Gorilla demographic and selective history from whole genome sequence data

Kimberly F. McManus, Joanna L. Kelley, Shiya Song, Krishna Veeramah, August E. Woerner, Laurie S. Stevison, Oliver A. Ryder, , Jeffrey M. Kidd, Jeffrey D. Wall, Carlos D. Bustamante, Michael F. Hammer
doi: http://dx.doi.org/10.1101/009191

While population-level genomic sequence data have been gathered extensively for humans, similar data from our closest living relatives are just beginning to emerge. Examination of genomic variation within great apes offers many opportunities to increase our understanding of the forces that have differentially shaped the evolutionary history of hominid taxa. Here, we expand upon the work of the Great Ape Genome Project by analyzing medium to high coverage whole genome sequences from 14 western lowland gorillas (Gorilla gorilla gorilla), 2 eastern lowland gorillas (G. beringei graueri), and a single Cross River individual (G. gorilla diehli). We infer that the ancestors of western and eastern lowland gorillas diverged from a common ancestor ~261 thousand years ago (kya), and that the ancestors of the Cross River population diverged from the western lowland gorilla lineage ~68 kya. Using a diffusion approximation approach to model the genome-wide site frequency spectrum, we infer a history of western lowland gorillas that includes an ancestral population expansion of ~1.4-fold around ~970 kya and a recent ~5.6-fold contraction in population size ~23 kya. The latter may correspond to a major reduction in African equatorial forests around the Last Glacial Maximum. We also analyze patterns of variation among western lowland gorillas to identify several genomic regions with strong signatures of recent selective sweeps. We find that processes related to taste, pancreatic and saliva secretion, sodium ion transmembrane transport, and cardiac muscle function are overrepresented in genomic regions predicted to have experienced recent positive selection.

Author post: Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

This guest post is by Claude Becker, Jörg Hagmann and Detlef Weigel on their preprint Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage, bioRxived here.

This paper is the result of a collaboration between experts in machine learning and statistical analysis (from the group of Karsten Borgwardt at the Max Planck Institute of Intelligent Systems), a lab that has spearheaded the assembly and SNP genotyping of a world-wide collection of Arabidopsis thaliana specimen (Joy Bergelson’s lab at the University of Chicago), a group specialized in large-scale phenotyping (the lab of Thomas Altmann at the Leibniz Institute of Plant Genetics and Crop Plant Research in Gatersleben) and our epigenomics group at the Max Planck Institute for Developmental Biology in Tübingen.

The epigenome of an organism, in a restricted definition, consists of the entirety of post-translational histone modifications (e.g. methylation, acetylation, etc.) and chemical modifications to the DNA, such as methylation of cytosines. Epigenetic marks can influence the transcriptional activity of genes and transposable elements by locally modulating the accessibility of the DNA. The local configuration of the epigenome can change (i) spontaneously, (ii) in dependence of genetic rearrangements, or (iii) as a consequence of external signals. That the epigenome reacts to external signals such as stress and nutrient supply and that it can influence physiological processes – even behavior – has caused much recent excitement. Academic and popular scientific articles have raised the question whether the epigenome has the potential to maintain environmental footprints across generations. The epigenome is thus presented as an entity that fuels acclimation to rapidly changing environmental conditions and that enables adaptation in subsequent generations. Studies investigating the epigenetic basis of the inheritance of acquired traits, however, often either lack the depth of analysis necessary for the identification of locus-specific epigenetic changes or investigate inheritance over a rather short time period of only one or two generations. Moreover, many study designs do not allow for easy distinction between genetic variation causing the observed epigenetic change and epigenetic differences independent of DNA sequence variation.

In our new study we aim to tackle the question to what extent long exposure to varying and diverse environmental conditions can change the heritable DNA methylation landscape. We overcome several of the above-mentioned problems and limitations by studying variation of DNA methylation in a quasi-isogenic lineage of the model plant Arabidopsis thaliana. North America (NA) was only recently colonized by A. thaliana, and approximately half of the current population is made of a single lineage that underwent a recent population bottleneck, having diverged from a common ancestor more a century or two ago, resulting in minimal genetic diversity in the current population [1].

We sequenced the genome and DNA methylome of thirteen closely related NA accessions originating from different geographical locations in order to determine the spectrum, frequency and effect of epigenetic variants. We then compared the epigenetic variation in the NA lineage to that of a previously analyzed set of isogenic A. thaliana lines that had been propagated for 30 generations in the greenhouse [2,3].

Pairwise comparison of the NA accessions revealed that only 3% of the genome-wide methylation showed variable methylation. By using the genetic mutations as a molecular clock, we found that – contrary to our expectation – epimutations did not accumulate at a higher rate under varying natural conditions compared to growth in a stable greenhouse environment. Even more surprisingly, changes in DNA methylation of single cytosines and of larger contiguous regions were often seen in both NA and greenhouse-grown accessions. In both datasets, accumulation of epimutations over time was non-linear, likely reflecting frequent reversions of methylation changes back to the initial configuration. Population structure inferred from methylation data reflected the genetic relatedness of the accessions and showed no signal of a genome-wide environmental footprint. This, together with the fact that most epigenetic variants were neutral and did not correlate with changes in gene expression, indicated that epigenetic variants accumulate to a large extent as a function of time and genetic diversification rather than as a consequence of local adaptation to environmental changes.

In summary, we have shown that long-term methylome variation of plants grown in varying and diverse natural sites is largely stable at the whole-genome level and in several aspects is intriguingly similar to that of lines raised in uniform conditions. This does not rule out a limited number of subtle adaptive DNA methylation changes that are linked to specific growth conditions, but it is in stark contrast to the published claims of broad, genome-wide epigenetic variation reflecting local adaptation. Heritable polymorphisms that arise in response to specific growth conditions certainly appear to be much less frequent than those that arise spontaneously or due to genetic variation.

In addition to the biological findings discussed above, an important part of our paper is an improved method for the detection of differentially methylated regions. Past studies have relied on clustering of differentially methylated positions or on fixed sliding windows, with the caveat of high rates of false negatives and false positives, respectively. We have adapted a Hidden Markov Model, initially developed for animal methylation data, to the more complex DNA methylation patterns in plants. Upon identification of methylated regions in each strain, these are then tested for differential methylation between strains. Our method results in increased specificity and higher accuracy and we believe it will be of broad interest to the epigenomics community.


1. Platt A, Horton M, Huang YS, Li Y, Anastasio AE, et al. (2010) The scale of population structure in Arabidopsis thaliana. PLoS Genet 6: e1000843.

2. Becker C, Hagmann J, Müller J, Koenig D, Stegle O, et al. (2011) Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480: 245-249.

3. Schmitz RJ, Schultz MD, Lewsey MG, O’Malley RC, Urich MA, et al. (2011) Transgenerational epigenetic instability is a source of novel methylation variants. Science 334: 369-373.

Non-crossover gene conversions show strong GC bias and unexpected clustering in humans

Non-crossover gene conversions show strong GC bias and unexpected clustering in humans

Amy Williams, Giulio Geneovese, Thomas Dyer, Katherine Truax, Goo Jun, Nick Patterson, Joanne E. Curran, Ravi Duggirala, John Blangero, David Reich, Molly Przeworski,
doi: http://dx.doi.org/10.1101/009175

Although the past decade has seen tremendous progress in our understanding of fine-scale recombination, little is known about non-crossover (or “gene conversion”) resolutions. We report the first genome-wide study of non-crossover gene conversion events in humans. Using SNP array data from 94 meioses, we identified 107 sites affected by non-crossover events, of which 51/53 were confirmed in sequence data. Our results suggest that a site is involved in a non-crossover event at a rate of 6.7 × 10-6/bp/generation, consistent with results from sperm-typing studies. Observed non-crossover events show strong allelic bias, with 70% (61–79%) of events transmitting GC alleles (P=7.9 × 10-5), and have tracts lengths that vary over more than an order of magnitude. Strikingly, in 4 of 15 regions with available resequencing data, multiple (~2–4) distinct non-crossover events cluster within ~20–30 kb. This pattern has not been reported previously in mammals and is inconsistent with canonical models of double strand break repair.

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

Joerg Hagmann, Claude Becker, Jonas Müller, Oliver Stegle, Rhonda C Meyer, Korbinian Schneeberger, Joffrey Fitz, Thomas Altmann, Joy Bergelson, Karsten Borgwardt, Detlef Weigel
doi: http://dx.doi.org/10.1101/009225

There has been much excitement about the possibility that exposure to specific environments can induce an ecological memory in the form of whole-sale, genome-wide epigenetic changes that are maintained over many generations. In the model plant Arabidopsis thaliana, numerous heritable DNA methylation differences have been identified in greenhouse-grown isogenic lines, but it remains unknown how natural, highly variable environments affect the rate and spectrum of such changes. Here we present detailed methylome analyses in a geographically dispersed A. thaliana population that constitutes a collection of near-isogenic lines, diverged for at least a century from a common ancestor. We observed little DNA methylation divergence whole-genome wide. Nonetheless, methylome variation largely reflected genetic distance, and was in many aspects similar to that of lines raised in uniform conditions. Thus, even when plants are grown in varying and diverse natural sites, genome-wide epigenetic variation accumulates in a clock-like manner, and epigenetic divergence thus parallels the pattern of genome-wide DNA sequence divergence.

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

Maxwell W Libbrecht, Ferhat Ay, Michael M Hoffman, David M Gilbert, Jeffrey A Bilmes, William Stafford Noble
doi: http://dx.doi.org/10.1101/009209

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation, in which regions of hundreds or thousands of kilobases known as domains are regulated as a unit. Previous studies using genomics assays such as chromatin immunoprecipitation (ChIP)-seq and chromatin conformation capture (3C)-based assays have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods can incorporate only data sets that can be expressed as a one-dimensional vector over the genome and therefore cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a comprehensive model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly-regulated genes expressed in only a small number of cell types, which we term “specific expression domains.” We additionally found that a subset of domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used for the seemingly unrelated task of transferring information from well-studied cell types to less well characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.

Methods for Joint Imaging and RNA-seq Data Analysis

Methods for Joint Imaging and RNA-seq Data Analysis

Junhai Jiang, Nan Lin, Shicheng Guo, Jinyun Chen, Momiao Xiong
(Submitted on 13 Sep 2014)

Emerging integrative analysis of genomic and anatomical imaging data which has not been well developed, provides invaluable information for the holistic discovery of the genomic structure of disease and has the potential to open a new avenue for discovering novel disease susceptibility genes which cannot be identified if they are analyzed separately. A key issue to the success of imaging and genomic data analysis is how to reduce their dimensions. Most previous methods for imaging information extraction and RNA-seq data reduction do not explore imaging spatial information and often ignore gene expression variation at genomic positional level. To overcome these limitations, we extend functional principle component analysis from one dimension to two dimension (2DFPCA) for representing imaging data and develop a multiple functional linear model (MFLM) in which functional principal scores of images are taken as multiple quantitative traits and RNA-seq profile across a gene is taken as a function predictor for assessing the association of gene expression with images. The developed method has been applied to image and RNA-seq data of ovarian cancer and KIRC studies. We identified 24 and 84 genes whose expressions were associated with imaging variations in ovarian cancer and KIRC studies, respectively. Our results showed that many significantly associated genes with images were not differentially expressed, but revealed their morphological and metabolic functions. The results also demonstrated that the peaks of the estimated regression coefficient function in the MFLM often allowed the discovery of splicing sites and multiple isoform of gene expressions.