Author post: Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

This guest post is by Claude Becker, Jörg Hagmann and Detlef Weigel on their preprint Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage, bioRxived here.

This paper is the result of a collaboration between experts in machine learning and statistical analysis (from the group of Karsten Borgwardt at the Max Planck Institute of Intelligent Systems), a lab that has spearheaded the assembly and SNP genotyping of a world-wide collection of Arabidopsis thaliana specimen (Joy Bergelson’s lab at the University of Chicago), a group specialized in large-scale phenotyping (the lab of Thomas Altmann at the Leibniz Institute of Plant Genetics and Crop Plant Research in Gatersleben) and our epigenomics group at the Max Planck Institute for Developmental Biology in Tübingen.

The epigenome of an organism, in a restricted definition, consists of the entirety of post-translational histone modifications (e.g. methylation, acetylation, etc.) and chemical modifications to the DNA, such as methylation of cytosines. Epigenetic marks can influence the transcriptional activity of genes and transposable elements by locally modulating the accessibility of the DNA. The local configuration of the epigenome can change (i) spontaneously, (ii) in dependence of genetic rearrangements, or (iii) as a consequence of external signals. That the epigenome reacts to external signals such as stress and nutrient supply and that it can influence physiological processes – even behavior – has caused much recent excitement. Academic and popular scientific articles have raised the question whether the epigenome has the potential to maintain environmental footprints across generations. The epigenome is thus presented as an entity that fuels acclimation to rapidly changing environmental conditions and that enables adaptation in subsequent generations. Studies investigating the epigenetic basis of the inheritance of acquired traits, however, often either lack the depth of analysis necessary for the identification of locus-specific epigenetic changes or investigate inheritance over a rather short time period of only one or two generations. Moreover, many study designs do not allow for easy distinction between genetic variation causing the observed epigenetic change and epigenetic differences independent of DNA sequence variation.

In our new study we aim to tackle the question to what extent long exposure to varying and diverse environmental conditions can change the heritable DNA methylation landscape. We overcome several of the above-mentioned problems and limitations by studying variation of DNA methylation in a quasi-isogenic lineage of the model plant Arabidopsis thaliana. North America (NA) was only recently colonized by A. thaliana, and approximately half of the current population is made of a single lineage that underwent a recent population bottleneck, having diverged from a common ancestor more a century or two ago, resulting in minimal genetic diversity in the current population [1].

We sequenced the genome and DNA methylome of thirteen closely related NA accessions originating from different geographical locations in order to determine the spectrum, frequency and effect of epigenetic variants. We then compared the epigenetic variation in the NA lineage to that of a previously analyzed set of isogenic A. thaliana lines that had been propagated for 30 generations in the greenhouse [2,3].

Pairwise comparison of the NA accessions revealed that only 3% of the genome-wide methylation showed variable methylation. By using the genetic mutations as a molecular clock, we found that – contrary to our expectation – epimutations did not accumulate at a higher rate under varying natural conditions compared to growth in a stable greenhouse environment. Even more surprisingly, changes in DNA methylation of single cytosines and of larger contiguous regions were often seen in both NA and greenhouse-grown accessions. In both datasets, accumulation of epimutations over time was non-linear, likely reflecting frequent reversions of methylation changes back to the initial configuration. Population structure inferred from methylation data reflected the genetic relatedness of the accessions and showed no signal of a genome-wide environmental footprint. This, together with the fact that most epigenetic variants were neutral and did not correlate with changes in gene expression, indicated that epigenetic variants accumulate to a large extent as a function of time and genetic diversification rather than as a consequence of local adaptation to environmental changes.

In summary, we have shown that long-term methylome variation of plants grown in varying and diverse natural sites is largely stable at the whole-genome level and in several aspects is intriguingly similar to that of lines raised in uniform conditions. This does not rule out a limited number of subtle adaptive DNA methylation changes that are linked to specific growth conditions, but it is in stark contrast to the published claims of broad, genome-wide epigenetic variation reflecting local adaptation. Heritable polymorphisms that arise in response to specific growth conditions certainly appear to be much less frequent than those that arise spontaneously or due to genetic variation.

In addition to the biological findings discussed above, an important part of our paper is an improved method for the detection of differentially methylated regions. Past studies have relied on clustering of differentially methylated positions or on fixed sliding windows, with the caveat of high rates of false negatives and false positives, respectively. We have adapted a Hidden Markov Model, initially developed for animal methylation data, to the more complex DNA methylation patterns in plants. Upon identification of methylated regions in each strain, these are then tested for differential methylation between strains. Our method results in increased specificity and higher accuracy and we believe it will be of broad interest to the epigenomics community.

References

1. Platt A, Horton M, Huang YS, Li Y, Anastasio AE, et al. (2010) The scale of population structure in Arabidopsis thaliana. PLoS Genet 6: e1000843.

2. Becker C, Hagmann J, Müller J, Koenig D, Stegle O, et al. (2011) Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480: 245-249.

3. Schmitz RJ, Schultz MD, Lewsey MG, O’Malley RC, Urich MA, et al. (2011) Transgenerational epigenetic instability is a source of novel methylation variants. Science 334: 369-373.

Non-crossover gene conversions show strong GC bias and unexpected clustering in humans

Non-crossover gene conversions show strong GC bias and unexpected clustering in humans

Amy Williams, Giulio Geneovese, Thomas Dyer, Katherine Truax, Goo Jun, Nick Patterson, Joanne E. Curran, Ravi Duggirala, John Blangero, David Reich, Molly Przeworski,
doi: http://dx.doi.org/10.1101/009175

Although the past decade has seen tremendous progress in our understanding of fine-scale recombination, little is known about non-crossover (or “gene conversion”) resolutions. We report the first genome-wide study of non-crossover gene conversion events in humans. Using SNP array data from 94 meioses, we identified 107 sites affected by non-crossover events, of which 51/53 were confirmed in sequence data. Our results suggest that a site is involved in a non-crossover event at a rate of 6.7 × 10-6/bp/generation, consistent with results from sperm-typing studies. Observed non-crossover events show strong allelic bias, with 70% (61–79%) of events transmitting GC alleles (P=7.9 × 10-5), and have tracts lengths that vary over more than an order of magnitude. Strikingly, in 4 of 15 regions with available resequencing data, multiple (~2–4) distinct non-crossover events cluster within ~20–30 kb. This pattern has not been reported previously in mammals and is inconsistent with canonical models of double strand break repair.

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

Joerg Hagmann, Claude Becker, Jonas Müller, Oliver Stegle, Rhonda C Meyer, Korbinian Schneeberger, Joffrey Fitz, Thomas Altmann, Joy Bergelson, Karsten Borgwardt, Detlef Weigel
doi: http://dx.doi.org/10.1101/009225

There has been much excitement about the possibility that exposure to specific environments can induce an ecological memory in the form of whole-sale, genome-wide epigenetic changes that are maintained over many generations. In the model plant Arabidopsis thaliana, numerous heritable DNA methylation differences have been identified in greenhouse-grown isogenic lines, but it remains unknown how natural, highly variable environments affect the rate and spectrum of such changes. Here we present detailed methylome analyses in a geographically dispersed A. thaliana population that constitutes a collection of near-isogenic lines, diverged for at least a century from a common ancestor. We observed little DNA methylation divergence whole-genome wide. Nonetheless, methylome variation largely reflected genetic distance, and was in many aspects similar to that of lines raised in uniform conditions. Thus, even when plants are grown in varying and diverse natural sites, genome-wide epigenetic variation accumulates in a clock-like manner, and epigenetic divergence thus parallels the pattern of genome-wide DNA sequence divergence.

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

Maxwell W Libbrecht, Ferhat Ay, Michael M Hoffman, David M Gilbert, Jeffrey A Bilmes, William Stafford Noble
doi: http://dx.doi.org/10.1101/009209

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation, in which regions of hundreds or thousands of kilobases known as domains are regulated as a unit. Previous studies using genomics assays such as chromatin immunoprecipitation (ChIP)-seq and chromatin conformation capture (3C)-based assays have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods can incorporate only data sets that can be expressed as a one-dimensional vector over the genome and therefore cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a comprehensive model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly-regulated genes expressed in only a small number of cell types, which we term “specific expression domains.” We additionally found that a subset of domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used for the seemingly unrelated task of transferring information from well-studied cell types to less well characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.

Author post: Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids

This guest post is by Russ Corbett-Detig, Emily Jacobs-Palmer, and Hopi Hoekstra (@hopihoekstra) on their paper Corbett-Detig et al Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids bioRxived here.

What are segregation distorters and how can they contribute to reproductive isolation?

Within an individual, somatic cells are typically genetic clones of one another; in contrast, haploid gametes are related to their compatriots at only half of all loci on average, opening doors to intra-individual competition and conflict. Eggs and sperm may express selfish genetic elements called segregation distorters (SDs) that disable or destroy competitor gametes carrying unrelated alleles. The resulting transmission advantage attained by SDs allows them to invade populations without improving the fitness of individuals that harbor them. Indeed, SDs often negatively impact carriers’ fitness because such hosts transmit fewer fit (or viable) gametes. Hence natural selection favors the evolution of alleles that suppress distortion and thereby restore fertility.

Coevolution of SDs and their suppressors can in turn contribute to the evolution of reproductive isolation between diverging lineages. How? If two populations become temporarily isolated from one another, SDs and later their accompanying suppressors may arise and eventually fix in one isolated population, possibly multiple times over. Should the two populations then encounter each other again, the sperm of hybrid males, for example, will contain one or more distorters without the appropriate suppressors, and these males will suffer decreased fertility. Over time, gene flow may be substantially and perhaps permanently hindered leading to the formation of two reproductively isolated species.

In some Drosophila species pairs, and in many crop plants, it is clear that the coevolution of SDs and their suppressors are major, even primary, contributors to the evolution of reproductive isolation between diverging lineages. At present, however, the relative importance of SDs-suppressor systems to reproductive isolation in broader taxonomic swathes of sexually reproducing organisms (e.g. mammals) is largely unexplored.

Our solution to the practical challenges of studying SDs

Supplemental_Figure_S1

The primary impediment to addressing this important question in evolutionary biology is practical, not conceptual. Conventionally, researchers detect SD-suppressor systems by crossing two strains to produce a large second-generation hybrid population; they then genotype these hybrids at a set of markers across the genome to identify loci that show substantive deviations from 50:50 mendelian ratios—putative SDs. Ultimately, this traditional approach suffers from two major pitfalls. First, for many organisms it is not feasible to raise and genotype enough hybrids (hundreds to thousands) to have sufficient statistical power to detect SDs, especially those with weaker effects. Second, by genotyping these second generation hybrids, rather than the gametes of their parents, one conflates SD with hybrid inviability, and it can be very difficult to disentangle these two factors.

How to circumvent these challenges? In this work, we develop an alternative approach that avoids these practical challenges. We first obtain high quality, motile sperm from first generation hybrid males (generated from two strains with available genome sequences), and then sequence these sperm in bulk as well as a somatic ‘control’ tissue. We then contrast the relative representation of the parental chromosomes in windows across the genome in both samples, searching for regions where the sperm allele ratios show more DNA copies of one parental haplotype, but the somatic alleles do not. Importantly, this approach is very general, and it can easily be applied to any number of interspecific or intraspecific crosses where it is possible to obtain large quantities of viable gametes.

Little evidence for SDs in house mouse hybrids

We apply this method to a nascent pair of Mus musculus subspecies,M. m. castaneus and M. m. domesticus. We chose these subspecies because hybrid males formed in this cross are known to be partially reproductively dysfunctional. Nonetheless, using our novel method we find no evidence supporting the presence of SDs—no genomics regions showing a statistical deviation from 50:50 compared to control tissue—despite strong statistical power to detect them. We conclude that SDs do not contribute appreciably to the evolution of reproductive isolation in this nascent species pair. Instead, reproductive isolation in these mammalian subspecies likely stems from other incompatibilities in spermatogenesis or ejaculate production unrelated to SD-suppressor coevolution.

So what’s next? Because this approach—bulk sequencing of sperm from hybrid males—can be used on almost any pair of interfertile taxa, we can begin to better understand the prevalence of SD and its role in speciation in a wide diversity of species.

Methods for Joint Imaging and RNA-seq Data Analysis

Methods for Joint Imaging and RNA-seq Data Analysis

Junhai Jiang, Nan Lin, Shicheng Guo, Jinyun Chen, Momiao Xiong
(Submitted on 13 Sep 2014)

Emerging integrative analysis of genomic and anatomical imaging data which has not been well developed, provides invaluable information for the holistic discovery of the genomic structure of disease and has the potential to open a new avenue for discovering novel disease susceptibility genes which cannot be identified if they are analyzed separately. A key issue to the success of imaging and genomic data analysis is how to reduce their dimensions. Most previous methods for imaging information extraction and RNA-seq data reduction do not explore imaging spatial information and often ignore gene expression variation at genomic positional level. To overcome these limitations, we extend functional principle component analysis from one dimension to two dimension (2DFPCA) for representing imaging data and develop a multiple functional linear model (MFLM) in which functional principal scores of images are taken as multiple quantitative traits and RNA-seq profile across a gene is taken as a function predictor for assessing the association of gene expression with images. The developed method has been applied to image and RNA-seq data of ovarian cancer and KIRC studies. We identified 24 and 84 genes whose expressions were associated with imaging variations in ovarian cancer and KIRC studies, respectively. Our results showed that many significantly associated genes with images were not differentially expressed, but revealed their morphological and metabolic functions. The results also demonstrated that the peaks of the estimated regression coefficient function in the MFLM often allowed the discovery of splicing sites and multiple isoform of gene expressions.

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Matthew D MacManes, Michael B Eisen
doi: http://dx.doi.org/10.1101/009134

As a direct result of intense heat and aridity, deserts are thought to be among the most harsh of environments, particularly for their mammalian inhabitants. Given that osmoregulation can be challenging for these animals, with failure resulting in death, strong selection should be observed on genes related to the maintenance of water and solute balance. One such animal, Peromyscus eremicus, is native to the desert regions of the southwest United States and may live its entire life without oral fluid intake. As a first step toward understanding the genetics that underlie this phenotype, we present a characterization of the P. eremicus transcriptome. We assay four tissues (kidney, liver, brain, testes) from a single individual and supplement this with population level renal transcriptome sequencing from 15 additional animals. We identified a set of transcripts undergoing both purifying and balancing selection based on estimates of Tajima’s D. In addition, we used the branch-site test to identify a transcript – Slc2a9, likely related to desert osmoregulation – undergoing enhanced selection in P. eremicus relative to a set of related non-desert rodents.

Author post: Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics

Thus guest post is by Irene Gallego Romero (@ee_reh_neh) on her paper Gallego Romero et al “Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics” bioRxived here.

Genetic divergence in protein coding regions between humans and chimpanzees cannot explain phenotypic differences between the two species, or, more broadly, between other closely related groups. Although we have known this since the early days of genetic sequencing, it has been very hard to formally test the hypothesis that follows logically – that it may be changes in gene expression and regulation that underlie the divergence in phenotypes. This is especially true in the great apes, where there are plenty of ethical and practical impediments to experimentation. For instance, our ability to carry out functional studies and really decode cellular mechanisms is restricted to tissues that can be sampled non-invasively. To date, this has mostly meant fibroblasts and immortalised lymphoblastoid cell lines. The rest of comparative work in primates tends to be done in tissue samples collected post-mortem, where experimental manipulation is not a possibility.

Together, these limitations provided the impetus for us to develop a panel of high-quality induced pluripotent stem cell (iPSC) lines from chimpanzees. The promise of this panel lies, of course, not just in insights into the pluripotent state in chimpanzees (although that is certainly a worthy subject) but in how it opens the door to a tantalizing number of previously inaccessible questions, when we combine it with any of the many protocols available for differentiating iPSCs into particular somatic cell types that have remained out of reach until now.

The amount of work that went into developing an effective reprogramming protocol is not readily apparent in our preprint, but it was exhaustive – and exhausting! We began by using retroviral vectors to deliver the four factors that are commonly used to reprogram somatic cells to pluripotency, but soon encountered two fairly sizable problems with that approach. First, these viral vectors are integrated into the host genome during the course of reprogramming, and one never knows what they’re going to disrupt. This is an issue that everyone using retro- or lentiviral vectors has to contend with, and indeed, when we began working on the project three and a half years ago they were the most reliable and established reprogramming method around, so we were prepared to take our chances and scan the resulting lines to determine insertion sites. Regardless, the thought of random insertions of pluripotency genes set us somewhat on edge!

However, for reasons that we never fully understood, those chimpanzee lines had a lot of trouble silencing the retroviral vectors and maintaining pluripotency solely through endogenous mechanisms, as we show in one of our supplemental figures. At the time, we were making human iPSC lines in tandem using exactly the same vector stocks. While the human lines would lose most exogenous vector expression after 12 to 15 passages, in chimpanzee iPSCs of the same age we would generally find that expression of at least one, if not more, exongenous genes was as high as it had been on day one. This did not bode well for the lines, or for our ability to do interesting things with them! So we scrapped the integrating approach, and began optimizing protocols all over again. Fortunately for us, Shinya Yamanaka’s group had just published a very thorough protocol on reprogramming cells using non-integrating episomal vectors, which ended up laying the foundations of the one we present in our preprint.

The lines we have generated with it are of fantastic quality, and they have passed every test we have thrown at them with flying colours. Pluripotency is being endogenously maintained, they’re karyotypically normal, and they differentiate into all three germ layers both spontaneously as embryoid bodies and teratomas when injected into mice, and when we use directed protocols to push them towards a particular fate.

We were very interested in quantifying how human and chimpanzee iPSC lines differ from each other. To this end, we collected RNA-sequencing and methylation data from the chimpanzee iPSCs and the fibroblast lines they were generated from, as well as from seven human iPSC lines from various ethnic and cellular origins and their precursors, and compared them to one another. We find large numbers of inter-species differences both before and after reprogramming, but crucially, most of them are not the same differences. Of all the genes with strong evidence for differential expression between species at the iPSC stage, only 38% are also differentially expressed before reprogramming, and the situation is quite similar with regards to methylation.

Another thing we have found very striking in the data is the very clear increase in homogeneity within (and possibly between, although our design makes that harder to effectively quantify) species at the iPSC level relative to the precursor cells, both in gene expression levels and in DNA methylation. This finding will be very interesting to keep in mind as we go forward and differentiate the iPSCs into a suite of somatic cell types and see how these measures fluctuate through differentiation.

Ultimately, however, where the biggest significance of this work lies for us is in the fact that the lines are not just for our own use. They’re available to other researchers, and this is something we have had in mind from the earliest stages of the work. There is no possible way for our lab to even begin to tackle all the questions that these lines can be used to answer. So if you want to work with our chimpanzee iPSC lines, get in touch.

Population genomic analysis uncovers African and European admixture in Drosophila melanogaster populations from the southeastern United States and Caribbean Islands

Population genomic analysis uncovers African and European admixture in Drosophila melanogaster populations from the southeastern United States and Caribbean Islands

Joyce Y Kao, Asif Zubair, Matthew P Salomon, Sergey V Nuzhdin, Daniel Campo
doi: http://dx.doi.org/10.1101/009092

Genome sequences from North American Drosophila melanogaster populations have become available to the scientific community. Deciphering the underlying population structure of these resources is crucial to make the most of these population genomic resources. Accepted models of North American colonization generally purport that several hundred years ago, flies from Africa and Europe were transported to the east coast United States and the Caribbean Islands respectively and thus current east coast US and Caribbean populations are an admixture of African and European ancestry. Theses models have been constructed based on phenotypes and limited genetic data. In our study, we have sequenced individual whole genomes of flies from populations in the southeast US and Caribbean Islands and examined these populations in conjunction with population sequences from Winters, CA, (USA); Raleigh, NC (USA); Cameroon (Africa); and Montpellier (France) to uncover the underlying population structure of North American populations. We find that west coast US populations are most like European populations likely reflecting a rapid westward expansion upon first settlements into North America. We also find genomic evidence of African and European admixture in east coast US and Caribbean populations, with a clinal pattern of decreasing proportions of African ancestry with higher latitude further supporting the proposed demographic model of Caribbean flies being established by African ancestors. Our genomic analysis of Caribbean flies is the first study that exposes the source of previously reported novel African alleles found in east coast US populations.

Secondary contact and local adaptation contribute to genome-wide patterns of clinal variation in Drosophila melanogaster

Secondary contact and local adaptation contribute to genome-wide patterns of clinal variation in Drosophila melanogaster

Alan O. Bergland, Ray Tobler, Josefa Gonzalez, Paul Schmidt, Dmitri Petrov
doi: http://dx.doi.org/10.1101/009084

Populations arrayed along broad latitudinal gradients often show patterns of clinal variation in phenotype and genotype. Such population differentiation can be generated and maintained by a combination of demographic events and adaptive evolutionary processes. Here, we investigate the evolutionary forces that generated and maintain clinal variation genome-wide among populations of Drosophila melanogaster sampled in North America and Australia. We contrast patterns of clinal variation in these continents with patterns of differentiation among ancestral European and African populations. We show that recently derived North America and Australia populations were likely founded by both European and African lineages and that this admixture event generated genome-wide patterns of parallel clinal variation. The pervasive effects of admixture meant that only a handful of loci could be attributed to the operation of spatially varying selection using an FST outlier approach. Our results provide novel insight into a well-studied system of clinal differentiation and provide a context for future studies seeking to identify loci contributing to local adaptation in D. melanogaster.