Non-crossover gene conversions show strong GC bias and unexpected clustering in humans

Non-crossover gene conversions show strong GC bias and unexpected clustering in humans

Amy Williams, Giulio Geneovese, Thomas Dyer, Katherine Truax, Goo Jun, Nick Patterson, Joanne E. Curran, Ravi Duggirala, John Blangero, David Reich, Molly Przeworski,

Although the past decade has seen tremendous progress in our understanding of fine-scale recombination, little is known about non-crossover (or “gene conversion”) resolutions. We report the first genome-wide study of non-crossover gene conversion events in humans. Using SNP array data from 94 meioses, we identified 107 sites affected by non-crossover events, of which 51/53 were confirmed in sequence data. Our results suggest that a site is involved in a non-crossover event at a rate of 6.7 × 10-6/bp/generation, consistent with results from sperm-typing studies. Observed non-crossover events show strong allelic bias, with 70% (61–79%) of events transmitting GC alleles (P=7.9 × 10-5), and have tracts lengths that vary over more than an order of magnitude. Strikingly, in 4 of 15 regions with available resequencing data, multiple (~2–4) distinct non-crossover events cluster within ~20–30 kb. This pattern has not been reported previously in mammals and is inconsistent with canonical models of double strand break repair.

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

Joerg Hagmann, Claude Becker, Jonas Müller, Oliver Stegle, Rhonda C Meyer, Korbinian Schneeberger, Joffrey Fitz, Thomas Altmann, Joy Bergelson, Karsten Borgwardt, Detlef Weigel

There has been much excitement about the possibility that exposure to specific environments can induce an ecological memory in the form of whole-sale, genome-wide epigenetic changes that are maintained over many generations. In the model plant Arabidopsis thaliana, numerous heritable DNA methylation differences have been identified in greenhouse-grown isogenic lines, but it remains unknown how natural, highly variable environments affect the rate and spectrum of such changes. Here we present detailed methylome analyses in a geographically dispersed A. thaliana population that constitutes a collection of near-isogenic lines, diverged for at least a century from a common ancestor. We observed little DNA methylation divergence whole-genome wide. Nonetheless, methylome variation largely reflected genetic distance, and was in many aspects similar to that of lines raised in uniform conditions. Thus, even when plants are grown in varying and diverse natural sites, genome-wide epigenetic variation accumulates in a clock-like manner, and epigenetic divergence thus parallels the pattern of genome-wide DNA sequence divergence.

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

Maxwell W Libbrecht, Ferhat Ay, Michael M Hoffman, David M Gilbert, Jeffrey A Bilmes, William Stafford Noble

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation, in which regions of hundreds or thousands of kilobases known as domains are regulated as a unit. Previous studies using genomics assays such as chromatin immunoprecipitation (ChIP)-seq and chromatin conformation capture (3C)-based assays have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods can incorporate only data sets that can be expressed as a one-dimensional vector over the genome and therefore cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a comprehensive model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly-regulated genes expressed in only a small number of cell types, which we term “specific expression domains.” We additionally found that a subset of domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used for the seemingly unrelated task of transferring information from well-studied cell types to less well characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.

Author post: Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids

This guest post is by Russ Corbett-Detig, Emily Jacobs-Palmer, and Hopi Hoekstra (@hopihoekstra) on their paper Corbett-Detig et al Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids bioRxived here.

What are segregation distorters and how can they contribute to reproductive isolation?

Within an individual, somatic cells are typically genetic clones of one another; in contrast, haploid gametes are related to their compatriots at only half of all loci on average, opening doors to intra-individual competition and conflict. Eggs and sperm may express selfish genetic elements called segregation distorters (SDs) that disable or destroy competitor gametes carrying unrelated alleles. The resulting transmission advantage attained by SDs allows them to invade populations without improving the fitness of individuals that harbor them. Indeed, SDs often negatively impact carriers’ fitness because such hosts transmit fewer fit (or viable) gametes. Hence natural selection favors the evolution of alleles that suppress distortion and thereby restore fertility.

Coevolution of SDs and their suppressors can in turn contribute to the evolution of reproductive isolation between diverging lineages. How? If two populations become temporarily isolated from one another, SDs and later their accompanying suppressors may arise and eventually fix in one isolated population, possibly multiple times over. Should the two populations then encounter each other again, the sperm of hybrid males, for example, will contain one or more distorters without the appropriate suppressors, and these males will suffer decreased fertility. Over time, gene flow may be substantially and perhaps permanently hindered leading to the formation of two reproductively isolated species.

In some Drosophila species pairs, and in many crop plants, it is clear that the coevolution of SDs and their suppressors are major, even primary, contributors to the evolution of reproductive isolation between diverging lineages. At present, however, the relative importance of SDs-suppressor systems to reproductive isolation in broader taxonomic swathes of sexually reproducing organisms (e.g. mammals) is largely unexplored.

Our solution to the practical challenges of studying SDs


The primary impediment to addressing this important question in evolutionary biology is practical, not conceptual. Conventionally, researchers detect SD-suppressor systems by crossing two strains to produce a large second-generation hybrid population; they then genotype these hybrids at a set of markers across the genome to identify loci that show substantive deviations from 50:50 mendelian ratios—putative SDs. Ultimately, this traditional approach suffers from two major pitfalls. First, for many organisms it is not feasible to raise and genotype enough hybrids (hundreds to thousands) to have sufficient statistical power to detect SDs, especially those with weaker effects. Second, by genotyping these second generation hybrids, rather than the gametes of their parents, one conflates SD with hybrid inviability, and it can be very difficult to disentangle these two factors.

How to circumvent these challenges? In this work, we develop an alternative approach that avoids these practical challenges. We first obtain high quality, motile sperm from first generation hybrid males (generated from two strains with available genome sequences), and then sequence these sperm in bulk as well as a somatic ‘control’ tissue. We then contrast the relative representation of the parental chromosomes in windows across the genome in both samples, searching for regions where the sperm allele ratios show more DNA copies of one parental haplotype, but the somatic alleles do not. Importantly, this approach is very general, and it can easily be applied to any number of interspecific or intraspecific crosses where it is possible to obtain large quantities of viable gametes.

Little evidence for SDs in house mouse hybrids

We apply this method to a nascent pair of Mus musculus subspecies,M. m. castaneus and M. m. domesticus. We chose these subspecies because hybrid males formed in this cross are known to be partially reproductively dysfunctional. Nonetheless, using our novel method we find no evidence supporting the presence of SDs—no genomics regions showing a statistical deviation from 50:50 compared to control tissue—despite strong statistical power to detect them. We conclude that SDs do not contribute appreciably to the evolution of reproductive isolation in this nascent species pair. Instead, reproductive isolation in these mammalian subspecies likely stems from other incompatibilities in spermatogenesis or ejaculate production unrelated to SD-suppressor coevolution.

So what’s next? Because this approach—bulk sequencing of sperm from hybrid males—can be used on almost any pair of interfertile taxa, we can begin to better understand the prevalence of SD and its role in speciation in a wide diversity of species.

Methods for Joint Imaging and RNA-seq Data Analysis

Methods for Joint Imaging and RNA-seq Data Analysis

Junhai Jiang, Nan Lin, Shicheng Guo, Jinyun Chen, Momiao Xiong
(Submitted on 13 Sep 2014)

Emerging integrative analysis of genomic and anatomical imaging data which has not been well developed, provides invaluable information for the holistic discovery of the genomic structure of disease and has the potential to open a new avenue for discovering novel disease susceptibility genes which cannot be identified if they are analyzed separately. A key issue to the success of imaging and genomic data analysis is how to reduce their dimensions. Most previous methods for imaging information extraction and RNA-seq data reduction do not explore imaging spatial information and often ignore gene expression variation at genomic positional level. To overcome these limitations, we extend functional principle component analysis from one dimension to two dimension (2DFPCA) for representing imaging data and develop a multiple functional linear model (MFLM) in which functional principal scores of images are taken as multiple quantitative traits and RNA-seq profile across a gene is taken as a function predictor for assessing the association of gene expression with images. The developed method has been applied to image and RNA-seq data of ovarian cancer and KIRC studies. We identified 24 and 84 genes whose expressions were associated with imaging variations in ovarian cancer and KIRC studies, respectively. Our results showed that many significantly associated genes with images were not differentially expressed, but revealed their morphological and metabolic functions. The results also demonstrated that the peaks of the estimated regression coefficient function in the MFLM often allowed the discovery of splicing sites and multiple isoform of gene expressions.

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Matthew D MacManes, Michael B Eisen

As a direct result of intense heat and aridity, deserts are thought to be among the most harsh of environments, particularly for their mammalian inhabitants. Given that osmoregulation can be challenging for these animals, with failure resulting in death, strong selection should be observed on genes related to the maintenance of water and solute balance. One such animal, Peromyscus eremicus, is native to the desert regions of the southwest United States and may live its entire life without oral fluid intake. As a first step toward understanding the genetics that underlie this phenotype, we present a characterization of the P. eremicus transcriptome. We assay four tissues (kidney, liver, brain, testes) from a single individual and supplement this with population level renal transcriptome sequencing from 15 additional animals. We identified a set of transcripts undergoing both purifying and balancing selection based on estimates of Tajima’s D. In addition, we used the branch-site test to identify a transcript – Slc2a9, likely related to desert osmoregulation – undergoing enhanced selection in P. eremicus relative to a set of related non-desert rodents.

Author post: Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics

Thus guest post is by Irene Gallego Romero (@ee_reh_neh) on her paper Gallego Romero et al “Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics” bioRxived here.

Genetic divergence in protein coding regions between humans and chimpanzees cannot explain phenotypic differences between the two species, or, more broadly, between other closely related groups. Although we have known this since the early days of genetic sequencing, it has been very hard to formally test the hypothesis that follows logically – that it may be changes in gene expression and regulation that underlie the divergence in phenotypes. This is especially true in the great apes, where there are plenty of ethical and practical impediments to experimentation. For instance, our ability to carry out functional studies and really decode cellular mechanisms is restricted to tissues that can be sampled non-invasively. To date, this has mostly meant fibroblasts and immortalised lymphoblastoid cell lines. The rest of comparative work in primates tends to be done in tissue samples collected post-mortem, where experimental manipulation is not a possibility.

Together, these limitations provided the impetus for us to develop a panel of high-quality induced pluripotent stem cell (iPSC) lines from chimpanzees. The promise of this panel lies, of course, not just in insights into the pluripotent state in chimpanzees (although that is certainly a worthy subject) but in how it opens the door to a tantalizing number of previously inaccessible questions, when we combine it with any of the many protocols available for differentiating iPSCs into particular somatic cell types that have remained out of reach until now.

The amount of work that went into developing an effective reprogramming protocol is not readily apparent in our preprint, but it was exhaustive – and exhausting! We began by using retroviral vectors to deliver the four factors that are commonly used to reprogram somatic cells to pluripotency, but soon encountered two fairly sizable problems with that approach. First, these viral vectors are integrated into the host genome during the course of reprogramming, and one never knows what they’re going to disrupt. This is an issue that everyone using retro- or lentiviral vectors has to contend with, and indeed, when we began working on the project three and a half years ago they were the most reliable and established reprogramming method around, so we were prepared to take our chances and scan the resulting lines to determine insertion sites. Regardless, the thought of random insertions of pluripotency genes set us somewhat on edge!

However, for reasons that we never fully understood, those chimpanzee lines had a lot of trouble silencing the retroviral vectors and maintaining pluripotency solely through endogenous mechanisms, as we show in one of our supplemental figures. At the time, we were making human iPSC lines in tandem using exactly the same vector stocks. While the human lines would lose most exogenous vector expression after 12 to 15 passages, in chimpanzee iPSCs of the same age we would generally find that expression of at least one, if not more, exongenous genes was as high as it had been on day one. This did not bode well for the lines, or for our ability to do interesting things with them! So we scrapped the integrating approach, and began optimizing protocols all over again. Fortunately for us, Shinya Yamanaka’s group had just published a very thorough protocol on reprogramming cells using non-integrating episomal vectors, which ended up laying the foundations of the one we present in our preprint.

The lines we have generated with it are of fantastic quality, and they have passed every test we have thrown at them with flying colours. Pluripotency is being endogenously maintained, they’re karyotypically normal, and they differentiate into all three germ layers both spontaneously as embryoid bodies and teratomas when injected into mice, and when we use directed protocols to push them towards a particular fate.

We were very interested in quantifying how human and chimpanzee iPSC lines differ from each other. To this end, we collected RNA-sequencing and methylation data from the chimpanzee iPSCs and the fibroblast lines they were generated from, as well as from seven human iPSC lines from various ethnic and cellular origins and their precursors, and compared them to one another. We find large numbers of inter-species differences both before and after reprogramming, but crucially, most of them are not the same differences. Of all the genes with strong evidence for differential expression between species at the iPSC stage, only 38% are also differentially expressed before reprogramming, and the situation is quite similar with regards to methylation.

Another thing we have found very striking in the data is the very clear increase in homogeneity within (and possibly between, although our design makes that harder to effectively quantify) species at the iPSC level relative to the precursor cells, both in gene expression levels and in DNA methylation. This finding will be very interesting to keep in mind as we go forward and differentiate the iPSCs into a suite of somatic cell types and see how these measures fluctuate through differentiation.

Ultimately, however, where the biggest significance of this work lies for us is in the fact that the lines are not just for our own use. They’re available to other researchers, and this is something we have had in mind from the earliest stages of the work. There is no possible way for our lab to even begin to tackle all the questions that these lines can be used to answer. So if you want to work with our chimpanzee iPSC lines, get in touch.