Our paper: Genetic recombination is targeted towards gene promoter regions in dogs

This guest post is by Adam Auton (@adamauton) on his paper (along with coauthors) Genetic recombination is targeted towards gene promoter regions in dogs arXived here.

In this paper, we investigate the age-old question of how meiotic recombination is distributed in the genome of dogs. Before you stop reading, I’d like to spend a couple of paragraphs explaining why this is an interesting topic.

Recombination in mammalian genomes tends to occur in highly localized regions known as recombination hotspots. There are probably about 30,000 or so recombination hotspots in the human genome, each of which are about 2kb wide with recombination rates that can be thousands of times that of the surrounding region. Until a few years ago, the mechanism by which recombination hotspots are localized was largely unknown. This all began to change with the discovery of PRDM9 as the gene responsible for localizing hotspots [1-3]. The role of PRDM9 is to recognize and bind to specific DNA motifs in the genome, which are subsequently epigenetically marked as preferred locations of recombination.

PRDM9 turns out to be quite a fascinating gene. There is extensive variation in PRDM9 both within and across species, which points to strong selective pressures. Importantly, variation in PRDM9 can alter the recognized DNA motifs, thereby altering the locations of recombination hotspots in the genome. The high level of variation in PRDM9 between species appears to explain why recombination hotspots tend to not be shared between even closely related species, such as human and chimpanzees.

We’ve learnt much about the importance of PRDM9 from studies in mice. Knock-out of Prdm9 in mice results in infertility and, most interestingly of all, certain alleles of mouse Prdm9 appear to be incompatible with each other [4,5]. Specifically, Mus m. musculus / Mus m. domesticus hybrid male mice are infertile if they are heterozygotic for specific Prdm9 alleles. As such, Prdm9 has been called a ‘speciation gene’, as it has the potential to restrict gene flow between nascent species, and is the only known such example in mammals.

Given this importance, it was therefore surprising to note that dogs, uniquely amongst mammals, appear to carry a dysfunctional version of PRDM9 [6]. This therefore begs the question of how recombination occurs in dogs, and provides the motivation for our paper.

Estimating recombination rates directly is challenging and costly, as only a few dozen events occur during any given meiosis. Therefore, to characterize large numbers of recombination events on a genome-wide basis, large pedigrees need to be genotyped, which can be both laborious and costly to do in non-model organisms. Luckily, an experiment of this nature has been previously performed in dogs, which revealed a recombination landscape that was reasonably consistent with patterns observed in other mammals [7].

However, without enormous sample sizes, such methods can only investigate patterns at scales far greater than the scale of individual hotspots. In order to investigate fine-scale patterns on a genome-wide basis, one must turn to indirect statistical methods, and it is this approach that we have adopted in our study. First, we whole-genome sequenced a collection of 51 outbred dogs and used this data to call single nucleotide polymorphisms. Having done so, we used the statistical method, LDhat, which infers historical recombination rates via analysis of patterns of linkage disequilibrium. This is a similar approach that adopted by Axelsson et al. [8], who used microarrays to gain strong insights into canine recombination, although our use of sequencing allows us to investigate patterns at a much finer scale.

Our results agree nicely with the broad-scale experimental estimates, but reveal a quite unusual landscape at the fine scale. In particular, we find that canine recombination is strongly enriched in regions with high CpG content. As such, recombination rates are very high around the CpG-rich regions associated with gene promoters, and contrasts with other mammalian species in which recombination hotspots do not show any particularly strong affinity for gene promoter regions. However, it is also reminiscent of patterns seen in Prdm9 knock-out mice which, although infertile, still produce double-strand breaks that cluster in gene promoter regions [9].

Interestingly, the dog genome is known to have very high CpG content. It has previously been suggested that one potential mechanism by which this may have occurred is biased gene conversion, which can result in the preferential transmission of G-C alleles over A-T alleles in the vicinity of recombination events. To investigate this phenomenon, we also sequenced a related fox species, which allowed us to see if G-C alleles are being gained or lost around recombination hotspots. We see that dog recombination hotspots do indeed appear to be acquiring GC content. This could imply a runaway process, by which CpG-rich regions have become recombinogenic, and hence have started to acquire more GC content, and hence become more recombinogenic.

As such, our results show that recombination in the dog genome appears to have some quite interesting properties. However, questions remain. The loss of PRDM9 in dogs appears to have resulted in some qualitative features that are consistent with knock-out mice, and yet dogs somehow avoid the associated infertility. Perhaps canine meiosis manages to complete without a PRDM9 ortholog, or perhaps an as-yet-unknown gene in the dog genome has adopted the role of PRDM9. In either case, the investigation of recombination in dogs provides a valuable means for building our understanding of how recombination occurs and its importance in shaping the genome.

1. Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327: 836-840.
2. Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327: 876-879.
3. Parvanov ED, Petkov PM, Paigen K (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327: 835.
4. Flachs P, Mihola O, Simecek P, Gregorova S, Schimenti JC, et al. (2012) Interallelic and intergenic incompatibilities of the Prdm9 (Hst1) gene in mouse hybrid sterility. PLoS Genet 8: e1003044.
5. Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J (2009) A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science 323: 373-375.
6. Oliver PL, Goodstadt L, Bayes JJ, Birtle Z, Roach KC, et al. (2009) Accelerated evolution of the Prdm9 speciation gene across diverse metazoan taxa. PLoS Genet 5: e1000753.
7. Wong AK, Ruhe AL, Dumont BL, Robertson KR, Guerrero G, et al. (2010) A comprehensive linkage map of the dog genome. Genetics 184: 595-605.
8. Axelsson E, Webster MT, Ratnakumar A, Ponting CP, Lindblad-Toh K (2012) Death of PRDM9 coincides with stabilization of the recombination landscape in the dog genome. Genome Res 22: 51-63.
9. Brick K, Smagulova F, Khil P, Camerini-Otero RD, Petukhova GV (2012) Genetic recombination is directed away from functional genomic elements in mice. Nature 485: 642-645.

Our paper: The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

This guest post is by Detlef Weigel (@WeigelWorld) and Hernán A. Burbano on their arXived paper [with coauthors] Yoshida et al. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. arXived here and in press at eLife [to appear here].

This paper is the result of a great collaboration between a lab that specializes in ancient DNA (that of Johannes Krause from the University of Tübingen), an expert in pathogen systematics (the group of Marco Thines from the Senckenberg Museum and Goethe University in Frankfurt), two pathogen genomics labs (those of Sophien Kamoun from the Sainsbury Laboratory in Norwich and Frank Martin from the USDA in California), and our evolutionary genomics group at the Max Planck Institute in Tübingen (Hernán A. Burbano and Detlef Weigel).

 

Phytophthora infestans made history when it destroyed large parts of the European potato crop, beginning in 1845. Potato has its origin in the Andes, in the Southeast of modern Peru and Northwest of Bolivia, while the center of diversity of P. infestans is several thousand kilometers further north, in Mexico’s Toluca Valley. There, other Phytophthora species live on a broad range of host plants. At some point in its history, evolutionary events associated with repeat-driven genome expansion [1,2] endowed P. infestans with the genetic arsenal required to infect potato. The pathogen was introduced to Europe in 1845 via infected potato tuber from the United States, where potato blight had made its first appearance in 1843. In the ensuing European blight epidemic, Ireland was hit especially hard, because the virtual absence of independent farmers and a restrictive customs policy conspired with the disease caused by P. infestans, potato blight, to have disproportionately devastating effects. The Great Famine that struck Ireland was a decisive event in both European and American history. One million Irish died of starvation, and at least another million left the country – most of them to the USA.

 

This part of P. infestans history has been clear, but the relationship of the strain(s) that caused the nineteenth century epidemic to modern strains has been controversial. Before a range of genetically quite distinct P. infestans strains made their debut throughout the world some 40 years ago, the global population outside Mexico was dominated by a single strain, called US-1. Because of its prevalence, US-1 was long thought to have been the cause of the fatal outbreak in the nineteenth century. From the analysis of a single SNP in the mitochondrial genome, it was, however, concluded in 2001 that the nineteenth century strains were more closely related to the modern strains that prevail today [3].

 

In our new paper, we resolve this paradoxical view: While the historical pathogen strain, which we call HERB-1, indeed differs at this one position from US-1, which has a derived allele, HERB-1 is far more closely related to US-1 than to other modern strains. Molecular clock analyses show that both strains probably separated from each other only a few years before the major European outbreak. HERB-1 seems to have dominated the global population without many genetic changes, and only in the twentieth century, after new potato varieties were introduced, was HERB-1 replaced by US-1 as the most successful P. infestans strain. We do not know for sure why HERB-1 was replaced, but we noted that the modern strains tend to be polyploid, while HERB-1 was diploid. We speculate that the increased genetic diversity in polyploid lineages were important for the success of US-1 (and other modern strains).

 

Our conclusions are based on Illumina sequencing of 11 herbarium samples of infected potato and tomato leaves collected in Ireland, the UK, Continental Europe and North America and preserved in the herbaria of the Botanical State Collection Munich and the Kew Gardens in London. Both herbaria placed a great deal of confidence in our abilities and were very generous in providing the dried plants. The degree of DNA preservation in the herbarium samples was impressive, much higher than in other examples of ancient DNA, and the majority of recovered DNA was from the host plant, with some samples having in addition over 20% pathogen DNA. In contrast to recent studies of historic human pathogens, no target DNA enrichment was required. We compared the historic samples with modern strains from Europe, Africa and North and South America as well as two closely related Phytophthora species. Due to the 150-year long period over which the individual samples had been collected, we were able to estimate with great confidence when the various P. infestans strains had emerged during evolutionary time. Here, too, we found connections with historic events: the first contact between Europeans and Americans in Mexico falls exactly into the time window in which the genetic diversity of P. infestans experienced a remarkable increase. Presumably, the social upheaval following the arrival of the Europeans somehow led to a spread of the pathogen at the beginning of the sixteenth century, which in turn accelerated its evolution.

 

The historical HERB-1 type is so far not known from modern collections, but we now have many diagnostic markers with which we can type the hundreds of modern isolates to determine whether perhaps there is somewhere a reservoir of HERB-1. In addition, our work highlights that herbaria constitute a rich, so far untapped source for investigating real-time evolution.

 

Detlef Weigel, weigel@weigelworld.org

Hernán A. Burbano, hernan.burbano@tuebingen.mpg.de

 

Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany

 

 

1.         Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393-398.

2.         Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, et al. (2010) Genome evolution following host jumps in the Irish potato famine pathogen lineage. Science 330: 1540-1543.

3.         Ristaino JB, Groves CT, Parra GR (2001) PCR amplification of the Irish potato famine pathogen from historic specimens. Nature 411: 695-697.

 

 

Our paper: Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

This guest post is by Vincent Plagnol on his group’s paper Bayesian test for co-localisation between pairs of genetic association studies using summary statistics, arXived here. This has been cross-posted from the Plagnol Lab web site.

In this paper we want to answer the following question: given two genetic association studies both showing some association signal at a locus, how likely is it that the same variant is responsible for both associations?

We care about this because a shared causal variant is likely to imply an etiological link between the traits being considered. An obvious application consists of comparing a gene expression study and a disease trait. If one can show that the same variant is affecting both measurements, then it is very likely that the expression of this gene is affecting disease pathogenesis. It also provides information about the tissue type where the effect is mediated. This is a key information to inform a drug design process.

Previous work that led to this manuscript

A while back, I started a discussion with my colleague (and co-author on this manuscript) Eric Schadt about the involvement of a gene name RPS26 in type 1 diabetes. We came up with tests of co-localisation, which were later improved by my colleague (and co-author as well) Chris Wallace, based in Cambridge. These tests are somewhat dated now. The earliest version considered situations with very small number of SNPs, and was not well suited for densely typed regions, in particular as a result of imputation procedures.

This SNP density problem can be overcome to some extent, and Chris Wallace discusses how to do this here. However, a more fundamental issue is the Bayesian/frequentist difference. These earlier tests were testing the null hypothesis of a shared causal variant. Failing to reject the null could be the result of either a lack of power, or a true shared causal variant. In this newer Bayesian framework, the probability of each scenario is computed, including the “lack of power” case. It then becomes easier to interpret the outcome of the test. The tests are about to be released in the latest version of the coloc package (which is maintained by Chris Wallace).

In this latest paper, the underlying model is closely linked to the one proposed by Matthew Stephens and colleagues in a recent PLoS Genetics paper. However, co-localisation was more a side story in this paper, whereas it is the central point of our work. In particular, we show that it is possible to use single SNP P-values to obtain a very good approximation of the correct answer. As discussed below, this has important practical applications.

Another closely related work is the software Sherlock. Sherlock also uses P-values, and also tries to match a gene expression dataset with another GWAS. However, Sherlock does not really perform a co-localisation test but rather a general matching between a gene and a GWAS. In particular, in the Sherlock framework, only the variants significantly associated with gene expression contribute to the final test statistic. In contrast, a variant flat for the expression trait but strongly associated with disease provides strong support against co-localisation. Our work incorporates this information, by adding support to the “distinct association peaks” scenario.

A warning about the interpretation

As always in statistics, correlation does not imply causality. And what we quantify here are correlations. We can find very strong evidence that the same variant is affecting two traits, but what we cannot conclude without doubt is that the two traits (say, expression of a gene and disease outcome) are causally related. It may be likely, but we are not testing this.

An illustration of the complexity of this is the commonly observed case where a single variant (or haplotype) appears to affect the expression of a group of genes in the same chromosome region. Our test may, in such a situation, provide strong evidence of co-localisation for several of these genes with a disease GWAS. However, most of the time the expression of a single of these genes will actually causally affect the disease trait of interest. It does not mean that the test is wrong but one just has to understand what it is actually testing. Precisely, two traits affected by the same causal variant may suggest a causal link between both, but it does not have to be the case.

Two limitations of this approach

There are two additional limitations to mention. One is that the causal variant should be typed or imputed. We use simulations to show that if this is not the case, the behaviour of the test becomes very conservative.

A second issue is the presence of more than one association for the same trait at a locus. If both associations have approximately the same level of significance, the test can misbehave. In addition, identifying co-localisation with the secondary association requires conditional P-values. We give a nice example of this in the paper. However, if only P-values are available (which is key for what we want to do), this requires using approximate methods. Things are much easier if the genotype level data are available and a proper conditional regression can be implemented.

Why it is important to use summary statistics

Data sharing is always a contentious issue in human genetics. I am incredibly frustrated by the lack of willingness displayed by some groups to share data, even though the claim is that they do. It is a topic for another post. Eric Schadt has been extremely helpful by sharing the liver gene expression dataset with us, but this is a rather uncommon behaviour. In most cases, data are hidden between various “regulations” and “data access committees” that rarely meet and extensively delay the process of data sharing.

Given this frustration, being able to base tests on P-values makes it much easier to interact with other groups and share data. The success of large scale meta-analyses is an example of this. This is why we worked out the statistics so that P-values alone are sufficient to derive the probabilities for each scenario.

A practical implication is that it becomes possible to build a web-based server that will take P-values uploaded by users, compare these P-values with a set of GWAS datasets stored on the server (typically expression studies but perhaps other data types) and return statistics about the overlapping association signals.

We have initiated that process and the coloc server is now live (http://coloc.cs.ucl.ac.uk/coloc/), with a lot of help from the Computer Science department at UCL. We have only loaded the liver dataset that we used in this preprint as of now, but we are in the process of adding a brain gene expression study, led by my colleagues Mike Weale, John Hardy and Mina Ryten. We very much welcome collaborations, and if other datasets, for gene expression or any other relevant traits, are available, we would love to collaborate and incorporate these data into our server.

From genome-wide to “phenome-wide”

What we really want to do with this tool in the near future is mine dozens of GWAS studies using single variant P-values summary data, and search for connections that have been missed by previous investigators. Perhaps there are lipid traits that can be linked to neurodegenerative conditions, like the well known APOE result? Perhaps some T cell genes have an unexpected effect on a cardiovascular trait? Obviously these are not likely events but the genome-wide analysis of many association studies is likely to show several results of this type. The idea is to not only work genome-wide but also “phenome-wide”, comparing as many pairs of traits as possible. Again, this is definitely a collaborative work and we would be excited if we could bring more datasets to make these comparisons more powerful. So don’t hesitate to get in touch.

Our paper: Integrating influenza antigenic dynamics with molecular evolution

This guest post is by Trevor Bedford (@trvrb) on his paper (along with coauthors): Bedford et al. Integrating influenza antigenic dynamics with molecular evolution arXived here.

The influenza virus shows a remarkable capacity to evolve to escape human immunity. Many other viruses, like measles, do not have this capacity. After infection with measles, a person gains life-long immunity to the virus, and hence measles has become constrained to be a childhood infection. Continual antigenic evolution in influenza necessitates frequent vaccine updates to provide sufficient protection to circulating strains.

Antigenic differences between strains are commonly quantified using the hemagglutination inhibition (HI) assay, which measures the ability of antibodies created against one strain to interfere with virus from another strain. The resulting HI data is represented as a sparse matrix of comparisons between viruses from strains A, B, C… and sera from strains X, Y, Z… Taken by itself, this matrix is difficult to work with. Experienced virologists can pick up the loss of reactivity between groups of viruses in the noisy HI data, but these patterns are not fully quantified.

In our new paper, available on the arXiv, we extend techniques of multidimensional scaling (MDS) pioneered by Derek Smith and colleagues for the analysis of influenza antigenic data. Here, we attempted to bring the MDS antigenic model into a fully Bayesian framework and refer to the revised technique as Bayesian MDS (BMDS). In this model, viruses and sera are represented as 2D coordinates on an antigenic map in which their pairwise distances yield expectations for the HI titers, with antigenically similar viruses lying close to one another and antigenically distant viruses lying far apart.

By placing antigenic cartography in a Bayesian context, we are able to integrate other data sources, most notably sequence data. In this case, genetic sequences provide an evolutionary tree relating virus strains and we assume that antigenic location evolves along this tree in a 2D diffusion process. This process imposes a prior on antigenic locations in which evolutionary similar viruses have a prior expectation of lying close to one another on the map. In the paper, we use this BMDS / diffusion model to investigate patterns of antigenic evolution in 4 circulating lineages of influenza and show that antigenic drift determines to a large degree incidence patterns across time and across lineages.

The paper is also up on GitHub, which I’ll keep updating as it goes through the review process. The BMDS model is implemented in the software package BEAST and is available in the latest source code. I hope to provide tutorials on running the BMDS model in the not-to-distant future.

Our paper: Inferring non-neutral regulatory change in pathways from transcriptional profiling data

This post is by Josh Schraiber on his paper (along with coauthors): Schraiber et al. Inferring non-neutral regulatory change in pathways from transcriptional profiling data arXived here.

We’ve known for a long time now that gene sequence alone does not determine phenotype. From the trivial example of differentiated cell types (which all have the same DNA) to now-common examples where species adapt to their environment by changing something other than protein-coding sequence, it’s clear that the expression level of a gene plays just as important a role in phenotypic development as does its sequence. Despite this fact, we still lack the kinds of tools that are widely available for detecting non-neutral evolution at the level of gene expression (in packages like PAML). Part of this problem lies in a fundamental lack of power. A single gene may have hundreds of sites, and the patterns that occur at all of those sites give us plenty of information to learn about accelerated substation rates and the like. But a gene (in a given environment) has just one expression level, so the sample size is often small and power is reduced.

This same problem occurs, of course, in phylogenetic studies of quantitative characters at the organismal level. The difference is that in those cases, researchers typically have access to tens, if not hundreds, of species with good quality measurements. Unfortunately, transcriptome-wide gene expression data can be difficult and costly to collect, so large-scale studies are few and far between.

Instead of trying to leverage large collections of species, we sought to utilize one of the benefits of transcriptome-wide profiles: data from lots and lots of genes. A common practice in molecular evolution is to run tests for selection on a gene-by-gene basis and then look for functional groups that are overrepresented (e.g. Gene Ontology enrichment). We turned that around and instead started with a priori defined gene groups (in our case, from Gene Ontology), looking to detect signal for a history of lineage-specific gene expression evolution, by jointly analyzing all the genes in a group simultaneously.

Doing this would potentially run into a problem of overfitting: should we try to fit a separate rate of evolution for each gene in the group? Instead, we borrowed a page from Ziheng Yang’s book and assumed that the rate of evolution across genes was inverse-gamma distributed. We chose this distribution mostly for for computational convenience, but it is important to note that it can cover a wide range of possibilities—from a model in which every gene evolves at the same rate to a distribution so fat-tailed that there is no average rate of evolution across the group! By fitting a distribution of rates across genes in a group, we are able to look for examples of lineage-specific evolution without being confounded by outlying genes.

We encourage you to check out our paper and let us know what you think
of our approach. In addition, our method will soon be available as an
R package (once I get around to doing all the documentation…) and we
would love to see people using it. If you are interested in getting an
early version of our package, please don’t hesitate to contact me:
jgschraiber@berkeley.edu.

Our paper: Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish

This guest post is by Ewan Birney on Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish, arXived here.

Our lab is part of a collaboration spanning Japan, two groups in Germany and EMBL-EBI in the UK which put the paper “Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish” on to the arXive pre-print server. Here I’ve taken up a kind invitation to post about this paper.

birney_post
Before getting into the details of the paper, I should introduce Medaka, scientific name, Oryzias latipes. Medaka is a small fish which lives in both fresh and brackish water across the majority of the Japanese archipelago (the exception is Hokkaido, the large northern island in Japan), the Korean peninsular and the eastern China region. It has been kept as a garden pet (and then in aquaria) in Japan for a long time, with documented colourful strains in Japanese art since 17th Century (this is a woodblock from the celebrated artist, Ando Hiroshige, (安藤 広重) depicting Goldfish and, the much smaller, Medaka fish – the medaka fish are the small horizontal shoals, not the colourful goldfish sadly). It is widespread in the wild, in particular in rice paddies, hence its other common name, “the Japanese rice paddy fish”. After the rediscovery of Mendel’s work at the turn of the twentieth century scientists in Japan started to use the established colourful strains to explore genetics. The most famous paper from this era is the first discovery in any species of crossover of the sex chromosomes, X-Y. Rather brilliantly this is an open access paper from Genetics (Aida T: On the Inheritance of Color in a Fresh-Water Fish, APLOCHEILUS LATIPES Temmick and Schlegel, with Special Reference to Sex-Linked Inheritance. Genetics 1921, 6:554-573. http://europepmc.org/articles/PMC1200522 Note: at the time, the systematics in this area of fish was different, hence the different genus name). Since this early genetics, Medaka has been used for research both inside and outside Japan over the 20th Century, with a well established linkage map, transgenic procedures, genome sequence, and considerable probing of different phenotypes.

My experience in describing Medaka to European and American audiences is to now answer the usual questions about similarity and contrasts with Zebrafish, the most commonly used laboratory fish in the West. The first important thing to realise is that Medaka is on a very different branch of the Teleost (bony fish) lineage from Zebrafish, separated by an estimated 250-300 million years of divergence – so the Medaka fish genome is only marginally closer to the Zebrafish genome than either are to mammals. One should expect quite different biological details in the two systems, and each system to be equally applicable to mammalian systems. Medaka is somewhat smaller than Zebrafish and nearly always will live in the same tank format as Zebrafish and the same water system (many labs co-culture both Medaka and Zebrafish). Generation time is similar (6 weeks ~ 3 months). Zebrafish lay around 1,000 eggs in a single mating, which is a distinct advantage to Medaka’s clutch of around 30 eggs in a single mating, held to the female. However whereas zebrafish mate only once per week requiring a ‘recovery phase’ , medaka mate every day. Thus the difference in fecundity over time is small. Both zebrafish and medaka have transparent chorions (egg shells); also the embryo itself is completely transparent in both species rendering them ideal model systems to study development. Generally all techniques that have been established for one species are also applicable for the other, such as transgenesis by simple injection of suitable DNA vectors or antisense morphlinos. Medaka fish genetics is cleaner, with many inbred laboratory lines maintained by single brother-sister mating, and thus very homozygous throughout. Finally these medaka inbred lines are often made from wild-catch individuals, with an established breeding protocol to achieve homozygosity from wild individuals.

It is this last feature that we would like leverage here. With an inbreeding protocol from the wild, one can set up a near-isogenic wild panel, similar to the panels that have already been developed in Arabidopsis and Drosophila. These panels are proving very informative for quantitative genetics in both of these fields. Once such a panel is established it is a powerful mapping resource and is one of the few ways to study gene-environment effects as one can repeat phenotyping experiments over the same genotypes but in differing laboratory environments (say, high to low calorie diets, or different temperatures, or different small molecules added to the water). Being able to have such a panel with a vertebrate will be very powerful (Medaka fish has all the common cell types and tissues for a vertebrate – brain, heart, liver, muscle, gut, pancreas – both endocrine and exocrine, kidney.) The main purpose of this paper is to find and characterise the source wild population for such a panel.

A good genetic panel needs ideally to be free of population structure. One also needs the right linkage disequilibrium (LD) properties. In previous decades there was a sweet spot for LD in the genome; the longer the LD the cheaper it was to genotype, but shorter LD gave better resolution of where a functional variant is. With the advent of cheap sequencing, this trade off logic has changed to finding a population with short LD to have the best possible mapping resolution. Finally there are also practical aspects for choice of population – one would like to be able to resample from the same region easily (for example, to add to the panel in the future). After looking at a number of sites where we assessed population properties via mitochondrial typing, we choose a site, Kiyosu (https://maps.google.com/maps?q=34.78113,+137.347928333056(Kiyosu%20Sampling%20Site)&iwloc=A&hl=en), close to Nagoya where the NIBB Medaka resource group under Kiyoshi Naruse is housed. From this population we caught a number of individuals and set up 8 breeding pairs. For each of the 8 pairs, we sequenced the two parents and one child (a “trio” in the parlance of genetics). We choose this sequencing structure as it means we can phase the parental genotypes using the child’s genotype, and in effect sample 16 haplotypes from this population.

From this we show we have a good population for an isogenic panel. There is no discernable population structure, both from a distance matrix perspective across all individuals, and the lack of long LD in the population. As expected for a large teleost population, there are a relatively large number of variants, with a segregating SNP every 150bp on average. In this limited sample the LD is, as expected, quite tight, with the correlation between SNPs (expressed as r2) dropping to “baseline” levels between 5 – 10 KB . From even this limited panel we estimate that almost 40% of SNPs would be mappable to a single exon. We expect this will improve in a more complete panel.

To augment this population characterisation, we also performed some high throughput phenotyping, showing that the population has expected phenotypic diversity driven by genetics. To do this we took advantage of a small number of existing inbred southern line strains. As the numbers are low here, we cannot map any phenotypes (this will require the panel), but we can get an estimate of broad sense heritability, i.e., the proportion of variance of a measurement explained by the differences between inbred lines compared to the differences between individuals of the same line. This is the sort of calculation which is easy to do on inbred panels, and harder to deconvolute on family or outbred cases. For 6 out 7 traits we chose to measure we get reasonably high broad sense heritability measures.

We conclude that we have a good source wild population that is likely to lead to a successful near isogenic panel, and have started inbreeding. We are somewhere between the 3rd to the 4th round of brother-sister inbreeding for 200 founder pairs, and all the lines look healthy. Traditionally one considers lines to have inbred after 8 brother-sister matings, so we’re almost half way through. Although in theory the generation time is 3 months, when one does this on 200 lines, the logistics means it is closer to 4 to 5 months. Felix Loosli is overseeing the inbreeding at the KIT (Karlsruhe Institute for Technology), in Germany.

In addition to the main thrust of this paper, we also look at the population genetics of Medaka, as this is a large wild catch with complete genome wide coverage of SNPs. For me the most interesting thing is the relationship with the Northern strain of Medaka. Japan has a large mountain range roughly running down the middle of the main island of Japan (called the “Japanese Alps” in English –日本アルプス Nihon Arupusu in Japanese). The Medaka fish found to the north of this are phenotypically different (more heavily pigmented, and prefer to live in shallower tanks), but do interbreed in the laboratory with Southern strains. An open question is whether there has been any partial interbreeding (called introgression) in the wild. Using sensitive tests for introgression between lineages, first developed to detect the Human/Neanderthal interbreeding, we do not see evidence for wild interbreeding between the Northern and Southern strains. This lends support that these two “strains” are best thought of as separate species, and one might expect there to have been specific selection events for the phenotypic properties in both strains.

We welcome comments on this paper, but also more generally on the use of the future Kiyosu near isogenic wild panel. We will be completely open about data collection and distribution of data for this panel, wherever possible using global archives to minimise the complications in getting access to the data.

If you are a teleost biologist, the fact that one can co-culture Medaka and Zebrafish means that many phenotypic assays set up for Zebrafish are probably quite easily transferable to Medaka; there is an existing (small) set of inbred lines which you could look to develop assays on. If you are a molecular quantitative geneticist, this panel should have better mapping properties than human or mouse, but a full range of vertebrate cell types. I am looking forward to using some of the statistical techniques developed on Arabidopsis and Drosophila, in particular the gene/environment partitioning components, and if you are interested in gene/environment interactions, this panel will have some unique opportunities. Of course, this is not a panacea – it takes investment in husbandry details and logistics to bring in another species into any system, and quantitative genetics is just one way of exploring genetic effects – forward and reverse genetics are very powerful techniques (which, incidentally have both been used in Medaka).

If you are interested do contact Ewan Birney (birney@ebi.ac.uk) or Ian Dunham (dunham@ebi.ac.uk) on aspects of the genomics/variation, and Felix Loosli (felix.loosli@kit.edu) , Jochen Wittbrodt (Jochen.Wittbrodt@cos.uni-Heidelberg.de) or Kiyoshi Naruse (naruse@nibb.ac.jp) on Medaka husbandry and molecular biology.

Our paper: The influence of relatives on the efficiency and error rate of familial searching

This guest post is by Rori Rohlfs on her paper (along with coauthors): Rohlfs et al. The influence of relatives on the efficiency and error rate of familial searching. arXived here.

One of the ways we in the U.S. (and elsewhere) are likely to encounter genetic technologies in our lives is through forensic DNA identification.  Without knowing a specific quantity, clearly a huge number of us encounter forensic uses of DNA through court cases using genetic evidence (as survivors, defendants, jury members, etc.), DNA sample seizure during a stop or arrest (currently being considered by the U.S. Supreme Court), or by being genetically related to someone in an offender or arrestee DNA database (>11 million profiles in U.S. national database).  Despite the social relevance of forensic uses of DNA, it seems to me that forensic genetics isn’t much discussed by the population and evolutionary genetics crowd these days.

A while back, I became interested in a newer forensic technique known as familial searching, particularly in how some pop gen assumptions affect outcomes.  Familial searching is performed in cases where police have some DNA evidence from an unknown individual they want to identify but have no leads.  First, they’ll search offender/arrestee DNA database(s) for someone with a matching genetic profile (which is verrrry unlikely between unrelated individuals with complete profiles), who they’d then investigate.  In some jurisdictions (where familial searching is legal or practiced without explicit policy), if there’s no complete profile match, they’ll search the database again for a partially matching profile.  The idea being that the partial match may be due to a close genetic relationship.  (Of course, two unrelated individuals could reasonably have partially matching profiles by chance.  More on that later.)  Again, depending on the policies of the jurisdiction, the relatives of some number of partially matching individuals are investigated.  In the most high-profile case of familial searching in the U.S., the suspected genetic relative was subject to surreptitious DNA collection (i.e. being followed until leaving a DNA sample (in that case, a pizza crust)).  Then this sample was tested directly against the original unknown sample, and showing a complete profile match a suspect was identified.

Because familial searching effectively extends offender/arrestee databases to the genetic kin of people in the databases, it raises important questions like:

For a population geneticist, attempting to identify unknown genetic relatives of individuals in the database (rather than the known individuals in the database) introduces more uncertainty and some additional questions come up like:

  • With the genetic information in forensic profiles (typically 13-15 autosomal STRs, sometimes with 17 Y-chromosome STRs), what’s the chance that an unrelated individual coincidentally has a partially matching profile resembling a genetic relative?
  • What background allele and haplotype frequencies are considered in profile likelihood calculations?
  • What statistical methodology will be used to identify [specific?  non-specific?] genetic relatives?

All these questions are especially relevant when considering intense multiple testing introduced by the relevant databases (>1.4 million profiles in the California offender database).  It can be challenging to get a handle on these questions because of widely varying policies and methodology between jurisdictions.  In New York City, it seems that an error-prone ‘allelic matching’ technique has been used to attempt to identify relatives in at least one case of robbery, leading to investigations of unrelated individuals.  While in California, familial searching is used specifically in cold cases of violent crimes with a continuing threat to public safety and in 2011 Myers et al. published the likelihood ratio-based test statistic and procedure used in practice.

When I arrived at the U.C. Berkeley for my postdoc, I met Monty Slatkin and Yun Song who, along with Erin Murphy, had attempted to estimate some error rates of familial searching, but were stymied by a lack of a well-described methods currently used in practice.  When the statistical procedure used by California was published, we were excited to collaborate using practically relevant methodology.  Specifically, we estimated the false positive rate and power of familial searching using the California state procedure.  Generally, we found high power to detect a specified first-degree relationship (.79 to .99) and low (but still substantial in a multiple testing context) false positive rates of calling unrelated individuals as first-degree relatives (<5e-9 to 1e-5).  We got thinking about more distant Y-chromosome-sharing relatives (half-siblings, cousins, second cousins) who (barring mutation) share Y-haplotypes and some portion of their autosomal STRs IBD.  We estimated that these distant relatives could be mistaken for close relatives fairly often, like in our simulations 14-42% of half-sibs and 3-18% of first cousins were misidentified as siblings.

These rates are non-trivial, especially if you consider the size of databases and the fact that there are more distant relatives than near (so distant relatives are more likely to be present in databases).  Further, some of these genetic relationships are not known (even to the individuals themselves) so are not useful to investigation, but may still be interpreted as evidence of familial involvement, leading to investigation of uninvolved individuals.  Lucky for us, our collaborator Erin Murphy has a background in law and thoughtfully outlined some of the practical ramifications in the introduction and discussion of our paper.  Not the least of which is how extended families and communities in groups which are over-represented in databases (perhaps most obviously African Americans and Latinos) would be disproportionately impacted by misidentification of distant relatives as near relatives.

We hope that this interdisciplinary manuscript broadens sorely needed technical and policy discussions of familial searching.

Our paper: Clusters of microRNAs emerge by new hairpins in existing transcripts

This guest post is by Antonio Marco (@antonio_marco_c) on his paper Marco et al. Clusters of microRNAs emerge by new hairpins in existing transcripts arXived here.

Our paper:

MicroRNAs are short regulatory sequences involved in virtually all biological processes. MicroRNAs are often organized in genomic clusters that produce polycistronic transcripts. It is well-known that protein-coding polycistronic transcripts are almost absent in animals (with a few exceptions in nematodes and ascidians). So where do these microRNA clusters come from, and why are they so prevalent? We tackle these questions in our paper “Clusters of microRNAs emerge by new hairpins in existing transcripts”, recently deposited in arXiv.

We envisioned several possible scenarios for the origin of polycistronic microRNAs: First, polycistronic microRNAs can emerge by genomic rearrangements that bring together pre-existing microRNAs. As in bacterial operons, the clustering of microRNAs with related functions can be advantageous, and the fusion of related microRNAs may be positively selected. We call this the ‘put together’ model. Alternatively, multiple microRNAs could become polycistronic as a by-product of genome reduction (this is analogous to Caenorhabditis elegans operons). This is the ‘left together’ model. A third model, called ‘tandem duplication’, implies that polycistronic microRNAs emerge by tandem duplication of single sequences. Lastly, new microRNAs can emerge de novo in already existing microRNA transcripts. We named this the ‘new hairpin’ model, since a novel microRNA first requires the formation of a hairpin-like structure in the transcript.

By reconstructing the evolutionary history of Drosophila melanogaster microRNAs we observed that the majority of microRNA clusters emerged by the formation of new microRNA precursors in existing transcribed microRNA genes (‘new hairpin’ model). We also find that gene duplication generated a minority of the clusters (‘tandem duplication’). However, we didn’t see any instance of fusion of pre-existing microRNA genes. Moreover, clusters rarely split or suffer rearrangements. Once a microRNA cluster is formed, it stays as a cluster or it is lost a a whole.

We propose a model for the origin and evolution of microRNA clusters. Polycistronic microRNAs are an extreme case of genetic linkage, in which a microRNA is typically a few nucleotides away from another microRNA. Once a cluster is formed, the linkage is so tight that recombination is dramatically reduced between these loci. We suggest that, because of strong selective interference between loci (Hill-Robertson effect), a microRNA under selective pressure strongly influences the evolutionary fate of any neighbouring microRNA. Even slightly deleterious microRNAs may be maintained in a population if selection in one microRNA of the cluster is strong enough. Currently, we are analysing polymorphism data to test the validity of our model in actual Drosophila populations.

In summary, we suggest that clusters of microRNAs emerge by non-adaptive mechanisms and they are maintained as a consequence of tight linkage.

Our paper: The causal meaning of Fisher’s average effect

This guest post is by James Lee on his paper with Carson Chow, “The causal meaning of Fisher’s average effect“, arXived here

Early in graduate school, I took it upon myself to read Reinhard Burger’s excellent treatise The Mathematical Theory of Selection, Recombination, and Mutation. Here I encountered the concepts of “average excess” and “average effect,” which were defined (rather unclearly to the casual reader) by Ronald Fisher in his presentation of the Fundamental Theorem of Natural Selection. Finding some of the distinctions made between these two concepts rather confusing, I directed some questions about them to the Yahoo quantitative genetics group. A respondent told me to consult Falconer (1985), which would “make things as clear as mud.”

My school did not have electronic access to Genetics Research at the time, so I did things the old-fashioned way and got my hands on a bound copy of the journal volume containing Falconer’s article. This masterpiece of exposition impressed me so much that I copied it down by hand; since the paper was at the end of the bound volume, the librarian was not able to scan it for me.

Falconer set out four distinct concepts that at various times have been put forth as definitions of the average excess, average effect, or both:

(A) Divide the population into two groups, one containing all A1A1 homozygotes and half of the heterozygotes, the other containing all A2A2 homozygotes and half of the heterozygotes. Take the difference between the conditional mean phenotypes of these two groups.

(B) Choose gametes bearing A1 and A2 at random. Measure the phenotypes of the mature organisms to which these gametes ultimately give rise. Take the difference between the conditional mean phenotypes of the A1 and A2 gametes.

(C) Regress the phenotype on the count (0, 1, or 2) of an arbitrarily chosen allele (A1 or A2). Take the regression coefficient of gene count.

(D) Take the average change in phenotype resulting from experimentally “zapping” one allele into the other, as if by mutation, in a zygote immediately after fertilization but before the onset of any developmental events.

Implicitly assuming that genotypes and environments are independent, Falconer then showed that all four concepts are equivalent under random mating. Now suppose that mating is not random. Then (A) and (B) are still equal and correspond to what Fisher called the average excess. The numerical value of this quantity is generally not equal to either (C) or (D), and in turn (C) and (D) are generally not equal to each other. Falconer concluded that (C) was what Fisher really meant by the average effect.

This conclusion disturbed me a great deal. As any GWAS researcher knows, the (partial) regression of phenotype on gene count does not necessarily pick out any biologically meaningful quantity if genotypes and environments are dependent (“population stratification”). The fundamental issue here is that (C) is merely a statistical definition, appealing only to passive observations of a static population, whereas (D) is a causal definition turning on the result of a hypothetical experimental intervention. I no longer remember now whether I had read Pearl (2009) by this point, but regardless my Spider Sense was unambiguously telling me that (D) was deeper and more meaningful than (C). Furthermore, if Fisher was not the one who coined the slogan “correlation is not causation,” he was certainly one of its first and most vocal promoters. How could Fisher, who invented randomization in experimental design, have preferred a correlational definition over a causal one when setting forth one of the key concepts in his evolutionary theory? Could it be because of the difficult in translating (D) from words into mathematical symbols without something like Pearl’s do operator, which was not available in Fisher’s time?

This paradox continued to bother me over the next several years. Soon after my daughter was born, I indulged one of those wild impulses that strike the sleepless: I emailed my questions regarding this matter to Anthony W. F. Edwards, the last student of the great Fisher himself. Anthony very generously sent me some of his unpublished work and also his correspondence with Falconer about the very article that had spurred my thoughts. This correspondence spanned a period of more than 20 years, and it provided a very poignant portrait of Douglas Falconer as a scientist (Hill and Mackay, 2004). I did not immediately find the answers to my questions in the materials that Anthony sent to me, but they set me on the path toward finding the answers. These are presented in the paper, which will shortly appear in Genetics Research.

It turns out that Fisher’s average effect must be given a causal interpretation after all. For the detailed story of the reconciliation between (C) and (D), you will have to read the paper, written in collaboration with my supervisor Carson Chow. I am particularly pleased with our proof that the frequency-weighted mean of the (experimental) average effects at any locus is equal to zero. In most texts this relation is extrinsically applied to the multilocus case without any motivation except that it holds automatically for the (regression) average effects in the case of a single locus. The fact that this identity, which otherwise is an arbitrary constraint, can be derived from a definition positing the experimental replacement of a homologous gene is rather striking evidence for the importance of a causal interpretation.

Our investigation unexpectedly turned up many connections to other parts of population genetics. I like to think that in the pages of our paper one can hear many masters of population and quantitative genetics–Hardy, Fisher, Wright, Kimura, Falconer, Price, Ewens, Lessard–engaging in a deep conversation.

There are some issues raised in the paper that I am still contemplating. First, there is a complication when one considers randomly sampling a zygote and experimentally changing its genotype to the one whose value needs to be known; such an experiment inevitably changes the frequencies of the genotypes, and for theoretical reasons any ensuing frequency-dependent changes in the phenotypic means of the genotypes needs to be excluded. I believe that one way to do this properly is by partition of the effects of the experiment according to Wright’s path analysis–which would be rather ironic given the well-known antagonism between Wright and Fisher. Second, in the multilocus case it might be possible to mathematically describe special subsets of possible gene substitutions defining a given average effect that satisfy the property that all changes in Hardy-Weinberg and linkage disequilibria are “small.” We look forward to future work (by ourselves?) on these questions.

Note: The bibliography gives the name of the journal in which Falconer (1985) appears as Genetical Research. This is the same journal as Genetics Research; the name was changed about ten years ago.

Our paper: The effects of transcription factor competition on gene regulation

This guest post is by Radu Zabet on his papers “The effects of transcription factor competition on gene regulation” and “The influence of transcription factor competition on the relationship between occupancy and affinity”

Transcription factors (TFs) find their genomic target sites by a combination of three-dimensional diffusion and one-dimensional translocation on the DNA. We previously developed the stochastic simulation framework GRiP (http://logic.sysbiol.cam.ac.uk/grip/) that allows the realistic representation of the target finding process. The following two papers show our application of GRiP to address a few interesting phenomena:

The effects of transcription factor competition on gene regulation
arXiv:1303.6793

The binding of site-specific TFs to their genomic target sites controls the transcription rate of the target genes. In this manuscript, we discuss the influence of TF abundance on the arrival time of TFs on their target sites as well as the time they stay bound to the DNA. We investigate the TF search process using stochastic simulations and found that molecular crowding on the DNA always leads to longer times required by TF molecules to locate their target sites as well as to lower occupancy. There is also an “emergent property” in cases where many molecules compete in some sort of molecular traffic jam on the DNA. This newly identified noise component may be a contributor to transcriptional noise, by affecting both the size of the fluctuations and the distribution of the arrival times (unimodal or bimodal).

The influence of transcription factor competition on the relationship between occupancy and affinity
arXiv:1303.6869

This manuscript deals with the discrepancy between “predicted occupancy” of a TF to a binding site on the basis of, say, a PWM, in contrast to a “measured occupancy” when we simulate the system with our GRiP framework. Again, we can show that absolute TF abundances play an important role in gene expression, and also provide a compelling case where selecting “the highest peaks” from a ChIP experiment may not necessarily identify the most affine binding sites. Our results showed that for medium and high affinity sites, TF competition does not play a significant role for genomic occupancy except in cases when the abundance of the TF is significantly increased, or when the PWM displays relatively low information content. Nevertheless, for medium and low affinity sites, an increase in TF abundance (for both cognate and non-cognate molecules) leads to an increase in occupancy at several sites.