Coalescence, genetic diversity and adaptation in sexual populations

Coalescence, genetic diversity and adaptation in sexual populations
Richard A. Neher, Taylor A. Kessinger, Boris I. Shraiman
(Submitted on 5 Jun 2013)

In diverse sexual populations, selection operates neither on the whole genome — which is repeatedly taken apart and reassembled by recombination — nor on individual alleles which are tightly linked to the chromosomal neighborhood. Those tightly linked alleles affect each others dynamics which reduces the efficiency of selection and distorts patterns of genetic diversity. Inference of evolutionary history from diversity shaped by linked selection requires an understanding of these patterns. Here, we reexamine this problem in the light of recent progress in coalescent theory of rapidly adapting asexual populations. We present a simple but powerful scaling analysis identifying the unit of selection as the genomic “linkage block” with characteristic length \xi_b, which is determined in a self-consistent manner by the condition that the rate of recombination within the block is comparable to the fitness differences between different alleles of the block. We find that an asexual model with strength of selection tuned to that of the linkage block provides an excellent description of genetic diversity and the site frequency spectra when compared to computer simulations of population dynamics. This correspondence holds for the entire spectrum of strength of selection. When fitness differentials arise from the collective contribution of numerous weakly selected polymorphisms, the rate of adaptation increases as the square root of the recombination rate. Linkage block approximation thus provides a simple but powerful tool for understanding interference and collective behavior of dense weakly selected loci.

Interference limits resolution of selection pressures from linked neutral diversity

Interference limits resolution of selection pressures from linked neutral diversity
Benjamin H. Good, Aleksandra M. Walczak, Richard A. Neher, Michael M. Desai
(Submitted on 5 Jun 2013)

Pervasive natural selection can strongly influence observed patterns of genetic variation, but these effects remain poorly understood when multiple selected variants segregate in nearby regions of the genome. Classical population genetics fails to account for interference between linked mutations, which grows increasingly severe as the density of selected polymorphisms increases. Here, we describe a simple limit that emerges when interference is common, in which the fitness effects of individual mutations play a relatively minor role. Instead, molecular evolution is determined by the variance in fitness within the population, defined over an effectively asexual segment of the genome (a “linkage block”). We exploit this insensitivity in a new “coarse-grained” coalescent framework, which approximates the effects of many weakly selected mutations with a smaller number of strongly selected mutations with the same variance in fitness. This approximation generates accurate and efficient predictions for the genetic diversity that cannot be summarized by a simple reduction in effective population size. However, these results suggest a fundamental limit on our ability to resolve individual selection pressures from contemporary sequence data alone, since a wide range of parameters yield nearly identical patterns of sequence variability.

Reconstructing the Population Genetic History of the Caribbean

Reconstructing the Population Genetic History of the Caribbean
Andres Moreno-Estrada, Simon Gravel, Fouad Zakharia, Jacob L. McCauley, Jake K. Byrnes, Christopher R. Gignoux, Patricia A. Ortiz-Tello, Ricardo J. Martinez, Dale J. Hedges, Richard W. Morris, Celeste Eng, Karla Sandoval, Suehelay Acevedo-Acevedo, Juan Carlos Martinez-Cruzado, Paul J. Norman, Zulay Layrisse, Peter Parham, Esteban Gonzalez Burchard, Michael L. Cuccaro, Eden R. Martin, Carlos D. Bustamante
(Submitted on 3 Jun 2013)

The Caribbean basin is home to some of the most complex interactions in recent history among previously diverged human populations. Here, by making use of genome-wide SNP array data, we characterize ancestral components of Caribbean populations on a sub-continental level and unveil fine-scale patterns of population structure distinguishing insular from mainland Caribbean populations as well as from other Hispanic/Latino groups. We provide genetic evidence for an inland South American origin of the Native American component in island populations and for extensive pre-Columbian gene flow across the Caribbean basin. The Caribbean-derived European component shows significant differentiation from parental Iberian populations, presumably as a result of founder effects during the colonization of the New World. Based on demographic models, we reconstruct the complex population history of the Caribbean since the onset of continental admixture. We find that insular populations are best modeled as mixtures absorbing two pulses of African migrants, coinciding with early and maximum activity stages of the transatlantic slave trade. These two pulses appear to have originated in different regions within West Africa, imprinting two distinguishable signatures in present day Afro-Caribbean genomes and shedding light on the genetic impact of the dynamics occurring during the slave trade in the Caribbean.

biobambam: tools for read pair collation based algorithms on BAM files

biobambam: tools for read pair collation based algorithms on BAM files
German Tischler, Steven Leonard
(Submitted on 4 Jun 2013)

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. In this paper we introduce biobambam, an API for efficient BAM file reading supporting the efficient collation of alignments by read name without performing a complete resorting of the input file and some tools based on this API performing tasks like marking duplicate reads and conversion to the FastQ format. In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities in the Picard suite the approach of biobambam can often perform an equivalent task more efficiently in terms of the required main memory and run-time.

Populations in statistical genetic modelling and inference

Populations in statistical genetic modelling and inference

Daniel John Lawson
(Submitted on 4 Jun 2013)

What is a population? This review considers how a population may be defined in terms of understanding the structure of the underlying genetics of the individuals involved. The main approach is to consider statistically identifiable groups of randomly mating individuals, which is well defined in theory for any type of (sexual) organism. We discuss generative models using drift, admixture and spatial structure, and the ancestral recombination graph. These are contrasted with statistical models for inference, principle component analysis and other `non-parametric’ methods. The relationships between these approaches are explored with both simulated and real-data examples. The state-of-the-art practical software tools are discussed and contrasted. We conclude that populations are a useful theoretical construct that can be well defined in theory and often approximately exist in practice.

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads
Yinlong Xie, Gengxiong Wu, Jingbo Tang, Ruibang Luo, Jordan Patterson, Shanlin Liu, Weihua Huang, Guangzhu He, Shengchang Gu, Shengkang Li, Xin Zhou, Tak-Wah Lam, Yingrui Li, Xun Xu, Gane Ka-Shu Wong, Jun Wang
(Submitted on 29 May 2013)

Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences of many (but not all) of the genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popu-larity; but given the short reads (e.g. 2 * 90 bp paired ends), de novo assembly to recover complete full length gene sequences remains an algorithmic challenge.
Results: We present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. Its performance was evaluated on 2Gb and 5Gb of transcriptome data from mouse and rice. Using the known transcripts from these two well-annotated genomes as a benchmark, we assessed how SOAPdenovo-Trans and other competing software handle the practical issues of alterna-tive splicing and variable expression levels. Compared with other de novo transcriptome assemblers, SOAPdenovo-Trans provides high-er contiguity, lower redundancy, and faster execution.

Stochastic gene expression with delay

Stochastic gene expression with delay
Martin Jansen, Peter Pfaffelhuber
(Submitted on 28 May 2013)

The expression of genes usually follows a two-step procedure. First, a gene (encoded in the genome) is transcribed resulting in a strand of (messenger) RNA. Afterwards, the RNA is translated into protein. Classically, this gene expression is modeled using a Markov jump process including activation and deactivation of the gene, transcription and translation rates together with degradation of RNA and protein. We extend this model by adding delays (with arbitrary distributions) to transcription and translation. Such delays can e.g.\ mean that RNA has to be transported to a different part of a cell before translation can be initiated. Already in the classical model, production of RNA and protein come in bursts by activation and deactivation of the gene, resulting in a large variance of the number of RNA and proteins in equilibrium. We derive precise formulas for this second-order structure with the model including delay in equilibrium. As a general fact, the delay decreases the variance of the number of RNA and proteins.

Our paper: Genetic recombination is targeted towards gene promoter regions in dogs

This guest post is by Adam Auton (@adamauton) on his paper (along with coauthors) Genetic recombination is targeted towards gene promoter regions in dogs arXived here.

In this paper, we investigate the age-old question of how meiotic recombination is distributed in the genome of dogs. Before you stop reading, I’d like to spend a couple of paragraphs explaining why this is an interesting topic.

Recombination in mammalian genomes tends to occur in highly localized regions known as recombination hotspots. There are probably about 30,000 or so recombination hotspots in the human genome, each of which are about 2kb wide with recombination rates that can be thousands of times that of the surrounding region. Until a few years ago, the mechanism by which recombination hotspots are localized was largely unknown. This all began to change with the discovery of PRDM9 as the gene responsible for localizing hotspots [1-3]. The role of PRDM9 is to recognize and bind to specific DNA motifs in the genome, which are subsequently epigenetically marked as preferred locations of recombination.

PRDM9 turns out to be quite a fascinating gene. There is extensive variation in PRDM9 both within and across species, which points to strong selective pressures. Importantly, variation in PRDM9 can alter the recognized DNA motifs, thereby altering the locations of recombination hotspots in the genome. The high level of variation in PRDM9 between species appears to explain why recombination hotspots tend to not be shared between even closely related species, such as human and chimpanzees.

We’ve learnt much about the importance of PRDM9 from studies in mice. Knock-out of Prdm9 in mice results in infertility and, most interestingly of all, certain alleles of mouse Prdm9 appear to be incompatible with each other [4,5]. Specifically, Mus m. musculus / Mus m. domesticus hybrid male mice are infertile if they are heterozygotic for specific Prdm9 alleles. As such, Prdm9 has been called a ‘speciation gene’, as it has the potential to restrict gene flow between nascent species, and is the only known such example in mammals.

Given this importance, it was therefore surprising to note that dogs, uniquely amongst mammals, appear to carry a dysfunctional version of PRDM9 [6]. This therefore begs the question of how recombination occurs in dogs, and provides the motivation for our paper.

Estimating recombination rates directly is challenging and costly, as only a few dozen events occur during any given meiosis. Therefore, to characterize large numbers of recombination events on a genome-wide basis, large pedigrees need to be genotyped, which can be both laborious and costly to do in non-model organisms. Luckily, an experiment of this nature has been previously performed in dogs, which revealed a recombination landscape that was reasonably consistent with patterns observed in other mammals [7].

However, without enormous sample sizes, such methods can only investigate patterns at scales far greater than the scale of individual hotspots. In order to investigate fine-scale patterns on a genome-wide basis, one must turn to indirect statistical methods, and it is this approach that we have adopted in our study. First, we whole-genome sequenced a collection of 51 outbred dogs and used this data to call single nucleotide polymorphisms. Having done so, we used the statistical method, LDhat, which infers historical recombination rates via analysis of patterns of linkage disequilibrium. This is a similar approach that adopted by Axelsson et al. [8], who used microarrays to gain strong insights into canine recombination, although our use of sequencing allows us to investigate patterns at a much finer scale.

Our results agree nicely with the broad-scale experimental estimates, but reveal a quite unusual landscape at the fine scale. In particular, we find that canine recombination is strongly enriched in regions with high CpG content. As such, recombination rates are very high around the CpG-rich regions associated with gene promoters, and contrasts with other mammalian species in which recombination hotspots do not show any particularly strong affinity for gene promoter regions. However, it is also reminiscent of patterns seen in Prdm9 knock-out mice which, although infertile, still produce double-strand breaks that cluster in gene promoter regions [9].

Interestingly, the dog genome is known to have very high CpG content. It has previously been suggested that one potential mechanism by which this may have occurred is biased gene conversion, which can result in the preferential transmission of G-C alleles over A-T alleles in the vicinity of recombination events. To investigate this phenomenon, we also sequenced a related fox species, which allowed us to see if G-C alleles are being gained or lost around recombination hotspots. We see that dog recombination hotspots do indeed appear to be acquiring GC content. This could imply a runaway process, by which CpG-rich regions have become recombinogenic, and hence have started to acquire more GC content, and hence become more recombinogenic.

As such, our results show that recombination in the dog genome appears to have some quite interesting properties. However, questions remain. The loss of PRDM9 in dogs appears to have resulted in some qualitative features that are consistent with knock-out mice, and yet dogs somehow avoid the associated infertility. Perhaps canine meiosis manages to complete without a PRDM9 ortholog, or perhaps an as-yet-unknown gene in the dog genome has adopted the role of PRDM9. In either case, the investigation of recombination in dogs provides a valuable means for building our understanding of how recombination occurs and its importance in shaping the genome.

1. Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327: 836-840.
2. Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327: 876-879.
3. Parvanov ED, Petkov PM, Paigen K (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327: 835.
4. Flachs P, Mihola O, Simecek P, Gregorova S, Schimenti JC, et al. (2012) Interallelic and intergenic incompatibilities of the Prdm9 (Hst1) gene in mouse hybrid sterility. PLoS Genet 8: e1003044.
5. Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J (2009) A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science 323: 373-375.
6. Oliver PL, Goodstadt L, Bayes JJ, Birtle Z, Roach KC, et al. (2009) Accelerated evolution of the Prdm9 speciation gene across diverse metazoan taxa. PLoS Genet 5: e1000753.
7. Wong AK, Ruhe AL, Dumont BL, Robertson KR, Guerrero G, et al. (2010) A comprehensive linkage map of the dog genome. Genetics 184: 595-605.
8. Axelsson E, Webster MT, Ratnakumar A, Ponting CP, Lindblad-Toh K (2012) Death of PRDM9 coincides with stabilization of the recombination landscape in the dog genome. Genome Res 22: 51-63.
9. Brick K, Smagulova F, Khil P, Camerini-Otero RD, Petukhova GV (2012) Genetic recombination is directed away from functional genomic elements in mice. Nature 485: 642-645.

Genetic recombination is targeted towards gene promoter regions in dogs

Genetic recombination is targeted towards gene promoter regions in dogs
Adam Auton, Ying Rui Li, Jeffrey Kidd, Kyle Oliveira, Julie Nadel, J. Kim Holloway, Jessica J. Howard, Paula E. Cohen, John M. Greally, Jun Wang, Carlos D. Bustamante, Adam R. Boyko
(Submitted on 28 May 2013)

The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. Compared to genetic maps obtained in other mammalian species, the canine map is notably different at the fine-scale. While broad-scale patterns exhibit typical properties, our fine-scale estimates indicate that recombination is more uniformly distributed than has been observed in other mammalian species. In addition, highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. Finally, by comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.

Our paper: The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

This guest post is by Detlef Weigel (@WeigelWorld) and Hernán A. Burbano on their arXived paper [with coauthors] Yoshida et al. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. arXived here and in press at eLife [to appear here].

This paper is the result of a great collaboration between a lab that specializes in ancient DNA (that of Johannes Krause from the University of Tübingen), an expert in pathogen systematics (the group of Marco Thines from the Senckenberg Museum and Goethe University in Frankfurt), two pathogen genomics labs (those of Sophien Kamoun from the Sainsbury Laboratory in Norwich and Frank Martin from the USDA in California), and our evolutionary genomics group at the Max Planck Institute in Tübingen (Hernán A. Burbano and Detlef Weigel).

 

Phytophthora infestans made history when it destroyed large parts of the European potato crop, beginning in 1845. Potato has its origin in the Andes, in the Southeast of modern Peru and Northwest of Bolivia, while the center of diversity of P. infestans is several thousand kilometers further north, in Mexico’s Toluca Valley. There, other Phytophthora species live on a broad range of host plants. At some point in its history, evolutionary events associated with repeat-driven genome expansion [1,2] endowed P. infestans with the genetic arsenal required to infect potato. The pathogen was introduced to Europe in 1845 via infected potato tuber from the United States, where potato blight had made its first appearance in 1843. In the ensuing European blight epidemic, Ireland was hit especially hard, because the virtual absence of independent farmers and a restrictive customs policy conspired with the disease caused by P. infestans, potato blight, to have disproportionately devastating effects. The Great Famine that struck Ireland was a decisive event in both European and American history. One million Irish died of starvation, and at least another million left the country – most of them to the USA.

 

This part of P. infestans history has been clear, but the relationship of the strain(s) that caused the nineteenth century epidemic to modern strains has been controversial. Before a range of genetically quite distinct P. infestans strains made their debut throughout the world some 40 years ago, the global population outside Mexico was dominated by a single strain, called US-1. Because of its prevalence, US-1 was long thought to have been the cause of the fatal outbreak in the nineteenth century. From the analysis of a single SNP in the mitochondrial genome, it was, however, concluded in 2001 that the nineteenth century strains were more closely related to the modern strains that prevail today [3].

 

In our new paper, we resolve this paradoxical view: While the historical pathogen strain, which we call HERB-1, indeed differs at this one position from US-1, which has a derived allele, HERB-1 is far more closely related to US-1 than to other modern strains. Molecular clock analyses show that both strains probably separated from each other only a few years before the major European outbreak. HERB-1 seems to have dominated the global population without many genetic changes, and only in the twentieth century, after new potato varieties were introduced, was HERB-1 replaced by US-1 as the most successful P. infestans strain. We do not know for sure why HERB-1 was replaced, but we noted that the modern strains tend to be polyploid, while HERB-1 was diploid. We speculate that the increased genetic diversity in polyploid lineages were important for the success of US-1 (and other modern strains).

 

Our conclusions are based on Illumina sequencing of 11 herbarium samples of infected potato and tomato leaves collected in Ireland, the UK, Continental Europe and North America and preserved in the herbaria of the Botanical State Collection Munich and the Kew Gardens in London. Both herbaria placed a great deal of confidence in our abilities and were very generous in providing the dried plants. The degree of DNA preservation in the herbarium samples was impressive, much higher than in other examples of ancient DNA, and the majority of recovered DNA was from the host plant, with some samples having in addition over 20% pathogen DNA. In contrast to recent studies of historic human pathogens, no target DNA enrichment was required. We compared the historic samples with modern strains from Europe, Africa and North and South America as well as two closely related Phytophthora species. Due to the 150-year long period over which the individual samples had been collected, we were able to estimate with great confidence when the various P. infestans strains had emerged during evolutionary time. Here, too, we found connections with historic events: the first contact between Europeans and Americans in Mexico falls exactly into the time window in which the genetic diversity of P. infestans experienced a remarkable increase. Presumably, the social upheaval following the arrival of the Europeans somehow led to a spread of the pathogen at the beginning of the sixteenth century, which in turn accelerated its evolution.

 

The historical HERB-1 type is so far not known from modern collections, but we now have many diagnostic markers with which we can type the hundreds of modern isolates to determine whether perhaps there is somewhere a reservoir of HERB-1. In addition, our work highlights that herbaria constitute a rich, so far untapped source for investigating real-time evolution.

 

Detlef Weigel, weigel@weigelworld.org

Hernán A. Burbano, hernan.burbano@tuebingen.mpg.de

 

Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany

 

 

1.         Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393-398.

2.         Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, et al. (2010) Genome evolution following host jumps in the Irish potato famine pathogen lineage. Science 330: 1540-1543.

3.         Ristaino JB, Groves CT, Parra GR (2001) PCR amplification of the Irish potato famine pathogen from historic specimens. Nature 411: 695-697.