This guest post is by Gavin Douglas (@gmdougla), Stephen Wright (@stepheniwright), and Tanja Slotte (@tanjaslotte) on their paper Douglas et al. Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris. bioRxived here.
photo credit: Tanja Slotte
In this preprint we investigate the mode of origin and evolutionary consequences of polyploidy in the highly successful tetraploid plant Capsella bursa-pastoris. We analyze high-coverage massively parallel genomic sequence data and first show that C. bursa-pastoris is a recent hybrid of two Capsella lineages leading to C. grandiflora and C. orientalis. This settles a long-standing uncertainty regarding the origins of C. bursa-pastoris. Second, we investigate patterns of nonfunctionalization and gene loss, and while we find little evidence for rapid, massive genome-wide fractionation, our analyses suggest that there is a decrease in the efficacy of selection in this recently formed tetraploid.
Allopolyploid origins of Capsella bursa-pastoris
Determining the evolutionary origin of C. bursa-pastoris has proven to be difficult and many contradictory hypotheses have been suggested, including that the tetraploid is an autopolyploid of a single Capsella species. Part of the complication has been the relatively low levels of sequence divergence between homeologous gene copies, and across the diploid Capsella lineages. Given population genomic sequences from all three Capsella species mentioned, we were able to address this question again with several different approaches.
C. bursa-pastoris undergoes disomic inheritance, meaning that genes duplicated as a result of polyploidy (homeologs) are independently inherited. Thus, one of the major tasks with our genomic data was to partition out the sequences from the two homeologous subgenomes. Because of the low levels of sequence divergence between homeologs (3% on average), this can be a challenging task. We took two approaches to generate phased genome sequence for inferring species origins; de novo assembly of short reads and phasing of SNPs from mapping reads to the reference genome of the diploid Capsella rubella. Phylogenetic trees generated from de novo assemblies of these species overwhelmingly support one C. bursa-pastoris homeolog forming a clade with C. grandiflora and the other with C. orientalis. The distribution of SNPs and transposable elements shared between these species also strongly support this hybridization model, which we estimate occurred within the last 100-300,000 years.
One reason the hybrid origins of C. bursa-pastoris is exciting is due to the divergent evolution of its progenitor lineages. C. orientalis and C. grandiflora differ both in their mating system and geographical distribution. Given that C. bursa-pastoris is a highly successful weed found worldwide, it will be interesting in future work to assess whether this divergence between the C. orientalis and C. grandiflora lineages contributed to the tetraploid’s adaptability.
Decreased efficacy of selection in the recently arisen polyploid
Following genome duplications the majority of redundant loci are expected to become lost over time through the process of diploidization. This model has been supported by several ancient polyploid events, including in Arabidopsis. Capsella bursa-pastoris presents an interesting model for studying the early phases of diploidization, and allows for an investigation of the rate of gene loss as well as the relative importance of relaxed selection vs. positive selection during early stages of gene inactivation. We searched for large deletions spanning genes using several approaches both based on determination of exact breakpoints and by cross-referencing low-coverage regions in C. bursa-pastoris with other Capsella species. Although we identified proportionately more large deletions segregating in C. bursa-pastoris than in the diploids, we did not find evidence for massive genomic changes in the tetraploid.
We were able to demonstrate relaxation of selection by analyzing the site frequency spectrum of SNPs segregating at 0-fold nonsynonymous sites in the three Capsella species. We also investigated SNPs causing putatively deleterious effects, such as premature stop codons, segregating in the three Capsella. Many of these SNPs are shared between the three species, although segregating at low frequencies in C. grandiflora. Since this shared deleterious variation inherited from progenitors seems to be responsible for a large proportion of the earliest stages of gene degeneration, this data supports a model of genome fractionation that is given a “head start” from standing variation. A key message following from this result is that we should be giving more weight to purely historical explanations of gene loss when studying biased fractionation.
Extraordinarily wide genomic impact of a selective sweep associated with the evolution of sex ratio distorter suppression
Emily A Hornett, Bruce Moran, Louise A Reynolds, Sylvain Charlat, Samuel Tazzyman, Nina Wedell, Chris D Jiggins, Gregory Hurst
Symbionts that distort their host?s sex ratio by favouring the production and survival of females are common in arthropods. Their presence produces intense Fisherian selection to return the sex ratio to parity, typified by the rapid spread of host ?suppressor? loci that restore male survival/development. In this study, we investigated the genomic impact of a selective event of this kind in the butterfly Hypolimnas bolina. Through linkage mapping we first identified a genomic region that was necessary for males to survive Wolbachia-induced killing. We then investigated the genomic impact of the rapid spread of suppression that converted the Samoan population of this butterfly from a 100:1 female-biased sex ratio in 2001, to a 1:1 sex ratio by 2006. Models of this process revealed the potential for a chromosome-wide selective sweep. To measure the impact directly, the pattern of genetic variation before and after the episode of selection was compared. Significant changes in allele frequencies were observed over a 25cM region surrounding the suppressor locus, alongside generation of linkage disequilibrium. The presence of novel allelic variants in 2006 suggests that the suppressor was introduced via immigration rather than through de novo mutation. In addition, further sampling in 2010 indicated that many of the introduced variants were lost or had reduced in frequency since 2006. We hypothesise that this loss may have resulted from a period of purifying selection – removing deleterious material that introgressed during the initial sweep. Our observations of the impact of suppression of sex ratio distorting activity reveal an extraordinarily wide genomic imprint, reflecting its status as one of the strongest selective forces in nature.
Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity
Corey T Watson, Karyn Meltz Steinberg, Tina A Graves-Lindsay, Rene L Warren, Maika Malig, Jacqueline E Schein, Richard K Wilson, Rob Holt, Evan Eichler, Felix Breden
Germline variation at immunoglobulin gene (IG) loci is critical for pathogen-mediated immunity, but establishing complete reference sequences in these regions is problematic because of segmental duplications and somatically rearranged source DNA. We sequenced BAC clones from the essentially haploid hydatidiform mole, CHM1, across the light chain IG loci, kappa (IGK) and lambda (IGL), creating single haplotype representations of these regions. The IGL haplotype is 1.25Mb of contiguous sequence with four novel V gene and one novel C gene alleles and an 11.9kbp insertion. The IGK haplotype consists of two 644kbp proximal and 466kbp distal contigs separated by a gap also present in the reference genome sequence. Our effort added an additional 49kbp of unique sequence extending into this gap. The IGK haplotype contains six novel V gene and one novel J gene alleles and a 16.7kbp region with increased sequence identity between the two IGK contigs, exhibiting signatures of interlocus gene conversion. Our data facilitated the first comparison of nucleotide diversity between the light and IG heavy (IGH) chain haplotypes within a single genome, revealing a three to six fold enrichment in the IGH locus, supporting the theory that the heavy chain may be more important in determining antigenic specificity.
Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris
Gavin Douglas, Gesseca Gos, Kim Steige, Adriana Salcedo, Karl Holm, J. Arvid ?gren, Khaled Hazzouri, Wei Wang, Adrian E. Platts, Emily B. Josephs, Robert J. Williamson, Barbara Neuffer, Martin Lascoux, Tanja Slotte, Stephen Wright
Whole genome duplication events have occurred repeatedly during flowering plant evolution, and there is growing evidence for predictable patterns of gene retention and loss following polyploidization. Despite these important insights, the rate and processes governing the earliest stages of diploidization remain uncertain, and the relative importance of genetic drift vs. natural selection in the process of gene degeneration and loss is unclear. Here we conduct whole genome resequencing in Capsella bursa-pastoris, a recently formed tetraploid with one of the most widespread species distributions of any angiosperm. Whole genome data provide strong support for recent hybrid origins of the tetraploid species within the last 100-300,000 years from two diploid progenitors in the Capsella genus. Major-effect inactivating mutations are frequent, but many were inherited from the parental species and show no evidence of being fixed by positive selection. Despite a lack of large-scale gene loss, we observe a shift in the efficacy of natural selection genome-wide. Our results suggest that the earliest stages of diploidization are associated with quantitative genome-wide shifts in the strength and efficacy of selection rather than rapid gene loss, and that nonfunctionalization can receive a ‘head start’ through deleterious variants found in parental diploid populations.
Probabilities of Fitness Consequences for Point Mutations Across the Human Genome
Brad Gulko, Ilan Gronau, Melissa J Hubisz, Adam Siepel
The identification of noncoding functional elements based on high-throughput genomic data remains an important open problem. Here we describe a novel computational approach for estimating the probability that a point mutation at each nucleotide position in a genome will influence organismal fitness. These fitness consequence (fitCons) scores can be interpreted as an evolution-based measure of potential genomic function. We first partition the genome into clusters of positions having distinct functional genomic “fingerprints,” based on cell-type-specific DNase-seq, RNA-seq, and histone modification data. Then we estimate the probability of fitness consequences for each cluster from associated patterns of genetic polymorphism and divergence using a recently developed probabilistic method called INSIGHT. We have generated fitCons scores for three human cell types based on publicly available genomic data and made them available as UCSC Genome Browser tracks. Like conventional evolutionary conservation scores, fitCons scores are clearly elevated in known coding and noncoding functional elements, but they show considerably better sensitivity than conservation scores for many noncoding elements. In addition, they perform exceptionally well in distinguishing ChIP-seq-supported transcription factor binding sites, expression quantitative trait loci, and predicted enhancers from putatively nonfunctional sequences. The fitCons scores indicate that 4.2-7.5% of nucleotide positions in the human genome have influenced fitness since the human-chimpanzee divergence. In contrast to several recent studies, they suggest that recent evolutionary turnover has had a relatively modest impact on the functional content of the genome. Our approach provides a unique new measure of genomic function that complements measures based on evolutionary conservation or functional genomics alone and is particularly well suited for characterizing turnover and evolutionary novelty.
Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure
David Mimno, David M Blei, Barbara E Engelhardt
Subjects: Methodology (stat.ME); Genomics (q-bio.GN); Populations and Evolution (q-bio.PE); Applications (stat.AP)
Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of a statistical model. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the model estimates for four qualitatively different population genetic data sets: the POPRES European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.
Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus
Yujun Cui, Xianwei Yang, Xavier Didelot, Chenyi Guo, Dongfang Li, Yanfeng Yan, Yiquan Zhang, Yanting Yuan, Huanming Yang, Jian Wang, Jun Wang, Yajun Song, Dongsheng Zhou, Daniel Falush, Ruifu Yang
Subjects: Populations and Evolution (q-bio.PE)
In outbreeding organisms, genetic variation is reassorted each generation, leading to geographic gene pools. By contrast bacterial clones can spread and adapt independently leading to a wide variety of possible genetic structures. Here we investigated global patterns of variation in 157 whole genome sequences of Vibrio parahaemolyticus, a free living and seafood associated marine bacterium. Pandemic clones, responsible for recent outbreaks of gastroenteritis in humans have spread globally. However, there are oceanic gene pools, one located in the oceans surrounding Asia and another in the Mexican Gulf. Frequent recombination means that most isolates have acquired the genetic profile of their current location. Within oceanic gene pools, there is nevertheless the opportunity for substructure, for example due to niche partitioning by different clones. We investigated this structure by calculating the effective population size in two different ways. Under standard population genetic models, the two estimates should give similar answers but we found a 30 fold difference. This discrepancy provides evidence for an ‘epigenotype’ model in which distinct ecotypes are maintained by selection on an otherwise homogeneous genetic background. To investigate the genetic factors involved, we used 54 unrelated isolates to conduct a genome wide scan for epistatically interacting loci. We found a single example of strong epistasis between distant genome regions. One of the genes involved in this interaction has previously been implicated in biofilm formation, while the other is a hypothetical protein. Further work will allow a detailed understanding of how selection acts to structure the pattern of variation within natural bacterial populations.
Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data
Anand Bhaskar, Y.X. Rachel Wang, Yun S. Song
With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal which is difficult to pick up with small sample sizes. Lastly, we apply our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing dataset of tens of thousands of individuals assayed at a few hundred genic regions.
Long-term balancing selection in LAD1 maintains a missense trans-species polymorphism in humans, chimpanzees and bonobos
João C. Teixeira, Cesare de Filippo, Antje Weihmann, Juan R. Meneu, Fernando Racimo, Michael Dannemann, Birgit Nickel, Anne Fischer, Michel Halbwax, Claudine Andre, Rebeca Atencia, Matthias Meyer, Genís Parra, Svante Pääbo, Aida M. Andrés
Balancing selection maintains advantageous genetic and phenotypic diversity in populations. When selection acts for long evolutionary periods selected polymorphisms may survive species splits and segregate in present-day populations of different species. Here, we investigated the role of long-term balancing selection in the evolution of protein-coding sequences in the Pan-Homo clade. We sequenced the exome of 20 humans, 20 chimpanzees and 20 bonobos and detected eight coding trans-species polymorphisms (trSNPs) that are shared among the three species and have segregated for approximately 14 million years of independent evolution. While the majority of these trSNPs were found in three genes of the MHC cluster, we also uncovered one coding trSNP (rs12088790) in the gene LAD1. All these trSNPs show clustering of sequences by allele rather than by species and also exhibit other signatures of long-term balancing selection, such as segregating at intermediate frequency and lying in a locus with high genetic diversity. Here we focus on the trSNP in LAD1, a gene that encodes for Ladinin-1, a collagenous anchoring filament protein of basement membrane that is responsible for maintaining cohesion at the dermal-epidermal junction; the gene is also an autoantigen responsible for linear IgA disease. This trSNP results in a missense change (Leucine257Proline) and, besides altering the protein sequence, is associated with changes in gene expression of LAD1.
Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms
Fernando Racimo, Joshua G Schraiber
Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous mutations to a single probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model on SNP data. Our method serves to approximate the deleterious DFE of mutations that are segregating, regardless of their genomic consequence. We can then compare the proportion of mutations that are negatively selected or neutral across various categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly peaked at neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and a second peak at s ≈ −10^(−4). Other types of polymorphisms have shapes that fall roughly in between these two. We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are the regulatory categories with the largest proportion of deleterious polymorphisms.