Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris

Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris
Gavin Douglas, Gesseca Gos, Kim Steige, Adriana Salcedo, Karl Holm, J. Arvid ?gren, Khaled Hazzouri, Wei Wang, Adrian E. Platts, Emily B. Josephs, Robert J. Williamson, Barbara Neuffer, Martin Lascoux, Tanja Slotte, Stephen Wright

Whole genome duplication events have occurred repeatedly during flowering plant evolution, and there is growing evidence for predictable patterns of gene retention and loss following polyploidization. Despite these important insights, the rate and processes governing the earliest stages of diploidization remain uncertain, and the relative importance of genetic drift vs. natural selection in the process of gene degeneration and loss is unclear. Here we conduct whole genome resequencing in Capsella bursa-pastoris, a recently formed tetraploid with one of the most widespread species distributions of any angiosperm. Whole genome data provide strong support for recent hybrid origins of the tetraploid species within the last 100-300,000 years from two diploid progenitors in the Capsella genus. Major-effect inactivating mutations are frequent, but many were inherited from the parental species and show no evidence of being fixed by positive selection. Despite a lack of large-scale gene loss, we observe a shift in the efficacy of natural selection genome-wide. Our results suggest that the earliest stages of diploidization are associated with quantitative genome-wide shifts in the strength and efficacy of selection rather than rapid gene loss, and that nonfunctionalization can receive a ‘head start’ through deleterious variants found in parental diploid populations.

Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

Probabilities of Fitness Consequences for Point Mutations Across the Human Genome
Brad Gulko, Ilan Gronau, Melissa J Hubisz, Adam Siepel

The identification of noncoding functional elements based on high-throughput genomic data remains an important open problem. Here we describe a novel computational approach for estimating the probability that a point mutation at each nucleotide position in a genome will influence organismal fitness. These fitness consequence (fitCons) scores can be interpreted as an evolution-based measure of potential genomic function. We first partition the genome into clusters of positions having distinct functional genomic “fingerprints,” based on cell-type-specific DNase-seq, RNA-seq, and histone modification data. Then we estimate the probability of fitness consequences for each cluster from associated patterns of genetic polymorphism and divergence using a recently developed probabilistic method called INSIGHT. We have generated fitCons scores for three human cell types based on publicly available genomic data and made them available as UCSC Genome Browser tracks. Like conventional evolutionary conservation scores, fitCons scores are clearly elevated in known coding and noncoding functional elements, but they show considerably better sensitivity than conservation scores for many noncoding elements. In addition, they perform exceptionally well in distinguishing ChIP-seq-supported transcription factor binding sites, expression quantitative trait loci, and predicted enhancers from putatively nonfunctional sequences. The fitCons scores indicate that 4.2-7.5% of nucleotide positions in the human genome have influenced fitness since the human-chimpanzee divergence. In contrast to several recent studies, they suggest that recent evolutionary turnover has had a relatively modest impact on the functional content of the genome. Our approach provides a unique new measure of genomic function that complements measures based on evolutionary conservation or functional genomics alone and is particularly well suited for characterizing turnover and evolutionary novelty.

The dynamics of sperm cooperation in a competitive environment


The dynamics of sperm cooperation in a competitive environment

H. S. Fisher, L. Giomi, H. E. Hoekstra, L. Mahadevan
(Submitted on 2 Jul 2014)

Sperm cooperation has evolved in a variety of taxa and is often considered a response to sperm competition, yet the benefit of this form of collective movement remains unclear. Here we use fine-scale imaging and a minimal mathematical model to study sperm aggregation in the rodent genus Peromyscus. We demonstrate that as the number of sperm cells in an aggregate increase, the group moves with more persistent linearity but without increasing speed; this benefit, however, is offset in larger aggregates as the geometry of the group forces sperm to swim against one another. The result is a non-monotonic relationship between aggregate size and average velocity with both a theoretically predicted and empirically observed optimum of 6-7 sperm/aggregate. To understand the role of sexual selection in driving these sperm group dynamics, we compared two sister-species with divergent mating systems and find that sperm of P.maniculatus (highly promiscuous), which have evolved under intense competition, form optimal-sized aggregates more often than sperm of P.polionotus (strictly monogamous), which lack competition. Our combined mathematical and experimental study of coordinated sperm movement reveals the importance of geometry, motion and group size on sperm velocity and suggests how these physical variables interact with evolutionary selective pressures to regulate cooperation in competitive environments.

Pervasive variation of transcription factor orthologs contributes to regulatory network evolution

Pervasive variation of transcription factor orthologs contributes to regulatory network evolution
Shilpa Nadimpalli, Anton V. Persikov, Mona Singh
Comments: 29 pages, 5 figures, 5 supplemental figures, 3 supplemental tables
Subjects: Genomics (q-bio.GN)

Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning ~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in ~47% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, found in ~66% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to ~24% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve slower than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a subset of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.

Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure


Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure

David Mimno, David M Blei, Barbara E Engelhardt
Subjects: Methodology (stat.ME); Genomics (q-bio.GN); Populations and Evolution (q-bio.PE); Applications (stat.AP)

Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of a statistical model. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the model estimates for four qualitatively different population genetic data sets: the POPRES European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.

Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus

Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus
Yujun Cui, Xianwei Yang, Xavier Didelot, Chenyi Guo, Dongfang Li, Yanfeng Yan, Yiquan Zhang, Yanting Yuan, Huanming Yang, Jian Wang, Jun Wang, Yajun Song, Dongsheng Zhou, Daniel Falush, Ruifu Yang
Subjects: Populations and Evolution (q-bio.PE)

In outbreeding organisms, genetic variation is reassorted each generation, leading to geographic gene pools. By contrast bacterial clones can spread and adapt independently leading to a wide variety of possible genetic structures. Here we investigated global patterns of variation in 157 whole genome sequences of Vibrio parahaemolyticus, a free living and seafood associated marine bacterium. Pandemic clones, responsible for recent outbreaks of gastroenteritis in humans have spread globally. However, there are oceanic gene pools, one located in the oceans surrounding Asia and another in the Mexican Gulf. Frequent recombination means that most isolates have acquired the genetic profile of their current location. Within oceanic gene pools, there is nevertheless the opportunity for substructure, for example due to niche partitioning by different clones. We investigated this structure by calculating the effective population size in two different ways. Under standard population genetic models, the two estimates should give similar answers but we found a 30 fold difference. This discrepancy provides evidence for an ‘epigenotype’ model in which distinct ecotypes are maintained by selection on an otherwise homogeneous genetic background. To investigate the genetic factors involved, we used 54 unrelated isolates to conduct a genome wide scan for epistatically interacting loci. We found a single example of strong epistasis between distant genome regions. One of the genes involved in this interaction has previously been implicated in biofilm formation, while the other is a hypothetical protein. Further work will allow a detailed understanding of how selection acts to structure the pattern of variation within natural bacterial populations.

Most viewed on Haldane’s Sieve: June 2014

The most viewed posts on Haldane’s Sieve in June 2014 were:

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Anand Bhaskar, Y.X. Rachel Wang, Yun S. Song

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal which is difficult to pick up with small sample sizes. Lastly, we apply our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing dataset of tens of thousands of individuals assayed at a few hundred genic regions.

PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes


PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

I. Gregor, J. Dröge, M. Schirmer, C. Quince, A. C. McHardy
Subjects: Quantitative Methods (q-bio.QM)

Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. For communities of up to medium diversity, e.g. excluding environments such as soil, this is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community from which they originate. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for the recovery of species bins from an individual metagenome sample is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in a composition-based taxonomic metagenome classifier and identifies the ‘training’ sequences using marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area may not have. With these challenges in mind, we have developed PhyloPythiaS+, a successor to our previously described method PhyloPythia(S). The newly developed + component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated k-mer counting 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion.

Conservation and losses of avian non-coding RNA loci

Conservation and losses of avian non-coding RNA loci
Paul P. Gardner, Mario Fasold, Sarah W. Burge, Maria Ninova, Jana Hertel, Stephanie Kehr, Tammy E. Steeves, Sam Griffiths-Jones, Peter F. Stadler
Comments: 17 pages, 1 figure
Subjects: Genomics (q-bio.GN)

Here we present the results of a large-scale bioinformatic annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We show that a number of lncRNA-associated loci are conserved between birds and mammals, including several intriguing cases where the reported mammalian lncRNA function is not conserved in birds. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous “losses” of several RNA families, and attribute these to genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling Avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.