Phylogenetics and the human microbiome

Phylogenetics and the human microbiome
Frederick A Matsen IV
Comments: to appear in Systematic Biology
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN)

The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodology has developed in a somewhat different direction than for other applications of phylogenetics. In this paper I review the field and its methods from the perspective of a phylogeneticist, as well as describing current challenges for phylogenetics coming from this type of work.

Bayesian Coalescent Epidemic Inference: Comparison of Stochastic and Deterministic SIR Population Dynamics


Bayesian Coalescent Epidemic Inference: Comparison of Stochastic and Deterministic SIR Population Dynamics

Alex Popinga, Tim Vaughan, Tanja Stadler, Alexei Drummond
Comments: Submitted
Subjects: Populations and Evolution (q-bio.PE)

Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingmans coalescent as well as from birth death branching processes. The development of alternative approaches merits investigation of their characteristics and differences. Here we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent SIR tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the models performance and contrast results obtained with a recently published birth death sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. We also compare results of analyses using published HIV1 sequence data obtained from known UK infection clusters. We show that the coalescent SIR model is effective at estimating epidemiological parameters from data with large fundamental reproductive number R0 and large population size S0. We find that the stochastic variant generally outperforms its deterministic counterpart. However, each of these Bayesian estimators are shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with R0 close to one or with small susceptible populations.

Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris

Hybrid origins and the earliest stages of diploidization in the highly successful recent polyploid Capsella bursa-pastoris
Gavin Douglas, Gesseca Gos, Kim Steige, Adriana Salcedo, Karl Holm, J. Arvid ?gren, Khaled Hazzouri, Wei Wang, Adrian E. Platts, Emily B. Josephs, Robert J. Williamson, Barbara Neuffer, Martin Lascoux, Tanja Slotte, Stephen Wright

Whole genome duplication events have occurred repeatedly during flowering plant evolution, and there is growing evidence for predictable patterns of gene retention and loss following polyploidization. Despite these important insights, the rate and processes governing the earliest stages of diploidization remain uncertain, and the relative importance of genetic drift vs. natural selection in the process of gene degeneration and loss is unclear. Here we conduct whole genome resequencing in Capsella bursa-pastoris, a recently formed tetraploid with one of the most widespread species distributions of any angiosperm. Whole genome data provide strong support for recent hybrid origins of the tetraploid species within the last 100-300,000 years from two diploid progenitors in the Capsella genus. Major-effect inactivating mutations are frequent, but many were inherited from the parental species and show no evidence of being fixed by positive selection. Despite a lack of large-scale gene loss, we observe a shift in the efficacy of natural selection genome-wide. Our results suggest that the earliest stages of diploidization are associated with quantitative genome-wide shifts in the strength and efficacy of selection rather than rapid gene loss, and that nonfunctionalization can receive a ‘head start’ through deleterious variants found in parental diploid populations.

Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

Probabilities of Fitness Consequences for Point Mutations Across the Human Genome
Brad Gulko, Ilan Gronau, Melissa J Hubisz, Adam Siepel

The identification of noncoding functional elements based on high-throughput genomic data remains an important open problem. Here we describe a novel computational approach for estimating the probability that a point mutation at each nucleotide position in a genome will influence organismal fitness. These fitness consequence (fitCons) scores can be interpreted as an evolution-based measure of potential genomic function. We first partition the genome into clusters of positions having distinct functional genomic “fingerprints,” based on cell-type-specific DNase-seq, RNA-seq, and histone modification data. Then we estimate the probability of fitness consequences for each cluster from associated patterns of genetic polymorphism and divergence using a recently developed probabilistic method called INSIGHT. We have generated fitCons scores for three human cell types based on publicly available genomic data and made them available as UCSC Genome Browser tracks. Like conventional evolutionary conservation scores, fitCons scores are clearly elevated in known coding and noncoding functional elements, but they show considerably better sensitivity than conservation scores for many noncoding elements. In addition, they perform exceptionally well in distinguishing ChIP-seq-supported transcription factor binding sites, expression quantitative trait loci, and predicted enhancers from putatively nonfunctional sequences. The fitCons scores indicate that 4.2-7.5% of nucleotide positions in the human genome have influenced fitness since the human-chimpanzee divergence. In contrast to several recent studies, they suggest that recent evolutionary turnover has had a relatively modest impact on the functional content of the genome. Our approach provides a unique new measure of genomic function that complements measures based on evolutionary conservation or functional genomics alone and is particularly well suited for characterizing turnover and evolutionary novelty.

The dynamics of sperm cooperation in a competitive environment


The dynamics of sperm cooperation in a competitive environment

H. S. Fisher, L. Giomi, H. E. Hoekstra, L. Mahadevan
(Submitted on 2 Jul 2014)

Sperm cooperation has evolved in a variety of taxa and is often considered a response to sperm competition, yet the benefit of this form of collective movement remains unclear. Here we use fine-scale imaging and a minimal mathematical model to study sperm aggregation in the rodent genus Peromyscus. We demonstrate that as the number of sperm cells in an aggregate increase, the group moves with more persistent linearity but without increasing speed; this benefit, however, is offset in larger aggregates as the geometry of the group forces sperm to swim against one another. The result is a non-monotonic relationship between aggregate size and average velocity with both a theoretically predicted and empirically observed optimum of 6-7 sperm/aggregate. To understand the role of sexual selection in driving these sperm group dynamics, we compared two sister-species with divergent mating systems and find that sperm of P.maniculatus (highly promiscuous), which have evolved under intense competition, form optimal-sized aggregates more often than sperm of P.polionotus (strictly monogamous), which lack competition. Our combined mathematical and experimental study of coordinated sperm movement reveals the importance of geometry, motion and group size on sperm velocity and suggests how these physical variables interact with evolutionary selective pressures to regulate cooperation in competitive environments.

Pervasive variation of transcription factor orthologs contributes to regulatory network evolution

Pervasive variation of transcription factor orthologs contributes to regulatory network evolution
Shilpa Nadimpalli, Anton V. Persikov, Mona Singh
Comments: 29 pages, 5 figures, 5 supplemental figures, 3 supplemental tables
Subjects: Genomics (q-bio.GN)

Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning ~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in ~47% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, found in ~66% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to ~24% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve slower than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a subset of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.

Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure


Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure

David Mimno, David M Blei, Barbara E Engelhardt
Subjects: Methodology (stat.ME); Genomics (q-bio.GN); Populations and Evolution (q-bio.PE); Applications (stat.AP)

Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of a statistical model. We develop PPCs for five population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the model estimates for four qualitatively different population genetic data sets: the POPRES European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.

Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus

Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus
Yujun Cui, Xianwei Yang, Xavier Didelot, Chenyi Guo, Dongfang Li, Yanfeng Yan, Yiquan Zhang, Yanting Yuan, Huanming Yang, Jian Wang, Jun Wang, Yajun Song, Dongsheng Zhou, Daniel Falush, Ruifu Yang
Subjects: Populations and Evolution (q-bio.PE)

In outbreeding organisms, genetic variation is reassorted each generation, leading to geographic gene pools. By contrast bacterial clones can spread and adapt independently leading to a wide variety of possible genetic structures. Here we investigated global patterns of variation in 157 whole genome sequences of Vibrio parahaemolyticus, a free living and seafood associated marine bacterium. Pandemic clones, responsible for recent outbreaks of gastroenteritis in humans have spread globally. However, there are oceanic gene pools, one located in the oceans surrounding Asia and another in the Mexican Gulf. Frequent recombination means that most isolates have acquired the genetic profile of their current location. Within oceanic gene pools, there is nevertheless the opportunity for substructure, for example due to niche partitioning by different clones. We investigated this structure by calculating the effective population size in two different ways. Under standard population genetic models, the two estimates should give similar answers but we found a 30 fold difference. This discrepancy provides evidence for an ‘epigenotype’ model in which distinct ecotypes are maintained by selection on an otherwise homogeneous genetic background. To investigate the genetic factors involved, we used 54 unrelated isolates to conduct a genome wide scan for epistatically interacting loci. We found a single example of strong epistasis between distant genome regions. One of the genes involved in this interaction has previously been implicated in biofilm formation, while the other is a hypothetical protein. Further work will allow a detailed understanding of how selection acts to structure the pattern of variation within natural bacterial populations.

PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes


PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

I. Gregor, J. Dröge, M. Schirmer, C. Quince, A. C. McHardy
Subjects: Quantitative Methods (q-bio.QM)

Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. For communities of up to medium diversity, e.g. excluding environments such as soil, this is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community from which they originate. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for the recovery of species bins from an individual metagenome sample is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in a composition-based taxonomic metagenome classifier and identifies the ‘training’ sequences using marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area may not have. With these challenges in mind, we have developed PhyloPythiaS+, a successor to our previously described method PhyloPythia(S). The newly developed + component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated k-mer counting 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion.

Conservation and losses of avian non-coding RNA loci

Conservation and losses of avian non-coding RNA loci
Paul P. Gardner, Mario Fasold, Sarah W. Burge, Maria Ninova, Jana Hertel, Stephanie Kehr, Tammy E. Steeves, Sam Griffiths-Jones, Peter F. Stadler
Comments: 17 pages, 1 figure
Subjects: Genomics (q-bio.GN)

Here we present the results of a large-scale bioinformatic annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We show that a number of lncRNA-associated loci are conserved between birds and mammals, including several intriguing cases where the reported mammalian lncRNA function is not conserved in birds. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous “losses” of several RNA families, and attribute these to genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling Avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.