Population genetic analyses of metagenomes reveal extensive strain-level variation in prevalent human-associated bacteria

Population genetic analyses of metagenomes reveal extensive strain-level variation in prevalent human-associated bacteria

Stephen Nayfach, Katherine S Pollard
doi: http://dx.doi.org/10.1101/031757

Deep sequencing has the potential to shed light on the functional and phylogenetic heterogeneity of microbial populations in the environment. Here we present PhyloCNV, an integrated computational pipeline for quantifying species abundance and strain-level genomic variation from shotgun metagenomes. Our method leverages a comprehensive database of >30,000 reference genomes which we accurately clustered into species groups using a panel of universal-single-copy genes. Given a shotgun metagenome, PhyloCNV will rapidly and automatically identify gene copy number variants and single-nucleotide variants present in abundant bacterial species. We applied PhyloCNV to >500 faecal metagenomes from the United States, Europe, China, Peru, and Tanzania and present the first global analysis of strain-level variation and biogeography in the human gut microbiome. On average there is 8.5x more nucleotide diversity of strains between different individuals than within individuals, with elevated strain-level diversity in hosts from Peru and Tanzania that live rural lifestyles. For many, but not all common gut species, a significant proportion of inter-sample strain-level genetic diversity is explained by host geography. Eubacterium rectale, for example, has a highly structured population that tracks with host country, while strains of Bacteroides uniformis and other species are structured independently of their hosts. Finally, we discovered that the gene content of some bacterial strains diverges at short evolutionary timescales during which few nucleotide variants accumulate. These findings shed light onto the recent evolutionary history of microbes in the human gut and highlight the extensive differences in the gene content of closely related bacterial strains. PhyloCNV is freely available at: https://github.com/snayfach/PhyloCNV.

A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models

A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models

Shai Carmi, James Xue, Itsik Pe’er
(Submitted on 19 Sep 2015)

Admixed populations are formed by the merging of two or more ancestral populations, and the ancestry of each locus in an admixed genome derives from either source. Consider a simple “pulse” admixture model, where populations A and B merged t generations ago without subsequent gene flow. We derive the distribution of the proportion of an admixed chromosome that has A (or B) ancestry, as a function of the chromosome length L, t, and the initial contribution of the A source, m. We demonstrate that these results can be used for inference of the admixture parameters. For more complex admixture models, we derive an expression in Laplace space for the distribution of ancestry proportions that depends on having the distribution of the lengths of segments of each ancestry. We obtain explicit results for the special case of a “two-wave” admixture model, where population A contributed additional migrants in one of the generations between the present and the initial admixture event. Specifically, we derive formulas for the distribution of A and B segment lengths and numerical results for the distribution of ancestry proportions. We show that for recent admixture, data generated under a two-wave model can hardly be distinguished from that generated under a pulse model.

Author post: Immunosequencing reveals diagnostic signatures of chronic viral infection in T cell memory

This guest post is by William DeWitt on his preprint (with co-authors) “Immunosequencing reveals diagnostic signatures of chronic viral infection in T cell memory”.

This is a post about our preprint at biorXiv. Although it’s a paper on infectious disease and immunology, our colleague (and Haldane’s Sieve contributor) Bryan Howie suggested we might engage the community here, since we think there are interesting connections to standard GWAS methodology. I’ll start with a one-paragraph immunology primer, then summarize what we’ve been up to, and what’s next.

Cell-mediated adaptive immunity is effected by T cells, which recognize infection through interface of the T cell receptor (TCR) with foreign peptides presented on the surface of all nucleated cells by major histocompatibility complex (MHC). During development in the thymus, maturing T cells somatically generate genes encoding the TCR according to a random process of V(D)J recombination and are passed through selective barriers against weak MHC affinity (positive selection) and strong self-peptide affinity (negative selection). This results in a diverse repertoire of self-tolerant receptors from which to deploy specific responses to threats from a protean universe of evolving pathogens. Upon recognition of foreign antigen, a T cell proliferates, generating a subpopulation with identical-by-descent TCRs. This clonal selection mechanism of immunological memory implies that the TCR repertoire dynamically encodes an individual’s pathogen exposure history, and suggests that infection with a given pathogen could be recognized by identifying concomitant TCRs.


In this study, we identify signatures of viral infection in the TCR repertoire. With a cohort of 640 subjects, we performed high-throughput immunosequencing of rearranged TCR genes, and serostatus tests for cytomegalovirus (CMV) infection. We used an analysis approach similar to GWAS; among the ~85 million unique TCRs in these data, we tested for enrichment of specific TCRs among CMV seropositive subjects, identifying a set of CMV-associated TCRs. These were reduced to a single dimension by defining the pathogen memory burden as the fraction of an individual’s TCRs that are CMV-associated, revealing a powerful discriminator. A binary classifier trained against this quantity demonstrated high cross validation accuracy.


The binding of TCR to antigen is mediated by MHC, which is encoded by the highly polymorphic HLA loci. Thus, the affinity of a given TCR for a given antigen is modulated by HLA haplotype. HLA typing was performed for this cohort according to standard methods, and we investigated enrichment of specific HLA alleles among the subjects in which each CMV-associated TCR appeared. Most CMV-associated TCRs were found to have HLA restrictions, and none were associated with more than one allele in any locus.

There is substantial literature identifying TCRs that bind CMV antigen through low-throughput in vitro methods, including so-called public TCRs, which arise in many individuals. Most public CMV TCRs were present in our data, however most were not in our list of diagnostically useful CMV-associated TCRs. This is understood by considering that V(D)J recombination produces different TCRs with different probability. Public TCRs, having high recombination probability, will be repeatedly recombined in the repertoires of all subjects, regardless of CMV infection status; their presence is not diagnostic for CMV serostatus, even if they are CMV-avid. Conversely, CMV-avid TCRs with low recombination probability will be private to one subject (if present at all) in any cohort of reasonable size; their enrichment in CMV seropositive subjects is not detectable. CMV-avid TCRs with intermediate recombination probability recombine intermittently, residing transiently in the repertoires CMV naïve individuals and reliably proliferating upon CMV exposure; the presence of these TCRs in an immunosequencing sample is diagnostic for infection status.

It may be interesting to draw a comparison with GWAS, where selection drives disease-associated variants with high effect size to low population frequency, out of reach of the detection power of any study. In contrast, the V(D)J recombination machinery is constant across individuals, and CMV-avid TCRs appear to span a broad range of recombination probabilities from public to private. This includes plenty in an intermediate regime of what we might call diagnostic TCRs, which can be used to build powerful classifiers of disease status that aren’t blunted by suppression of the most relevant features.

We’ll be making the data from this study available online, constituting the largest publically accessible TCR immunosequencing data set. It’ll be fun to see what other groups do with it.

Some things we’re still working on:
• We did cross validation to assess diagnostic accuracy (yellow curve in the ROC figure), including recomputation of CMV-associated TCRs for each holdout. Results are encouraging, but a more convincing test will be to diagnose a separate cohort. We’re in the process of acquiring these data.
• MHC polymorphism suggests that thymic selection barriers censor different TCRs in individuals with different HLA haplotype. We’ve done preliminary work identifying TCRs that are associated with more common HLA alleles, indicating the possibility of HLA typing from immunosequencing data. Interestingly, this necessitates a two-tailed test due to modulation of both positive and negative selection.
• Our association analysis relies on the recurrence of TCRs with identical amino acid sequence across individuals, but we’d like to be able to define TCR motifs more loosely, so that we can detect enrichment without requiring identity. This necessitates a similarity metric in amino acid space that captures similarity in avidity. We have some ideas here, and are testing them out on some validation data. It’s definitely a tough one, but could substantially increase power in this sort of study.

Author post: Limits to adaptation in partially selfing species

This guest post is by Matthew Hartfield (@mathyhartfield) on his preprint (with Sylvain Glemin) “Limits to adaptation in partially selfing species”, available from bioRxiv here

Our paper “Limits to adaptation in partially selfing species” is now available from bioRxiv. This preprint is the result from a collaboration that has been sent back-and-forth across the Atlantic for well over a year, so we are pleased to see it online.

Haldane’s Sieve, after which this blog is named, is a theory pertaining to the role of dominance in adaptation, which was initially developed for outcrossing species and then shown to be absent in selfing species. When beneficial alleles initially appear in diploid individuals, they do so in heterozygote form (so only one of two alleles at the locus carry the advantageous type). Mathematically, these mutations have selective advantage 1 + hs where h is the degree of dominance, and s the selective advantage. Haldane’s Sieve states that recessive mutations (h 1/2), because selection is not efficient on heterozygotes if mutations are recessive. However, self-fertilising individuals are able to rapidly create homozygote forms of the mutant, increasing the efficacy of selection acting on them. Yet selfing also increases genetic drift, and hence the risk that these adaptations will go extinct by chance. Consequently, an extension of Haldane’s Sieve states that if the mutation is recessive (h 1/2).

This result holds for a single mutant in isolation. Yet mutants seldom act independently; they usually arise alongside other alleles in the genome, each of which has their own evolutionary outcomes. A known additional advantage of outcrossing is that, through recombining genomes from each parent, selected alleles can be moved from disadvantageous genomes to fitter backgrounds. For example, say an adaptive allele was present in a population, and a second adaptation arose at a nearby locus. If the second allele was not as strongly selected as the first, then it has to arise on the same genome as the initial adaptation. Otherwise it is likely to be lost as the less-fit genotype is replaced over time, a process known as selective interference. However, outcrossing can unite the two mutations into the same genome, so both can spread.

Despite these potential advantages of outcrossing, the effect of selective interference has not yet been investigated in the context of how facultative selfing influences the fixation of multiple beneficial alleles. Our model therefore aimed to determine how likely it is that secondary beneficial alleles can fix in the population, given an existing adaptation was already present, and reproduction involved a certain degree of self-fertilisation.

After working through the calculations, two subtle yet important twists on Haldane’s Sieve revealed themselves. First, due to the effects of selection interference, Haldane’s Sieve is likely to be reinforced in areas of low recombination. That is, recessive mutants are more likely to be lost in outcrossers (when compared to single-locus results), with similar losses for dominant mutations in self-fertilising organisms. Secondly, we also investigated a case where the second beneficial mutant could be reintroduced by recurrent mutation. In this case, selection interference can be very severe in selfers due to the lack of recombination. Hence some degree of outcrossing would be optimal to prevent these beneficial alleles from being repeatedly lost, even if they are recessive. In the most extreme case, complete outcrossing is best if secondary mutations only confer minor advantages.

In recent years, the role that selection interference plays in affecting mating system evolution is starting to become recognised. Our theoretical study is just one of many that elucidates how important outcrossing can be in augmenting the efficacy of selection. Our hope is that these studies will spur on further empirical work quantifying the rate of adaptation in species with different mating systems, to further unravel why species reproduce in vastly different ways.

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

Jamie V. De Jong, Jeanette C McLeod, Mike Steel
(Submitted on 15 Aug 2015)

A central theme in phylogenetics is the reconstruction and analysis of evolutionary trees from a given set of data. To determine the optimal search methods for reconstructing trees, it is crucial to understand the size and structure of the neighbourhoods of trees under tree rearrangement operations. The diameter and size of the immediate neighbourhood of a tree has been well-studied, however little is known about the number of trees at distance two, three or (more generally) k from a given tree. In this paper we provide a number of exact and asymptotic results concerning these quantities, and identify some key aspects of tree shape that play a role in determining these quantities. We obtain several new results for two of the main tree rearrangement operations – Nearest Neighbour Interchange and Subtree Prune and Regraft — as well as for the Robinson-Foulds metric on trees.

Sex-dependent dominance at a single locus maintains variation in age at maturity in Atlantic salmon

Sex-dependent dominance at a single locus maintains variation in age at maturity in Atlantic salmon

Nicola Barson, Tutku Aykanat, Kjetil Hindar, Matthew Baranski, Geir Bolstad, Peder Fiske, Celeste Jacq, Arne Jensen, Susan E Johnston, Sten Karlsoon, Matthew Kent, Eero Niemelä, Torfinn Nome, Tor Naesje, Panu Orell, Atso Romakkaniemi, Harald Saegrov, Kurt Urdal, Jaakko Erkinaro, Sigbjorn Lien, Craig Primmer
doi: http://dx.doi.org/10.1101/024695

Males and females share many traits that have a common genetic basis, however selection on these traits often differs between the sexes leading to sexual conflict. Under such sexual antagonism, theory predicts the evolution of genetic architectures that resolve this sexual conflict. Yet, despite intense theoretical and empirical interest, the specific genetic loci behind sexually antagonistic phenotypes have rarely been identified, limiting our understanding of how sexual conflict impacts genome evolution and the maintenance of genetic diversity. Here, we identify a large effect locus controlling age at maturity in 57 salmon populations, an important fitness trait in which selection favours earlier maturation in males than females, and show it is a clear example of sex dependent dominance reducing intralocus sexual conflict and maintaining adaptive variation in wild populations. Using high density SNP data and whole genome re-sequencing, we found that vestigial-like family member 3 (VGLL3) exhibits sex-dependent dominance in salmon, promoting earlier and later maturation in males and females, respectively. VGLL3, an adiposity regulator associated with size and age at maturity in humans, explained 39.4% of phenotypic variation, an unexpectedly high effect size for what is usually considered a highly polygenic trait. Such large effects are predicted under balancing selection from either sexually antagonistic or spatially varying selection. Our results provide the first empirical example of dominance reversal permitting greater optimisation of phenotypes within each sex, contributing to the resolution of sexual conflict in a major and widespread evolutionary trade-off between age and size at maturity. They also provide key empirical evidence for how variation in reproductive strategies can be maintained over large geographical scales. We further anticipate these findings will have a substantial impact on population management in a range of harvested species where trends towards earlier maturation have been observed

A genomic region containing RNF212 is associated with sexually-dimorphic recombination rate variation in wild Soay sheep (Ovis aries).

A genomic region containing RNF212 is associated with sexually-dimorphic recombination rate variation in wild Soay sheep (Ovis aries).

Susan E Johnston, Jon Slate, Josephine M Pemberton
doi: http://dx.doi.org/10.1101/024869

Meiotic recombination breaks down linkage disequilibrium and forms new haplotypes, meaning that it is an important driver of diversity in eukaryotic genomes. Understanding the causes of variation in recombination rate is not only important in interpreting and predicting evolutionary phenomena, but also for understanding the potential of a population to respond to selection. Yet, there remains little data on if, how and why recombination rate varies in natural populations. Here, we used extensive pedigree and high-density SNP information in a wild population of Soay sheep (Ovis aries) to determine individual crossovers in 3330 gametes from 813 individuals. Using these data, we investigated the recombination landscape and the genetic architecture of individual autosomal recombination rate. The population was strongly heterochiasmic (male to female linkage map ratio = 1.31), driven by significantly elevated levels of male recombination in sub-telomeric regions. Autosomal recombination rate was heritable in both sexes (h2 = 0.16 & 0.12 in females and males, respectively), but with different genetic architectures. In females, 46.7% of heritable variation was explained by a sub-telomeric region on chromosome 6; a genome-wide association study showed the strongest associations at RNF212, with further associations observed at a nearby ~374kb region of complete linkage disequilibrium containing three additional candidate loci, CPLX1, GAK and PCGF3. This region did not affect male recombination rate. A second region on chromosome 7 containing REC8 and RNF212B explained 26.2% of heritable variation in recombination rate in both sexes, with further single locus associations identified on chromosome 3. Our findings provide a key empirical example of the genetic architecture of recombination rate in a wild mammal population with male-biased crossover frequency.