Author post: Efficient coalescent simulation and genealogical analysis for large sample sizes

This guest post is by Jerome Kelleher, on the preprint by Kelleher, Etheridge, and McVean titled “Efficient coalescent simulation and genealogical analysis for large sample sizes” available here from biorXiv.

In this post we summarise the main results of our recent bioRxiv preprint. We’ve left out a lot of important details here, but hopefully this summary will be enough to convince you that it’s worth reading the paper!

Coalescent simulation is a fundamental tool in modern population genetics, and a large number of packages exist to simulate various aspects of the model. The basic algorithm to simulate the coalescent with recombination was defined by Hudson in 1983, who also published the classical ms simulation program in 2002. Programs such as ms based on Hudson’s algorithm perform poorly for longer sequences, making simulation of chromosome sized regions under the influence of recombination impossible. The Sequentially Markov Coalescent (SMC) approximates the coalescent with recombination by assuming that each marginal genealogy depends only on its predecessor, making simulation much more efficient. The SMC can be a poor approximation when long range linkage information is important, however, and current simulations do not scale well in terms of sample size. Population-scale sequencing projects currently under way mean there is an urgent need for accurate simulations of hundreds of thousands of genomes.

We present a new formulation of Hudson’s simulation algorithm that solves these issues, making chromosome-scale simulation of the exact coalescent with recombination for hundreds of thousands of samples possible for the first time. Our approach begins by defining the genealogies that we are constructing in terms of integer vectors of a specific form, which we refer to as `sparse trees’. We generate recombination and common ancestor events in the same manner as the classical methods, but our approach to constructing marginal genealogies is quite different. When a coalescence within a marginal tree occurs, we store a tuple consisting of the left and right coordinates of the overlapping region, the parent and child nodes (which are integers), and the time at which the event occurred. We refer to these tuples as `coalescence records’, and they provide sufficient information to recover all genealogies after the simulation has completed. We implemented these ideas in a simulator called msprime, which we compared with the state of the art. For a fixed sample size of 1000 and increasing sequence length (with human-like recombination parameters), we found that msprime is much faster than comparable exact simulators and, surprisingly, is competitive with approximate SMC simulators. Even more surprisingly, we found that for a fixed sequence length of 50 megabases and increasing sample size, msprime was much faster than any existing simulator for large samples.

Storing the output of the simulations as coalescence records has major advantages. Because parent-child relationships shared by adjacent trees are stored only once, the correlation structure of the sequence of genealogies is explicit and the storage requirements minimised. To illustrate this, we ran a simulation of a 100 megabase chromosome with a roughly human recombination rate for a sample of 100,000 individuals. This simulation ran in about 6 minutes on a single CPU thread and used around 850MB of RAM. The resulting coalescence records required 88MB using msprime’s native HDF5 based storage format. Storing the same genealogies in Newick format requires around 3.5TB.

Highly compressed representations of data usually come at the cost of increased access time. In contrast, we can retrieve complete genealogical information from coalescence records many times faster than is possible using existing Newick-based methods. We provide a detailed listing of an algorithm to sequentially recover the marginal genealogies from a set of coalescence records and show that this algorithm requires constant time to transition between adjacent trees. This algorithm has been implemented as part of msprime’s Python API, and required around 3 seconds to iterate over all 1.1 million trees generated by the simulation above. We compared this performance to several popular tree processing libraries, and found that the fastest would require an estimated 38 days to parse the same set of trees in Newick format. Thus, in this example, by using msprime’s storage format and API we can store the same set of trees using around forty thousand times less space and parse them around a million times more quickly than Newick based methods.

We can also store mutational information in a natural and efficient way. If we have an infinite sites mutation that occurs on the parent branch of a particular node at a particular position on the sequence, then we simply store this (node, position) pair. This leads to a very compact representation of the combined genealogical and mutational state of a sample. We simulated 1.2 million infinite sites mutations on top of the genealogies generated earlier, which resulted in a 102MB HDF5 file containing the coalescence records and mutations. In contrast, the corresponding text haplotypes consumed 113GB of storage space. Associating mutations directly with tree nodes also allows us to perform some important calculations efficiently. We describe an efficient algorithm to count the total number of leaf nodes from a particular set below each node in the tree as we iterate over the sequence. This algorithm allows us to (for example) calculate allele frequencies within specific subsets in constant time. Many other applications of these techniques are possible.

The availability of faster and more accurate simulations may lead to interesting new applications, and so we conclude by discussing some potential applications of our work. Of particular interest is the possibility of inferring a set of coalescence records from real biological data, obtaining a compressed representation that can be efficiently processed. This is a very interesting and promising direction for future research.

Population genetic analyses of metagenomes reveal extensive strain-level variation in prevalent human-associated bacteria

Population genetic analyses of metagenomes reveal extensive strain-level variation in prevalent human-associated bacteria

Stephen Nayfach, Katherine S Pollard

Deep sequencing has the potential to shed light on the functional and phylogenetic heterogeneity of microbial populations in the environment. Here we present PhyloCNV, an integrated computational pipeline for quantifying species abundance and strain-level genomic variation from shotgun metagenomes. Our method leverages a comprehensive database of >30,000 reference genomes which we accurately clustered into species groups using a panel of universal-single-copy genes. Given a shotgun metagenome, PhyloCNV will rapidly and automatically identify gene copy number variants and single-nucleotide variants present in abundant bacterial species. We applied PhyloCNV to >500 faecal metagenomes from the United States, Europe, China, Peru, and Tanzania and present the first global analysis of strain-level variation and biogeography in the human gut microbiome. On average there is 8.5x more nucleotide diversity of strains between different individuals than within individuals, with elevated strain-level diversity in hosts from Peru and Tanzania that live rural lifestyles. For many, but not all common gut species, a significant proportion of inter-sample strain-level genetic diversity is explained by host geography. Eubacterium rectale, for example, has a highly structured population that tracks with host country, while strains of Bacteroides uniformis and other species are structured independently of their hosts. Finally, we discovered that the gene content of some bacterial strains diverges at short evolutionary timescales during which few nucleotide variants accumulate. These findings shed light onto the recent evolutionary history of microbes in the human gut and highlight the extensive differences in the gene content of closely related bacterial strains. PhyloCNV is freely available at:

A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models

A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models

Shai Carmi, James Xue, Itsik Pe’er
(Submitted on 19 Sep 2015)

Admixed populations are formed by the merging of two or more ancestral populations, and the ancestry of each locus in an admixed genome derives from either source. Consider a simple “pulse” admixture model, where populations A and B merged t generations ago without subsequent gene flow. We derive the distribution of the proportion of an admixed chromosome that has A (or B) ancestry, as a function of the chromosome length L, t, and the initial contribution of the A source, m. We demonstrate that these results can be used for inference of the admixture parameters. For more complex admixture models, we derive an expression in Laplace space for the distribution of ancestry proportions that depends on having the distribution of the lengths of segments of each ancestry. We obtain explicit results for the special case of a “two-wave” admixture model, where population A contributed additional migrants in one of the generations between the present and the initial admixture event. Specifically, we derive formulas for the distribution of A and B segment lengths and numerical results for the distribution of ancestry proportions. We show that for recent admixture, data generated under a two-wave model can hardly be distinguished from that generated under a pulse model.

Author post: Immunosequencing reveals diagnostic signatures of chronic viral infection in T cell memory

This guest post is by William DeWitt on his preprint (with co-authors) “Immunosequencing reveals diagnostic signatures of chronic viral infection in T cell memory”.

This is a post about our preprint at biorXiv. Although it’s a paper on infectious disease and immunology, our colleague (and Haldane’s Sieve contributor) Bryan Howie suggested we might engage the community here, since we think there are interesting connections to standard GWAS methodology. I’ll start with a one-paragraph immunology primer, then summarize what we’ve been up to, and what’s next.

Cell-mediated adaptive immunity is effected by T cells, which recognize infection through interface of the T cell receptor (TCR) with foreign peptides presented on the surface of all nucleated cells by major histocompatibility complex (MHC). During development in the thymus, maturing T cells somatically generate genes encoding the TCR according to a random process of V(D)J recombination and are passed through selective barriers against weak MHC affinity (positive selection) and strong self-peptide affinity (negative selection). This results in a diverse repertoire of self-tolerant receptors from which to deploy specific responses to threats from a protean universe of evolving pathogens. Upon recognition of foreign antigen, a T cell proliferates, generating a subpopulation with identical-by-descent TCRs. This clonal selection mechanism of immunological memory implies that the TCR repertoire dynamically encodes an individual’s pathogen exposure history, and suggests that infection with a given pathogen could be recognized by identifying concomitant TCRs.


In this study, we identify signatures of viral infection in the TCR repertoire. With a cohort of 640 subjects, we performed high-throughput immunosequencing of rearranged TCR genes, and serostatus tests for cytomegalovirus (CMV) infection. We used an analysis approach similar to GWAS; among the ~85 million unique TCRs in these data, we tested for enrichment of specific TCRs among CMV seropositive subjects, identifying a set of CMV-associated TCRs. These were reduced to a single dimension by defining the pathogen memory burden as the fraction of an individual’s TCRs that are CMV-associated, revealing a powerful discriminator. A binary classifier trained against this quantity demonstrated high cross validation accuracy.


The binding of TCR to antigen is mediated by MHC, which is encoded by the highly polymorphic HLA loci. Thus, the affinity of a given TCR for a given antigen is modulated by HLA haplotype. HLA typing was performed for this cohort according to standard methods, and we investigated enrichment of specific HLA alleles among the subjects in which each CMV-associated TCR appeared. Most CMV-associated TCRs were found to have HLA restrictions, and none were associated with more than one allele in any locus.

There is substantial literature identifying TCRs that bind CMV antigen through low-throughput in vitro methods, including so-called public TCRs, which arise in many individuals. Most public CMV TCRs were present in our data, however most were not in our list of diagnostically useful CMV-associated TCRs. This is understood by considering that V(D)J recombination produces different TCRs with different probability. Public TCRs, having high recombination probability, will be repeatedly recombined in the repertoires of all subjects, regardless of CMV infection status; their presence is not diagnostic for CMV serostatus, even if they are CMV-avid. Conversely, CMV-avid TCRs with low recombination probability will be private to one subject (if present at all) in any cohort of reasonable size; their enrichment in CMV seropositive subjects is not detectable. CMV-avid TCRs with intermediate recombination probability recombine intermittently, residing transiently in the repertoires CMV naïve individuals and reliably proliferating upon CMV exposure; the presence of these TCRs in an immunosequencing sample is diagnostic for infection status.

It may be interesting to draw a comparison with GWAS, where selection drives disease-associated variants with high effect size to low population frequency, out of reach of the detection power of any study. In contrast, the V(D)J recombination machinery is constant across individuals, and CMV-avid TCRs appear to span a broad range of recombination probabilities from public to private. This includes plenty in an intermediate regime of what we might call diagnostic TCRs, which can be used to build powerful classifiers of disease status that aren’t blunted by suppression of the most relevant features.

We’ll be making the data from this study available online, constituting the largest publically accessible TCR immunosequencing data set. It’ll be fun to see what other groups do with it.

Some things we’re still working on:
• We did cross validation to assess diagnostic accuracy (yellow curve in the ROC figure), including recomputation of CMV-associated TCRs for each holdout. Results are encouraging, but a more convincing test will be to diagnose a separate cohort. We’re in the process of acquiring these data.
• MHC polymorphism suggests that thymic selection barriers censor different TCRs in individuals with different HLA haplotype. We’ve done preliminary work identifying TCRs that are associated with more common HLA alleles, indicating the possibility of HLA typing from immunosequencing data. Interestingly, this necessitates a two-tailed test due to modulation of both positive and negative selection.
• Our association analysis relies on the recurrence of TCRs with identical amino acid sequence across individuals, but we’d like to be able to define TCR motifs more loosely, so that we can detect enrichment without requiring identity. This necessitates a similarity metric in amino acid space that captures similarity in avidity. We have some ideas here, and are testing them out on some validation data. It’s definitely a tough one, but could substantially increase power in this sort of study.

Author post: Limits to adaptation in partially selfing species

This guest post is by Matthew Hartfield (@mathyhartfield) on his preprint (with Sylvain Glemin) “Limits to adaptation in partially selfing species”, available from bioRxiv here

Our paper “Limits to adaptation in partially selfing species” is now available from bioRxiv. This preprint is the result from a collaboration that has been sent back-and-forth across the Atlantic for well over a year, so we are pleased to see it online.

Haldane’s Sieve, after which this blog is named, is a theory pertaining to the role of dominance in adaptation, which was initially developed for outcrossing species and then shown to be absent in selfing species. When beneficial alleles initially appear in diploid individuals, they do so in heterozygote form (so only one of two alleles at the locus carry the advantageous type). Mathematically, these mutations have selective advantage 1 + hs where h is the degree of dominance, and s the selective advantage. Haldane’s Sieve states that recessive mutations (h 1/2), because selection is not efficient on heterozygotes if mutations are recessive. However, self-fertilising individuals are able to rapidly create homozygote forms of the mutant, increasing the efficacy of selection acting on them. Yet selfing also increases genetic drift, and hence the risk that these adaptations will go extinct by chance. Consequently, an extension of Haldane’s Sieve states that if the mutation is recessive (h 1/2).

This result holds for a single mutant in isolation. Yet mutants seldom act independently; they usually arise alongside other alleles in the genome, each of which has their own evolutionary outcomes. A known additional advantage of outcrossing is that, through recombining genomes from each parent, selected alleles can be moved from disadvantageous genomes to fitter backgrounds. For example, say an adaptive allele was present in a population, and a second adaptation arose at a nearby locus. If the second allele was not as strongly selected as the first, then it has to arise on the same genome as the initial adaptation. Otherwise it is likely to be lost as the less-fit genotype is replaced over time, a process known as selective interference. However, outcrossing can unite the two mutations into the same genome, so both can spread.

Despite these potential advantages of outcrossing, the effect of selective interference has not yet been investigated in the context of how facultative selfing influences the fixation of multiple beneficial alleles. Our model therefore aimed to determine how likely it is that secondary beneficial alleles can fix in the population, given an existing adaptation was already present, and reproduction involved a certain degree of self-fertilisation.

After working through the calculations, two subtle yet important twists on Haldane’s Sieve revealed themselves. First, due to the effects of selection interference, Haldane’s Sieve is likely to be reinforced in areas of low recombination. That is, recessive mutants are more likely to be lost in outcrossers (when compared to single-locus results), with similar losses for dominant mutations in self-fertilising organisms. Secondly, we also investigated a case where the second beneficial mutant could be reintroduced by recurrent mutation. In this case, selection interference can be very severe in selfers due to the lack of recombination. Hence some degree of outcrossing would be optimal to prevent these beneficial alleles from being repeatedly lost, even if they are recessive. In the most extreme case, complete outcrossing is best if secondary mutations only confer minor advantages.

In recent years, the role that selection interference plays in affecting mating system evolution is starting to become recognised. Our theoretical study is just one of many that elucidates how important outcrossing can be in augmenting the efficacy of selection. Our hope is that these studies will spur on further empirical work quantifying the rate of adaptation in species with different mating systems, to further unravel why species reproduce in vastly different ways.

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

Jamie V. De Jong, Jeanette C McLeod, Mike Steel
(Submitted on 15 Aug 2015)

A central theme in phylogenetics is the reconstruction and analysis of evolutionary trees from a given set of data. To determine the optimal search methods for reconstructing trees, it is crucial to understand the size and structure of the neighbourhoods of trees under tree rearrangement operations. The diameter and size of the immediate neighbourhood of a tree has been well-studied, however little is known about the number of trees at distance two, three or (more generally) k from a given tree. In this paper we provide a number of exact and asymptotic results concerning these quantities, and identify some key aspects of tree shape that play a role in determining these quantities. We obtain several new results for two of the main tree rearrangement operations – Nearest Neighbour Interchange and Subtree Prune and Regraft — as well as for the Robinson-Foulds metric on trees.

Sex-dependent dominance at a single locus maintains variation in age at maturity in Atlantic salmon

Sex-dependent dominance at a single locus maintains variation in age at maturity in Atlantic salmon

Nicola Barson, Tutku Aykanat, Kjetil Hindar, Matthew Baranski, Geir Bolstad, Peder Fiske, Celeste Jacq, Arne Jensen, Susan E Johnston, Sten Karlsoon, Matthew Kent, Eero Niemelä, Torfinn Nome, Tor Naesje, Panu Orell, Atso Romakkaniemi, Harald Saegrov, Kurt Urdal, Jaakko Erkinaro, Sigbjorn Lien, Craig Primmer

Males and females share many traits that have a common genetic basis, however selection on these traits often differs between the sexes leading to sexual conflict. Under such sexual antagonism, theory predicts the evolution of genetic architectures that resolve this sexual conflict. Yet, despite intense theoretical and empirical interest, the specific genetic loci behind sexually antagonistic phenotypes have rarely been identified, limiting our understanding of how sexual conflict impacts genome evolution and the maintenance of genetic diversity. Here, we identify a large effect locus controlling age at maturity in 57 salmon populations, an important fitness trait in which selection favours earlier maturation in males than females, and show it is a clear example of sex dependent dominance reducing intralocus sexual conflict and maintaining adaptive variation in wild populations. Using high density SNP data and whole genome re-sequencing, we found that vestigial-like family member 3 (VGLL3) exhibits sex-dependent dominance in salmon, promoting earlier and later maturation in males and females, respectively. VGLL3, an adiposity regulator associated with size and age at maturity in humans, explained 39.4% of phenotypic variation, an unexpectedly high effect size for what is usually considered a highly polygenic trait. Such large effects are predicted under balancing selection from either sexually antagonistic or spatially varying selection. Our results provide the first empirical example of dominance reversal permitting greater optimisation of phenotypes within each sex, contributing to the resolution of sexual conflict in a major and widespread evolutionary trade-off between age and size at maturity. They also provide key empirical evidence for how variation in reproductive strategies can be maintained over large geographical scales. We further anticipate these findings will have a substantial impact on population management in a range of harvested species where trends towards earlier maturation have been observed