Author post: Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis

This guest post is by Zamin Iqbal [@ZaminIqbal] and Phelim Bradley [@Phelimb]

Our paper “Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis” has just appeared on the Biorxiv. We’re excited about it for a number of reasons.

The idea of using a graph of genetic variation as a reference, instead of a linear genome, has been discussed for some while, and in fact a previous biorxiv preprint of ours applying them to the MHC has just come out in Nature Genetics:

In this paper we apply those ideas to bacteria, where we let go of the linear coordinate system in order to handle plasmid-mediated genes. Our idea is simple – we want to see if genomic data can be used to predict antibiotic resistance in bacteria, and we explicitly want to build a general framework that will extend to many species, and handle mixed infections.

The paper does not deal with the issue of discovering mechanisms/mutations/genes which drive drug resistance – we take a set of geno-pheno rules as prerequisite, and then use a graph of resistance mutations and genes on different genetic backgrounds to detect presence of alleles and compare statistical models – is the population clonal susceptible, minor resistant or major resistant? Although it is accepted that minor alleles can sweep to fixation, in general there is neither consensus nor quantitative data on the correlation between allele frequency and in vitro phenotypic resistance or patient outcome (the latter obviously being much harder). At a practical level,in some cases a clinician might avoid a drug if they knew there was a 5%-frequency resistance allele, and in others they might increase the dose. Resistance is of course a quantitative trait, often measured in terms of the minimum concentration of a drug required to stop growth of a fixed inoculum – but commonly a threshold is drawn and samples are classified in a binary fashion.

A paper last year from some of us ( showed that a simple panel of SNPs and genes was enough to predict resistance with high sensitivity and specificity for S. aureus (where SNPs, indels, chromosomal genes and plasmid-mediated genes can all cause resistance) – once you discard all samples with any mixed strains. (Standard process is to take a patient sample and culture “overnight” (12-24 hours), thus removing almost all diversity and samples which show any morphological signs of diversity after culture are discarded or subcultured). By contrast, for M. tuberculosis (which causes TB), known resistance mutations explain a relatively low proportion of phenotypic resistance (~85%) for first-line drugs, and even less for 2nd line (I explain below what 1st/2nd line are). The Mtb population within-host is highly structured and multiple genotypes can evolve in different loci within the body, so it’s important to be able to deal with mixtures. Typical phenotyping relies on several weeks of solid culture (Mtb is slow growing), but mixtures are more able to survive this type of culture than in the case of S. aureus.

We show with simulations that we can use the graph to detect low frequency mutations and genes (no surprise), and that for S. aureus we make no minor calls for our validation set of ~500 blood-cultured samples (no surprise). Each sample is phenotyped with 2 standard lab methods, and where they disagree a higher quality test is used to arbitrate. This consensus allows us to estimate error rates both for our method (called Mykrobe predictor) and for the phenotypic tests. As a result we’re able to show not only that we do comparably with FDA requirements for a diagnostic, but also that we match or beat the common phenotypic tests.

On the other hand for TB, the story is much more complex and interesting. We analyse ~3500 genomes in total, split into ~1900 training samples and ~1600 for validation. For M. tuberculosis, a sample is classed as resistant if after some weeks of culturing under drug pressure, the number of surviving colonies is >1% of the number of colonies from a control strain treated identically – the number 1% is of course arbitrary (set down by Canetti in the 1960s I think), though it has been shown that phenotypic resistance does correlate with worse patient outcome. Sequencing on the other hand is done before the drug pressure, so we are fundamentally testing a different population, and we can’t simply mirror that 1% allele frequency expectation. This is what we use the 1900 training samples for – determining what frequency to set for our minor-resistant model. We ended up using 20%, and also found that there
was an appreciable amount of lower frequency resistance, which did not survive the 6-week drug-pressure susceptibility test, but which might cause resistance in a patient.

Mtb infections can last a long time, and despite their slow growth, the sheer number of bacilli in a host result in a vast in-host diversity. As a result, mono therapies fail, as resistant strains sweep to fixation – standard treatment is therefore with 4 “first-line” drugs, reducing the chance that any strain has enough mutations to resist them all. If the first-line drugs fail, or if the strain is known to be resistant, then it is necessary to fall back to more toxic and less effective second-line drugs. We found, somewhat to our surprise, that

1. Overall, minor alleles contribute very little to phenotypic resistance in first-line drugs, but they do make a significant contribution to second-line drugs, improving predictive power by >15%. This matches previous reports that patient samples had mixed R and S alleles for 2nd line drugs. This could have major public health consequences, as resistance to these drugs needs to be detected to distinguish MDR-TB (resistant to isoniazid, rifampicin) from XDR-TB (isoniazid, rifampicin + second-line), a major concern for the WHO.

2. Interestingly, a noticeable number of rifampicin false-positive calls were due to SNPs which confer resistance but have been shown to slow growth. Since the phenotyping test is intrinsically a measure of relative growth, these strains may be misclassified as susceptible – i.e. these are probably false-susceptible calls due to an artefact of the nature of the test. This has been reported before by the way.

Anyway – please check out the paper for details. We think this large-scale analysis of whether minor alleles contribute to in vitro phenotype, and whether they should be used for prediction is new and interesting both scientifically and in terms of translation. The bigger question is what the consequences are for patient outcome, and how to deal with in-host diversity, and for that we of course need data collection and sharing. We’ve spent a lot of time in the Oxford John Radcliffe Hospital working with clinicians, and trying to determine what information they really need from this kind of predictive test, and we’ve produced both Windows/Mac apps with very simple user-interfaces (drag-the-fastq on, and let it run) for them to use; we’ve also produced an Illumina Basespace app, currently submitted to Illumina for approval, which should enable automated cloud-use.

Our paper also has a whole bunch of work I’ve not mentioned here, where we needed to identify species, and detect contaminants – most interesting when common contaminants can contain the same resistance gene as the species under test.

Our software is up on github here
including some desktop apps and example fastq files so you can test it.

Comments very welcome!

Zam and Phelim

PS By the way, the 4 first-line drugs have different effectiveness in different body compartments – see this interesting paper for the modelling of the consequences:


Epidemiological and evolutionary analysis of the 2014 Ebola virus outbreak

Epidemiological and evolutionary analysis of the 2014 Ebola virus outbreak
Marta Łuksza, Trevor Bedford, Michael Lässig
Subjects: Populations and Evolution (q-bio.PE)

The 2014 epidemic of the Ebola virus is governed by a genetically diverse viral population. In the early Sierra Leone outbreak, a recent study has identified new mutations that generate genetically distinct sequence clades. Here we find evidence that major Sierra Leone clades have systematic differences in growth rate and reproduction number. If this growth heterogeneity remains stable, it will generate major shifts in clade frequencies and influence the overall epidemic dynamics on time scales within the current outbreak. Our method is based on simple summary statistics of clade growth, which can be inferred from genealogical trees with an underlying clade-specific birth-death model of the infection dynamics. This method can be used to perform realtime tracking of an evolving epidemic and identify emerging clades of epidemiological or evolutionary significance.

Tackling drug resistant infection outbreaks of global pandemic Escherichia coli ST131 using evolutionary and epidemiological genomics

Tackling drug resistant infection outbreaks of global pandemic Escherichia coli ST131 using evolutionary and epidemiological genomics
Tim Downing
(Submitted on 4 Nov 2014)

High-throughput molecular approaches are required to investigate the origin and diffusion of antimicrobial resistance in rapidly radiating pathogen outbreaks. The most frequent cause of human infection is Escherichia coli, which is dominated by ST131, a single pandemic clone. This epidemic subtype possesses an extensive array of virulence elements and tolerates many drugs. Frequent global sweeps of new dominant ST131 varieties necessitate deep genomic scrutiny of their spread, evolution and lateral transfer of drug resistance genes. Phylogenetic methods that decipher past events can predict future patterns of virulence and transmission based on genetic signatures of adaptation and recombination. Antibiotic tolerance is controlled by natural variation in gene expression levels, which can initiate delayed cell growth. This dormancy allows survival despite drug exposure, and yet may only be present in part of the infecting cell population. Consequently, genomic epidemiology needs to explore the scale of phenotypic regulatory control acting on RNA. A multi-faceted approach can comprehensively assess antimicrobial resistance in E. coli ST131 in terms of within-host genetic heterogeneity, regulation of gene expression, and transmission dynamics between hosts to achieve a goal of pre-empting resistance before it emerges by optimising drug treatment protocols.

Extraordinarily wide genomic impact of a selective sweep associated with the evolution of sex ratio distorter suppression

Extraordinarily wide genomic impact of a selective sweep associated with the evolution of sex ratio distorter suppression
Emily A Hornett, Bruce Moran, Louise A Reynolds, Sylvain Charlat, Samuel Tazzyman, Nina Wedell, Chris D Jiggins, Gregory Hurst

Symbionts that distort their host?s sex ratio by favouring the production and survival of females are common in arthropods. Their presence produces intense Fisherian selection to return the sex ratio to parity, typified by the rapid spread of host ?suppressor? loci that restore male survival/development. In this study, we investigated the genomic impact of a selective event of this kind in the butterfly Hypolimnas bolina. Through linkage mapping we first identified a genomic region that was necessary for males to survive Wolbachia-induced killing. We then investigated the genomic impact of the rapid spread of suppression that converted the Samoan population of this butterfly from a 100:1 female-biased sex ratio in 2001, to a 1:1 sex ratio by 2006. Models of this process revealed the potential for a chromosome-wide selective sweep. To measure the impact directly, the pattern of genetic variation before and after the episode of selection was compared. Significant changes in allele frequencies were observed over a 25cM region surrounding the suppressor locus, alongside generation of linkage disequilibrium. The presence of novel allelic variants in 2006 suggests that the suppressor was introduced via immigration rather than through de novo mutation. In addition, further sampling in 2010 indicated that many of the introduced variants were lost or had reduced in frequency since 2006. We hypothesise that this loss may have resulted from a period of purifying selection – removing deleterious material that introgressed during the initial sweep. Our observations of the impact of suppression of sex ratio distorting activity reveal an extraordinarily wide genomic imprint, reflecting its status as one of the strongest selective forces in nature.

Bayesian Coalescent Epidemic Inference: Comparison of Stochastic and Deterministic SIR Population Dynamics

Bayesian Coalescent Epidemic Inference: Comparison of Stochastic and Deterministic SIR Population Dynamics

Alex Popinga, Tim Vaughan, Tanja Stadler, Alexei Drummond
Comments: Submitted
Subjects: Populations and Evolution (q-bio.PE)

Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingmans coalescent as well as from birth death branching processes. The development of alternative approaches merits investigation of their characteristics and differences. Here we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent SIR tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the models performance and contrast results obtained with a recently published birth death sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. We also compare results of analyses using published HIV1 sequence data obtained from known UK infection clusters. We show that the coalescent SIR model is effective at estimating epidemiological parameters from data with large fundamental reproductive number R0 and large population size S0. We find that the stochastic variant generally outperforms its deterministic counterpart. However, each of these Bayesian estimators are shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with R0 close to one or with small susceptible populations.

Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus

Epidemic clones, oceanic gene pools and epigenotypes in the free living marine pathogen Vibrio parahaemolyticus
Yujun Cui, Xianwei Yang, Xavier Didelot, Chenyi Guo, Dongfang Li, Yanfeng Yan, Yiquan Zhang, Yanting Yuan, Huanming Yang, Jian Wang, Jun Wang, Yajun Song, Dongsheng Zhou, Daniel Falush, Ruifu Yang
Subjects: Populations and Evolution (q-bio.PE)

In outbreeding organisms, genetic variation is reassorted each generation, leading to geographic gene pools. By contrast bacterial clones can spread and adapt independently leading to a wide variety of possible genetic structures. Here we investigated global patterns of variation in 157 whole genome sequences of Vibrio parahaemolyticus, a free living and seafood associated marine bacterium. Pandemic clones, responsible for recent outbreaks of gastroenteritis in humans have spread globally. However, there are oceanic gene pools, one located in the oceans surrounding Asia and another in the Mexican Gulf. Frequent recombination means that most isolates have acquired the genetic profile of their current location. Within oceanic gene pools, there is nevertheless the opportunity for substructure, for example due to niche partitioning by different clones. We investigated this structure by calculating the effective population size in two different ways. Under standard population genetic models, the two estimates should give similar answers but we found a 30 fold difference. This discrepancy provides evidence for an ‘epigenotype’ model in which distinct ecotypes are maintained by selection on an otherwise homogeneous genetic background. To investigate the genetic factors involved, we used 54 unrelated isolates to conduct a genome wide scan for epistatically interacting loci. We found a single example of strong epistasis between distant genome regions. One of the genes involved in this interaction has previously been implicated in biofilm formation, while the other is a hypothetical protein. Further work will allow a detailed understanding of how selection acts to structure the pattern of variation within natural bacterial populations.

Author post: Predicting evolution from the shape of genealogical trees

This guest post by Richard Neher discusses his preprint Predicting evolution from the shape of genealogical trees. Richard A. Neher, Colin A. Russell, Boris I. Shraiman. arXived here. This is cross-posted from the Neher lab website.

In this preprint — a collaboration with Colin Russell and Boris Shraiman — we show that it is possible to predict which individual from a population is most closely related to future populations. To this end, we have developed a method that uses the branching pattern of genealogical trees to estimate which part of the tree contains the “fittest” sequences, where fit means rapidly multiplying. Those that multiply rapidly, are most likely to take over the population. We demonstrate the power of our method by predicting the evolution of seasonal influenza viruses.

How does it work?
Individuals adapt to a changing environment by accumulating beneficial mutations, while avoiding deleterious mutations. We model this process assuming that there are many such mutations which change fitness in small increments. Using this model, we calculate the probability that an individual that lived in the past at time t leaves n descendants in the present. This distributions depends critically on the fitness of the ancestral individual. We then extend this calculation to the probability of observing a certain branch in a genealogical tree reconstructed from a sample of sequences. A branch in a tree connects an individual A that lived at time tA and had fitness xA and with an individual B that lived at a later time tB with fitness xB as illustrated in the figure. B has descendants in the sample, otherwise the branch would not be part of the tree. Furthermore, all sampled descendants of A are also descendants of B, otherwise the connection between A and B would have branched between tA and tB. We call the mathematical object describing fitness evolution between A and B “branch propagator” and propagatordenote it by g(xB,tB|xA,tA). The joint probability distribution of fitness values of all nodes of the tree is given by a product of branch propagators. We then calculate the expected fitness of each node and use it to rank the sampled sequences. The top ranked sequence is our prediction for the sequence of the progenitor of the future population.

Why do we care?
flu_tree Being able to predict evolution could have immediate applications. The best example is the seasonal influenza vaccine, that needs to be updated frequently to keep up with the evolving virus. Vaccine strains are chosen among sampled virus strains, and the more closely this strain matches the future influenza virus population, the better the vaccine is going to be. Hence by predicting a likely progenitor of the future, our method could help to improve influenza vaccines. One of our predictions is shown in the figure, with the top ranked sequence marked by a black arrow. Influenza is not the only possible application. Since the algorithm only requires a reconstructed tree as input, it can be applied to other rapidly evolving pathogens or cancer cell populations. In addition, to being useful, the ability to predict also implies that the model captures an essential aspect of evolutionary dynamics: influenza evolution is to a substantial degree — enough to enable prediction — dependent on the accumulation of small effect mutations.

Comparison to other approaches
Given the importance of good influenza vaccines, there has been a number of previous efforts to anticipate influenza virus evolution, typically based on using patterns of molecular evolution from historical data. Along these lines, Luksza and Lässig have recently presented an explicit fitness model for influenza virus evolution that rewards mutations at positions known to convey antigenic novelty and penalizes likely deleterious mutations (+a few other things). By using molecular influenza specific signatures, this model is complementary to ours that uses only the tree reconstructed from nucleotide sequences. Interestingly, the two models do more or less equally well and combining different methods of prediction should result in more reliable results.

Phylogenetic tree shapes resolve disease transmission patterns

Phylogenetic tree shapes resolve disease transmission patterns
Jennifer Gardy, Caroline Colijn

Whole genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterised infectious periods, epidemiological and clinical meta-data which may not always be available, and typically require computationally intensive analysis focussing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the overall transmission patterns underlying an outbreak. Here we use simulated outbreaks to train and then test computational classifiers. We test the method on data from two real-world outbreaks. We find that different transmission patterns result in quantitatively different phylogenetic tree shapes. We describe five topological features that summarize a phylogeny’s structure and find that computational classifiers based on these are capable of predicting an outbreak’s transmission dynamics. The method is robust to variations in the transmission parameters and network types, and recapitulates known epidemiology of previously characterized real-world outbreaks. We conclude that there are simple structural properties of phylogenetic trees which, when combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous transmission, and chains of transmission. This is possible using genome data alone, and can be done during an outbreak. We discuss the implications for management of outbreaks.

Reassortment between influenza B lineages and the emergence of a co-adapted PB1-PB2-HA gene complex

Reassortment between influenza B lineages and the emergence of a co-adapted PB1-PB2-HA gene complex
Gytis Dudas, Trevor Bedford, Samantha Lycett, Andrew Rambaut
Comments: 33 pages, 21 figures
Subjects: Populations and Evolution (q-bio.PE)

Influenza B viruses are increasingly being recognized as major contributors to morbidity attributed to seasonal influenza. Currently circulating influenza B isolates are known to belong to two antigenically distinct lineages referred to as B/Victoria and B/Yamagata. Frequent exchange of genomic segments of these two lineages has been noted in the past, but the observed patterns of reassortment have not been formalized in detail. We investigate inter-lineage reassortments by comparing phylogenetic trees across genomic segments. Our analyses indicate that of the 8 segments of influenza B viruses only PB1, PB2 and HA segments maintained separate Victoria and Yamagata lineages and that currently circulating strains possess PB1, PB2 and HA segments derived entirely from one or the other lineage; other segments have repeatedly reassorted between lineages thereby reducing genetic diversity. We argue that this difference between segments is due to selection against reassortant viruses with mixed lineage PB1, PB2 and HA segments. Given sufficient time and continued recruitment to the reassortment-isolated PB1-PB2-HA gene complex, we expect influenza B viruses to eventually undergo sympatric speciation.

High Genetic Diversity and Adaptive Potential of Two Simian Hemorrhagic Fever Viruses in a Wild Primate Population

High Genetic Diversity and Adaptive Potential of Two Simian Hemorrhagic Fever Viruses in a Wild Primate Population
Adam L. Bailey, Michael Lauck, Andrea Weiler, Samuel D. Sibley, Jorge M. Dinis, Zachary Bergman, Chase W. Nelson, Michael Correll, Michael Gleicher, David Hyeroba, Alex Tumukunde, Geoffrey Weny, Colin Chapman, Jens Kuhn, Austin Hughes, Thomas C. Friedrich, Tony L. Goldberg, David H. O’Connor

Key biological properties such as high genetic diversity and high evolutionary rate enhance the potential of certain RNA viruses to adapt and emerge. Identifying viruses with these properties in their natural hosts could dramatically improve disease forecasting and surveillance. Recently, we discovered two novel members of the viral family Arteriviridae: simian hemorrhagic fever virus (SHFV)-krc1 and SHFV-krc2, infecting a single wild red colobus (Procolobus rufomitratus tephrosceles) in Kibale National Park, Uganda. Nearly nothing is known about the biological properties of SHFVs in nature, although the SHFV type strain, SHFV-LVR, has caused devastating outbreaks of viral hemorrhagic fever in captive macaques. Here we detected SHFV-krc1 and SHFV-krc2 in 40% and 47% of 60 wild red colobus tested, respectively. We found viral loads in excess of 1×10^6-1×10^7 RNA copies per milliliter of blood plasma for each of these viruses. SHFV-krc1 and SHFV-krc2 also showed high genetic diversity at both the inter- and intra-host levels. Analyses of synonymous and non-synonymous nucleotide diversity across viral genomes revealed patterns suggestive of positive selection in SHFV open reading frames (ORF) 5 (SHFV-krc2 only) and 7 (SHFV-krc1 and SHFV-krc2). Thus, these viruses share several important properties with some of the most rapidly evolving, emergent RNA viruses.