Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda

Alita Burmeister , Richard Lenski , Justin Meyer
doi: http://dx.doi.org/10.1101/018606

The evolution of qualitatively new functions is fundamental for shaping the diversity of life. Such innovations are rare because they require multiple coordinated changes. We sought to understand the evolutionary processes involved in a particular key innovation, whereby phage λ evolved the ability to exploit a novel receptor, OmpF, on the surface of Escherichia coli cells. Previous work has shown that this transition repeatedly evolves in the laboratory, despite requiring four mutations in specific regions of a single gene. Here we examine how this innovation evolved by studying six intermediate genotypes that arose during independent transitions to use OmpF. In particular, we tested whether these genotypes were favored by selection, and how a coevolved change in the hosts influenced the fitness of the phage genotypes. To do so, we measured the fitness of the intermediate types relative to the ancestral λ when competing for either ancestral or coevolved host cells. All six intermediates had improved fitness on at least one host, and four had higher fitness on the coevolved host than on the ancestral host. These results show that the evolution of the phage’s new ability to use OmpF was repeatable because the intermediate genotypes were adaptive and, in many cases, because coevolution of the host favored their emergence.

Advertisements

Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis.

Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis.

Phelim Bradley , N Claire Gordon , Timothy M Walker , Laura Dunn , Simon Heys , Bill Huang , Sarah Earle , Louise J Pankhurst , Luke Anson , Mariateresa de Cesare , Paolo Piazza , Antonina A Votintseva , Tanya Golubchik , Daniel J Wilson , David H Wyllie , Roland Diel , Stefan Niemann , Silke Feuerriegel , Thomas A Kohl , Nazir Ismail , Shaheed V Omar , E Grace Smith , David Buck , Gil McVean , A Sarah Walker , Tim Peto , Derrick Crook , Zamin Iqbal
doi: http://dx.doi.org/10.1101/018564

Rapid and accurate detection of antibiotic resistance in pathogens is an urgent need, affecting both patient care and population-scale control. Microbial genome sequencing promises much, but many barriers exist to its routine deployment. Here, we address these challenges, using a de Bruijn graph comparison of clinical isolate and curated knowledge-base to identify species and predict resistance profile, including minor populations. This is implemented in a package, Mykrobe predictor, for S. aureus and M. tuberculosis, running in under three minutes on a laptop from raw data. For S. aureus, we train and validate in 495/471 samples respectively, finding error rates comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.3%/99.5% across 12 drugs. For M. tuberculosis, we identify species and predict resistance with specificity of 98.5% (training/validating on 1920/1609 samples). Sensitivity of 82.6% is limited by current understanding of genetic mechanisms. We also show that analysis of minor populations increases power to detect phenotypic resistance in second-line drugs without appreciable loss of specificity. Finally, we demonstrate feasibility of an emerging single-molecule sequencing technique.

Author post: Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis

This guest post is by Zamin Iqbal [@ZaminIqbal] and Phelim Bradley [@Phelimb]

Our paper “Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis” has just appeared on the Biorxiv. We’re excited about it for a number of reasons.

The idea of using a graph of genetic variation as a reference, instead of a linear genome, has been discussed for some while, and in fact a previous biorxiv preprint of ours applying them to the MHC has just come out in Nature Genetics:
http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3257.html

In this paper we apply those ideas to bacteria, where we let go of the linear coordinate system in order to handle plasmid-mediated genes. Our idea is simple – we want to see if genomic data can be used to predict antibiotic resistance in bacteria, and we explicitly want to build a general framework that will extend to many species, and handle mixed infections.

The paper does not deal with the issue of discovering mechanisms/mutations/genes which drive drug resistance – we take a set of geno-pheno rules as prerequisite, and then use a graph of resistance mutations and genes on different genetic backgrounds to detect presence of alleles and compare statistical models – is the population clonal susceptible, minor resistant or major resistant? Although it is accepted that minor alleles can sweep to fixation, in general there is neither consensus nor quantitative data on the correlation between allele frequency and in vitro phenotypic resistance or patient outcome (the latter obviously being much harder). At a practical level,in some cases a clinician might avoid a drug if they knew there was a 5%-frequency resistance allele, and in others they might increase the dose. Resistance is of course a quantitative trait, often measured in terms of the minimum concentration of a drug required to stop growth of a fixed inoculum – but commonly a threshold is drawn and samples are classified in a binary fashion.

A paper last year from some of us (http://jcm.asm.org/content/52/4/1182.full) showed that a simple panel of SNPs and genes was enough to predict resistance with high sensitivity and specificity for S. aureus (where SNPs, indels, chromosomal genes and plasmid-mediated genes can all cause resistance) – once you discard all samples with any mixed strains. (Standard process is to take a patient sample and culture “overnight” (12-24 hours), thus removing almost all diversity and samples which show any morphological signs of diversity after culture are discarded or subcultured). By contrast, for M. tuberculosis (which causes TB), known resistance mutations explain a relatively low proportion of phenotypic resistance (~85%) for first-line drugs, and even less for 2nd line (I explain below what 1st/2nd line are). The Mtb population within-host is highly structured and multiple genotypes can evolve in different loci within the body, so it’s important to be able to deal with mixtures. Typical phenotyping relies on several weeks of solid culture (Mtb is slow growing), but mixtures are more able to survive this type of culture than in the case of S. aureus.

We show with simulations that we can use the graph to detect low frequency mutations and genes (no surprise), and that for S. aureus we make no minor calls for our validation set of ~500 blood-cultured samples (no surprise). Each sample is phenotyped with 2 standard lab methods, and where they disagree a higher quality test is used to arbitrate. This consensus allows us to estimate error rates both for our method (called Mykrobe predictor) and for the phenotypic tests. As a result we’re able to show not only that we do comparably with FDA requirements for a diagnostic, but also that we match or beat the common phenotypic tests.

On the other hand for TB, the story is much more complex and interesting. We analyse ~3500 genomes in total, split into ~1900 training samples and ~1600 for validation. For M. tuberculosis, a sample is classed as resistant if after some weeks of culturing under drug pressure, the number of surviving colonies is >1% of the number of colonies from a control strain treated identically – the number 1% is of course arbitrary (set down by Canetti in the 1960s I think), though it has been shown that phenotypic resistance does correlate with worse patient outcome. Sequencing on the other hand is done before the drug pressure, so we are fundamentally testing a different population, and we can’t simply mirror that 1% allele frequency expectation. This is what we use the 1900 training samples for – determining what frequency to set for our minor-resistant model. We ended up using 20%, and also found that there
was an appreciable amount of lower frequency resistance, which did not survive the 6-week drug-pressure susceptibility test, but which might cause resistance in a patient.

Mtb infections can last a long time, and despite their slow growth, the sheer number of bacilli in a host result in a vast in-host diversity. As a result, mono therapies fail, as resistant strains sweep to fixation – standard treatment is therefore with 4 “first-line” drugs, reducing the chance that any strain has enough mutations to resist them all. If the first-line drugs fail, or if the strain is known to be resistant, then it is necessary to fall back to more toxic and less effective second-line drugs. We found, somewhat to our surprise, that

1. Overall, minor alleles contribute very little to phenotypic resistance in first-line drugs, but they do make a significant contribution to second-line drugs, improving predictive power by >15%. This matches previous reports that patient samples had mixed R and S alleles for 2nd line drugs. This could have major public health consequences, as resistance to these drugs needs to be detected to distinguish MDR-TB (resistant to isoniazid, rifampicin) from XDR-TB (isoniazid, rifampicin + second-line), a major concern for the WHO.

2. Interestingly, a noticeable number of rifampicin false-positive calls were due to SNPs which confer resistance but have been shown to slow growth. Since the phenotyping test is intrinsically a measure of relative growth, these strains may be misclassified as susceptible – i.e. these are probably false-susceptible calls due to an artefact of the nature of the test. This has been reported before by the way.

Anyway – please check out the paper for details. We think this large-scale analysis of whether minor alleles contribute to in vitro phenotype, and whether they should be used for prediction is new and interesting both scientifically and in terms of translation. The bigger question is what the consequences are for patient outcome, and how to deal with in-host diversity, and for that we of course need data collection and sharing. We’ve spent a lot of time in the Oxford John Radcliffe Hospital working with clinicians, and trying to determine what information they really need from this kind of predictive test, and we’ve produced both Windows/Mac apps with very simple user-interfaces (drag-the-fastq on, and let it run) for them to use; we’ve also produced an Illumina Basespace app, currently submitted to Illumina for approval, which should enable automated cloud-use.

Our paper also has a whole bunch of work I’ve not mentioned here, where we needed to identify species, and detect contaminants – most interesting when common contaminants can contain the same resistance gene as the species under test.

Our software is up on github here
https://github.com/iqbal-lab/Mykrobe-predictor
including some desktop apps and example fastq files so you can test it.

Comments very welcome!

Zam and Phelim

PS By the way, the 4 first-line drugs have different effectiveness in different body compartments – see this interesting paper for the modelling of the consequences: http://biorxiv.org/content/early/2014/12/19/013003.

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data — the impact of model mis-specification in distance corrections

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data — the impact of model mis-specification in distance corrections
Emily Jane McTavish, Mike Steel, Mark T. Holder
Comments: 29 pages, 3 figures
Subjects: Populations and Evolution (q-bio.PE)

Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two “long branch repulsion” trees will be preferred over the true tree — though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a “twisted Farris-zone” tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda
Alita Burmeister , Richard Lenski , Justin Meyer
doi: http://dx.doi.org/10.1101/018606

The evolution of qualitatively new functions is fundamental for shaping the diversity of life. Such innovations are rare because they require multiple coordinated changes. We sought to understand the evolutionary processes involved in a particular key innovation, whereby phage λ evolved the ability to exploit a novel receptor, OmpF, on the surface of Escherichia coli cells. Previous work has shown that this transition repeatedly evolves in the laboratory, despite requiring four mutations in specific regions of a single gene. Here we examine how this innovation evolved by studying six intermediate genotypes that arose during independent transitions to use OmpF. In particular, we tested whether these genotypes were favored by selection, and how a coevolved change in the hosts influenced the fitness of the phage genotypes. To do so, we measured the fitness of the intermediate types relative to the ancestral λ when competing for either ancestral or coevolved host cells. All six intermediates had improved fitness on at least one host, and four had higher fitness on the coevolved host than on the ancestral host. These results show that the evolution of the phage’s new ability to use OmpF was repeatable because the intermediate genotypes were adaptive and, in many cases, because coevolution of the host favored their emergence.

Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns

Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns
Tychele Turner , Christopher Douville , Dewey Kim , Peter D Stenson , David N Cooper , Aravinda Chakravarti , Rachel Karchin
doi: http://dx.doi.org/10.1101/018648

The role of rare missense variants in disease causation remains difficult to interpret. We explore whether the clustering pattern of rare missense variants (MAF<0.01) in a protein is associated with mode of inheritance. Mutations in genes associated with autosomal dominant (AD) conditions are known to result in either loss or gain of function, whereas mutations in genes associated with autosomal recessive (AR) conditions invariably result in loss of function. Loss- of-function mutations tend to be distributed uniformly along protein sequence, while gain-of- function mutations tend to localize to key regions. It has not previously been ascertained whether these patterns hold in general for rare missense mutations. We consider the extent to which rare missense variants are located within annotated protein domains and whether they form clusters, using a new unbiased method called CLUstering by Mutation Position (CLUMP). These approaches quantified a significant difference in clustering between AD and AR diseases. Proteins linked to AD diseases exhibited more clustering of rare missense mutations than those linked to AR diseases (Wilcoxon P=5.7×10-4, permutation P=8.4×10-4). Rare missense mutation in proteins linked to either AD or AR diseases were more clustered than controls (1000G) (Wilcoxon P=2.8×10-15 for AD and P=4.5×10-4 for AR, permutation P=3.1×10-12 for AD and P=0.03 for AR). Differences in clustering patterns persisted even after removal of the most prominent genes. Testing for such non-random patterns may reveal novel aspects of disease etiology in large sample studies.

FermiKit: assembly-based variant calling for Illumina resequencing data

FermiKit: assembly-based variant calling for Illumina resequencing data
Heng Li
Subjects: Genomics (q-bio.GN)

Summary: FermiKit is a variant calling pipeline for Illumina data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions (INDELs) and structural variations (SVs). FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85GB RAM at the peak, and calls variants in half an hour to an accuracy comparable to the current practice. FermiKit assembly is a reduced representation of raw data while retaining most of the original information.
Availability and implementation: https://github.com/lh3/fermikit
Contact: hengli@broadinstitute.org