Our paper: Integrating influenza antigenic dynamics with molecular evolution

This guest post is by Trevor Bedford (@trvrb) on his paper (along with coauthors): Bedford et al. Integrating influenza antigenic dynamics with molecular evolution arXived here.

The influenza virus shows a remarkable capacity to evolve to escape human immunity. Many other viruses, like measles, do not have this capacity. After infection with measles, a person gains life-long immunity to the virus, and hence measles has become constrained to be a childhood infection. Continual antigenic evolution in influenza necessitates frequent vaccine updates to provide sufficient protection to circulating strains.

Antigenic differences between strains are commonly quantified using the hemagglutination inhibition (HI) assay, which measures the ability of antibodies created against one strain to interfere with virus from another strain. The resulting HI data is represented as a sparse matrix of comparisons between viruses from strains A, B, C… and sera from strains X, Y, Z… Taken by itself, this matrix is difficult to work with. Experienced virologists can pick up the loss of reactivity between groups of viruses in the noisy HI data, but these patterns are not fully quantified.

In our new paper, available on the arXiv, we extend techniques of multidimensional scaling (MDS) pioneered by Derek Smith and colleagues for the analysis of influenza antigenic data. Here, we attempted to bring the MDS antigenic model into a fully Bayesian framework and refer to the revised technique as Bayesian MDS (BMDS). In this model, viruses and sera are represented as 2D coordinates on an antigenic map in which their pairwise distances yield expectations for the HI titers, with antigenically similar viruses lying close to one another and antigenically distant viruses lying far apart.

By placing antigenic cartography in a Bayesian context, we are able to integrate other data sources, most notably sequence data. In this case, genetic sequences provide an evolutionary tree relating virus strains and we assume that antigenic location evolves along this tree in a 2D diffusion process. This process imposes a prior on antigenic locations in which evolutionary similar viruses have a prior expectation of lying close to one another on the map. In the paper, we use this BMDS / diffusion model to investigate patterns of antigenic evolution in 4 circulating lineages of influenza and show that antigenic drift determines to a large degree incidence patterns across time and across lineages.

The paper is also up on GitHub, which I’ll keep updating as it goes through the review process. The BMDS model is implemented in the software package BEAST and is available in the latest source code. I hope to provide tutorials on running the BMDS model in the not-to-distant future.

Adaptive reference-free compression of sequence quality scores

Adaptive reference-free compression of sequence quality scores

Lilian Janin, Giovanna Rosone, Anthony J. Cox
(Submitted on 1 May 2013)

Motivation:
Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. Since our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is therefore applicable to data from metagenomics and de novo experiments as well as to resequencing data.
Results:
We show that a conservative smoothing strategy affecting 75% of the quality scores above Q2 leads to an overall quality score compression of 1 bit per value with a negligible effect on variant calling. A compression of 0.68 bit per quality value is achieved using a more aggressive smoothing strategy, again with a very small effect on variant calling.
Availability:
Code to construct the BWT and LCP-array on large genomic data sets is part of the BEETL library, available as a github respository at this http URL .

Distilled Single Cell Genome Sequencing and De Novo Assembly for Sparse Microbial Communities

Distilled Single Cell Genome Sequencing and De Novo Assembly for Sparse Microbial Communities

Zeinab Taghavi, Narjes S. Movahedi, Sorin Draghici, Hamidreza Chitsaz
(Submitted on 1 May 2013)

Identification of all species in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier as the number of different species is usually much smaller than the number of cells. Here, we present a novel divide-and-conquer method to sequence and de novo assemble the genomes of all of the different species present in a microbial sample with a sequencing cost and computational complexity proportional to the number of species, not the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide-and-conquer method successfully reduces the cost of sequencing in comparison with the naive exhaustive approach.

Critical case stochastic phylogenetic tree model via the Laplace transform

Critical case stochastic phylogenetic tree model via the Laplace transform
Krzysztof Bartoszek, Michal Krzeminski
(Submitted on 30 Apr 2013)

Birth-and-death models are now a common mathematical tool to describe branching patterns observed in real-world phylogenetic trees. Liggett and Schinazi (2009) is one such example. The authors propose a simple birth-and-death model that is compatible with phylogenetic trees of both influenza and HIV, depending on the birth rate parameter. An interesting special case of this model is the critical case where the birth rate equals the death rate. This is a non-trivial situation and to study its asymptotic behaviour we employed the Laplace transform. With this we correct the proof of Liggett and Schinazi (2009) in the critical case.

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping
Buhm Han, Jae Hoon Sul, Eleazar Eskin, Paul I. W. de Bakker, Soumya Raychaudhuri
(Submitted on 30 Apr 2013)

Meta-analysis of genome-wide association studies is increasingly popular and many meta-analytic methods have been recently proposed. A majority of meta-analytic methods combine information from multiple studies by assuming that studies are independent since individuals collected in one study are unlikely to be collected again by another study. However, it has become increasingly common to utilize the same control individuals among multiple studies to reduce genotyping or sequencing cost. This causes those studies that share the same individuals to be dependent, and spurious associations may arise if overlapping subjects are not taken into account in a meta-analysis. In this paper, we propose a general framework for meta-analyzing dependent studies with overlapping subjects. Given dependent studies, our approach “decouples” the studies into independent studies such that meta-analysis methods assuming independent studies can be applied. This enables many meta-analysis methods, such as the random effects model, to account for overlapping subjects. Another advantage is that one can continue to use preferred software in the analysis pipeline which may not support overlapping subjects. Using simulations and the Wellcome Trust Case Control Consortium data, we show that our decoupling approach allows both the fixed and the random effects models to account for overlapping subjects while retaining desirable false positive rate and power.

Most viewed on Haldane’s Sieve: April 2013

Below are the most viewed posts on Haldane’s Sieve in April 2013. We’ve listed six instead of our usual five posts because the last two posts had identical numbers of views at the time of writing.

Slowing evolution is more effective than enhancing drug development for managing resistance

Slowing evolution is more effective than enhancing drug development for managing resistance
Nathan S. McClure, Troy Day
(Submitted on 29 Apr 2013)

Drug resistance is a serious public health problem that threatens to thwart our ability to treat many infectious diseases. Repeatedly, the introduction of new drugs has been followed by the evolution of resistance. In principle there are two ways to address this problem: (i) enhancing drug development, and (ii) slowing drug resistance. We present data and a modeling approach based on queueing theory that explores how interventions aimed at these two facets affect the ability of the entire drug supply system to provide service. Analytical and simulation-based results show that, all else equal, slowing the evolution of drug resistance is more effective at ensuring an adequate supply of effective drugs than is enhancing the rate at which new drugs are developed. This lends support to the idea that evolution management is not only a significant component of the solution to the problem of drug resistance, but may in fact be the most important component.

Positive selection drives faster-Z evolution in silkmoths

Positive selection drives faster-Z evolution in silkmoths
Timothy B. Sackton (1), Russell B. Corbett-Detig (1), Javaregowda Nagaraju (2), R. Lakshmi Vaishna (2), Kallare P. Arunkumar (2), Daniel L. Hartl (1) ((1) Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA, (2) Centre of Excellence for Genetics and Genomics of Silkmoths, Laboratory of Molecular Genetics, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, India)
(Submitted on 29 Apr 2013)

Genes linked to X or Z chromosomes, which are hemizygous in the heterogametic sex, are predicted to evolve at different rates than those on autosomes. This faster-X effect can arise either as a consequence of hemizygosity which leads to more efficient selection for recessive beneficial mutations in the heterogametic sex, or as a consequence of reduced effective population size on the hemizygous chromosome, which leads to increased fixation of weakly deleterious mutations due to random genetic drift. Empirical results to date have suggested that, while the overall pattern across taxa is complicated, in general systems with male-heterogamy show a faster-X effect primarily attributable to more efficient selection, whereas systems with female-heterogamy show a faster-Z effect primarily attributable to increased drift. However, to date only a single female-heterogamic taxa has been investigated. In order to test the generality of the faster-Z pattern seen in birds, we sequenced the genome of the Lepidopteran insect Bombyx huttoni, a close outgroup of the domesticated silkmoth Bombyx mori. We show that silkmoths experience faster-Z evolution, but unlike in birds, the faster-Z effect appears to be attributable to more efficient positive selection in females. These results suggest that female-heterogamy alone is unlikely to be sufficient to explain the reduced efficacy of selection on the bird Z chromosome. Instead, it is likely that a combination of patterns of dosage compensation and overall effective population size, among other factors, influence patterns of faster-Z evolution.

Remote Homology Detection in Proteins Using Graphical Models

Remote Homology Detection in Proteins Using Graphical Models
Noah M. Daniels
(Submitted on 24 Apr 2013)

Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge.
We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology.
Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies.
We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.

Timing of ancient human Y lineage depends on the mutation rate: A comment on Mendez et al

Timing of ancient human Y lineage depends on the mutation rate: A comment on Mendez et al
Melissa A. Wilson Sayres
(Submitted on 22 Apr 2013)

Mendez et al. recently report the identification of a Y chromosome lineage from an African American that is an outgroup to all other known Y haplotypes, and report a time to most recent common ancestor, TMRCA, for human Y lineages that is substantially longer than any previous estimate. The identification of a novel Y haplotype is always exciting, and this haplotype, in particular, is unique in its basal position on the Y haplotype tree. However, at 338 (237-581) thousand years ago, kya, the extremely ancient TMRCA reported by Mendez et al. is inconsistent with the known human fossil record (which estimate the age of anatomically modern humans at 195 +- 5 kya), with estimates from mtDNA (176.6 +- 11.3 kya, and 204.9 (116.8-295.7) kya) and with population genetic theory. The inflated TMRCA can quite easily be attributed to the extremely low Y chromosome mutation rate used by the authors.