The dynamics of alternative pathways to compensatory substitution

The dynamics of alternative pathways to compensatory substitution
Chris A. Nasrallah
(Submitted on 9 Aug 2013)

The role of epistatic interactions among loci is a central question in evolutionary biology and is increasingly relevant in the genomic age. While the population genetics of compensatory substitution have received considerable attention, most studies have focused on the case when natural selection is very strong against deleterious intermediates. In the biologically-plausible scenario of weak to moderate selection there exist two alternate pathways for compensatory substitution. In one pathway, a deleterious mutation becomes fixed prior to occurrence of the compensatory mutation. In the other, the two loci are simultaneously polymorphic. The rates of compensatory substitution along these two pathways and their relative probabilities are functions of the population size, selection strength, mutation rate, and recombination rate. In this paper these rates and path probabilities are derived analytically and verified using population genetic simulations. The expected time durations of these two paths are similar when selection is moderate, but not when selection is weak. The effect of recombination on the dynamics of the substitution process are explored using simulation. Using the derived rates, a phylogenetic substitution model of the compensatory evolution process is presented that could be used for inference of population genetic parameters from interspecific data.

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects
Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, Wei Fan
(Submitted on 9 Aug 2013)

Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++ and freely accessible at this ftp URL

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements
Weiwei Zhang, Tim D Spector, Panos Deloukas, Jordana T Bell, Barbara E Engelhardt
(Submitted on 9 Aug 2013)

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is important, but current approaches tackle average methylation within a genomic locus and are often limited to specific genomic regions. Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict CpG site methylation levels using as features neighboring CpG site methylation levels and genomic distance, and co-localization with coding regions, CGIs, and regulatory elements from the ENCODE project, among others. Our approach achieves 91% — 94% prediction accuracy of genome-wide methylation levels at single CpG site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs. Our classifier outperforms state-of-the-art methylation classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation status, CpG island status, co-localized DNase I hypersensitive sites, and specific transcription factor binding sites were found to be most predictive of methylation levels. Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict site-specific methylation levels that achieves the best DNA methylation predictive accuracy to date. Furthermore, our method identified genomic features that interact with DNA methylation, elucidating mechanisms involved in DNA methylation modification and regulation, and linking different epigenetic processes.

A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers

A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers
Justin D. Smith, Kimberly F. McManus, Hunter B. Fraser
(Submitted on 7 Aug 2013)

Measuring natural selection on genomic elements involved in the cis-regulation of gene expression — such as transcriptional enhancers and promoters — is critical for understanding the evolution of genomes, yet it remains a major challenge. Many studies have attempted to detect positive or negative selection in these noncoding elements by searching for those with the fastest or slowest rates of evolution, but this can be problematic. Here we introduce a new approach to this issue, and demonstrate its utility on three mammalian transcriptional enhancers. Using results from saturation mutagenesis studies of these enhancers, we classified all possible point mutations as up-regulating, down-regulating, or silent, and determined which of these mutations have occurred on each branch of a phylogeny. Applying a framework analogous to Ka/Ks in protein-coding genes, we measured the strength of selection on up-regulating and down-regulating mutations, in specific branches as well as entire phylogenies. We discovered distinct modes of selection acting on different enhancers: while all three have experienced negative selection against down-regulating mutations, the selection pressures on up-regulating mutations vary. In one case we detected positive selection for up-regulation, while the other two had no detectable selection on up-regulating mutations. Our methodology is applicable to the growing number of saturation mutagenesis data sets, and provides a detailed picture of the mode and strength of natural selection acting on cis-regulatory elements.

The molecular mechanism of a cis-regulatory adaptation in yeast

The molecular mechanism of a cis-regulatory adaptation in yeast
Jessica Chang, Yiqi Zhou, Xiaoli Hu, Lucia Lam, Cameron Henry, Erin M. Green, Ryosuke Kita, Michael S. Kobor, Hunter B. Fraser
(Submitted on 7 Aug 2013)

Despite recent advances in our ability to detect adaptive evolution involving the cis-regulation of gene expression, our knowledge of the molecular mechanisms underlying these adaptations has lagged far behind. Across all model organisms the causal mutations have been discovered for only a handful of gene expression adaptations, and even for these, mechanistic details (e.g. the trans-regulatory factors involved) have not been determined. We previously reported a polygenic gene expression adaptation involving down-regulation of the ergosterol biosynthesis pathway in the budding yeast Saccharomyces cerevisiae. Here we investigate the molecular mechanism of a cis-acting mutation affecting a member of this pathway, ERG28. We show that the causal mutation is a two-base deletion in the promoter of ERG28 that strongly reduces the binding of two transcription factors, Sok2 and Mot3, thus abolishing their regulation of ERG28. This down-regulation increases resistance to a widely used antifungal drug targeting ergosterol, similar to mutations disrupting this pathway in clinical yeast isolates. The identification of the causal genetic variant revealed that the selection likely occurred after the deletion was already present at high frequency in the population, rather than when it was a new mutation. These results provide a detailed view of the molecular mechanism of a cis-regulatory adaptation, and underscore the importance of this view to our understanding of evolution at the molecular level.

Our paper: Inferring HIV escape rates from multi-locus genotype data

This guest post is by Richard Neher on his paper with Taylor Kessinger and Alan Perelson: Kessinger et al. Inferring HIV escape rates from multi-locus genotype data. arXived here.
This is cross posted from the Neher lab website.

We have a new preprint on the arXiv (here on Haldane’s sieve). This work is the result of a collaboration between us and Alan Perelson, LANL, and explores methods to estimate parameters of the HIV-immune system interaction from time resolved sequence data. The focus of this paper is on early infeImagection dominated by a few rapid substitutions that fix because they prevent or reduce recognition of infected cells by the immune system via cytotoxic T-lymphocytes (CTL). CTL escape is one of the fastest instances of evolution I have come across. 4-6 mutations spread within a few weeks. It happens in most HIV infections and is partly predictable based on the HLA genotype of the infected person. These substitutions are so rapid that clonal interference has to be modeled. Our method fits a reduced model of clonal interference to the typically very sparse data and thereby estimates the selection coefficients, aka escape rates.

Why do we want to know these numbers?
The number of viruses in the blood of an infected person peaks 2-3 weeks after infection and thereafter drops by 2-3 order of magnitude. This drop is partly due to a response by the adaptive immune system. However, it has proved difficult to attribute this drop to specific parts of the immune response. The rates at which different mutations sweep through the population gives us information about the pressure exerted by the T-cell clones that target the epitope containing this mutation.

How do we do it?
Early in infection, the viral population is large and selection is strong. In these conditions, recombination is of minor importance since most double/triple… mutants are more efficiently produced by recurrent mutation than recombination. This implies that mutations accumulate sequentially always on a background one which already all previous mutations are present. The time at which a novel mutation happens in tightly constrained by the trajectory of preceding genotype. These constraints regularize the fitting problem to some degree and the multi-locus fitting is more robust than single locus fitting.

What do we learn about evolution in general?
In addition to the intrinsic interest in the HIV/CTL interaction, CTL escape is an ideal setting to study rapidly evolving populations. This evolution happens in its “natural” habitat and the selective pressure as well as the functional consequences of the observed molecular changes can be quantified via immunological data, protein structure, and replication assays. In addition, we have ample cross-sectional data (HIV sequences from many different patients) that allows us to look at prevalence of the escape mutations and potential compensatory mutations. None of this is done in this paper, but studying HIV/immune-system coevolution is a fascinating show case of rapid evolution.

Inferring HIV escape rates from multi-locus genotype data

Inferring HIV escape rates from multi-locus genotype data
Taylor A. Kessinger, Alan S. Perelson, Richard A. Neher
(Submitted on 6 Aug 2013)

Cytotoxic T-lymphocytes (CTLs) recognize viral protein fragments displayed by major histocompatibility complex (MHC) molecules on the surface of virally infected cells and generate an anti-viral response that can kill the infected cells. Virus variants whose protein fragments are not efficiently presented on infected cells or whose fragments are presented but not recognized by CTLs therefore have a competitive advantage and spread rapidly through the population. We present a method that allows a more robust estimation of these escape rates from serially sampled sequence data. The proposed method accounts for competition between multiple escapes by explicitly modeling the accumulation of escape mutations and the stochastic effects of rare multiple mutants. Applying our method to serially sampled HIV sequence data, we estimate rates of HIV escape that are substantially larger than those previously reported. The method can be extended to complex escapes that require compensatory mutations. We expect our method to be applicable in other contexts such as cancer evolution where time series data is also available.

Macro-evolutionary models and coalescent point processes: The shape and probability of reconstructed phylogenies

Macro-evolutionary models and coalescent point processes: The shape and probability of reconstructed phylogenies
Amaury Lambert, Tanja Stadler
(Submitted on 6 Aug 2013)

Forward-time models of diversification (i.e., speciation and extinction) produce phylogenetic trees that grow “vertically” as time goes by. Pruning the extinct lineages out of such trees leads to natural models for reconstructed trees (i.e., phylogenies of extant species). Alternatively, reconstructed trees can be modelled by coalescent point processes (CPP), where trees grow “horizontally” by the sequential addition of vertical edges. Each new edge starts at some random speciation time and ends at the present time; speciation times are drawn from the same distribution independently. CPP lead to extremely fast computation of tree likelihoods and simulation of reconstructed trees. Their topology always follows the uniform distribution on ranked tree shapes (URT). We characterize which forward-time models lead to URT reconstructed trees and among these, which lead to CPP reconstructed trees. We show that for any “asymmetric” diversification model in which speciation rates only depend on time and extinction rates only depend on time and on a non-heritable trait (e.g., age), the reconstructed tree is CPP, even if extant species are incompletely sampled. If rates additionally depend on the number of species, the reconstructed tree is (only) URT (but not CPP). We characterize the common distribution of speciation times in the CPP description, and discuss incomplete species sampling as well as three special model cases in detail: 1) extinction rate does not depend on a trait; 2) rates do not depend on time; 3) mass extinctions may happen additionally at certain points in the past.

Bayesian genome assembly and assessment by Markov Chain Monte Carlo sampling

Bayesian genome assembly and assessment by Markov Chain Monte Carlo sampling
Mark Howison, Felipe Zapata, Erika J. Edwards, Casey W. Dunn
(Submitted on 6 Aug 2013)

Most genome assemblers provide a point estimates of the true genome sequences, chosen from among many alternative hypotheses that are supported by the data. We present a Markov Chain Monte Carlo approach to sequence assembly that instead generates a distribution of assembly hypotheses with quantified probabilities. This statistically explicit Bayesian approach to assembly allows the investigator to evaluate alternative assembly hypotheses in a unified framework and propagate uncertainty about genomes assembly to downstream analyses. We implement this approach in a prototype assembler and illustrate its application to the genome of the bacteriophage $\Phi$X174.

Proceedings of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

Proceedings of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)
Aaron Darling, Jens Stoye
(Submitted on 6 Aug 2013)

These are the proceedings of the 13th Workshop on Algorithms in Bioinformatics, WABI2013, which was held September 2-4 2013 in Sophia Antipolis, France. All manuscripts were peer reviewed by the WABI2013 program committee and external reviewers.