Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography

Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography
Filip Bielejec, Philippe Lemey, Guy Baele, Andrew Rambaut, Marc A Suchard
(Submitted on 12 Sep 2013)

Molecular phylogenetic and phylogeographic reconstructions generally assume time-homogeneous substitution processes. Motivated by computational convenience, this assumption sacrifices biological realism and offers little opportunity to uncover the temporal dynamics in evolutionary histories. Here, we extend and generalize an evolutionary approach that relaxes the time-homogeneous process assumption by allowing the specification of different infinitesimal substitution rate matrices across different time intervals, called epochs, along the evolutionary history. We focus on an epoch model implementation in a Bayesian inference framework that offers great modeling flexibility in drawing inference about any discrete data type characterized as a continuous-time Markov chain, including phylogeographic traits. To alleviate the computational burden that the additional temporal heterogeneity imposes, we adopt a massively parallel approach that achieves both fine- and coarse-grain parallelization of the computations across branches that accommodate epoch transitions, making extensive use of graphics processing units. Through synthetic examples, we assess model performance in recovering evolutionary parameters from data generated according to different evolutionary scenarios that comprise different numbers of epochs for both nucleotide and codon substitution processes. We illustrate the usefulness of our inference framework in two different applications to empirical data sets: the selection dynamics on within-host HIV populations throughout infection and the seasonality of global influenza circulation. In both cases, our epoch model captures key features of temporal heterogeneity that remained difficult to test using ad hoc procedures.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Bogdan Pasaniuc, Noah Zaitlen, Huwenbo Shi, Gaurav Bhatia, Alexander Gusev, Joseph Pickrell, Joel Hirschhorn, David P Strachan, Nick Patterson, Alkes L. Price
(Submitted on 12 Sep 2013)

Imputation using external reference panels is a widely used approach for increasing power in GWAS and meta-analysis. Existing HMM-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants (increasing to 87% (60%) when summary LD information is available from target samples) versus 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and is computationally very fast. As an empirical demonstration, we apply our method to 7 case-control phenotypes from the WTCCC data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $\chi^2$ association statistics) compared to HMM-based imputation from individual-level genotypes at the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of 4 lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic vs. non-genic loci for these traits, as compared to an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown
(Submitted on 11 Sep 2013)

K-mer abundance analysis is widely used for many purposes in sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a CountMin Sketch. The CountMin Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support streaming k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a CountMin Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, and DSK. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer error rates. Khmer is implemented in C++ wrapped with a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

A Survey on Migration-Selection Models in Population Genetics

A Survey on Migration-Selection Models in Population Genetics
Reinhard Bürger
(Submitted on 10 Sep 2013)

This survey focuses on the most important aspects of the mathematical theory of population genetic models of selection and migration between discrete niches. Such models are most appropriate if the dispersal distance is short compared to the scale at which the environment changes, or if the habitat is fragmented. The general goal of such models is to study the influence of population subdivision and gene flow among subpopulations on the amount and pattern of genetic variation maintained. Only deterministic models are treated. Because space is discrete, they are formulated in terms of systems of nonlinear difference or differential equations. A central topic is the exploration of the equilibrium and stability structure under various assumptions on the patterns of selection and migration. Another important, closely related topic concerns conditions (necessary or sufficient) for fully polymorphic (internal) equilibria. First, the theory of one-locus models with two or multiple alleles is laid out. Then, mostly very recent, developments about multilocus models are presented. Finally, as an application, analysis and results of an explicit two-locus model emerging from speciation theory are highlighted.

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
M. Cyrus Maher, Ryan D. Hernandez
(Submitted on 9 Sep 2013)

Ortholog detection (OD) is a critical step for comparative genomic analysis of protein-coding sequences. There is a range of methods available for OD. However, relative performance varies by application, stymying attempts to identify a single best method. In this paper, we present a novel tool, MOSAIC, which is capable of integrating the entire swath of OD methods. We analyze the results of applying MOSAIC over four methodologically diverse OD methods. Relative to component and competing methods, we demonstrate large gains in the number of detected orthologs while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

Inferring selective constraint and recent gain and loss of function from population genomic data

Inferring selective constraint and recent gain and loss of function from population genomic data
Daniel R. Schrider, Andrew D. Kern
(Submitted on 10 Sep 2013)

The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human specific function in the genome. Using only allele frequency information from the complete low coverage 1000 Genomes Project dataset in conjunction with a support vector machine trained from known functional and non-functional portions of the genome, we are able to identify functional portions of the genome with extremely high accuracy (~88%). Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain of function along the human lineage include a novel isoform of a killer cell immunoglobulin-like receptor, while loss of function candidates include many members of a gene cluster involved in shaping the complexity of synaptic connections in the brain. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods.

Our paper: The inevitability of unconditionally deleterious substitutions during adaptation

This author post is by Joshua B. Plotkin and David McCandlish on their preprint “The inevitability of unconditionally deleterious substitutions during adaptation”, arXived here.

The idea for this paper came to us while we were re-reading an earlier study by Sergey Kryazhimskiy and others, on the dynamics of adapting populations (Kryazhimskiy et al. 2009). Kryazhimskiy et al. studied the “fitness trajectory” — that is, the mean fitness across an ensemble of populations, as a function of time. The basic idea of their study was to infer the structure of an underlying fitness landscape by observing the fitness trajectory in experimental populations of evolving organisms (such as the ones from Lenski’s long-term experiments).

In re-reading Sergey’s paper, we noticed that the fitness trajectories were always monotonic, that is, the expected fitness would either always decrease or always increase. Indeed Kryazhimskiy et al. 2009 had presented a detailed analytical theory for how the fitness trajectory should behave (at least for a large class of models), and according to this theory the fitness trajectories should always be monotonic. However, when we looked more carefully at how this analytical theory was derived, we saw that the apparent impossibility of non-monotonic fitness trajectories was actually an unintended consequence of a seemingly innocuous technical assumption. The theory had been thoroughly tested against simulations for the examples explored in the paper, and it had performed quite well. But still, we wondered, could we construct fitness landscapes with non-monotonic fitness trajectories?

The answer was yes. In fact, we found conditions that produced non-monotonic fitness trajectories in one of the simplest and widely used models of a fitness landscape: the house of cards model, where the fitness of each new mutation is drawn from some fixed probability distribution. We also noticed an interesting pattern. If the population starts at a very low fitness then the fitness trajectory is be monotonically increasing. But if the starting fitness of the population is closer to the equilibrium mean fitness (that is, the value that the fitness trajectory would eventually tend to) the fitness trajectories will become non-monotonic: fitness will initially decrease, and then, eventually, increase to
its asymptotic value.

After much coffee, we eventually proved that this basic pattern must occur for any house of cards model whose equilibrium fitness distribution has a finite mean (at least under a Moran process in the limit of weak mutation). That result was the germ that eventually developed into our paper, which includes further results on the house of cards model, and on Fisher’s geometric model.

Why are non-monotonic fitness trajectories interesting? On the one hand, this is a population-genetic curiosity in a vein similar to McVean and Charlesworth (1999)’s observation that increasing the strength of purifying selection can sometimes increase the nucleotide site diversity. It’s somewhat counter-intuitive that the expected selection coefficient of the first mutation to fix in an adapting population can be negative, even on a fitness landscapes that contains no local maxima!

On the other hand, we think that this result has important implications studying adaptive evolution. It is common in such studies to assume that deleterious mutations can never fix (e.g. by approximating the probability of fixation for a new mutation as 2s). Our results on the surprising prevalence of deleterious substitutions during adapation should hopefully spur others to consider carefully the circumstances under which ignoring deleterious fixations is justified.

Joshua B. Plotkin and David McCandlish

Works cited:

Kryazhimskiy S, Tkacik G, Plotkin JB. The dynamics of adaptation on correlated fitness landscapes. PNAS 106: 18638-18643 (2009)

McVean, G. A., and Charlesworth, B. (1999). A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genetical research, 74:145-158.

Our paper: An integrative genomic approach illuminates the causes and consequences of genetic background effects

This is a guest post by Dr. Chris Chandler on “An integrative genomic approach illuminates the causes and consequences of genetic background effects“, arXived here.

Biologists have long recognized that a mutation can have variable effects on an organism’s phenotype; even introductory genetics classes often make this observation by introducing the concepts of penetrance and expressivity. More mysterious, however, are the factors that influence the phenotypic expression of a mutation or allele. We know, for instance, that introducing the same mutation into two different but otherwise wild-type genetic backgrounds can result in vastly different phenotypes. But what specific differences between these two genetic backgrounds interact with the mutation, and how? And how does gene expression fit into this puzzle? Answering these questions has not been an easy task, which is not too surprising when you realize that penetrance and expressivity are, in reality, complex quantitative traits. We therefore adopted a multi-pronged genetic and genomic approach to tease apart the mechanisms mediating background dependence in a mutation affecting wing development in the fly Drosophila melanogaster.

The phenotypic patterns seen in our model trait have already been characterized: the scalloped[E3] (sd[E3]) mutation has strong effects in the Oregon-R (ORE) background, resulting in a tiny, underdeveloped wing, while its effects in the Samarkand (SAM) background are still obvious but much less extreme, resulting in a blade-like wing.

To try to find out what causes these differences, we generated and combined a variety of datasets: whole-genome re-sequencing of the parental strains and a panel of introgression lines to map the background modifiers of the sd[E3] phenotype; transcription profiling (using two microarray datasets and one RNA-seq-like dataset), including analyses of allele-specific expression in flies carrying a “hybrid” genetic background; predictions of binding sites for the SD protein, which is a transcription factor; and a screen for deletion alleles that enhance or suppress the sd[E3] phenotype in a background-dependent fashion.

Our results point to a complex genetic basis for this background dependence. We found evidence for a number of loci that are likely to modulate the effects of the sd[E3] allele. However, some unexpected inconsistencies provide a cautionary tale for those intending to take a similar mapping-by-introgression approach for their trait of interest: do multiple replicates, and introgress in both directions, or you may inadvertently end up mapping some other trait! Although the number of candidate genes we identified were generally large, by combining those results with data from our other datasets, we were able to narrow our focus to those showing a consistent signal, yielding a robust set of candidate genes for further study. Without getting into too much detail, we also used a novel approach to show that background-dependent modifier deletions of the sd[E3] phenotype (of which there are many) involve higher-order epistatic interactions between the sd[E3] mutation, the deletion, and the genetic background, rather than quantitative non-complementation (so more than two genes were involved).

Overall, we think that an integrative approach like this could be useful for others trying to understand complex traits, including genetic background-dependence of mutations. In addition, if you’re a Drosophila researcher working with the commonly used Samarkand or Oregon-R strains, our genome re-sequencing data (raw and assembled), including SNPs, will soon be available in public repositories for genetic data.

Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies

Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies
Xiaoquan Wen
(Submitted on 3 Sep 2013)

Motivated by examples from genetic association studies, this paper considers the model selection problem in a general complex linear model system and in a Bayesian framework. We discuss formulating model selection problems and incorporating context-dependent {\it a priori} information through different levels of prior specifications. We also derive analytic Bayes factors and their approximations to facilitate model selection and discuss their theoretical and computational properties. We demonstrate our Bayesian approach based on an implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real data application of mapping tissue-specific eQTLs. Our novel results on Bayes factors provide a general framework to perform efficient model comparisons in complex linear model systems.

The inevitability of unconditionally deleterious substitutions during adaptation

The inevitability of unconditionally deleterious substitutions during adaptation
David M. McCandlish, Charles L. Epstein, Joshua B. Plotkin
(Submitted on 4 Sep 2013)

Studies on the genetics of adaptation typically neglect the possibility that a deleterious mutation might fix. Nonetheless, here we show that, in many regimes, the first substitution is most often deleterious, even when fitness is expected to increase in the long term. In particular, we prove that this phenomenon occurs under weak mutation for any house-of-cards model with an equilibrium distribution. We find that the same qualitative results hold under Fisher’s geometric model. We also provide a simple intuition for the surprising prevalence of unconditionally deleterious substitutions during early adaptation. Importantly, the phenomenon we describe occurs on fitness landscapes without any local maxima and is therefore distinct from “valley-crossing”. Our results imply that the common practice of ignoring deleterious substitutions leads to qualitatively incorrect predictions in many regimes. Our results also have implications for the substitution process at equilibrium and for the response to a sudden decrease in population size.