Universality and predictability in the evolution of molecular quantitative traits

Universality and predictability in the evolution of molecular quantitative traits
Armita Nourmohammad, Torsten Held, Michael Lässig
(Submitted on 12 Sep 2013)

Molecular traits, such as gene expression levels or protein binding affinities, are increasingly accessible to quantitative measurement by modern high-throughput techniques. Such traits measure molecular functions and, from an evolutionary point of view, are important as targets of natural selection. Here we discuss recent developments in the evolutionary theory of quantitative traits that reach beyond classical quantitative genetics. We focus on universal evolutionary characteristics: these are largely independent of a trait’s genetic basis, which is often at least partially unknown. We show that universal measurements can be used to infer selection on a quantitative trait, which determines its evolutionary mode of conservation or adaptation. Furthermore, universality is closely linked to predictability of trait evolution across lineages. We argue that universal trait statistics extends over a range of cellular scales and opens new avenues of quantitative evolutionary systems biology.

Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography

Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography
Filip Bielejec, Philippe Lemey, Guy Baele, Andrew Rambaut, Marc A Suchard
(Submitted on 12 Sep 2013)

Molecular phylogenetic and phylogeographic reconstructions generally assume time-homogeneous substitution processes. Motivated by computational convenience, this assumption sacrifices biological realism and offers little opportunity to uncover the temporal dynamics in evolutionary histories. Here, we extend and generalize an evolutionary approach that relaxes the time-homogeneous process assumption by allowing the specification of different infinitesimal substitution rate matrices across different time intervals, called epochs, along the evolutionary history. We focus on an epoch model implementation in a Bayesian inference framework that offers great modeling flexibility in drawing inference about any discrete data type characterized as a continuous-time Markov chain, including phylogeographic traits. To alleviate the computational burden that the additional temporal heterogeneity imposes, we adopt a massively parallel approach that achieves both fine- and coarse-grain parallelization of the computations across branches that accommodate epoch transitions, making extensive use of graphics processing units. Through synthetic examples, we assess model performance in recovering evolutionary parameters from data generated according to different evolutionary scenarios that comprise different numbers of epochs for both nucleotide and codon substitution processes. We illustrate the usefulness of our inference framework in two different applications to empirical data sets: the selection dynamics on within-host HIV populations throughout infection and the seasonality of global influenza circulation. In both cases, our epoch model captures key features of temporal heterogeneity that remained difficult to test using ad hoc procedures.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Bogdan Pasaniuc, Noah Zaitlen, Huwenbo Shi, Gaurav Bhatia, Alexander Gusev, Joseph Pickrell, Joel Hirschhorn, David P Strachan, Nick Patterson, Alkes L. Price
(Submitted on 12 Sep 2013)

Imputation using external reference panels is a widely used approach for increasing power in GWAS and meta-analysis. Existing HMM-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants (increasing to 87% (60%) when summary LD information is available from target samples) versus 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and is computationally very fast. As an empirical demonstration, we apply our method to 7 case-control phenotypes from the WTCCC data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $\chi^2$ association statistics) compared to HMM-based imputation from individual-level genotypes at the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of 4 lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic vs. non-genic loci for these traits, as compared to an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown
(Submitted on 11 Sep 2013)

K-mer abundance analysis is widely used for many purposes in sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a CountMin Sketch. The CountMin Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support streaming k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a CountMin Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, and DSK. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer error rates. Khmer is implemented in C++ wrapped with a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

Some preprint comment streams at Haldane’s sieve and related sites

Given our one year anniversary, I thought I’d collect together a few examples of preprint commenting at work. These have taken place in the comment boxes of Haldane’s sieve and/or across a range of other blogs.

These are somewhat isolated cases, as the majority of preprints pass without any comment. It would be great to see more of this level of commentary. Remember comments can be simple inquiries about methods/figures/reference etc and don’t have to be super involved. In general we’ve found authors to be very responsive to comments, perhaps in part because they can take place as a more informal conversation without the pressures of publication concerns.

Genome sequencing highlights genes under selection and the dynamic early history of dogs
Reconstructing the population genetic history of the Caribbean
the population genetic signal of polygenic adaptation
The geography of recent genetic ancestry across Europe
Loss and Recovery of Genetic Diversity in Adapting Populations of HIV
Sailfish RNA-seq quantification
Genome-wide inference of ancestral recombination graphs

The date of interbreeding between Neandertals and modern humans.


Ancient west Eurasian ancestry in southern and eastern Africa.


The identifiability of piecewise demographic models from the sample frequency spectrum

One year at Haldane’s Sieve

We started Haldane’s Sieve back in August 2012, so we’ve just passed our one year anniversary. You can read our first post on our motivations for starting the blog here. We are pretty happy about how well Haldane’s Sieve has done at promoting preprints and a preprint culture more generally in population and evolutionary genetics and genomics.

Overall we posted 430 posts, the majority of which have been abstracts of arXived papers. It’s been great to see so many people starting to experiment with preprinting their work.

We’ve also had 41 guest posts by authors blogging about their papers (see here). This has been a really nice side effect of Haldane’s Sieve; we have gotten more researchers blogging about their work. The main aim of these “our paper” posts has been to allow authors to write about their paper in a more informal setting than a paper, to reach out to other researchers for feedback and to start to publicize their papers to the population and evolutionary genetics and genomics communities.

Over the past year Haldane’s Sieve has had over 600 comments. The majority of preprints have passed without comment, which is fine by us. Not all preprints need commentary, and a reasonable fraction are likely to have little long-term impact (like many papers). However, all of the abstracts posted at Haldane’s Sieve have been visited multiple times (the top ones hundreds of times), and the majority have been tweeted on twitter. Thus all of the preprints have received attention, and have likely had many more sets of eyes viewing them earlier than if they’d never been preprinted.

Some of the preprints get significant amounts of attention, comments, and feedback (both online and offline), which is really heartening to see. We think that many papers have been improved thanks to appearing on the arXiv and at Haldane’s Sieve. Thanks to everyone for their comments. It would be great to have more, remember they do not have to be substantial and could be as simple as asking for clarification on a figure legend. We try to make sure that the authors of preprints get notified about comments, however minor. Every comment helps improve preprints, to encourage others to preprint their papers, and a culture of preprint comments more generally.

Encouragingly, during the past year Genetics, Genome Research, and MBE have all changed their preprint policies to allow the submission of previously preprinted articles (see here). It is great to see preprints are starting to gain more acceptance in evolutionary genetics and genomics.

Here’s hoping for another good year, and we are thinking about extending Haldane’s Sieve in a few different ways over the coming year.

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees
Dongying Wu, Ladan Doroud, Jonathan A. Eisen
(Submitted on 28 Aug 2013)

Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based upon studies of sequences of the small- subunit rRNAs (ssu-rRNAs). To address the limitation of ssu-rRNA as a phylogenetic marker, such as copy number variation among organisms and complications introduced by horizontal gene transfer, convergent evolution, or evolution rate variations, we have identified protein- coding gene families as alternative Phylogenetic and Phylogenetic Ecology markers (PhyEco). Current nucleotide sequence similarity based Operational Taxonomic Unit (OTU) classification methods are not readily applicable to amino acid sequences of PhyEco markers. We report here the development of TreeOTU, a phylogenetic tree structure based OTU classification method that takes into account of differences in rates of evolution between taxa and between genes. OTU sets built by TreeOTU are more faithful to phylogenetic tree structures than sequence clustering (non phylogenetic) methods for ssu-rRNAs. OTUs built from phylogenetic trees of protein coding PhyEco markers are comparable to our current taxonomic classification at different levels. With the included OTU comparing tools, the TreeOTU is robust in phylogenetic referencing with different phylogenetic markers and trees.

The role of mutation rate variation and genetic diversity in the architecture of human disease

The role of mutation rate variation and genetic diversity in the architecture of human disease
Ying Chen Eyre-Walker, Adam Eyre-Walker
(Submitted on 29 Aug 2013)

We have investigated the role that the mutation rate and the structure of genetic variation at a locus play in determining whether a gene is involved in disease. We predict that the mutation rate and its genetic diversity should be higher in genes associated with disease, unless all genes that could cause disease have already been identified. Consistent with our predictions we find that genes associated with Mendelian and complex disease are substantially longer than non-disease genes. However, we find that both Mendelian and complex disease genes are found in regions of the genome with relatively low mutation rates, as inferred from intron divergence between humans and chimpanzees. Complex disease gene are predicted to have higher rates of non-synonymous mutation than non-disease genes, but the opposite pattern is found in Mendelian disease genes. Finally, we find that disease genes are in regions of significantly elevated genetic diversity, even when variation in the rate of mutation is controlled for. The effect is small nevertheless. Our results suggest that variation in the genic mutation rate and the genetic architecture of the locus play a minor role in determining whether a gene is associated with disease.

Exploration and retrieval of whole-metagenome sequencing samples

Exploration and retrieval of whole-metagenome sequencing samples
Sohan Seth, Niko Välimäki, Samuel Kaski, Antti Honkela
(Submitted on 28 Aug 2013)

Over the recent years, the field of whole metagenome shotgun sequencing has witnessed significant growth due to the next generation sequencing technologies that allow sequencing genomic samples cheaper, faster, and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. In this paper, we develop a content-based retrieval method for whole metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples, and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome data sets and observe significant enrichment for diseased samples in results of queries with another diseased sample.

Using Volcano Plots and Regularized-Chi Statistics in Genetic Association Studies

Using Volcano Plots and Regularized-Chi Statistics in Genetic Association Studies
Wentian Li, Jan Freudenberg, Young Ju Suh, Yaning Yang
(Submitted on 28 Aug 2013)

Labor intensive experiments are typically required to identify the causal disease variants from a list of disease associated variants in the genome. For designing such experiments, candidate variants are ranked by their strength of genetic association with the disease. However, the two commonly used measures of genetic association, the odds-ratio (OR) and p-value, may rank variants in different order. To integrate these two measures into a single analysis, here we transfer the volcano plot methodology from gene expression analysis to genetic association studies. In its original setting, volcano plots are scatter plots of fold-change and t-test statistic (or -log of the p-value), with the latter being more sensitive to sample size. In genetic association studies, the OR and Pearson’s chi-square statistic (or equivalently its square root, chi; or the standardized log(OR)) can be analogously used in a volcano plot, allowing for their visual inspection. Moreover, the geometric interpretation of these plots leads to an intuitive method for filtering results by a combination of both OR and chi-square statistic, which we term “regularized-chi”. This method selects associated markers by a smooth curve in the volcano plot instead of the right-angled lines which corresponds to independent cutoffs for OR and chi-square statistic. The regularized-chi incorporates relatively more signals from variants with lower minor-allele-frequencies than chi-square test statistic. As rare variants tend to have stronger functional effects, regularized-chi is better suited to the task of prioritization of candidate genes.