MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping

MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping
Wan-Ping Lee (1), Michael Stromberg (1 and 2), Alistair Ward (1), Chip Stewart (1 and 3), Erik Garrison (1), Gabor T. Marth (1) ((1) Department of Biology, Boston College, Chestnut Hill, MA, (2) Illumina, Inc., San Diego, CA, (3) Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA)
(Submitted on 4 Sep 2013)

This paper presents an accurate short-read mapper for next-generation sequencing data which is widely used in the 1000 Genomes Project, and human clinical and other species genome studies.

Predicting the ancestral character changes in a tree is typically easier than predicting the root state

Predicting the ancestral character changes in a tree is typically easier than predicting the root state
Olivier Gascuel, Mike Steel
(Submitted on 4 Sep 2013)

Predicting the ancestral sequences of a group of homologous sequences related by a phylogenetic tree has been the subject of many studies, and numerous methods have been proposed to this purpose. Theoretical results are available that show that when the mutation rate become too large, reconstructing the ancestral state at the tree root is no longer feasible. Here, we also study the reconstruction of the ancestral changes that occurred along the tree edges. We show that, depending on the tree and branch length distribution, reconstructing these changes (i.e. reconstructing the ancestral state of all internal nodes in the tree) may be easier or harder than reconstructing the ancestral root state. However, results from information theory indicate that for the standard Yule tree, the task of reconstructing internal node states remains feasible, even for very high substitution rates. Moreover, computer simulations demonstrate that for more complex trees and scenarios, this result still holds. For a large variety of counting, parsimony-based and likelihood-based methods, the predictive accuracy of a randomly selected internal node in the tree is indeed much higher than the accuracy of the same method when applied to the tree root. Moreover, parsimony- and likelihood-based methods appear to be remarkably robust to sampling bias and model mis-specification.

Evolutionary consequences of assortativeness in haploid genotypes

Evolutionary consequences of assortativeness in haploid genotypes
David M. Schneider, Ayana B. Martins, Eduardo do Carmo, Marcus A.M. de Aguiar
(Submitted on 3 Sep 2013)

We study the evolution of allele frequencies in a large population where random mating is violated in a particular way that is related to recent works on speciation. Specifically, we consider non-random encounters in haploid organisms described by biallelic genes at two loci and assume that individuals whose alleles differ at both loci are incompatible. We show that evolution under these conditions leads to the disappearance of one of the alleles and substantially reduces the diversity of the population. The allele that disappears, and the other allele frequencies at equilibrium, depend only on their initial values, and so does the time to equilibration. However, certain combinations of allele frequencies remain constant during the process, revealing the emergence of strong correlation between the two loci promoted by the epistatic mechanism of incompatibility. We determine the geometrical structure of the haplotype frequency space and solve the dynamical equations, obtaining a simple rule to determine equilibrium solution from the initial conditions. We show that our results are equivalent to selection against double heterozigotes for a population of diploid individuals and discuss the relevance of our findings to speciation.

Biological Averaging in RNA-Seq

Biological Averaging in RNA-Seq
Surojit Biswas, Yash N. Agrawal, Tatiana S. Mucyn, Jeffery L. Dangl, Corbin D. Jones
(Submitted on 3 Sep 2013)

RNA-seq has become a de facto standard for measuring gene expression. Traditionally, RNA-seq experiments are mathematically averaged — they sequence the mRNA of individuals from different treatment groups, hoping to correlate phenotype with differences in arithmetic read count averages at shared loci of interest. Alternatively, the tissue from the same individuals may be pooled prior to sequencing in what we refer to as a biologically averaged design. As mathematical averaging sequences all individuals it controls for both biological and technical variation; however, is the statistical resolution gained always worth the additional cost? To compare biological and mathematical averaging, we examined theoretical and empirical estimates of statistical efficiency and relative cost efficiency. Though less efficient at a fixed sample size, we found that biological averaging can be more cost efficient than mathematical averaging. With this motivation, we developed a differential expression classifier, ICRBC, to can detect alternatively expressed genes between biologically averaged samples. In simulation studies, we found that biological averaging and subsequent analysis with our classifier performed comparably to existing methods, such as ASC, edgeR, and DESeq, especially when individuals were pooled evenly and less than 20% of the regulome was expected to be differentially regulated. In two technically distinct mouse datasets and one plant dataset, we found that our method was over 87% concordant with edgeR for the 100 most significant features. We therefore conclude biological averaging may sufficiently control biological variation to a level that differences in gene expression may be detectable. In such situations, ICRBC can enable reliable exploratory analysis at a fraction of the cost, especially when interest lies in the most differentially expressed loci.

An integrative genomic approach illuminates the causes and consequences of genetic background effects

An integrative genomic approach illuminates the causes and consequences of genetic background effects
Christopher H. Chandler, Sudarshan Chari, David Tack, Ian Dworkin
(Submitted on 2 Sep 2013)

(abridged) – The phenotypic consequences of mutations are modulated by the wild type genetic background in which they occur, sometimes dramatically so. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist, nor about the mechanisms underlying it. We also lack knowledge on how mutations interact with the genetic background to influence gene expression patterns, and how gene expression may in turn mediate mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets, from a mapping-by-introgression experiment, as well as a tagged RNA gene expression dataset. In addition we used whole genome re-sequencing of the parental lines-two commonly used laboratory strains-to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutational screen. By searching for genes showing a congruent signal in multiple datasets, we identified candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers are caused by higher-order epistasis, not quantitative non-complementation of alleles. Our results also suggest that cis-regulatory variation contributes little to the background dependence of this mutant phenotype. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well.

Human Genome Variation and the concept of Genotype Networks

Human Genome Variation and the concept of Genotype Networks
Giovanni Marco Dall’Olio (1), Jaume Bertranpetit (1), Andreas Wagner (2, 3, 4), Hafid Laayouni (1) ((1) Institut de Biologia Evolutiva, CSIC-Universitat Pompeu Fabra, Barcelona, Spain. (2) Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Switzerland. (3) The Swiss Institute of Bioinformatics, Lausanne, Switzerland. (4) The Santa Fe Institute, Santa Fe, USA.)
(Submitted on 3 Sep 2013)

In 1970, John Maynard-Smith introduced the concept of “Protein Space”, a representation of all the possible protein sequences, as a framework to describe how evolutionary processes take place. Since then, the concepts of protein and of networks of sequences have been applied to a variety of systems, from protein modeling to RNA evolution, and to metabolic systems. Here, we adapted these concepts to the analysis of human DNA sequence data. We focused on the variation that can be represented from Single Nucleotide Variants (SNV) data, and we used the 1000 Genomes dataset to determine how human populations have explored this genotype space.
Our results include a genome-wide survey of how the genotype networks of human populations vary along the genome, and a framework to calculate the properties of these networks from sequencing data. Moreover, we found that, in coding regions, these networks tend to be both more “extended” in the space, and also more connected, than in non-coding regions. The application of the concept of genotype networks can provide a new opportunity to understand the evolutionary processes that shaped our genome. If we learn how human populations have explored the genotype space, we can achieve a better understanding of how selective pressures such as pathogens and diseases have shaped the evolution of a region of the genome, and how different regions have evolved. Combined with the availability of larger datasets of sequencing data, genotype networks represent a new approach to the study of human genetic diversity.

Some preprint comment streams at Haldane’s sieve and related sites

Given our one year anniversary, I thought I’d collect together a few examples of preprint commenting at work. These have taken place in the comment boxes of Haldane’s sieve and/or across a range of other blogs.

These are somewhat isolated cases, as the majority of preprints pass without any comment. It would be great to see more of this level of commentary. Remember comments can be simple inquiries about methods/figures/reference etc and don’t have to be super involved. In general we’ve found authors to be very responsive to comments, perhaps in part because they can take place as a more informal conversation without the pressures of publication concerns.

Genome sequencing highlights genes under selection and the dynamic early history of dogs
Reconstructing the population genetic history of the Caribbean
the population genetic signal of polygenic adaptation
The geography of recent genetic ancestry across Europe
Loss and Recovery of Genetic Diversity in Adapting Populations of HIV
Sailfish RNA-seq quantification
Genome-wide inference of ancestral recombination graphs

The date of interbreeding between Neandertals and modern humans.


Ancient west Eurasian ancestry in southern and eastern Africa.


The identifiability of piecewise demographic models from the sample frequency spectrum

One year at Haldane’s Sieve

We started Haldane’s Sieve back in August 2012, so we’ve just passed our one year anniversary. You can read our first post on our motivations for starting the blog here. We are pretty happy about how well Haldane’s Sieve has done at promoting preprints and a preprint culture more generally in population and evolutionary genetics and genomics.

Overall we posted 430 posts, the majority of which have been abstracts of arXived papers. It’s been great to see so many people starting to experiment with preprinting their work.

We’ve also had 41 guest posts by authors blogging about their papers (see here). This has been a really nice side effect of Haldane’s Sieve; we have gotten more researchers blogging about their work. The main aim of these “our paper” posts has been to allow authors to write about their paper in a more informal setting than a paper, to reach out to other researchers for feedback and to start to publicize their papers to the population and evolutionary genetics and genomics communities.

Over the past year Haldane’s Sieve has had over 600 comments. The majority of preprints have passed without comment, which is fine by us. Not all preprints need commentary, and a reasonable fraction are likely to have little long-term impact (like many papers). However, all of the abstracts posted at Haldane’s Sieve have been visited multiple times (the top ones hundreds of times), and the majority have been tweeted on twitter. Thus all of the preprints have received attention, and have likely had many more sets of eyes viewing them earlier than if they’d never been preprinted.

Some of the preprints get significant amounts of attention, comments, and feedback (both online and offline), which is really heartening to see. We think that many papers have been improved thanks to appearing on the arXiv and at Haldane’s Sieve. Thanks to everyone for their comments. It would be great to have more, remember they do not have to be substantial and could be as simple as asking for clarification on a figure legend. We try to make sure that the authors of preprints get notified about comments, however minor. Every comment helps improve preprints, to encourage others to preprint their papers, and a culture of preprint comments more generally.

Encouragingly, during the past year Genetics, Genome Research, and MBE have all changed their preprint policies to allow the submission of previously preprinted articles (see here). It is great to see preprints are starting to gain more acceptance in evolutionary genetics and genomics.

Here’s hoping for another good year, and we are thinking about extending Haldane’s Sieve in a few different ways over the coming year.

An alternative to the breeder’s and Lande’s equations

An alternative to the breeder’s and Lande’s equations
Bahram Houchmandzadeh (LIPhy)
(Submitted on 2 Sep 2013)

The breeder’s equation is a cornerstone of quantitative genetics and is widely used in evolutionary modeling. The equation which reads R=h^{2}S relates response to selection R (the mean phenotype of the progeny) to the selection differential S (mean phenotype of selected parents) through a simple proportionality relation. The validity of this relation however relies strongly on the normal (Gaussian) distribution of parent’s genotype which is an unobservable quantity and cannot be ascertained. In contrast, we show here that if the fitness (or selection) function is Gaussian, an alternative, exact linear equation in the form of R’=j^{2}S’ can be derived, regardless of the parental genotype distribution. Here R’ and S’ stand for the mean phenotypic lag behind the mean of the fitness function in the offspring and selected populations. To demonstrate this relation, we derive the exact functional relation between the mean phenotype in the selected and the offspring population and deduce all cases that lead to a linear relation between these quantities. These computations, which are confirmed by individual based numerical simulations, generalize naturally to the multivariate Lande’s equation \Delta\mathbf{\bar{z}}=GP^{-1}\mathbf{S} .

Most viewed on Haldane’s Sieve: August 2013

The most viewed preprints on Haldane’s Sieve this month were:

*”The world” is defined for these purposes as Haldane’s Sieve.