# A two-fold advantage of sex

Su-Chan Park, Joachim Krug
(Submitted on 27 Feb 2013)

The adaptation of large asexual populations is hampered by the competition between independently arising beneficial mutations in different individuals, which is known as clonal interference. In classic work, Fisher and Muller proposed that recombination provides an evolutionary advantage in large populations by alleviating this competition. Based on recent progress in quantifying the speed of adaptation in asexual populations undergoing clonal interference, we present a detailed analysis of the Fisher-Muller mechanism for a model genome consisting of two loci with an infinite number of beneficial alleles each and multiplicative (non-epistatic) fitness effects. We solve the deterministic, infinite population dynamics exactly and show that, for a particular, natural mutation scheme, the speed of adaptation in sexuals is twice as large as in asexuals. This result is argued to hold for any nonzero value of the rate of recombination. Guided by the infinite population result and by previous work on asexual adaptation, we postulate an expression for the speed of adaptation in finite sexual populations that agrees with numerical simulations over a wide range of population sizes and recombination rates. The ratio of the sexual to asexual adaptation speed is a function of population size that increases in the clonal interference regime and approaches 2 for extremely large populations. The simulations also show that recombination leads to a strong equalization of the number of fixed mutations in the two loci. The generalization of the model to an arbitrary number $L$ of loci is briefly discussed. For a particular communal recombination scheme, the ratio of the sexual to asexual adaptation speed is approximately equal to $L$ in large populations.

# SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner

SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
Ruibang Luo, Thomas Wong, Jianqiao Zhu, Chi-Man Liu, Edward Wu, Lap-Kei Lee, Haoxiang Lin, Wenjuan Zhu, David W. Cheung, Hing-Fung Ting, Siu-Ming Yiu, Chang Yu, Yingrui Li, Ruiqiang Li, Tak-Wah Lam
(Submitted on 22 Feb 2013)

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60 percent. Real data evaluation using human genome demonstrates SOAP3-dp’s power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides a scoring scheme same as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.

# Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution
Michael G. Elliot, Arne O. Mooers
(Submitted on 20 Feb 2013)

The value of a continuous character evolving on a phylogenetic tree is commonly modelled as the location of a particle moving under one-dimensional Brownian motion with constant rate. The Brownian motion model is best suited to characters evolving under neutral drift or tracking an optimum that drifts neutrally. We present a generalization of the Brownian motion model which relaxes assumptions of neutrality and gradualism by considering increments to evolving characters to be drawn from a heavy-tailed stable distribution (of which the normal distribution is a specialized form). We describe Markov chain Monte Carlo methods for fitting the model to biological data paying special attention to ancestral state reconstruction, and study the performance of the model in comparison with a selection of existing comparative methods, using both simulated data and a database of body mass in 1,679 mammalian species. We discuss hypothesis testing and model selection. The new model is well suited to a stochastic process with a volatile rate of change in which biological characters undergo a mixture of neutral drift and occasional evolutionary events of large magnitude.

# Our paper: Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Our next guest post is by Mike Eisen [@mbeisen] on his paper with Peter Combs [@rflrob]
Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression. arXived here.

This is cross posted from Mike’s blog.

It’s no secret to people who read this blog that I hate the way scientific publishing works today. Most of my efforts in this domain have focused on removing barriers to the access and reuse of published papers. But there are other things that are broken with the way scientists communicate with each other, and chief amongst them is pre-publication peer review. I’ve written about this before, and won’t rehash the arguments here, save to say that I think we should publish first, and then review. But one could argue that I haven’t really practiced what I preach, as all of my lab’s papers have gone through peer review before they were published.

No more. From now on we are going to post all of our papers online when we feel they’re ready to share – before they go to a journal. We’ll then solicit comments from our colleagues and use them to improve the work prior to formal publication. Physicists and mathematicians have been doing this for decades, as have an increasing number of biologists. It’s time for this to become standard practice.

Some ground rules. I will not filter comments except to remove obvious spam. You are welcome to post comments under your name or under a pseudonym – I will not reveal anyone’s identity – but I urge you to use your real name as I think we should have fully open peer review in science.

OK. Now for the paper, which is posted on arxiv and can be linked to, cited there. We also have a copy here, in case you’re having trouble with figures on arXiv.

Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Several years ago a postdoc in my lab, Susan Lott (now at UC Davis) developed methods to sequence the RNA’s from single Drosophila embryos. She was interested in looking at expression differences between males and females in early embryogenesis, and published a beautiful paper on that topic.

Although we were initially worried that we wouldn’t be able to get enough RNA from single embryos to get reliable sequencing results, it turns out we got more than enough. Each embryo yielded around 100ng of total RNA, and we would end up loading only ~10% of the sample onto the sequencer. So it occurred to us that maybe we could work with material from pieces of individual embryos and thereby get spatial expression information on a genomic scale in a single quick experiment – an alternative to highly informative, but slow imaging-based methods.

I recruited a new biophysics student, Peter Combs, to work on slicing embryos with a microtome along the anterior-posterior axis and sequencing each of the sections to identify genes with patterned expression along the A-P axis. In typical PI fashion, I figured this would take a few weeks, but it ended up taking over a year to get right.

The major challenge was that, while a tenth of an embyro contains more than enough RNA to analyze by mRNA-seq, it turned out to be very difficult to shepherd that RNA successfully from a single cryosection to the sequencer. Peter was routinely failing to recover RNA and make libraries from these samples using methods that worked great for whole embryos. While there are various protocols out there claiming to analyze RNA from single cells, we were reluctant to use these amplification-based strategies.

The typical way people deal with loss of small quantities of nucleic acids during experimental manipulation is to add carrier RNA or DNA – something like tRNA or salmon sperm DNA. We didn’t want to do that, since we would just end up with tons of useless sequencing reads. So we came up with a different strategy – adding embryos from distantly related Drosophila species to each slice at an early stage in the process. This brought the total amount of RNA in each sample well amove the threshold where our purification and library preparation worked robustly, and we could easily separate the D. melanogaster RNA we were interested in for this experiment from that of the “carrier” embryo. But we could avoid wasting sequencing reads by turning the carrier RNAs into an experiment of their own – in this case looking at expression variation between species.

With this trick, the method now works great, and the paper is really just a description of the method and a demonstration that accurate expression patterns can be recovered from individual cryosectioned embryos. The resolution here is not that great – we used 6 slices of ~60um each per embryo. But we’ve started to make smaller sections, and a back of the envelope calculation suggests we can, with available sample handling and sequencing techniques, make up to 100 slices per embryo. This would be more than enough to see stripes and other subtle patterns missed in the current dataset.

Our immediate near term goals are to do a developmental time course, compare patterns in male and female embryos, look at other species and examine embryos from strains carrying various patterning defects. For those of you going to the fly meeting in DC in April, Peter’s talk will, I hope, have some of this new data.

Anyway, we would love comments on either the method or the manuscript.

# Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression

Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression
Peter A. Combs, Michael B. Eisen
(Submitted on 19 Feb 2013)

Complex spatial and temporal patterns of gene expression underlie embryo differentiation, yet methods do not yet exist for the efficient genome-wide determination of spatial patterns of gene expression. {\em In situ} imaging of transcripts and proteins is the gold-standard, but is difficult and time consuming to apply to an entire genome, even when highly automated. Sequencing, in contrast, is fast and genome-wide, but generally applied to homogenized tissues, thereby discarding spatial information. At some point, these methods will converge, and we will be able to sequence RNAs {\em in situ}, simultaneously determining their identity and location. As a step along this path, we developed methods to cryosection individual blastoderm stage {\em Drosophila melanogaster} embryos along the anterior-posterior axis and sequence the mRNA isolated from each 60\micron{} slice. The spatial patterns of gene expression we infer closely match patterns determined by {\em in situ} hybridization and microscopy, where such data exist, and thus we conclude that we have generated the first genome-wide map of spatial patterns in the {\em Drosophila} embryo. We identify numerous genes with spatial patterns that have not yet been screened in the several ongoing systematic in situ based projects, the majority of which are localized to the posterior end of the embryo, likely in the pole cells. This simple experiment demonstrates the potential for combining careful anatomical dissection with high-throughput sequencing to obtain spatially resolved gene expression on a genome-wide scale.

# Mutation Rules and the Evolution of Sparseness and Modularity in Biological Systems

Mutation Rules and the Evolution of Sparseness and Modularity in Biological Systems
Tamar Friedlander, Avraham E. Mayo, Tsvi Tlusty, Uri Alon
(Submitted on 18 Feb 2013)

Biological systems show two structural features on many levels of organization: sparseness, in which only a small fraction of possible interactions between components actually occur; and modularity: the near decomposability of the system into modules with distinct functionality. Recent work suggests that modularity can evolve in a variety of circumstances, including goals that vary in time such that they share the same subgoals (modularly varying goals). Here, we studied the origin of modularity and sparseness focusing on the nature of the mutation process, rather than variations in the goal. We use simulations of evolution with different mutation rules. We find that commonly used sum-rule mutations, in which interactions are mutated by adding random numbers, do not lead to modularity or sparseness except for special situations. In contrast, product-rule mutations in which interactions are mutated by multiplying by random numbers, a better model for the effects of biological mutations, lead to sparseness naturally. When the goals of evolution are modular, in the sense that specific groups of inputs affect specific groups of outputs, product-rule mutations lead to modular structure; sum-rule mutations do not. Product-rule mutations generate sparseness and modularity because they keep small interaction terms small.

# Fitness distributions in spatial populations undergoing clonal interference

Fitness distributions in spatial populations undergoing clonal interference
Jakub Otwinowski, Joachim Krug
(Submitted on 18 Feb 2013)

Competition between independently arising beneficial mutations is enhanced in spatial populations due to the linear rather than exponential growth of the clones. Recent theoretical studies have pointed out that the resulting fitness dynamics is analogous to a surface growth process, where new layers nucleate and spread stochastically, leading to the build up of scale-invariant roughness. This scenario differs qualitatively from the standard view of adaptation in that the speed of adaptation becomes independent of population size while the fitness variance does not, in apparent violation of Fisher’s fundamental theorem. Here we exploit recent progress in the understanding of surface growth processes to obtain precise predictions for the universal, non-Gaussian shape of the fitness distribution for one-dimensional habitats, which are verified by simulations.

# Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim
(Submitted on 18 Feb 2013)

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise interaction model is formally equivalent to the disordered Potts models of statistical mechanics and it generalizes previous approaches to interdependent positions. Its structure allows for co-variation of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of PWMs also have this last feature, we show that pairwise interaction models outperform them. The significant pairwise interactions are found to be sparse and found dominantly between consecutive base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown to give significantly different predictions than a model based on independent positions.

# Thoughts on “Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes”

Our next guest post is by Pleuni Pennings [@pleunipennings] with her thoughts on:
Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes, Duncan Palmer, John Frater, Rodney Philips, Angela McLean, Gil McVean, arXived here

[UPDATED]

Last week, a group of people from Oxford University published an interesting paper on the ArXiv. The paper is about using genealogical data (from HIV sequences), in combination with cross-sectional data (on patient and HIV phenotypes) to infer rates of evolution in HIV.

My conclusion: the approach is very interesting, and it makes total sense to use genealogical data to improve the inference from cross-sectional data. In fact, it is quite surprising to me that inferring rates from cross-sectional data works at all. However, in a previous paper by (partly) the same people, they show that it is possible to infer rates from using cross-sectional data only, and the estimates they get are very similar to the estimates from longitudinal data. The current paper provides a new and improved method, whose results are consistent with the previous papers.

The biological conclusion of the paper is that HIV adaptation is slower than many previous studies suggested. Case studies of fast evolution of the virus suffer from extreme publication bias and give the impression that evolution in HIV is always fast, whereas cross-sectional and longitudinal data show that evolution is often slow. Waiting times for CTL-escape and reversion are on the order of years.

## 1. What rates are they interested in?

The rates of interest here are the rate of escape from CTL pressure and the rate of reversion if there is no CTL pressure.

When someone is infected with HIV, the CTL response by the immune system of the patient can reduce the amount of virus in the patient. CTL stands for cytotoxic lymphocytes. Which amino-acid sequences (epitopes) can be recognized by the host’s CTL response depends on the HLA genotype of the host.
Suppose I have a certain HLA genotype X, such that my CTLs can recognize virus with a specific sequence of about 9 amino acids, let’s call this sequence Y. To escape from the pressure of these CTLs, the virus can mutate sequence Y to sequence Y’. A virus with sequence Y’ is called an escape mutant. The host (patient) with HLA X is referred to as a “matched host” and hosts without HLA X are referred to as “unmatched.” The escape mutations are thought to be costly for the virus.
So, for each CTL epitope there are 4 possible combinations of host and virus:
1. matched host and wildtype virus (there is selection pressure on the virus to “escape”)
2. matched host and escape mutant virus
3. unmatched host and wildtype virus
4. unmatched host and escape mutant virus (there is selection pressure on the virus to revert)

The question is “how fast does the virus escape if it is in a matched host and how fast does it revert if it is in an unmatched host?”

## 2. Why do we want to know these rates?

First of all, just out of curiosity, it is interesting to study how fast things evolve – it is surprising how little we know about rates of adaptive evolution. Secondly, because escape rates are relevant for the success of a potential HIV vaccine, if escape rates are high, then vaccines will probably not be very successful.

## 3. What are cross-sectional data and how can we infer rates from them?

Cross-sectional data are snap-shots of the population, with information on hosts and their virus. Here, it is the number of matched and unmatched hosts with wildtype and escape virus at a given point in time.

So how do these data tell us what escape rates and reversion rates are? Intuitively, it is easy to see how very high or very low rates would shape the data. For example, if escape and reversion would happen very fast, then the virus would always be perfectly adapted: we’d only find wildtype virus in unmatched hosts and only escape mutant virus in matched hosts. Conversely, if escape and reversion would be extremely slow, than the fraction of escape mutant virus would not differ between matched and unmatched hosts. Everyone would be infected with a random virus and this would never change.
The real situation is somewhere in between: the fraction of escape mutant virus is higher in matched hosts than in unmatched hosts. With the help of an standard epidemiological SI-model (ODE-model) and an estimate of the age of the epidemic, the fraction of escape mutant virus in the two types of hosts translates into estimates of the rates of escape and reversion. In the earlier paper, this is exactly what the authors did, and the results make a lot of sense. Rates range from months to years, reversion is always slower than escape, and there are large differences between CTLs. The results also matched well with data from longitudinal studies. In a longitudinal study, the patients are followed over time and evolution of the virus can be more directly observed. This is much more costly, but a much better way to estimate rates.

## 4. Why are the estimates from cross-sectional data not good enough?

Unfortunately, the estimates from cross-sectional data are only point estimates, and maybe not very good ones. The problem is that the method (implicitly) assumes that each virus is independently derived from an ancestor at the beginning of the epidemic. For example, if there are a lot of escape mutant viruses in the dataset, then the estimated rate of escape will be high. However, the high number of escape mutant virus may be due to one or a few escape events early on in the epidemic that got transmitted to a lot of other patients. It is a classical case of non-independence of data. It could lead us to believe that we can have more confidence in the estimates than we should have.

## 5. Genealogical data to the rescue!

Fortunately, the authors have viral sequences that provide much more information than just whether or not the virus is an escape mutant. The sequences of the virus can inform us about the underlying genealogical tree and can tell us how non-independent the data really are (two escape mutants that are very close to each other in the tree are not very independent). The goal of the current paper is to use the genealogical data to get better estimates of the escape and reversion rates.

A large part of the paper deals with the nuts and bolts of how to combine all the data, but in essence, this is what they do: They first estimate the genealogical tree for the viruses of the patients for which they have data (while allowing for uncertainty in the estimated tree). Then they add information on the states of the tips (wildtype vs escape for the virus and matched vs unmatched for the patient), and use the tree with the tip-labels to estimate the rates. This seems to be a very useful new method, that may give better estimates and a natural way to get credible intervals for the estimates.

The results they obtain with the new method are similar to the previous results for three CTL epitopes and slower rates for one CTL epitope. The credible intervals are quite wide, which shows that the data (from 84 patients) really don’t contain a whole lot of information about the rates, possibly because the trees are rather star-shaped, due to the exponential growth of the epidemic. Interestingly, the fact that the tree is rather star-shaped could explain why the older approach (based only on cross-sectional data) worked quite well. However, this will not necessarily be the case for other datasets.

## Question for the authors

Do you use the information about the specific escape mutations in the data? Certainly not all sequences that are considered “escape mutants” carry exactly the same nucleotide changes? Whenever they carry different mutations, you know they must be independent.

# Robust estimation of microbial diversity in theory and in practice

Robust estimation of microbial diversity in theory and in practice
Bart Haegeman, Jérôme Hamelin, John Moriarty, Peter Neal, Jonathan Dushoff, Joshua S. Weitz
(Submitted on 15 Feb 2013)

Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao’s estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics (“Hill diversities”), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao’s estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.