# Loss and Recovery of Genetic Diversity in Adapting Populations of HIV

Loss and Recovery of Genetic Diversity in Adapting Populations of HIV
Pleuni Pennings, Sergey Kryazhimskiy, John Wakeley
(Submitted on 15 Mar 2013)

A population’s adaptive potential is the likelihood that it will adapt in response to an environmental challenge, e.g., develop resistance in response to drug treatment. The effective population size inferred from genetic diversity at neutral sites has been traditionally taken as a major predictor of adaptive potential. However recent studies demonstrate that such effective population size vastly underestimates the population’s adaptive potential (Karasov 2010).
Here we use data from treated HIV-infected patients (Bacheler2000) to estimate the effective size of HIV populations relevant for adaptation. Our estimate is based on the frequencies of soft and hard selective sweeps of a known resistance mutation K103N. We observe that 41% of HIV populations in this study acquire resistance via at least two functionally equivalent but distinct mutations which sweep to fixation without significantly reducing genetic diversity at neighboring sites (soft selective sweeps). We further estimate that 20% of populations acquire a resistant allele via a single mutation that sweeps to fixation and drastically reduces genetic diversity (hard selective sweeps). We infer that the effective population size that determines the adaptive potential of within-patient HIV populations is approximately 150,000. Our estimate is two orders of magniture higher than a classical estimate based on diversity at synonymous sites.
Three not mutually exclusive reasons can explain this discrepancy:
(1) some synonymous mutations may be under selection;
(2) highly beneficial mutations may be less affected by ongoing linked selection than synonymous mutations; and
(3) synonymous diversity may not be at its expected equilibrium because it recovers slowly from sweeps and bottlenecks.

# Sensitive Long-Indel-Aware Alignment of Sequencing Reads

Sensitive Long-Indel-Aware Alignment of Sequencing Reads
Tobias Marschall, Alexander Schönhuth
(Submitted on 14 Mar 2013)

The tremdendous advances in high-throughput sequencing technologies have made population-scale sequencing as performed in the 1000 Genomes project and the Genome of the Netherlands project possible. Next-generation sequencing has allowed genom-wide discovery of variations beyond single-nucleotide polymorphisms (SNPs), in particular of structural variations (SVs) like deletions, insertions, duplications, translocations, inversions, and even more complex rearrangements. Here, we design a read aligner with special emphasis on the following properties: (1) high sensitivity, i.e. find all (reasonable) alignments; (2) ability to find (long) indels; (3) statistically sound alignment scores; and (4) runtime fast enough to be applied to whole genome data. We compare performance to BWA, bowtie2, stampy and find that our methods is especially advantageous on reads containing larger indels.

# Our paper: RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates

This guest post is by Titus Brown on his group’s preprint, RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates, arXived here. This post is cross-posted from his blog

I’m worried about our current mRNAseq analysis strategies.

I recently posted a draft paper of ours to arXiv entitled RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates; the link is to the Haldane’s Sieve discussion of the paper. Graham Coop said I should write something quick, and so I am (sort of — I mean to write this post a few days ago, but teaching. family. etc.)

I decided to post the paper — although we haven’t submitted it yet — because I wanted to solicit feedback and gauge where the disagreements or misunderstandings were likely to pop up. I suspect reviewers are going to have a hate-hate relationship with this paper, too, so I want to get rid of any obvious mistakes by crowdsourcing some early reviews ;).

Before I talk about the paper, let me mention a few related factoids that crossed my ‘net firehose and that tie into the paper; I’ll weave them into the paper discussion below.

1. Dan Zerbino (one of the Assembly Gods?) wrote a good e-mail about how Oases can’t really leverage coverage information when reconstructing transcripts
on the oases-users mailing list.
2. I read and reviewed the SASeq paper on mRNAseq quantification, which talks about how many of our transcript models are almost certainly not expressed.
3. The lamprey genome paper came out, pointing out that the genome is probably missing ~20-30% of the germline content.
4. A paper on evaluating mRNAseq expression analysis software hit arXiv.

OK, so what did our paper say?

The paper’s genesis is this: for years I’d been explaining to collaborators that mRNAseq was unreliable on any organism without a pretty good reference transcriptome, and that constructing the reference transcriptome relied on either good mRNAseq assembly tools OR on having a good reference genome. So that was what my lab was working on doing with their data.

Since we tend to work on weird organisms like chick (which has an incomplete genome and a notoriously poor gene annotation), lamprey (see above — missing 20-30% of its genome), and Molgulid ascidians (for which we are constructing the genome), we needed to go the de novo mRNAseq assembly route.

However, we also quickly realized that there were a few problems with de novo mRNAseq assembly and expression evaluation: lots of mRNAseq needed to be crunched in order to get good reconstruction, which was computationally intensive (see: diginorm); splice variants would confuse mapping (see: this paper); we were uncertain of how well splice variants could be assembled with or without a reference (see: this paper); and we weren’t sure how to evaluate mapping tools.

When a new postdoc (Alexis Black Pyrkosz, the first author) joined the lab and wanted a fairly simple project to get started, I suggested that she evaluate the extent to which an incomplete reference transcriptome would screw things up, and perhaps try out a few different mappers. This took her about a year to work through, in addition to all of her other projects. She built a straightforward simulation system, tried it out on chick and mouse, and got some purely computational results that place (in my view) fundamental limits on what you can accomplish with certainty using current mRNAseq technology.

Incidentally, one piece of feedback she got at GLBio from a prof was (paraphrased) "I hope this isn’t your real project, because it’s not very interesting."

The results, in short, are:

1. What mapper you use doesn’t really matter, except to within a few percent; they all perform fine, and they all degrade fairly fast with sequencing errors.

2. Incomplete reference transcriptomes matter, a lot. There are two entirely obvious reasons: if you have a splice variant A that is in your reference but not present in your mRNAseq, and a splice variant B that is not in your reference but is actually transcribed and in your mRNAseq, the reads for B will get mapped to the wrong transcript; and (the even more obvious one) you can’t measure the expression of something that’s not in your reference via mapping.

The SASeq paper does a nice job of pointing out that there’s something rather seriously wrong with current mRNAseq references, even in mouse, and they provide a way to minimize misestimation in the context of the reference.

3. Direct splice variant reconstruction and measurement is, technically, impossible for about 30% of the transcripts in mouse. For standard paired-end sequencing, it turns out that you cannot claim that exon A and exon Z are present in the same isoform for about 30% of the isoforms.

4. The slightly surprising conclusion that we reached from this is that mRNAseq assembly is also, generally speaking, impossible: you cannot unambiguously construct a reasonably large proportion of observed isoforms via assembly, since the information to connect the exons is not there.

Until recently, I had held out a forlorn hope that Clever Things were being done with coverage. Then I saw Dan Zerbino’s e-mail, point A, above.

And yet, judging by the Oases and Trinity publications, assembly works! What’s up?

There’s something a bit weird going on. Tools like Oases and Trinity can clearly construct a fair proportion of previously observed transcripts, even though the information to do so from direct observation isn’t there and they can’t necessarily use coverage inference reliably. My guess (see paper D, above) is that this is because biology is mostly cooperating with us by giving us one dominant isoform in many or most circumstances; this matches what Joe Pickrell et al. observed in their truly excellent noisy splicing paper. But I’d be interested in hearing alternate theories.

At this point, my friend and colleague Erich Schwarz tends to get unhappy with me and say "what would you have me do with my mRNAseq, then? Ignore it until you come up with a solution, which you claim is impossible anyway?" Good question! My answer is (a) "explore the extent to which we can place error bars or uncertainty on isoform abundance calculations", (b) "figure out where interesting isoform misestimation is likely to lie in the data sets", and (c) "look at exon-exon junctions and exon presence/absence instead of whole isoform abundance." But the tools to do this are still rather immature, I think, and people mostly ignore the issue or hope it doesn’t bugger up their results. (Please correct me if I’m wrong – I’d love pointers!)

In my lab, we are starting to explore ways to determine what mis- or un-assembled isoforms there might be in a given transcriptome. We’re also looking at non-reference-based ways of doing mRNAseq quantification and differential expression (technically, graph-based methods for mRNAseq). We are also deeply skeptical of many of the normalization approaches being used, largely because every time we evaluate them in the context of our actual data, our data seems to violate a number of their assumptions… Watch This Space.

## Paper reactions

What’s the reaction to the paper been, so far?

Well, on Twitter, I’ve been getting a fair bit of "ahh, a good reference genome will solve your problems!" But I think points #3 and #4 above stand. Plus, invoking solving a harder and more expensive problem to solve what you think is a simpler problem is an interesting approach :). And since we don’t have good references (and won’t for a while) it’s not really a solution for us.

I’ve also been getting "this is a well known problem and not worth publishing!" from a few people. Well, OK, fair enough. I’ve been skimming papers with an eye to this for a while, but it’s entirely possible I’ve missed this part of the literature. I’d love to read and cite such a paper in this one (and even rely on it to make our points, if it truly has significant overlap). Please post links in the comments, I’d really appreciate it!

It is clear that the paper needs some reshaping in light of some of the comments, and I’d like to especially thank Mick Watson for his open comments.

## Concluding thoughts

If our results are right, then our current approaches to mRNAseq have some potentially serious problems, especially in the area of isoform expression analysis. Worse, these problems aren’t readily addressible by doing qPCR confirmation or replicate sequencing. I’m not entirely sure what the ramifications are but it seems like a worthwhile thing that someone should point out.

–titus

p.s. Our simulations are fairly simple, BTW. We’ll put the code out there soon for you to play with.

p.p.s. Likit Preeyanon, a graduate student in my lab, was one of the first people in my lab to look really critically at mRNAseq. Watch for his paper soon.

# A Unifying Parsimony Model of Genome Evolution

A Unifying Parsimony Model of Genome Evolution
Benedict Paten, Daniel R. Zerbino, Glenn Hickey, David Haussler
(Submitted on 9 Mar 2013)

The study of molecular evolution rests on the classical fields of population genetics and systematics, but the increasing availability of DNA sequence data has broadened the field in the last decades, leading to new theories and methodologies. This includes parsimony and maximum likelihood methods of phylogenetic tree estimation, the theory of genome rearrangements, and the coalescent model with recombination. These all interact in the study of genome evolution, yet to date they have only been pursued in isolation. We present the first unified parsimony framework for the study of genome evolutionary histories that includes all of these aspects, proposing a graphical data structure called a history graph that is intended to form a practical basis for analysis. We define tractable upper and lower bound parsimony cost functions on history graphs that incorporate both substitutions and rearrangements. We demonstrate that these bounds become tight for a special unambiguous type of history graph called an ancestral variation graph (AVG), which captures in its combinatorial structure the operations required in an evolutionary history. For an input history graph G, we demonstrate that there exists a finite set of interpretations of G that contains all minimal (lacking extraneous elements) and most parsimonious AVG interpretations of G. We define a partial order over this set and an associated set of sampling moves that can be used to explore these DNA histories. These results generalise and conceptually simplify the problem so that we can sample evolutionary histories using parsimony cost functions that account for all substitutions and rearrangements in the presence of duplications.

# khmer: Working with Big Data in Bioinformatics

khmer: Working with Big Data in Bioinformatics
Eric McDonald, C. Titus Brown
(Submitted on 9 Mar 2013)

We introduce design and optimization considerations for the ‘khmer’ package.

Comments: Invited chapter for forthcoming book on Performance of Open Source Applications

# A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes

A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes
John A. Capra, Melissa J. Hubisz, Dennis Kostka, Katherine S. Pollard, Adam Siepel
(Submitted on 9 Mar 2013)

GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts ~1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. We also find some evidence that they contribute to the fixation of deleterious alleles, including an enrichment for disease-associated polymorphisms. These tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages; they supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.

# RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates

RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates
Alexis Black Pyrkosz, Hans Cheng, C. Titus Brown
(Submitted on 11 Mar 2013)

Whole transcriptome sequencing is increasingly being used as a functional genomics tool to study non- model organisms. However, when the reference transcriptome used to calculate differential expression is incomplete, significant error in the inferred expression levels can result. In this study, we use simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measure the error resulting from using an incomplete transcriptome. We show that the two primary sources of count- ing error are 1) alternative splice variants that share reads and 2) missing transcripts from the reference. Alternative splice variants increase the false positive rate of mapping while incomplete reference tran- scriptomes decrease the true positive rate, leading to inaccurate transcript expression levels. Grouping transcripts by gene or read sharing (similar to mapping to a reference genome) significantly decreases false positives, but only by improving the reference transcriptome itself can the missing transcript problem be addressed. We also demonstrate that employing different mapping software does not yield substantial increases in accuracy on simulated data. Finally, we show that read lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.

# Our paper: Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

This guest post is by Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, and Dmitri A. Petrov, on their paper  Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster, arXived here

We typically think of adaptive events as arising from single de novo mutations that sweep through the population one at a time. In this scenario, one expects to observe the signatures of hard selective sweeps, where a single haplotype rises to very high frequencies, removing variation in linked genomic regions. It is also possible, however, that adaptation could lead to signatures of soft sweeps. Soft sweeps are generated by multiple adaptive haplotypes rising in frequency at the same time, either because (i) the adaptive mutation comes from standing variation and thus had time to recombine onto multiple haplotypes, or (ii) because multiple de novo mutations arise virtually simultaneously. The second mode is likely in large populations or when the adaptive mutation rate per locus is high.

Soft sweeps have generally been considered a mere curiosity and most scans for adaptation focus on the hard sweep scenario. Despite this prevailing view, the three best-studied cases of adaptation in Drosophila at the loci Ace, CHKov1, and Cyp6g1 all show signatures of soft sweeps. In two cases (Ace and Cyp6g1), soft sweeps were generated by de novo mutations indicating that the population size in D. melanogaster relevant to adaptation is on the order of billions or larger. In one case (CHKov1), soft sweeps arose from standing variation. Surprisingly, we do not have very convincing cases of recent adaptation in Drosophila that generated hard sweeps.

Nevertheless, it remained an open question of whether these three cases were the exception or the norm. They are all related to pesticide or viral resistance and it is entirely possible that much adaptation unrelated to human disturbance or immunity proceeds differently and might generate hard sweeps.

In this paper, we developed two haplotype statistics that allowed us to systematically identify hard and soft sweeps with similar power and then to differentiate them from each other. We applied these statistics to the Drosophila polymorphism data of ~150 fully sequenced, inbred strains available through the Drosophila Genetic Reference Panel (DGRP).

We found abundant signatures of recent and strong sweeps in the Drosophila genome with haplotype structure often extending over tens or even hundreds of kb. However, to our surprise, when we looked at the top 50 peaks, all of them showed signatures of soft sweeps, while we could not convincingly demonstrate the existence of any hard sweeps.

Our results suggest that hard sweeps might be exceedingly rare in Drosophila. Instead, it appears that adaptation in Drosophila primarily proceeds via soft sweeps and thus often involves standing genetic variation or recurrent de novo mutations. There are two caveats, however: One is that we were only able to study strong and recent adaptation. Such strong adaptation should “feel” recent population sizes that are close to the census size, whereas it should be insensitive to bottlenecks that have occurred in the distant past. Weaker adaptation, on the other hand, might take longer and thus would be sensitive to ancient bottlenecks or interference from other sweeps. Whether weak adaptation thus proceeds via hard sweeps remains to be seen. The second caveat is that much of adaptation might involve sweeps that are so soft and move so many haplotypes up in frequency that we cannot detect them. Similarly, adaptation could often be polygenic involving very subtle shifts in allele frequency at many loci. These modes would hardly leave any signatures of sweeps at all. Whichever way it is, it is becoming increasingly clear that adaptation in Drosophila and many other organisms is likely to be much more complex, much more common, and in many ways a much more turbulent process than we usually tend to think.

# Minimal clade size in the Bolthausen-Sznitman coalescent

Minimal clade size in the Bolthausen-Sznitman coalescent
Fabian Freund, Arno Siri-Jégousse
(Submitted on 14 Jan 2013 (v1), last revised 6 Mar 2013 (this version, v2))

This article shows the asymptotics of distribution and moments of the size $X_n$ of the minimal clade of a randomly chosen individual in a Bolthausen-Sznitman $n$-coalescent for $n\to\infty$. The Bolthausen-Sznitman $n$-coalescent is a Markov process taking states in the set of partitions of $\left\{1,\ldots,n\right\}$, where $1,\ldots,n$ are referred to as individuals. The minimal clade of an individual is the equivalence class the individual is in at the time of the first coalescence event this individual participates in.\\ The main tool used is the connection of the Bolthausen-Sznitman $n$-coalescent with random recursive trees introduced by Goldschmidt and Martin (see \cite{goldschmidtmartin}). This connection shows that $X_n-1$ is distributed as the number $M_n$ of all individuals not in the equivalence class of individual 1 shortly before the time of the last coalescence event. Both functionals are distributed like the size $RT_{n-1}$ of an uniformly chosen table in a standard Chinese restaurant process with $n-1$ customers.We give exact formulae for these distributions.\\ Using the asymptotics of $M_n$ shown by Goldschmidt and Martin in \cite{goldschmidtmartin}, we see $(\log n)^{-1}\log X_n$ converges in distribution to the uniform distribution on [0,1] for $n\to\infty$.\\ We provide the complimentary information that $\frac{\log n}{n^k}E(X_n^k)\to \frac{1}{k}$ for $n\to\infty$, which is also true for $M_n$ and $RT_n$.

# The consequences of gene flow for local adaptation and differentiation: A two-locus two-deme model

The consequences of gene flow for local adaptation and differentiation: A two-locus two-deme model
We consider a population subdivided into two demes connected by migration in which selection acts in opposite direction. We explore the effects of recombination and migration on the maintenance of multilocus polymorphism, on local adaptation, and on differentiation by employing a deterministic model with genic selection on two linked diallelic loci (i.e., no dominance or epistasis). For the following cases, we characterize explicitly the possible equilibrium configurations: weak, strong, highly asymmetric, and super-symmetric migration, no or weak recombination, and independent or strongly recombining loci. For independent loci (linkage equilibrium) and for completely linked loci, we derive the possible bifurcation patterns as functions of the total migration rate, assuming all other parameters are fixed but arbitrary. For these and other cases, we determine analytically the maximum migration rate below which a stable fully polymorphic equilibrium exists. In this case, differentiation and local adaptation are maintained. Their degree is quantified by a new multilocus version of $\Fst$ and by the migration load, respectively. In addition, we investigate the invasion conditions of locally beneficial mutants and show that linkage to a locus that is already in migration-selection balance facilitates invasion. Hence, loci of much smaller effect can invade than predicted by one-locus theory if linkage is sufficiently tight. We study how this minimum amount of linkage admitting invasion depends on the migration pattern. This suggests the emergence of clusters of locally beneficial mutations, which may form `genomic islands of divergence’. Finally, the influence of linkage and two-way migration on the effective migration rate at a linked neutral locus is explored. Numerical work complements our analytical results.