The budding yeast Saccharomyces cerevisiae is important for human food production and as a model organism for biological research. The genetic diversity contained in the global population of yeast strains represents a valuable resource for a number of fields, including genetics, bioengineering, and studies of evolution and population structure. Here, we apply a multiplexed, reduced genome sequencing strategy (known as RAD-seq) to genotype a large collection of S. cerevisiae strains, isolated from a wide range of geographical locations and environmental niches. The method permits the sequencing of the same 1% of all genomes, producing a multiple sequence alignment of 116,880 bases across 262 strains. We find diversity among these strains is principally organized by geography, with European, North American, Asian and African/S. E. Asian populations defining the major axes of genetic variation. At a finer scale, small groups of strains from cacao, olives and sake are defined by unique variants not present in other strains. One population, containing strains from a variety of fermentations, exhibits high levels of heterozygosity and mixtures of alleles from European and Asian populations, indicating an admixed origin for this group. In the context of this global diversity, we demonstrate that a collection of seven strains commonly used in the laboratory encompasses only one quarter of the genetic diversity present in the full collection of strains, underscoring the relatively limited genetic diversity captured by the current set of lab strains. We propose a model of geographic differentiation followed by human-associated admixture, primarily between European and Asian populations and more recently between European and North American populations. The large collection of genotyped yeast strains characterized here will provide a useful resource for the broad community of yeast researchers.
Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs split alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For short-read mapping, BWA-MEM shows better performance than several state-of-art read aligners to date.
Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at this http URL
Background Despite its status as a model organism, the development of Caenorhabditis elegans is not necessarily archetypical for nematodes. The phylum Nematoda is divided into the Chromadorea (indcludes C. elegans) and the Enoplea. Compared to C. elegans, enoplean nematodes have very different patterns of cell division and determination. Embryogenesis of the enoplean Romanomermis culicivorax has been studied in great detail, but the genetic circuitry underpinning development in this species is unknown. Results We created a draft genome of R. culicivorax and compared its developmental gene content with that of two nematodes, C. elegans and Trichinella spiralis (another enoplean), and a representative arthropod Tribolium castaneum. This genome evidence shows that R. culicivorax retains components of the conserved metazoan developmental toolkit lost in C. elegans. T. spiralis has independently lost even more of the toolkit than has C. elegans. However, the C. elegans toolkit is not simply depauperate, as many genes essential for embryogenesis in C. elegans are unique to this lineage, or have only extremely divergent homologues in R. culicivorax and T. spiralis. These data imply fundamental differences in the genetic programmes for early cell specification, inductive interactions, vulva formation and sex determination. Conclusions Thus nematodes, despite their apparent phylum-wide morphological conservatism, have evolved major differences in the molecular logic of their development. R. culicivorax serves as a tractable, contrasting model to C. elegans for understanding how divergent genomic and thus regulatory backgrounds can generate a conserved phenotype. The availability of the draft genome will promote use of R. culicivorax as a research model.
A population’s adaptive potential is the likelihood that it will adapt in response to an environmental challenge, e.g., develop resistance in response to drug treatment. The effective population size inferred from genetic diversity at neutral sites has been traditionally taken as a major predictor of adaptive potential. However recent studies demonstrate that such effective population size vastly underestimates the population’s adaptive potential (Karasov 2010).
Here we use data from treated HIV-infected patients (Bacheler2000) to estimate the effective size of HIV populations relevant for adaptation. Our estimate is based on the frequencies of soft and hard selective sweeps of a known resistance mutation K103N. We observe that 41% of HIV populations in this study acquire resistance via at least two functionally equivalent but distinct mutations which sweep to fixation without significantly reducing genetic diversity at neighboring sites (soft selective sweeps). We further estimate that 20% of populations acquire a resistant allele via a single mutation that sweeps to fixation and drastically reduces genetic diversity (hard selective sweeps). We infer that the effective population size that determines the adaptive potential of within-patient HIV populations is approximately 150,000. Our estimate is two orders of magniture higher than a classical estimate based on diversity at synonymous sites.
Three not mutually exclusive reasons can explain this discrepancy:
(1) some synonymous mutations may be under selection;
(2) highly beneficial mutations may be less affected by ongoing linked selection than synonymous mutations; and
(3) synonymous diversity may not be at its expected equilibrium because it recovers slowly from sweeps and bottlenecks.
The tremdendous advances in high-throughput sequencing technologies have made population-scale sequencing as performed in the 1000 Genomes project and the Genome of the Netherlands project possible. Next-generation sequencing has allowed genom-wide discovery of variations beyond single-nucleotide polymorphisms (SNPs), in particular of structural variations (SVs) like deletions, insertions, duplications, translocations, inversions, and even more complex rearrangements. Here, we design a read aligner with special emphasis on the following properties: (1) high sensitivity, i.e. find all (reasonable) alignments; (2) ability to find (long) indels; (3) statistically sound alignment scores; and (4) runtime fast enough to be applied to whole genome data. We compare performance to BWA, bowtie2, stampy and find that our methods is especially advantageous on reads containing larger indels.
I decided to post the paper — although we haven’t submitted it yet — because I wanted to solicit feedback and gauge where the disagreements or misunderstandings were likely to pop up. I suspect reviewers are going to have a hate-hate relationship with this paper, too, so I want to get rid of any obvious mistakes by crowdsourcing some early reviews ;).
Before I talk about the paper, let me mention a few related factoids that crossed my ‘net firehose and that tie into the paper; I’ll weave them into the paper discussion below.
The paper’s genesis is this: for years I’d been explaining to collaborators that mRNAseq was unreliable on any organism without a pretty good reference transcriptome, and that constructing the reference transcriptome relied on either good mRNAseq assembly tools OR on having a good reference genome. So that was what my lab was working on doing with their data.
Since we tend to work on weird organisms like chick (which has an incomplete genome and a notoriously poor gene annotation), lamprey (see above — missing 20-30% of its genome), and Molgulid ascidians (for which we are constructing the genome), we needed to go the de novo mRNAseq assembly route.
However, we also quickly realized that there were a few problems with de novo mRNAseq assembly and expression evaluation: lots of mRNAseq needed to be crunched in order to get good reconstruction, which was computationally intensive (see: diginorm); splice variants would confuse mapping (see: this paper); we were uncertain of how well splice variants could be assembled with or without a reference (see: this paper); and we weren’t sure how to evaluate mapping tools.
When a new postdoc (Alexis Black Pyrkosz, the first author) joined the lab and wanted a fairly simple project to get started, I suggested that she evaluate the extent to which an incomplete reference transcriptome would screw things up, and perhaps try out a few different mappers. This took her about a year to work through, in addition to all of her other projects. She built a straightforward simulation system, tried it out on chick and mouse, and got some purely computational results that place (in my view) fundamental limits on what you can accomplish with certainty using current mRNAseq technology.
Incidentally, one piece of feedback she got at GLBio from a prof was (paraphrased) "I hope this isn’t your real project, because it’s not very interesting."
The results, in short, are:
What mapper you use doesn’t really matter, except to within a few percent; they all perform fine, and they all degrade fairly fast with sequencing errors.
Incomplete reference transcriptomes matter, a lot. There are two entirely obvious reasons: if you have a splice variant A that is in your reference but not present in your mRNAseq, and a splice variant B that is not in your reference but is actually transcribed and in your mRNAseq, the reads for B will get mapped to the wrong transcript; and (the even more obvious one) you can’t measure the expression of something that’s not in your reference via mapping.
The SASeq paper does a nice job of pointing out that there’s something rather seriously wrong with current mRNAseq references, even in mouse, and they provide a way to minimize misestimation in the context of the reference.
Direct splice variant reconstruction and measurement is, technically, impossible for about 30% of the transcripts in mouse. For standard paired-end sequencing, it turns out that you cannot claim that exon A and exon Z are present in the same isoform for about 30% of the isoforms.
The slightly surprising conclusion that we reached from this is that mRNAseq assembly is also, generally speaking, impossible: you cannot unambiguously construct a reasonably large proportion of observed isoforms via assembly, since the information to connect the exons is not there.
Until recently, I had held out a forlorn hope that Clever Things were being done with coverage. Then I saw Dan Zerbino’s e-mail, point A, above.
And yet, judging by the Oases and Trinity publications, assembly works! What’s up?
There’s something a bit weird going on. Tools like Oases and Trinity can clearly construct a fair proportion of previously observed transcripts, even though the information to do so from direct observation isn’t there and they can’t necessarily use coverage inference reliably. My guess (see paper D, above) is that this is because biology is mostly cooperating with us by giving us one dominant isoform in many or most circumstances; this matches what Joe Pickrell et al. observed in their truly excellentnoisy splicing paper. But I’d be interested in hearing alternate theories.
At this point, my friend and colleague Erich Schwarz tends to get unhappy with me and say "what would you have me do with my mRNAseq, then? Ignore it until you come up with a solution, which you claim is impossible anyway?" Good question! My answer is (a) "explore the extent to which we can place error bars or uncertainty on isoform abundance calculations", (b) "figure out where interesting isoform misestimation is likely to lie in the data sets", and (c) "look at exon-exon junctions and exon presence/absence instead of whole isoform abundance." But the tools to do this are still rather immature, I think, and people mostly ignore the issue or hope it doesn’t bugger up their results. (Please correct me if I’m wrong – I’d love pointers!)
In my lab, we are starting to explore ways to determine what mis- or un-assembled isoforms there might be in a given transcriptome. We’re also looking at non-reference-based ways of doing mRNAseq quantification and differential expression (technically, graph-based methods for mRNAseq). We are also deeply skeptical of many of the normalization approaches being used, largely because every time we evaluate them in the context of our actual data, our data seems to violate a number of their assumptions… Watch This Space.
What’s the reaction to the paper been, so far?
Well, on Twitter, I’ve been getting a fair bit of "ahh, a good reference genome will solve your problems!" But I think points #3 and #4 above stand. Plus, invoking solving a harder and more expensive problem to solve what you think is a simpler problem is an interesting approach :). And since we don’t have good references (and won’t for a while) it’s not really a solution for us.
I’ve also been getting "this is a well known problem and not worth publishing!" from a few people. Well, OK, fair enough. I’ve been skimming papers with an eye to this for a while, but it’s entirely possible I’ve missed this part of the literature. I’d love to read and cite such a paper in this one (and even rely on it to make our points, if it truly has significant overlap). Please post links in the comments, I’d really appreciate it!
It is clear that the paper needs some reshaping in light of some of the comments, and I’d like to especially thank Mick Watson for his open comments.
If our results are right, then our current approaches to mRNAseq have some potentially serious problems, especially in the area of isoform expression analysis. Worse, these problems aren’t readily addressible by doing qPCR confirmation or replicate sequencing. I’m not entirely sure what the ramifications are but it seems like a worthwhile thing that someone should point out.
p.s. Our simulations are fairly simple, BTW. We’ll put the code out there soon for you to play with.
p.p.s. Likit Preeyanon, a graduate student in my lab, was one of the first people in my lab to look really critically at mRNAseq. Watch for his paper soon.
The study of molecular evolution rests on the classical fields of population genetics and systematics, but the increasing availability of DNA sequence data has broadened the field in the last decades, leading to new theories and methodologies. This includes parsimony and maximum likelihood methods of phylogenetic tree estimation, the theory of genome rearrangements, and the coalescent model with recombination. These all interact in the study of genome evolution, yet to date they have only been pursued in isolation. We present the first unified parsimony framework for the study of genome evolutionary histories that includes all of these aspects, proposing a graphical data structure called a history graph that is intended to form a practical basis for analysis. We define tractable upper and lower bound parsimony cost functions on history graphs that incorporate both substitutions and rearrangements. We demonstrate that these bounds become tight for a special unambiguous type of history graph called an ancestral variation graph (AVG), which captures in its combinatorial structure the operations required in an evolutionary history. For an input history graph G, we demonstrate that there exists a finite set of interpretations of G that contains all minimal (lacking extraneous elements) and most parsimonious AVG interpretations of G. We define a partial order over this set and an associated set of sampling moves that can be used to explore these DNA histories. These results generalise and conceptually simplify the problem so that we can sample evolutionary histories using parsimony cost functions that account for all substitutions and rearrangements in the presence of duplications.