Genomic Sequence Diversity and Population Structure of Saccharomyces cerevisiae Assessed by RAD-seq

Genomic Sequence Diversity and Population Structure of Saccharomyces cerevisiae Assessed by RAD-seq
Gareth A. Cromie, Katie E. Hyma, Catherine L. Ludlow, Cecilia Garmendia-Torres, Teresa L. Gilbert, Patrick May, Angela A. Huang, Aimée M. Dudley, Justin C. Fay
(Submitted on 20 Mar 2013)

The budding yeast Saccharomyces cerevisiae is important for human food production and as a model organism for biological research. The genetic diversity contained in the global population of yeast strains represents a valuable resource for a number of fields, including genetics, bioengineering, and studies of evolution and population structure. Here, we apply a multiplexed, reduced genome sequencing strategy (known as RAD-seq) to genotype a large collection of S. cerevisiae strains, isolated from a wide range of geographical locations and environmental niches. The method permits the sequencing of the same 1% of all genomes, producing a multiple sequence alignment of 116,880 bases across 262 strains. We find diversity among these strains is principally organized by geography, with European, North American, Asian and African/S. E. Asian populations defining the major axes of genetic variation. At a finer scale, small groups of strains from cacao, olives and sake are defined by unique variants not present in other strains. One population, containing strains from a variety of fermentations, exhibits high levels of heterozygosity and mixtures of alleles from European and Asian populations, indicating an admixed origin for this group. In the context of this global diversity, we demonstrate that a collection of seven strains commonly used in the laboratory encompasses only one quarter of the genetic diversity present in the full collection of strains, underscoring the relatively limited genetic diversity captured by the current set of lab strains. We propose a model of geographic differentiation followed by human-associated admixture, primarily between European and Asian populations and more recently between European and North American populations. The large collection of genotyped yeast strains characterized here will provide a useful resource for the broad community of yeast researchers.

A Unifying Parsimony Model of Genome Evolution

A Unifying Parsimony Model of Genome Evolution
Benedict Paten, Daniel R. Zerbino, Glenn Hickey, David Haussler
(Submitted on 9 Mar 2013)

The study of molecular evolution rests on the classical fields of population genetics and systematics, but the increasing availability of DNA sequence data has broadened the field in the last decades, leading to new theories and methodologies. This includes parsimony and maximum likelihood methods of phylogenetic tree estimation, the theory of genome rearrangements, and the coalescent model with recombination. These all interact in the study of genome evolution, yet to date they have only been pursued in isolation. We present the first unified parsimony framework for the study of genome evolutionary histories that includes all of these aspects, proposing a graphical data structure called a history graph that is intended to form a practical basis for analysis. We define tractable upper and lower bound parsimony cost functions on history graphs that incorporate both substitutions and rearrangements. We demonstrate that these bounds become tight for a special unambiguous type of history graph called an ancestral variation graph (AVG), which captures in its combinatorial structure the operations required in an evolutionary history. For an input history graph G, we demonstrate that there exists a finite set of interpretations of G that contains all minimal (lacking extraneous elements) and most parsimonious AVG interpretations of G. We define a partial order over this set and an associated set of sampling moves that can be used to explore these DNA histories. These results generalise and conceptually simplify the problem so that we can sample evolutionary histories using parsimony cost functions that account for all substitutions and rearrangements in the presence of duplications.

khmer: Working with Big Data in Bioinformatics

khmer: Working with Big Data in Bioinformatics
Eric McDonald, C. Titus Brown
(Submitted on 9 Mar 2013)

We introduce design and optimization considerations for the ‘khmer’ package.

Comments: Invited chapter for forthcoming book on Performance of Open Source Applications

A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes

A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes
John A. Capra, Melissa J. Hubisz, Dennis Kostka, Katherine S. Pollard, Adam Siepel
(Submitted on 9 Mar 2013)

GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts ~1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. We also find some evidence that they contribute to the fixation of deleterious alleles, including an enrichment for disease-associated polymorphisms. These tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages; they supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.

RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates

RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates
Alexis Black Pyrkosz, Hans Cheng, C. Titus Brown
(Submitted on 11 Mar 2013)

Whole transcriptome sequencing is increasingly being used as a functional genomics tool to study non- model organisms. However, when the reference transcriptome used to calculate differential expression is incomplete, significant error in the inferred expression levels can result. In this study, we use simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measure the error resulting from using an incomplete transcriptome. We show that the two primary sources of count- ing error are 1) alternative splice variants that share reads and 2) missing transcripts from the reference. Alternative splice variants increase the false positive rate of mapping while incomplete reference tran- scriptomes decrease the true positive rate, leading to inaccurate transcript expression levels. Grouping transcripts by gene or read sharing (similar to mapping to a reference genome) significantly decreases false positives, but only by improving the reference transcriptome itself can the missing transcript problem be addressed. We also demonstrate that employing different mapping software does not yield substantial increases in accuracy on simulated data. Finally, we show that read lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.

Minimal clade size in the Bolthausen-Sznitman coalescent

Minimal clade size in the Bolthausen-Sznitman coalescent
Fabian Freund, Arno Siri-Jégousse
(Submitted on 14 Jan 2013 (v1), last revised 6 Mar 2013 (this version, v2))

This article shows the asymptotics of distribution and moments of the size $X_n$ of the minimal clade of a randomly chosen individual in a Bolthausen-Sznitman $n$-coalescent for $n\to\infty$. The Bolthausen-Sznitman $n$-coalescent is a Markov process taking states in the set of partitions of $\left\{1,\ldots,n\right\}$, where $1,\ldots,n$ are referred to as individuals. The minimal clade of an individual is the equivalence class the individual is in at the time of the first coalescence event this individual participates in.\\ The main tool used is the connection of the Bolthausen-Sznitman $n$-coalescent with random recursive trees introduced by Goldschmidt and Martin (see \cite{goldschmidtmartin}). This connection shows that $X_n-1$ is distributed as the number $M_n$ of all individuals not in the equivalence class of individual 1 shortly before the time of the last coalescence event. Both functionals are distributed like the size $RT_{n-1}$ of an uniformly chosen table in a standard Chinese restaurant process with $n-1$ customers.We give exact formulae for these distributions.\\ Using the asymptotics of $M_n$ shown by Goldschmidt and Martin in \cite{goldschmidtmartin}, we see $(\log n)^{-1}\log X_n$ converges in distribution to the uniform distribution on [0,1] for $n\to\infty$.\\ We provide the complimentary information that $\frac{\log n}{n^k}E(X_n^k)\to \frac{1}{k}$ for $n\to\infty$, which is also true for $M_n$ and $RT_n$.

The consequences of gene flow for local adaptation and differentiation: A two-locus two-deme model

The consequences of gene flow for local adaptation and differentiation: A two-locus two-deme model
Ada Akerman, Reinhard Bürger
(Submitted on 6 Mar 2013)

We consider a population subdivided into two demes connected by migration in which selection acts in opposite direction. We explore the effects of recombination and migration on the maintenance of multilocus polymorphism, on local adaptation, and on differentiation by employing a deterministic model with genic selection on two linked diallelic loci (i.e., no dominance or epistasis). For the following cases, we characterize explicitly the possible equilibrium configurations: weak, strong, highly asymmetric, and super-symmetric migration, no or weak recombination, and independent or strongly recombining loci. For independent loci (linkage equilibrium) and for completely linked loci, we derive the possible bifurcation patterns as functions of the total migration rate, assuming all other parameters are fixed but arbitrary. For these and other cases, we determine analytically the maximum migration rate below which a stable fully polymorphic equilibrium exists. In this case, differentiation and local adaptation are maintained. Their degree is quantified by a new multilocus version of $\Fst$ and by the migration load, respectively. In addition, we investigate the invasion conditions of locally beneficial mutants and show that linkage to a locus that is already in migration-selection balance facilitates invasion. Hence, loci of much smaller effect can invade than predicted by one-locus theory if linkage is sufficiently tight. We study how this minimum amount of linkage admitting invasion depends on the migration pattern. This suggests the emergence of clusters of locally beneficial mutations, which may form `genomic islands of divergence’. Finally, the influence of linkage and two-way migration on the effective migration rate at a linked neutral locus is explored. Numerical work complements our analytical results.

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding
Mathilde Paris, Tommy Kaplan, Xiao Yong Li, Jacqueline E. Villalta, Susan E. Lott, Michael B. Eisen
(Submitted on 1 Mar 2013)

To better characterize how variation in regulatory sequences drives divergence in gene expression, we undertook a systematic study of transcription factor binding and gene expression in the blastoderm embryos of four species that sample much of the diversity in the 60 million-year old genus Drosophila: D. melanogaster, D. yakuba, D. pseudoobscura and D. virilis. We compared gene expression, as measured by mRNA-seq to the genome-wide binding of four transcription factors involved in early development, as measured by ChIP-seq (Bicoid, Giant, Hunchback and Kr\”uppel). Surprisingly, we found that mRNA levels are much better conserved than individual binding events. We looked at binding characteristics that may explain such evolutionary disparity. As expected, we found that binding divergence increases with phylogenetic distance. Interestingly, binding events in non-coding regions that were bound strongly by single factors, or bound by multiple factors, were more likely to be conserved. As this class of sites are most likely to be involved in gene regulation, the divergence of other bound regions may simply reflect their lack of function. We used a model of quantitative trait evolution to compare the changes of gene expression with nearby regulatory TF binding. We found that changes in gene expression were poorly explained by changes in associated TF binding. These results suggest that some of the differences in sequence and binding have limited effect on gene expression or act in a compensatory manner to maintain the overall expression levels of regulated genes.

A two-fold advantage of sex

A two-fold advantage of sex
Su-Chan Park, Joachim Krug
(Submitted on 27 Feb 2013)

The adaptation of large asexual populations is hampered by the competition between independently arising beneficial mutations in different individuals, which is known as clonal interference. In classic work, Fisher and Muller proposed that recombination provides an evolutionary advantage in large populations by alleviating this competition. Based on recent progress in quantifying the speed of adaptation in asexual populations undergoing clonal interference, we present a detailed analysis of the Fisher-Muller mechanism for a model genome consisting of two loci with an infinite number of beneficial alleles each and multiplicative (non-epistatic) fitness effects. We solve the deterministic, infinite population dynamics exactly and show that, for a particular, natural mutation scheme, the speed of adaptation in sexuals is twice as large as in asexuals. This result is argued to hold for any nonzero value of the rate of recombination. Guided by the infinite population result and by previous work on asexual adaptation, we postulate an expression for the speed of adaptation in finite sexual populations that agrees with numerical simulations over a wide range of population sizes and recombination rates. The ratio of the sexual to asexual adaptation speed is a function of population size that increases in the clonal interference regime and approaches 2 for extremely large populations. The simulations also show that recombination leads to a strong equalization of the number of fixed mutations in the two loci. The generalization of the model to an arbitrary number $L$ of loci is briefly discussed. For a particular communal recombination scheme, the ratio of the sexual to asexual adaptation speed is approximately equal to $L$ in large populations.

Our paper: Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Our next guest post is by Mike Eisen [@mbeisen] on his paper with Peter Combs [@rflrob]
Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression. arXived here.

This is cross posted from Mike’s blog.

It’s no secret to people who read this blog that I hate the way scientific publishing works today. Most of my efforts in this domain have focused on removing barriers to the access and reuse of published papers. But there are other things that are broken with the way scientists communicate with each other, and chief amongst them is pre-publication peer review. I’ve written about this before, and won’t rehash the arguments here, save to say that I think we should publish first, and then review. But one could argue that I haven’t really practiced what I preach, as all of my lab’s papers have gone through peer review before they were published.

No more. From now on we are going to post all of our papers online when we feel they’re ready to share – before they go to a journal. We’ll then solicit comments from our colleagues and use them to improve the work prior to formal publication. Physicists and mathematicians have been doing this for decades, as have an increasing number of biologists. It’s time for this to become standard practice.

Some ground rules. I will not filter comments except to remove obvious spam. You are welcome to post comments under your name or under a pseudonym – I will not reveal anyone’s identity – but I urge you to use your real name as I think we should have fully open peer review in science.

OK. Now for the paper, which is posted on arxiv and can be linked to, cited there. We also have a copy here, in case you’re having trouble with figures on arXiv.

Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Several years ago a postdoc in my lab, Susan Lott (now at UC Davis) developed methods to sequence the RNA’s from single Drosophila embryos. She was interested in looking at expression differences between males and females in early embryogenesis, and published a beautiful paper on that topic.

Although we were initially worried that we wouldn’t be able to get enough RNA from single embryos to get reliable sequencing results, it turns out we got more than enough. Each embryo yielded around 100ng of total RNA, and we would end up loading only ~10% of the sample onto the sequencer. So it occurred to us that maybe we could work with material from pieces of individual embryos and thereby get spatial expression information on a genomic scale in a single quick experiment – an alternative to highly informative, but slow imaging-based methods.

I recruited a new biophysics student, Peter Combs, to work on slicing embryos with a microtome along the anterior-posterior axis and sequencing each of the sections to identify genes with patterned expression along the A-P axis. In typical PI fashion, I figured this would take a few weeks, but it ended up taking over a year to get right.

The major challenge was that, while a tenth of an embyro contains more than enough RNA to analyze by mRNA-seq, it turned out to be very difficult to shepherd that RNA successfully from a single cryosection to the sequencer. Peter was routinely failing to recover RNA and make libraries from these samples using methods that worked great for whole embryos. While there are various protocols out there claiming to analyze RNA from single cells, we were reluctant to use these amplification-based strategies.

The typical way people deal with loss of small quantities of nucleic acids during experimental manipulation is to add carrier RNA or DNA – something like tRNA or salmon sperm DNA. We didn’t want to do that, since we would just end up with tons of useless sequencing reads. So we came up with a different strategy – adding embryos from distantly related Drosophila species to each slice at an early stage in the process. This brought the total amount of RNA in each sample well amove the threshold where our purification and library preparation worked robustly, and we could easily separate the D. melanogaster RNA we were interested in for this experiment from that of the “carrier” embryo. But we could avoid wasting sequencing reads by turning the carrier RNAs into an experiment of their own – in this case looking at expression variation between species.

With this trick, the method now works great, and the paper is really just a description of the method and a demonstration that accurate expression patterns can be recovered from individual cryosectioned embryos. The resolution here is not that great – we used 6 slices of ~60um each per embryo. But we’ve started to make smaller sections, and a back of the envelope calculation suggests we can, with available sample handling and sequencing techniques, make up to 100 slices per embryo. This would be more than enough to see stripes and other subtle patterns missed in the current dataset.

Our immediate near term goals are to do a developmental time course, compare patterns in male and female embryos, look at other species and examine embryos from strains carrying various patterning defects. For those of you going to the fly meeting in DC in April, Peter’s talk will, I hope, have some of this new data.

Anyway, we would love comments on either the method or the manuscript.