RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates

RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates
Alexis Black Pyrkosz, Hans Cheng, C. Titus Brown
(Submitted on 11 Mar 2013)

Whole transcriptome sequencing is increasingly being used as a functional genomics tool to study non- model organisms. However, when the reference transcriptome used to calculate differential expression is incomplete, significant error in the inferred expression levels can result. In this study, we use simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measure the error resulting from using an incomplete transcriptome. We show that the two primary sources of count- ing error are 1) alternative splice variants that share reads and 2) missing transcripts from the reference. Alternative splice variants increase the false positive rate of mapping while incomplete reference tran- scriptomes decrease the true positive rate, leading to inaccurate transcript expression levels. Grouping transcripts by gene or read sharing (similar to mapping to a reference genome) significantly decreases false positives, but only by improving the reference transcriptome itself can the missing transcript problem be addressed. We also demonstrate that employing different mapping software does not yield substantial increases in accuracy on simulated data. Finally, we show that read lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.

Our paper: Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

This guest post is by Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, and Dmitri A. Petrov, on their paper
 Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster, arXived here

We typically think of adaptive events as arising from single de novo mutations that sweep through the population one at a time. In this scenario, one expects to observe the signatures of hard selective sweeps, where a single haplotype rises to very high frequencies, removing variation in linked genomic regions. It is also possible, however, that adaptation could lead to signatures of soft sweeps. Soft sweeps are generated by multiple adaptive haplotypes rising in frequency at the same time, either because (i) the adaptive mutation comes from standing variation and thus had time to recombine onto multiple haplotypes, or (ii) because multiple de novo mutations arise virtually simultaneously. The second mode is likely in large populations or when the adaptive mutation rate per locus is high.

Soft sweeps have generally been considered a mere curiosity and most scans for adaptation focus on the hard sweep scenario. Despite this prevailing view, the three best-studied cases of adaptation in Drosophila at the loci Ace, CHKov1, and Cyp6g1 all show signatures of soft sweeps. In two cases (Ace and Cyp6g1), soft sweeps were generated by de novo mutations indicating that the population size in D. melanogaster relevant to adaptation is on the order of billions or larger. In one case (CHKov1), soft sweeps arose from standing variation. Surprisingly, we do not have very convincing cases of recent adaptation in Drosophila that generated hard sweeps.

Nevertheless, it remained an open question of whether these three cases were the exception or the norm. They are all related to pesticide or viral resistance and it is entirely possible that much adaptation unrelated to human disturbance or immunity proceeds differently and might generate hard sweeps.

In this paper, we developed two haplotype statistics that allowed us to systematically identify hard and soft sweeps with similar power and then to differentiate them from each other. We applied these statistics to the Drosophila polymorphism data of ~150 fully sequenced, inbred strains available through the Drosophila Genetic Reference Panel (DGRP).

We found abundant signatures of recent and strong sweeps in the Drosophila genome with haplotype structure often extending over tens or even hundreds of kb. However, to our surprise, when we looked at the top 50 peaks, all of them showed signatures of soft sweeps, while we could not convincingly demonstrate the existence of any hard sweeps.

Our results suggest that hard sweeps might be exceedingly rare in Drosophila. Instead, it appears that adaptation in Drosophila primarily proceeds via soft sweeps and thus often involves standing genetic variation or recurrent de novo mutations. There are two caveats, however: One is that we were only able to study strong and recent adaptation. Such strong adaptation should “feel” recent population sizes that are close to the census size, whereas it should be insensitive to bottlenecks that have occurred in the distant past. Weaker adaptation, on the other hand, might take longer and thus would be sensitive to ancient bottlenecks or interference from other sweeps. Whether weak adaptation thus proceeds via hard sweeps remains to be seen. The second caveat is that much of adaptation might involve sweeps that are so soft and move so many haplotypes up in frequency that we cannot detect them. Similarly, adaptation could often be polygenic involving very subtle shifts in allele frequency at many loci. These modes would hardly leave any signatures of sweeps at all. Whichever way it is, it is becoming increasingly clear that adaptation in Drosophila and many other organisms is likely to be much more complex, much more common, and in many ways a much more turbulent process than we usually tend to think.

Minimal clade size in the Bolthausen-Sznitman coalescent

Minimal clade size in the Bolthausen-Sznitman coalescent
Fabian Freund, Arno Siri-Jégousse
(Submitted on 14 Jan 2013 (v1), last revised 6 Mar 2013 (this version, v2))

This article shows the asymptotics of distribution and moments of the size $X_n$ of the minimal clade of a randomly chosen individual in a Bolthausen-Sznitman $n$-coalescent for $n\to\infty$. The Bolthausen-Sznitman $n$-coalescent is a Markov process taking states in the set of partitions of $\left\{1,\ldots,n\right\}$, where $1,\ldots,n$ are referred to as individuals. The minimal clade of an individual is the equivalence class the individual is in at the time of the first coalescence event this individual participates in.\\ The main tool used is the connection of the Bolthausen-Sznitman $n$-coalescent with random recursive trees introduced by Goldschmidt and Martin (see \cite{goldschmidtmartin}). This connection shows that $X_n-1$ is distributed as the number $M_n$ of all individuals not in the equivalence class of individual 1 shortly before the time of the last coalescence event. Both functionals are distributed like the size $RT_{n-1}$ of an uniformly chosen table in a standard Chinese restaurant process with $n-1$ customers.We give exact formulae for these distributions.\\ Using the asymptotics of $M_n$ shown by Goldschmidt and Martin in \cite{goldschmidtmartin}, we see $(\log n)^{-1}\log X_n$ converges in distribution to the uniform distribution on [0,1] for $n\to\infty$.\\ We provide the complimentary information that $\frac{\log n}{n^k}E(X_n^k)\to \frac{1}{k}$ for $n\to\infty$, which is also true for $M_n$ and $RT_n$.

The consequences of gene flow for local adaptation and differentiation: A two-locus two-deme model

The consequences of gene flow for local adaptation and differentiation: A two-locus two-deme model
Ada Akerman, Reinhard Bürger
(Submitted on 6 Mar 2013)

We consider a population subdivided into two demes connected by migration in which selection acts in opposite direction. We explore the effects of recombination and migration on the maintenance of multilocus polymorphism, on local adaptation, and on differentiation by employing a deterministic model with genic selection on two linked diallelic loci (i.e., no dominance or epistasis). For the following cases, we characterize explicitly the possible equilibrium configurations: weak, strong, highly asymmetric, and super-symmetric migration, no or weak recombination, and independent or strongly recombining loci. For independent loci (linkage equilibrium) and for completely linked loci, we derive the possible bifurcation patterns as functions of the total migration rate, assuming all other parameters are fixed but arbitrary. For these and other cases, we determine analytically the maximum migration rate below which a stable fully polymorphic equilibrium exists. In this case, differentiation and local adaptation are maintained. Their degree is quantified by a new multilocus version of $\Fst$ and by the migration load, respectively. In addition, we investigate the invasion conditions of locally beneficial mutants and show that linkage to a locus that is already in migration-selection balance facilitates invasion. Hence, loci of much smaller effect can invade than predicted by one-locus theory if linkage is sufficiently tight. We study how this minimum amount of linkage admitting invasion depends on the migration pattern. This suggests the emergence of clusters of locally beneficial mutations, which may form `genomic islands of divergence’. Finally, the influence of linkage and two-way migration on the effective migration rate at a linked neutral locus is explored. Numerical work complements our analytical results.

From Many, One: Genetic Control of Prolificacy during Maize Domestication

From Many, One: Genetic Control of Prolificacy during Maize Domestication
David M. Wills, Clinton Whipple, Shohei Takuno, Lisa E. Kursel, Laura M. Shannon, Jeffrey Ross-Ibarra, John F. Doebley
(Submitted on 4 Mar 2013)

A reduction in number and an increase in size of inflorescences is a common aspect of plant domestication. When maize was domesticated from teosinte, the number and arrangement of ears changed dramatically. Teosinte has long lateral branches that bear multiple small ears at their nodes and tassels at their tips. Maize has much shorter lateral branches that are tipped by a single large ear with no additional ears at the branch nodes. To investigate the genetic basis of this difference in prolificacy (the number of ears on a plant), we performed a genome-wide QTL scan. A large effect QTL for prolificacy (prol1.1) was detected on the short arm of chromosome one in a location that has previously been shown to influence multiple domestication traits. We fine-mapped prol1.1 to a 2.7 kb interval or causative region upstream of the grassy tillers1 gene, which encodes a homeodomain leucine zipper transcription factor. Tissue in situ hybridizations reveal that the maize allele of prol1.1 is associated with up-regulation of gt1 expression in the nodal plexus. Given that maize does not initiate secondary ear buds, the expression of gt1 in the nodal plexus in maize may suppress their initiation. Population genetic analyses indicate positive selection on the maize allele of prol1.1, causing a partial sweep that fixed the maize allele throughout most of domesticated maize. This work shows how a subtle cis-regulatory change in tissue specific gene expression altered plant architecture in a way that improved the harvestability of maize.

Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster
Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, Dmitri A. Petrov
(Submitted on 5 Mar 2013)

Adaptation is often thought to leave the signature of a hard selective sweep, in which a single haplotype bearing the beneficial allele reaches high population frequency. However, an alternative and often-overlooked scenario is that of a soft selective sweep, in which multiple adaptive haplotypes sweep through the population simultaneously. Soft selective sweeps are likely either when adaptation proceeds from standing genetic variation or in large populations where adaptation is not mutation-limited. Current statistical methods are not well designed to test for soft sweeps, and thus are likely to miss these possibly numerous adaptive events because they look for characteristic reductions in heterozygosity. Here, we developed a statistical test based on a haplotype statistic, H12, capable of detecting both hard and soft sweeps with similar power. We used H12 to identify multiple genomic regions that have undergone recent and strong adaptation using a population sample of fully sequenced Drosophila melanogaster strains (DGRP). We then developed a second statistical test based on a statistic H2/H1 | H12, to test whether a given selective sweep detected by H12 is hard or soft. Surprisingly, when applying the test based on H2/H1 | H12 to the top 50 most extreme H12 candidates in the DGRP data, we reject the hard sweep hypothesis in every case. In contrast, all 50 cases show strong support (Bayes Factor >10) for a soft sweep model. Our results suggest that recent adaptation in North American populations of D. melanogaster has led primarily to soft sweeps either because it utilized standing genetic variation or because the short-term effective population size in D. melanogaster is on the order of billions or larger.

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis
Marc Harper, Luisa Gronenberg, James Liao, Christopher Lee
(Submitted on 3 Mar 2013)

Discovering all the genetic causes of a phenotype is an important goal in functional genomics. In this paper we combine an experimental design for multiple independent detections of the genetic causes of a phenotype, with a high-throughput sequencing analysis that maximizes sensitivity for comprehensively identifying them. Testing this approach on a set of 24 mutant strains generated for a metabolic phenotype with many known genetic causes, we show that this pathway-based phenotype sequencing analysis greatly improves sensitivity of detection compared with previous methods, and reveals a wide range of pathways that can cause this phenotype. We demonstrate our approach on a metabolic re-engineering phenotype, the PEP/OAA metabolic node in E. coli, which is crucial to a substantial number of metabolic pathways and under renewed interest for biofuel research. Out of 2157 mutations in these strains, pathway-phenoseq discriminated just five gene groups (12 genes) as statistically significant causes of the phenotype. Experimentally, these five gene groups, and the next two high-scoring pathway-phenoseq groups, either have a clear connection to the PEP metabolite level or offer an alternative path of producing oxaloacetate (OAA), and thus clearly explain the phenotype. These high-scoring gene groups also show strong evidence of positive selection pressure, compared with strictly neutral selection in the rest of the genome.

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees
Sha Zhu, James H Degnan, Bjarki Eldon
(Submitted on 4 Mar 2013)

Hybrid-Lambda is a software package that simulates gene trees under Kingman or two Lambda-coalescent processes within species networks or species trees. It is written in C++, and re- leased under GNU General Public License (GPL) version 3. Users can modify and make new dis- tribution under the terms of this license. For details of this license, visit this http URL. Hybrid Lambda is available at this https URL.

Deleterious synonymous mutations hitchhike to high frequency in HIV-1 env evolution

Deleterious synonymous mutations hitchhike to high frequency in HIV-1 env evolution
Fabio Zanini, Richard A. Neher
(Submitted on 4 Mar 2013)

Intrapatient HIV-1 evolution is dominated by selection on the protein level in the arms race with the adaptive immune system. When cytotoxic CD8+ T-cells or neutralizing antibodies target a new epitope, the virus often escapes via nonsynonymous mutations that impair recognition. Synonymous mutations do not affect this interplay and are often assumed to be neutral. We analyze longitudinal intrapatient data from the C2-V5 part of the envelope gene (env) and observe that synonymous derived alleles rarely fix even though they often reach high frequencies in the viral population. We find that synonymous mutations that disrupt base pairs in RNA stems flanking the variable loops of gp120 are more likely to be lost than other synonymous changes, hinting at a direct fitness effect of these stem-loop structures in the HIV-1 RNA. Computational modeling indicates that these synonymous mutations have a (Malthusian) selection coefficient of the order of -0.002 and that they are brought up to high frequency by hitchhiking on neighboring beneficial nonsynonymous alleles. The patterns of fixation of nonsynonymous mutations estimated from the longitudinal data and comparisons with computer models suggest that escape mutations in C2-V5 are only transiently beneficial, either because the immune system is catching up or because of competition between equivalent escapes.

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding
Mathilde Paris, Tommy Kaplan, Xiao Yong Li, Jacqueline E. Villalta, Susan E. Lott, Michael B. Eisen
(Submitted on 1 Mar 2013)

To better characterize how variation in regulatory sequences drives divergence in gene expression, we undertook a systematic study of transcription factor binding and gene expression in the blastoderm embryos of four species that sample much of the diversity in the 60 million-year old genus Drosophila: D. melanogaster, D. yakuba, D. pseudoobscura and D. virilis. We compared gene expression, as measured by mRNA-seq to the genome-wide binding of four transcription factors involved in early development, as measured by ChIP-seq (Bicoid, Giant, Hunchback and Kr\”uppel). Surprisingly, we found that mRNA levels are much better conserved than individual binding events. We looked at binding characteristics that may explain such evolutionary disparity. As expected, we found that binding divergence increases with phylogenetic distance. Interestingly, binding events in non-coding regions that were bound strongly by single factors, or bound by multiple factors, were more likely to be conserved. As this class of sites are most likely to be involved in gene regulation, the divergence of other bound regions may simply reflect their lack of function. We used a model of quantitative trait evolution to compare the changes of gene expression with nearby regulatory TF binding. We found that changes in gene expression were poorly explained by changes in associated TF binding. These results suggest that some of the differences in sequence and binding have limited effect on gene expression or act in a compensatory manner to maintain the overall expression levels of regulated genes.