Reducing INDEL errors in whole-genome and exome sequencing

Reducing INDEL errors in whole-genome and exome sequencing

Han Fang, Giuseppe Narzisi, Jason A. O’Rawe, Yiyang Wu, Julie Rosenbaum, Michael Ronemus, Ivan Iossifov, Michael C. Schatz, Gholson J. Lyon

Background INDELs, especially those disrupting protein-coding regions of the genome, have been associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. We have recently developed a new INDEL-calling algorithm, Scalpel, with substantially improved accuracy. Results We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate false-positive and false-negative INDEL errors. We developed a classification scheme utilizing validation data to define a class of low-quality INDELs with ~2.7-fold higher error rates than high-quality INDELs. The mean concordance of INDEL detection between WGS and WES data was ~52%, while WGS data uniquely identified ~10.8-fold more high-quality INDELs. Concordance of INDEL detection between standard and PCR-free sequencing data was ~71%, while PCR-free data uniquely yielded ~6.3-fold fewer low-quality INDELs. We demonstrate that these INDEL errors are significantly reduced with a PCR-free library protocol, implying that these errors are introduced with PCR amplification. We calculated that 60X WGS data from the HiSeq 2000 platform are needed to recover ~95% of INDELs, much higher than that for SNP detection. Accurate detection of heterozygous INDELs requires ~1.2-fold higher coverage than that for homozygous INDELs. Conclusions Homopolymer A/T INDELs are a major source of low quality and/or uncertain INDEL calls, and these are highly enriched in the WES data. We recommend WGS for human genomes at 60X mean coverage with PCR-free protocols, which can substantially improve the quality of personal genomes.

Natural selection helps explain the small range of genetic variation within species

Natural selection helps explain the small range of genetic variation within species

Russell B. Corbett-Detig, Daniel L. Hartl, Timothy B. Sackton

The range of genetic diversity observed within natural populations is much more narrow than expected based on models of neutral molecular evolution. Although the increased efficacy of natural selection in larger populations has been invoked to explain this paradox, to date no tests of this hypothesis have been conducted. Here, we present an analysis of whole-genome polymorphism data and genetic maps from 39 species to estimate for each species the reduction in genetic variation attributable to the operation of natural selection on the genome. We find that species with larger population sizes do in fact show greater reductions in genetic variation. This finding provides the first experimental support for the hypothesis that natural selection contributes to the restricted range of within-species genetic diversity.

Recombination impacts damaging and disease mutations accumulation in human populations

Recombination impacts damaging and disease mutations accumulation in human populations

Julie Hussin, Alan Hodgkinson, Youssef Idaghdour, Jean-Christophe Grenier, Jean-Philippe Goulet, Elias Gbeha, Elodie Hip-Ki, Philip Awadalla

Many decades of theory have demonstrated that in non-recombining systems, slightly deleterious mutations accumulate non-reversibly, potentially driving the extinction of many asexual species. Non-recombining chromosomes in sexual organisms are thought to have degenerated in a similar fashion, however it is not clear the extent to which these processes operate along recombining chromosomes with highly variable rates of crossing over. Using high coverage sequencing data from over 1400 individuals, we show that recombination rate modulates the genomic distribution of putatively deleterious variants across the entire human genome. We find that exons in regions of low recombination are significantly enriched for deleterious and disease variants, a signature that varies in strength across worldwide human populations with different demographic histories. As low recombining regions are enriched for highly conserved genes with essential cellular functions and show an excess of mutations with demonstrated effect on health, this phenomenon likely affects disease susceptibility in humans.

Transcriptomic analysis of the lesser spotted catshark (Scyliorhinus canicula) pancreas, liver and brain reveals molecular level conservation of vertebrate pancreas function

Transcriptomic analysis of the lesser spotted catshark (Scyliorhinus canicula) pancreas, liver and brain reveals molecular level conservation of vertebrate pancreas function

John F Mulley, Adam D Hargreaves, Matthew J Hegarty, R. Scott Heller, Martin T Swain

Background Understanding the evolution of the vertebrate pancreas is key to understanding its functions. The chondrichthyes (cartilaginous fish such as sharks and rays) have been suggested to possess the most ancient example of a distinct pancreas with both hormonal (endocrine) and digestive (exocrine) roles, although the lack of genetic, genomic and transcriptomic data for cartilaginous fish has hindered a more thorough understanding of the molecular-level functions of the chondrichthyan pancreas, particularly with respect to their “unusual” energy metabolism (where ketone bodies and amino acids are the main oxidative fuel source) and their paradoxical ability to both maintain stable blood glucose levels and tolerate extensive periods of hypoglycemia. In order to shed light on some of these processes we have carried out the first large-scale comparative transcriptomic survey of multiple cartilaginous fish tissues: the pancreas, brain and liver of the lesser spotted catshark, Scyliorhinus canicula. Results We generated a mutli-tissue assembly comprising 86,006 contigs, of which 44,794 were assigned to a particular tissue or combination of tissue based on mapping of sequencing reads. We have characterised transcripts encoding genes involved in insulin regulation, glucose sensing, transcriptional regulation, signaling and digestion, as well as many peptide hormone precursors and their receptors for the first time. Comparisons to published mammalian pancreas transcriptomes reveals that mechanisms of glucose sensing and insulin regulation used to establish and maintain a stable internal environment are conserved across jawed vertebrates and likely pre-date the vertebrate radiation. Conservation of pancreatic hormones and genes encoding digestive proteins support the single, early evolution of a distinct pancreatic gland with endocrine and exocrine functions in vertebrates, although the peptide diversity of the early vertebrate pancreas has been overestimated as a result of the use of cross-reacting antisera in earlier studies. A three hormone islet organ is therefore the basal vertebrate condition, later elaborated upon only in the tetrapod lineage. Conclusions The cartilaginous fish are a great untapped resource for the reconstruction of patterns and processes of vertebrate evolution and new approaches such as those described in this paper will greatly facilitate their incorporation into the rank of “model organism”.

iRAP – an integrated RNA-seq Analysis Pipeline

iRAP – an integrated RNA-seq Analysis Pipeline

Nuno A. Fonseca, Robert Petryszak, John Marioni, Alvis Brazma

RNA-sequencing (RNA-Seq) has become the technology of choice for whole-transcriptome profiling. However, processing the millions of sequence reads generated requires considerable bioinformatics skills and computational resources. At each step of the processing pipeline many tools are available, each with specific advantages and disadvantages. While using a specific combination of tools might be desirable, integrating the different tools can be time consuming, often due to specificities in the formats of input/output files required by the different programs. Here we present iRAP, an integrated RNA-seq analysis pipeline that allows the user to select and apply their preferred combination of existing tools for mapping reads, quantifying expression, testing for differential expression. iRAP also includes multiple tools for gene set enrichment analysis and generates web browsable reports of the results obtained in the different stages of the pipeline. Depending upon the application, iRAP can be used to quantify expression at the gene, exon or transcript level. iRAP is aimed at a broad group of users with basic bioinformatics training and requires little experience with the command line. Despite this, it also provides more advanced users with the ability to customise the options used by their chosen tools.

Author post: Predicting evolution from the shape of genealogical trees

This guest post by Richard Neher discusses his preprint Predicting evolution from the shape of genealogical trees. Richard A. Neher, Colin A. Russell, Boris I. Shraiman. arXived here. This is cross-posted from the Neher lab website.

In this preprint — a collaboration with Colin Russell and Boris Shraiman — we show that it is possible to predict which individual from a population is most closely related to future populations. To this end, we have developed a method that uses the branching pattern of genealogical trees to estimate which part of the tree contains the “fittest” sequences, where fit means rapidly multiplying. Those that multiply rapidly, are most likely to take over the population. We demonstrate the power of our method by predicting the evolution of seasonal influenza viruses.

How does it work?
Individuals adapt to a changing environment by accumulating beneficial mutations, while avoiding deleterious mutations. We model this process assuming that there are many such mutations which change fitness in small increments. Using this model, we calculate the probability that an individual that lived in the past at time t leaves n descendants in the present. This distributions depends critically on the fitness of the ancestral individual. We then extend this calculation to the probability of observing a certain branch in a genealogical tree reconstructed from a sample of sequences. A branch in a tree connects an individual A that lived at time tA and had fitness xA and with an individual B that lived at a later time tB with fitness xB as illustrated in the figure. B has descendants in the sample, otherwise the branch would not be part of the tree. Furthermore, all sampled descendants of A are also descendants of B, otherwise the connection between A and B would have branched between tA and tB. We call the mathematical object describing fitness evolution between A and B “branch propagator” and propagatordenote it by g(xB,tB|xA,tA). The joint probability distribution of fitness values of all nodes of the tree is given by a product of branch propagators. We then calculate the expected fitness of each node and use it to rank the sampled sequences. The top ranked sequence is our prediction for the sequence of the progenitor of the future population.

Why do we care?
flu_tree Being able to predict evolution could have immediate applications. The best example is the seasonal influenza vaccine, that needs to be updated frequently to keep up with the evolving virus. Vaccine strains are chosen among sampled virus strains, and the more closely this strain matches the future influenza virus population, the better the vaccine is going to be. Hence by predicting a likely progenitor of the future, our method could help to improve influenza vaccines. One of our predictions is shown in the figure, with the top ranked sequence marked by a black arrow. Influenza is not the only possible application. Since the algorithm only requires a reconstructed tree as input, it can be applied to other rapidly evolving pathogens or cancer cell populations. In addition, to being useful, the ability to predict also implies that the model captures an essential aspect of evolutionary dynamics: influenza evolution is to a substantial degree — enough to enable prediction — dependent on the accumulation of small effect mutations.

Comparison to other approaches
Given the importance of good influenza vaccines, there has been a number of previous efforts to anticipate influenza virus evolution, typically based on using patterns of molecular evolution from historical data. Along these lines, Luksza and Lässig have recently presented an explicit fitness model for influenza virus evolution that rewards mutations at positions known to convey antigenic novelty and penalizes likely deleterious mutations (+a few other things). By using molecular influenza specific signatures, this model is complementary to ours that uses only the tree reconstructed from nucleotide sequences. Interestingly, the two models do more or less equally well and combining different methods of prediction should result in more reliable results.

Polyester: simulating RNA-seq datasets with differential transcript expression

Polyester: simulating RNA-seq datasets with differential transcript expression

Alyssa C Frazee, Andrew E Jaffe, Ben Langmead, Jeffrey Leek

Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially-constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with col- lections of RNA-seq reads. The main advantage of Polyester is the ability to simulate isoform-level differential expression across biological replicates for a variety of experimental designs at the read level. Differential expression signal can be simulated with either built-in or user-defined statistical models. Polyester is available on GitHub at https://github.com/alyssafrazee/polyester.

Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree

Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree

Cécile Ané, Lam Si Tung Ho, Sebastien Roch
(Submitted on 6 Jun 2014)

Diffusion processes on trees are commonly used in evolutionary biology to model the joint distribution of continuous traits, such as body mass, across species. Estimating the parameters of such processes from tip values presents challenges because of the intrinsic correlation between the observations produced by the shared evolutionary history, thus violating the standard independence assumption of large-sample theory. For instance Ho and An\’e \cite{HoAne13} recently proved that the mean (also known in this context as selection optimum) of an Ornstein-Uhlenbeck process on a tree cannot be estimated consistently from an increasing number of tip observations if the tree height is bounded. Here, using a fruitful connection to the so-called reconstruction problem in probability theory, we study the convergence rate of parameter estimation in the unbounded height case. For the mean of the process, we provide a necessary and sufficient condition for the consistency of the maximum likelihood estimator (MLE) and establish a phase transition on its convergence rate in terms of the growth of the tree. In particular we show that a loss of n‾‾√-consistency (i.e., the variance of the MLE becomes Ω(n−1), where n is the number of tips) occurs when the tree growth is larger than a threshold related to the phase transition of the reconstruction problem. For the covariance parameters, we give a novel, efficient estimation method which achieves n‾‾√-consistency under natural assumptions on the tree.

Testing the Toxicofera: comparative reptile transcriptomics casts doubt on the single, early evolution of the reptile venom system

Testing the Toxicofera: comparative reptile transcriptomics casts doubt on the single, early evolution of the reptile venom system

Adam D Hargreaves, Martin T Swain, Darren W Logan, John F Mulley

Background The identification of apparently conserved gene complements in the venom and salivary glands of a diverse set of reptiles led to the development of the Toxicofera hypothesis – the idea that there was a single, early evolution of the venom system in reptiles. However, this hypothesis is based largely on relatively small scale EST-based studies of only venom or salivary glands and toxic effects have been assigned to only some of these putative Toxcoferan toxins in some species. We set out to investigate the distribution of these putative venom toxin transcripts in order to investigate to what extent conservation of gene complements may reflect a bias in previous sampling efforts. Results We have carried out the first large-scale test of the Toxicofera hypothesis and found it lacking in a number of regards. Our quantitative transcriptomic analyses of venom and salivary glands and other body tissues in five species of reptile, together with the use of available RNA-Seq datasets for additional species shows that the majority of genes used to support the establishment and expansion of the Toxicofera are in fact expressed in multiple body tissues and most likely represent general maintenance or “housekeeping” genes. The apparent conservation of gene complements across the Toxicofera therefore reflects an artefact of incomplete tissue sampling. In other cases, the identification of a non-toxic paralog of a gene encoding a true venom toxin has led to confusion about the phylogenetic distribution of that venom component. Conclusions Venom has evolved multiple times in reptiles. In addition, the misunderstanding regarding what constitutes a toxic venom component, together with the misidentification of genes and the classification of identical or near-identical sequences as distinct genes has led to an overestimation of the complexity of reptile venoms in general, and snake venom in particular, with implications for our understanding of (and development of treatments to counter) the molecules responsible for the physiological consequences of snakebite.

Restriction and recruitment – gene duplication and the origin and evolution of snake venom toxins

Restriction and recruitment – gene duplication and the origin and evolution of snake venom toxins

Adam D Hargreaves, Martin T Swain, Matthew J Hegarty, Darren W Logan, John F Mulley

The genetic and genomic mechanisms underlying evolutionary innovations are of fundamental importance to our understanding of animal evolution. Snake venom represents one such innovation and has been hypothesised to have originated and diversified via a process that involves duplication of genes encoding body proteins and subsequent recruitment of the copy to the venom gland where natural selection can act to develop or increase toxicity. However, gene duplication is known to be a rare event in vertebrate genomes and the recruitment of duplicated genes to a novel expression domain (neofunctionalisation) is an even rarer process that requires the evolution of novel combinations of transcription factor binding sites in upstream regulatory regions. This hypothesis concerning the evolution of snake venom is therefore very unlikely. Nonetheless, it is often assumed to be established fact and this has hampered research into the true origins of snake venom toxins. We have generated transcriptomic data for a diversity of body tissues and salivary and venom glands from venomous and non-venomous reptiles, which has allowed us to critically evaluate this hypothesis. Our comparative transcriptomic analysis of venom and salivary glands and body tissues in five species of reptile reveals that snake venom does not evolve via the hypothesised process of duplication and recruitment of body proteins. Indeed, our results show that many proposed venom toxins are in fact expressed in a wide variety of body tissues, including the salivary gland of non-venomous reptiles and have therefore been restricted to the venom gland following duplication, not recruited. Thus snake venom evolves via the duplication and subfunctionalisation of genes encoding existing salivary proteins. These results highlight the danger of the “just-so story: in evolutionary biology, where an elegant and intuitive idea is repeated so often that it assumes the mantle of established fact, to the detriment of the field as a whole.