READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data
Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

Model adequacy and the macroevolution of angiosperm functional traits

Model adequacy and the macroevolution of angiosperm functional traits
Matthew Pennell, Richard G FitzJohn, William K Cornwell, Luke J Harmon

All models are wrong and sometimes even the best of a set of models is useless. Modern phylogenetic comparative methods (PCMs) are almost exclusively model–based and therefore making robust inferences from PCMs requires using a model of trait evolution that is a good explanation for the data. To date, researchers using PCMs have evaluated the explanatory power of a model only in terms of relative, not absolute, fit. Here we develop a general statistical framework for assessing the absolute fit, or adequacy, of phylogenetic models for the evolution of quantitative traits. We use our approach to test whether commonly used models are adequate descriptors of the macroevolutionary dynamics of real comparative data. We fit models of trait evolution to 337 comparative datasets covering three key Angiosperm functional traits and evaluated the absolute fit of the models to each dataset. Overall, the models we used are very inadequate for the evolution of these traits; this was true for many different groups and at many different scales. Furthermore, the relative support for a model had very little to do with its absolute adequacy. We argue that assessing model adequacy should be a key step in comparative analyses.

Population genetics of identity by descent

Population genetics of identity by descent
Pier Francesco Palamara, Ph.D. thesis

Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.

Principal component gene set enrichment (PCGSE)

Principal component gene set enrichment (PCGSE)
H. Robert Frost, Zhigang Li, Jason H. Moore

Motivation: Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with few non-zero loadings. Although useful when just a few variables dominate the population PCs, these methods are often inadequate for characterizing the PCs of high-dimensional genomic data. For genomic data, reproducible and biologically meaningful PC interpretation requires methods based on the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. Results: We describe a novel approach, principal component gene set enrichment (PCGSE), for computing the statistical association between gene sets and the PCs of genomic data. The PCGSE method performs a two-stage competitive gene set test using the correlation between each gene and each PC as the gene-level test statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic. Using simulated data with simulated gene sets and real gene expression data with curated gene sets, we demonstrate that biologically meaningful and computationally efficient results can be obtained from a simple parametric version of the PCGSE method that performs a correlation-adjusted two-sample t-test between the gene-level test statistics for gene set members and genes not in the set. Availability: this http URL Contact: rob.frost@dartmouth.edu or jason.h.moore@dartmouth.edu

Phylogenetic Stochastic Mapping without Matrix Exponentiation

Phylogenetic Stochastic Mapping without Matrix Exponentiation
Jan Irvahn, Vladimir N. Minin

Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organisms carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and work well for small state spaces. The computations slow down considerably for larger state spaces (e.g. space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices — an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, that does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.

Identifying recombination hotspots using population genetic data

Identifying recombination hotspots using population genetic data
Adam Auton, Simon Myers, Gil McVean
(Submitted on 17 Mar 2014)

Motivation: Recombination rates vary considerably at the fine scale within mammalian genomes, with the majority of recombination occurring within hotspots of ~2 kb in width. We present a method for inferring the location of recombination hotspots from patterns of linkage disequilibrium within samples of population genetic data. Results: Using simulations, we show that our method has hotspot detection power of approximately 50-60%, but depending on the magnitude of the hotspot. The false positive rate is between 0.24 and 0.56 false positives per Mb for data typical of humans. Availability: this http URL

Horizontal Transfers and Gene Losses in the phospholipid pathway of Bartonella reveal clues about early ecological niches

Horizontal Transfers and Gene Losses in the phospholipid pathway of Bartonella reveal clues about early ecological niches
Qiyun Zhu, Michael Kosoy, Kevin J Olival, Katharina Dittmar

Bartonellae are mammalian pathogens vectored by blood-feeding arthropods. Although of increasing medical importance, little is known about their ecological past, and host associations are underexplored. Previous studies suggest an influence of horizontal gene transfers in ecological niche colonization by acquisition of host pathogenicity genes. We here expand these analyses to metabolic pathways of 28 Bartonella genomes, and experimentally explore the distribution of bartonellae in 21 species of blood-feeding arthropods. Across genomes, repeated gene losses and horizontal gains in the phospholipid pathway were found. The evolutionary timing of these patterns suggests functional consequences likely leading to an early intracellular lifestyle for stem bartonellae. Comparative phylogenomic analyses discover three independent lineage-specific reacquisitions of a core metabolic gene – NAD(P)H-dependent glycerol-3-phosphate dehydrogenase (gpsA) – from Gammaproteobacteria and Epsilonproteobacteria. Transferred genes are significantly closely related to invertebrate Arsenophonus-, and Serratia-like endosymbionts, and mammalian Helicobacter-like pathogens, supporting a cellular association with arthropods and mammals at the base of extant bartonellae. Our studies suggest that the horizontal re-aquisitions had a key impact on bartonellae lineage specific ecological and functional evolution.

Analysis of stop-gain and frameshift variants in human innate immunity genes


Analysis of stop-gain and frameshift variants in human innate immunity genes

Antonio Rausell, Pejman Mohammadi, Paul J McLaren, Ioannis Xenarios, Jacques Fellay, Amalio Telenti

Loss-of-function variants in innate immunity genes are associated with Mendelian disorders in the form of primary immunodeficiencies. Recent resequencing projects report that stop-gains and frameshifts are collectively prevalent in humans and could be responsible for some of the inter-individual variability in innate immune response. Current computational approaches evaluating loss-of-function in genes carrying these variants rely on gene-level characteristics such as evolutionary conservation and functional redundancy across the genome. However, innate immunity genes represent a particular case because they are more likely to be under positive selection and duplicated. To create a ranking of severity that would be applicable to the innate immunity genes we first evaluated 17764 stop-gain and 13915 frameshift variants from the NHLBI Exome Sequencing Project and 1000 Genomes Project. Sequence-based features such as loss of functional domains, isoform-specific truncation and non-sense mediated decay were found to correlate with variant allele frequency and validated with gene expression data. We integrated these features in a Bayesian classification scheme and benchmarked its use in predicting pathogenic variants against OMIM disease stop-gains and frameshifts. The classification scheme was applied in the assessment of 335 stop-gains and 236 frameshifts affecting 227 interferon-stimulated genes. The sequence-based score ranks variants in innate immunity genes according to their potential to cause disease, and complements existing gene-based pathogenicity scores.

Markov mutation models on Yule trees: pairwise species comparisons

Markov mutation models on Yule trees: pairwise species comparisons
Willem H. Mulder, Forrest W. Crawford
Subjects: Populations and Evolution (q-bio.PE)

Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are “born” from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknown how these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Understanding the interplay between speciation and mutation under simple models of evolution is essential for deriving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic studies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion and Ornstein-Uhlenbeck processes on a Yule tree. We compute the probability distribution of the number of mutations shared between two randomly chosen taxa in a Yule tree under several mutation models. These results suggest summary measures of phylogenetic information content, illuminate the correlation between site patterns in sequences or traits of related organisms, and provide heuristics for experimental design and reconstruction of phylogenetic trees.

Gaussian process test for high-throughput sequencing time series: application to experimental evolution

Gaussian process test for high-throughput sequencing time series: application to experimental evolution
Hande Topa, Ágnes Jónás, Robert Kofler, Carolin Kosiol, Antti Honkela
Comments: 26 pages, 13 figures
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM); Applications (stat.AP)

Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but to monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analysing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth.
Results: We present the beta-binomial Gaussian process (BBGP) model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present results on real data from Drosophila experimental evolution experiment in temperature adaptation.
Availability: R software implementing the test is available at https://github.com/handetopa/BBGP.