Principal component gene set enrichment (PCGSE)

Principal component gene set enrichment (PCGSE)
H. Robert Frost, Zhigang Li, Jason H. Moore

Motivation: Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with few non-zero loadings. Although useful when just a few variables dominate the population PCs, these methods are often inadequate for characterizing the PCs of high-dimensional genomic data. For genomic data, reproducible and biologically meaningful PC interpretation requires methods based on the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. Results: We describe a novel approach, principal component gene set enrichment (PCGSE), for computing the statistical association between gene sets and the PCs of genomic data. The PCGSE method performs a two-stage competitive gene set test using the correlation between each gene and each PC as the gene-level test statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic. Using simulated data with simulated gene sets and real gene expression data with curated gene sets, we demonstrate that biologically meaningful and computationally efficient results can be obtained from a simple parametric version of the PCGSE method that performs a correlation-adjusted two-sample t-test between the gene-level test statistics for gene set members and genes not in the set. Availability: this http URL Contact: or


Phylogenetic Stochastic Mapping without Matrix Exponentiation

Phylogenetic Stochastic Mapping without Matrix Exponentiation
Jan Irvahn, Vladimir N. Minin

Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organisms carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and work well for small state spaces. The computations slow down considerably for larger state spaces (e.g. space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices — an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, that does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.

Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates

Epigenetic Modifications are Associated with Inter-species Gene Expression Variation in Primates

Xiang Zhou, Carolyn Cain, Marsha Myrthil, Noah Lewellen, Katelyn Michelini, Emily Davenport, Matthew Stephens, Jonathan Pritchard, Yoav Gilad

Changes in gene regulation level have long been thought to play an important role in evolution and speciation, especially in primates. Over the past decade, comparative genomic studies have revealed extensive inter-species differences in gene expression levels yet we know much less about the extent to which regulatory mechanisms differ between species. To begin addressing this gap, we performed a comparative epigenetic study in primate lymphoblastoid cell lines (LCLs), to query the contribution of RNA polymerase II (Pol II) and four histone modifications (H3K4me1, H3K4me3, H3K27ac, and H3K27me3) to inter-species variation in gene expression levels. We found that inter-species differences in mark enrichment near transcription start sites are significantly more often associated with inter-species differences in the corresponding gene expression level than expected by chance alone. Interestingly, we also found that first-order interactions among the histone marks and Pol II do not markedly contribute to the degree of association between the marks and inter-species variation in gene expression levels, suggesting that the marginal effects of the five marks dominate this contribution.

The Role of Migration in the Evolution of Phenotypic Switching

The Role of Migration in the Evolution of Phenotypic Switching

Oana Carja, Robert E Furrow, Marc W Feldman

Stochastic switching is an example of phenotypic bet-hedging, where an individual can switch between different phenotypic states in a fluctuating environment. Although the evolution of stochastic switching has been studied when the environment varies temporally, there has been little theoretical work on the evolution of phenotypic switching under both spatially and temporally fluctuating selection pressures. Here we use a population genetic model to explore the interaction of temporal and spatial variation in the evolutionary dynamics of phenotypic switching. We find that spatial variation in selection is important; when selection pressures are similar across space, migration can decrease the rate of switching, but when selection pressures differ spatially, increasing migration between demes can facilitate the evolution of higher rates of switching. These results may help explain the diverse array of non-genetic contributions to phenotypic variability and phenotypic inheritance observed in both wild and experimental populations.

Identifying recombination hotspots using population genetic data

Identifying recombination hotspots using population genetic data
Adam Auton, Simon Myers, Gil McVean
(Submitted on 17 Mar 2014)

Motivation: Recombination rates vary considerably at the fine scale within mammalian genomes, with the majority of recombination occurring within hotspots of ~2 kb in width. We present a method for inferring the location of recombination hotspots from patterns of linkage disequilibrium within samples of population genetic data. Results: Using simulations, we show that our method has hotspot detection power of approximately 50-60%, but depending on the magnitude of the hotspot. The false positive rate is between 0.24 and 0.56 false positives per Mb for data typical of humans. Availability: this http URL

Horizontal Transfers and Gene Losses in the phospholipid pathway of Bartonella reveal clues about early ecological niches

Horizontal Transfers and Gene Losses in the phospholipid pathway of Bartonella reveal clues about early ecological niches
Qiyun Zhu, Michael Kosoy, Kevin J Olival, Katharina Dittmar

Bartonellae are mammalian pathogens vectored by blood-feeding arthropods. Although of increasing medical importance, little is known about their ecological past, and host associations are underexplored. Previous studies suggest an influence of horizontal gene transfers in ecological niche colonization by acquisition of host pathogenicity genes. We here expand these analyses to metabolic pathways of 28 Bartonella genomes, and experimentally explore the distribution of bartonellae in 21 species of blood-feeding arthropods. Across genomes, repeated gene losses and horizontal gains in the phospholipid pathway were found. The evolutionary timing of these patterns suggests functional consequences likely leading to an early intracellular lifestyle for stem bartonellae. Comparative phylogenomic analyses discover three independent lineage-specific reacquisitions of a core metabolic gene – NAD(P)H-dependent glycerol-3-phosphate dehydrogenase (gpsA) – from Gammaproteobacteria and Epsilonproteobacteria. Transferred genes are significantly closely related to invertebrate Arsenophonus-, and Serratia-like endosymbionts, and mammalian Helicobacter-like pathogens, supporting a cellular association with arthropods and mammals at the base of extant bartonellae. Our studies suggest that the horizontal re-aquisitions had a key impact on bartonellae lineage specific ecological and functional evolution.

Analysis of stop-gain and frameshift variants in human innate immunity genes

Analysis of stop-gain and frameshift variants in human innate immunity genes

Antonio Rausell, Pejman Mohammadi, Paul J McLaren, Ioannis Xenarios, Jacques Fellay, Amalio Telenti

Loss-of-function variants in innate immunity genes are associated with Mendelian disorders in the form of primary immunodeficiencies. Recent resequencing projects report that stop-gains and frameshifts are collectively prevalent in humans and could be responsible for some of the inter-individual variability in innate immune response. Current computational approaches evaluating loss-of-function in genes carrying these variants rely on gene-level characteristics such as evolutionary conservation and functional redundancy across the genome. However, innate immunity genes represent a particular case because they are more likely to be under positive selection and duplicated. To create a ranking of severity that would be applicable to the innate immunity genes we first evaluated 17764 stop-gain and 13915 frameshift variants from the NHLBI Exome Sequencing Project and 1000 Genomes Project. Sequence-based features such as loss of functional domains, isoform-specific truncation and non-sense mediated decay were found to correlate with variant allele frequency and validated with gene expression data. We integrated these features in a Bayesian classification scheme and benchmarked its use in predicting pathogenic variants against OMIM disease stop-gains and frameshifts. The classification scheme was applied in the assessment of 335 stop-gains and 236 frameshifts affecting 227 interferon-stimulated genes. The sequence-based score ranks variants in innate immunity genes according to their potential to cause disease, and complements existing gene-based pathogenicity scores.

Markov mutation models on Yule trees: pairwise species comparisons

Markov mutation models on Yule trees: pairwise species comparisons
Willem H. Mulder, Forrest W. Crawford
Subjects: Populations and Evolution (q-bio.PE)

Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are “born” from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknown how these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Understanding the interplay between speciation and mutation under simple models of evolution is essential for deriving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic studies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion and Ornstein-Uhlenbeck processes on a Yule tree. We compute the probability distribution of the number of mutations shared between two randomly chosen taxa in a Yule tree under several mutation models. These results suggest summary measures of phylogenetic information content, illuminate the correlation between site patterns in sequences or traits of related organisms, and provide heuristics for experimental design and reconstruction of phylogenetic trees.

Gaussian process test for high-throughput sequencing time series: application to experimental evolution

Gaussian process test for high-throughput sequencing time series: application to experimental evolution
Hande Topa, Ágnes Jónás, Robert Kofler, Carolin Kosiol, Antti Honkela
Comments: 26 pages, 13 figures
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM); Applications (stat.AP)

Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but to monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analysing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth.
Results: We present the beta-binomial Gaussian process (BBGP) model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present results on real data from Drosophila experimental evolution experiment in temperature adaptation.
Availability: R software implementing the test is available at

Genetic influences on translation in yeast

Genetic influences on translation in yeast

Frank W. Albert, Dale Muzzey, Jonathan Weissman, Leonid Kruglyak
(Submitted on 13 Mar 2014)

Heritable differences in gene expression between individuals are an important source of phenotypic variation. The question of how closely the effects of genetic variation on protein levels mirror those on mRNA levels remains open. Here, we addressed this question by using ribosome profiling to examine how genetic differences between two strains of the yeast S. cerevisiae affect translation. Strain differences in translation were observed for hundreds of genes, more than half as many as showed genetic differences in mRNA levels. Similarly, allele specific measurements in the diploid hybrid between the two strains revealed roughly half as many cis-acting effects on translation as were observed for mRNA levels. In both the parents and the hybrid, strong effects on translation were rare, such that the direction of an mRNA difference was typically reflected in a concordant footprint difference. The relative importance of cis and trans acting variation on footprint levels was similar to that for mRNA levels. Across all expressed genes, there was a tendency for translation to more often reinforce than buffer mRNA differences, resulting in footprint differences with greater magnitudes than the mRNA differences. A reanalysis of two earlier studies which reported translational buffering between two yeast species showed that translational reinforcement is in fact more common between these species, consistent with our results. Finally, we catalogued instances of premature translation termination in the two yeast strains. Overall, genetic variation clearly influences translation, but primarily does so by subtly modulating differences in mRNA levels. Translation does not appear to create strong discrepancies between genetic influences on mRNA and protein levels.