Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Gustavo Sacomoto
(Submitted on 23 Jun 2014)

In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events. Finally, we show that it enables to identify more correct events than general purpose transcriptome assemblers.
In order to deal with ever-increasing volumes of NGS data, we put an extra effort to make KisSplice as scalable as possible. First, to improve its running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. Then, to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time.
Additionally, we apply the same techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson’s algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the classical K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms with the same time complexities but using exponentially less memory than previous approaches.

Autosomal admixture levels are informative about sex bias in admixed populations

Autosomal admixture levels are informative about sex bias in admixed populations

Amy Goldberg, Paul Verdu, Noah A Rosenberg

Sex-biased admixture has been observed in a wide variety of admixed populations. Genetic variation in sex chromosomes and ratios of quantities computed from sex chromosomes and autosomes have often been examined in order to infer patterns of sex-biased admixture, typically using statistical approaches that do not mechanistically model the complexity of a sex-specific history of admixture. Here, expanding on a model of Verdu \& Rosenberg (2011) that did not include sex specificity, we develop a model that mechanistically examines sex-specific admixture histories. Under the model, multiple source populations contribute to an admixed population, potentially with their male and female contributions varying over time. In an admixed population descended from two source groups, we derive the moments of the distribution of the autosomal admixture fraction from a specific source population as a function of sex-specific introgression parameters and time. Considering admixture processes that are constant in time, we demonstrate that surprisingly, although the mean autosomal admixture fraction from a specific source population does not reveal a sex bias in the admixture history, the variance of autosomal admixture is informative about sex bias. Specifically, the long-term variance decreases as the sex bias from a contributing source population increases. This result can be viewed as analogous to the reduction in effective population size for populations with an unequal number of breeding males and females. Our approach can contribute to methods for inference of the history of complex sex-biased admixture processes by enabling consideration of the effect of sex-biased admixture on autosomal DNA.

Are phylogenetic patterns the same in anthropology and biology?

Are phylogenetic patterns the same in anthropology and biology?

David Morrison

The use of phylogenetic methods in anthropological fields such as archaeology, linguistics and stemmatology (involving what are often called ?culture data?) is based on an analogy between human cultural evolution and biological evolution. We need to understand this analogy thoroughly, including how well anthropology data fit the model of a phylogenetic tree, as used in biology. I provide a direct comparison of anthropology datasets with both phenotype and genotype datasets from biology. The anthropology datasets fit the tree model approximately as well as do the genotype data, which is detectably worse than the fit of the phenotype data. This is true for datasets with <500 parsimony-informative characters, as well as for larger datasets. This implies that cross-cultural (horizontal) processes have been important in the evolution of cultural artifacts, as well as branching historical (vertical) processes, and thus a phylogenetic network will be a more appropriate model than a phylogenetic tree.

Fixation in finite populations evolving in fluctuating environments

Fixation in finite populations evolving in fluctuating environments

Peter Ashcroft, Philipp M Altrock, Tobias Galla
(Submitted on 21 Jun 2014)

The environment in which a population evolves can have a crucial impact on selection. We study evolutionary dynamics in finite populations of fixed size in a changing environment. The population dynamics are driven by birth and death events. The rates of these events may vary in time depending on the state of the environment, which follows an independent Markov process. We develop a general theory for the fixation probability of a mutant in a population of wild-types, and for unconditional and conditional mean fixation times. We apply our theory to evolutionary games for which the payoff structure varies in time. The mutant can exploit the environmental noise; a dynamic environment that switches between two states can lead to a probability of fixation that is higher than in any of the individual environmental states. We provide an intuitive interpretation of this surprising effect. We also investigate stationary distributions of the population when mutations are more frequent. In this regime, we find two approximations of the stationary measure. One works well for rapid switching, the other for slowly fluctuating environments.

Hsp90 promotes kinase evolution

Hsp90 promotes kinase evolution

Jennifer Lachowiec, Tzitziki Lemus, Elhanan Borenstein, Christine Queitsch

Heat-shock protein 90 (Hsp90) promotes the maturation and stability of its client proteins, including many kinases. In doing so, Hsp90 may allow its clients to accumulate mutations as previously proposed by the capacitor hypothesis. If true, Hsp90 clients should show increased evolutionary rate compared to non-clients; however, other factors, such as gene expression and protein connectivity, may confound or obscure the chaperone?s putative contribution. Here, we compared the evolutionary rates of many Hsp90 clients and non-clients in the human protein kinase superfamily. We show that Hsp90 client status promotes evolutionary rate independently of, but in a similar magnitude to, gene expression and protein connectivity. Hsp90?s effect on kinase evolutionary rate was detected across mammals and increased with time of divergence. Hsp90 clients also showed increased nucleotide diversity and harbored more damaging variation than non-client kinases across humans. These results are consistent with the central argument of the capacitor hypothesis that interaction with the chaperone allows its clients to harbor genetic variation. Hsp90 client status is thought to be highly dynamic with as few as one amino acid change rendering a protein dependent on the chaperone. Contrary to this expectation, we found that across protein kinase phylogeny Hsp90 client status tends to be gained, maintained, and shared among closely related kinases. We also infer that the ancestral protein kinase was not an Hsp90 client. Taken together, our results suggest that Hsp90 played an important role in shaping the kinase superfamily.

Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms

Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms

Fernando Racimo, Joshua G Schraiber

Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral or not strongly selected, and we do not rely on fitting the DFE of all new nonsynonymous mutations to a single probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model on SNP data. Our method serves to approximate the deleterious DFE of mutations that are segregating, regardless of their genomic consequence. We can then compare the proportion of mutations that are negatively selected or neutral across various categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly peaked at neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and a second peak at s ≈ −10^(−4). Other types of polymorphisms have shapes that fall roughly in between these two. We find that transcriptional start sites, strong CTCF-enriched elements and enhancers are the regulatory categories with the largest proportion of deleterious polymorphisms.

Assessing Technical Performance in Differential Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures

Assessing Technical Performance in Differential Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures

Sarah A. Munro, Steve P. Lund, P. Scott Pine, Hans Binder, Djork-Arné Clevert, Ana Conesa, Joaquin Dopazo, Mario Fasold, Sepp Hochreiter, Huixiao Hong, Nederah Jafari, David P. Kreil, Paweł P. Łabaj, Sheng Li, Yang Liao, Simon Lin, Joseph Meehan, Christopher E. Mason, Javier Santoyo, Robert A. Setterquist, Leming Shi, Wei Shi, Gordon K. Smyth, Nancy Stralis-Pavese, Zhenqiang Su, Weida Tong, Charles Wang, Jian Wang, Joshua Xu, Zhan Ye, Yong Yang, Ying Yu, Marc Salit
(Submitted on 18 Jun 2014)

There is a critical need for standard approaches to assess, report, and compare the technical performance of genome-scale differential gene expression experiments. We assess technical performance with a proposed “standard” dashboard of metrics derived from analysis of external spike-in RNA control ratio mixtures. These control ratio mixtures with defined abundance ratios enable assessment of diagnostic performance of differentially expressed transcript lists, limit of detection of ratio (LODR) estimates, and expression ratio variability and measurement bias. The performance metrics suite is applicable to analysis of a typical experiment, and here we also apply these metrics to evaluate technical performance among laboratories. An interlaboratory study using identical samples shared amongst 12 laboratories with three different measurement processes demonstrated generally consistent diagnostic power across 11 laboratories. Ratio measurement variability and bias were also comparable amongst laboratories for the same measurement process. Different biases were observed for measurement processes using different mRNA enrichment protocols.

Parametric Inference using Persistence Diagrams: A Case Study in Population Genetics

Parametric Inference using Persistence Diagrams: A Case Study in Population Genetics

Kevin Emmett, Daniel Rosenbloom, Pablo Camara, Raul Rabadan
(Submitted on 18 Jun 2014)

Persistent homology computes topological invariants from point cloud data. Recent work has focused on developing statistical methods for data analysis in this framework. We show that, in certain models, parametric inference can be performed using statistics defined on the computed invariants. We develop this idea with a model from population genetics, the coalescent with recombination. We apply our model to an influenza dataset, identifying two scales of topological structure which have a distinct biological interpretation.

Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes

Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes

Julia Chifman, Laura Kubatko
(Submitted on 18 Jun 2014)

The inference of the evolutionary history of a collection of organisms is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on sequence information from several distinct genes sampled throughout the genome. It is widely accepted that each individual gene has its own phylogeny, which may not agree with the species tree. Many possible causes of this gene tree incongruence are known. The best studied is incomplete lineage sorting, which is commonly modeled by the coalescent process. Numerous methods based on the coalescent process have been proposed for estimation of the phylogenetic species tree given multi-locus DNA sequence data. However, use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, although this has not been formally established. We prove that the unrooted topology of the n-leaf phylogenetic species tree is generically identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process with time-reversible substitution.

The overdue promise of short tandem repeat variation for heritability

The overdue promise of short tandem repeat variation for heritability.

Maximilian Press, Keisha D. Carlson, Christine Queitsch

Short tandem repeat (STR) variation has been proposed as a major explanatory factor in the heritability of complex traits in humans and model organisms. However, we still struggle to incorporate STR variation into genotype-phenotype maps. Here, we review the promise of STRs in contributing to complex trait heritability, and highlight the challenges that STRs pose due to their repetitive nature. We argue that STR variants are more likely than single nucleotide variants to have epistatic interactions, reiterate the need for targeted assays to accurately genotype STRs, and call for more appropriate statistical methods in detecting STR-phenotype associations. Lastly, somatic STR variation within individuals may serve as a read-out of disease susceptibility, and is thus potentially a valuable covariate for future association studies.