Methods to study splicing from high-throughput RNA Sequencing data

Methods to study splicing from high-throughput RNA Sequencing data
Gael P. Alamancos, Eneritz Agirre, Eduardo Eyras
(Submitted on 22 Apr 2013)

The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.

The standard lateral gene transfer model is statistically consistent for pectinate four-taxon trees

The standard lateral gene transfer model is statistically consistent for pectinate four-taxon trees
Andreas Sand, Mike Steel
(Submitted on 22 Apr 2013)

Evolutionary events such as incomplete lineage sorting and lateral gene transfer constitute major problems for inferring species trees from gene trees, as they can sometimes lead to gene trees which conflict with the underlying species tree. One particularly simple and efficient way to infer species trees from gene trees under such conditions is to combine three-taxon analyses for several genes using a majority vote approach. For incomplete lineage sorting this method is known to be statistically consistent, however, in the case of lateral gene transfer it is known that a zone of inconsistency does exist for a specific four-taxon tree topology. In this paper we analyze all remaining four-taxon topologies and show that no other inconsistencies exist.

Informed and Automated k-Mer Size Selection for Genome Assembly

Informed and Automated k-Mer Size Selection for Genome Assembly
Rayan Chikhi, Paul Medvedev
(Submitted on 20 Apr 2013)

Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.
We develop a fast and accurate sampling method that constructs approximate abundance histograms with a several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.
Our tool KmerGenie is freely available at: this http URL

Comparing DNA sequence collections by direct comparison of compressed text indexes

Comparing DNA sequence collections by direct comparison of compressed text indexes
Anthony J. Cox, Tobias Jakobi, Giovanna Rosone, Ole B. Schulz-Trieglaff
(Submitted on 19 Apr 2013)

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored.
Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have high overlap with results from more standard reference-based methods.
Code to construct and compare the BWT of large genomic data sets is available at this http URL as part of the BEETL library.

Our paper: Inferring non-neutral regulatory change in pathways from transcriptional profiling data

This post is by Josh Schraiber on his paper (along with coauthors): Schraiber et al. Inferring non-neutral regulatory change in pathways from transcriptional profiling data arXived here.

We’ve known for a long time now that gene sequence alone does not determine phenotype. From the trivial example of differentiated cell types (which all have the same DNA) to now-common examples where species adapt to their environment by changing something other than protein-coding sequence, it’s clear that the expression level of a gene plays just as important a role in phenotypic development as does its sequence. Despite this fact, we still lack the kinds of tools that are widely available for detecting non-neutral evolution at the level of gene expression (in packages like PAML). Part of this problem lies in a fundamental lack of power. A single gene may have hundreds of sites, and the patterns that occur at all of those sites give us plenty of information to learn about accelerated substation rates and the like. But a gene (in a given environment) has just one expression level, so the sample size is often small and power is reduced.

This same problem occurs, of course, in phylogenetic studies of quantitative characters at the organismal level. The difference is that in those cases, researchers typically have access to tens, if not hundreds, of species with good quality measurements. Unfortunately, transcriptome-wide gene expression data can be difficult and costly to collect, so large-scale studies are few and far between.

Instead of trying to leverage large collections of species, we sought to utilize one of the benefits of transcriptome-wide profiles: data from lots and lots of genes. A common practice in molecular evolution is to run tests for selection on a gene-by-gene basis and then look for functional groups that are overrepresented (e.g. Gene Ontology enrichment). We turned that around and instead started with a priori defined gene groups (in our case, from Gene Ontology), looking to detect signal for a history of lineage-specific gene expression evolution, by jointly analyzing all the genes in a group simultaneously.

Doing this would potentially run into a problem of overfitting: should we try to fit a separate rate of evolution for each gene in the group? Instead, we borrowed a page from Ziheng Yang’s book and assumed that the rate of evolution across genes was inverse-gamma distributed. We chose this distribution mostly for for computational convenience, but it is important to note that it can cover a wide range of possibilities—from a model in which every gene evolves at the same rate to a distribution so fat-tailed that there is no average rate of evolution across the group! By fitting a distribution of rates across genes in a group, we are able to look for examples of lineage-specific evolution without being confounded by outlying genes.

We encourage you to check out our paper and let us know what you think
of our approach. In addition, our method will soon be available as an
R package (once I get around to doing all the documentation…) and we
would love to see people using it. If you are interested in getting an
early version of our package, please don’t hesitate to contact me:
jgschraiber@berkeley.edu.

Inferring non-neutral regulatory change in pathways from transcriptional profiling data

Inferring non-neutral regulatory change in pathways from transcriptional profiling data
Joshua G. Schraiber, Yulia Mostovoy, Tiffany Y. Hsu, Rachel B. Brem
(Submitted on 19 Apr 2013)

An outstanding question in comparative genomics is the evolutionary importance of gene expression differences between species. Rigorous molecular-evolution methods to infer evidence for natural selection from transcriptional profiling data are at a premium in the field, and to date, phylogenetic approaches have not been well-suited to address the question in the small sets of taxa profiled in standard surveys of gene expression. To meet this challenge, we have developed a strategy to infer evolutionary histories from expression data by analyzing suites of genes of common function. In a manner conceptually similar to molecular-evolution models in which the evolutionary rates of DNA sequence at multiple loci follow a gamma distribution, we modeled expression of the genes of an a priori-defined pathway with rates drawn from an inverse-gamma distribution. We then developed a fitting strategy to infer the parameters of this distribution from expression measurements, and to identify gene groups whose expression patterns were consistent with evolutionary constraint or rapid evolution in particular species. Simulations confirmed the power and accuracy of our inference method. As an experimental testbed for our approach, we generated and analyzed transcriptional profiles of four Saccharomyces yeasts. The results revealed pathways with signatures of constrained and accelerated regulatory evolution in individual yeasts, and across the phylogeny, highlighting the prevalence of pathway- level expression change during the divergence of yeast species. We anticipate that our pathway-based phylogenetic approach will be of broad utility in the search to understand the evolutionary relevance of regulatory change.

Our paper: The influence of relatives on the efficiency and error rate of familial searching

This guest post is by Rori Rohlfs on her paper (along with coauthors): Rohlfs et al. The influence of relatives on the efficiency and error rate of familial searching. arXived here.

One of the ways we in the U.S. (and elsewhere) are likely to encounter genetic technologies in our lives is through forensic DNA identification.  Without knowing a specific quantity, clearly a huge number of us encounter forensic uses of DNA through court cases using genetic evidence (as survivors, defendants, jury members, etc.), DNA sample seizure during a stop or arrest (currently being considered by the U.S. Supreme Court), or by being genetically related to someone in an offender or arrestee DNA database (>11 million profiles in U.S. national database).  Despite the social relevance of forensic uses of DNA, it seems to me that forensic genetics isn’t much discussed by the population and evolutionary genetics crowd these days.

A while back, I became interested in a newer forensic technique known as familial searching, particularly in how some pop gen assumptions affect outcomes.  Familial searching is performed in cases where police have some DNA evidence from an unknown individual they want to identify but have no leads.  First, they’ll search offender/arrestee DNA database(s) for someone with a matching genetic profile (which is verrrry unlikely between unrelated individuals with complete profiles), who they’d then investigate.  In some jurisdictions (where familial searching is legal or practiced without explicit policy), if there’s no complete profile match, they’ll search the database again for a partially matching profile.  The idea being that the partial match may be due to a close genetic relationship.  (Of course, two unrelated individuals could reasonably have partially matching profiles by chance.  More on that later.)  Again, depending on the policies of the jurisdiction, the relatives of some number of partially matching individuals are investigated.  In the most high-profile case of familial searching in the U.S., the suspected genetic relative was subject to surreptitious DNA collection (i.e. being followed until leaving a DNA sample (in that case, a pizza crust)).  Then this sample was tested directly against the original unknown sample, and showing a complete profile match a suspect was identified.

Because familial searching effectively extends offender/arrestee databases to the genetic kin of people in the databases, it raises important questions like:

For a population geneticist, attempting to identify unknown genetic relatives of individuals in the database (rather than the known individuals in the database) introduces more uncertainty and some additional questions come up like:

  • With the genetic information in forensic profiles (typically 13-15 autosomal STRs, sometimes with 17 Y-chromosome STRs), what’s the chance that an unrelated individual coincidentally has a partially matching profile resembling a genetic relative?
  • What background allele and haplotype frequencies are considered in profile likelihood calculations?
  • What statistical methodology will be used to identify [specific?  non-specific?] genetic relatives?

All these questions are especially relevant when considering intense multiple testing introduced by the relevant databases (>1.4 million profiles in the California offender database).  It can be challenging to get a handle on these questions because of widely varying policies and methodology between jurisdictions.  In New York City, it seems that an error-prone ‘allelic matching’ technique has been used to attempt to identify relatives in at least one case of robbery, leading to investigations of unrelated individuals.  While in California, familial searching is used specifically in cold cases of violent crimes with a continuing threat to public safety and in 2011 Myers et al. published the likelihood ratio-based test statistic and procedure used in practice.

When I arrived at the U.C. Berkeley for my postdoc, I met Monty Slatkin and Yun Song who, along with Erin Murphy, had attempted to estimate some error rates of familial searching, but were stymied by a lack of a well-described methods currently used in practice.  When the statistical procedure used by California was published, we were excited to collaborate using practically relevant methodology.  Specifically, we estimated the false positive rate and power of familial searching using the California state procedure.  Generally, we found high power to detect a specified first-degree relationship (.79 to .99) and low (but still substantial in a multiple testing context) false positive rates of calling unrelated individuals as first-degree relatives (<5e-9 to 1e-5).  We got thinking about more distant Y-chromosome-sharing relatives (half-siblings, cousins, second cousins) who (barring mutation) share Y-haplotypes and some portion of their autosomal STRs IBD.  We estimated that these distant relatives could be mistaken for close relatives fairly often, like in our simulations 14-42% of half-sibs and 3-18% of first cousins were misidentified as siblings.

These rates are non-trivial, especially if you consider the size of databases and the fact that there are more distant relatives than near (so distant relatives are more likely to be present in databases).  Further, some of these genetic relationships are not known (even to the individuals themselves) so are not useful to investigation, but may still be interpreted as evidence of familial involvement, leading to investigation of uninvolved individuals.  Lucky for us, our collaborator Erin Murphy has a background in law and thoughtfully outlined some of the practical ramifications in the introduction and discussion of our paper.  Not the least of which is how extended families and communities in groups which are over-represented in databases (perhaps most obviously African Americans and Latinos) would be disproportionately impacted by misidentification of distant relatives as near relatives.

We hope that this interdisciplinary manuscript broadens sorely needed technical and policy discussions of familial searching.

The Expected Linkage Disequilibrium in Finite Populations Revisited

The Expected Linkage Disequilibrium in Finite Populations Revisited
Ulrike Ober, Alexander Malinowski, Martin Schlather, Henner Simianer
(Submitted on 17 Apr 2013)

The expected level of linkage disequilibrium (LD) in a finite ideal population at equilibrium is of relevance for many applications in population and quantitative genetics. Several recursion formulae have been proposed during the last decades, whose derivations mostly contain heuristic parts and therefore remain mathematically questionable. We propose a more justifiable approach, including an alternative recursion formula for the expected LD. Since the exact formula depends on the distribution of allele frequencies in a very complicated manner, we suggest an approximate solution and analyze its validity extensively in a simulation study. Compared to the widely used formula of Sved, the proposed formula performs better for all parameter constellations considered. We then analyze the expected LD at equilibrium using the theory on discrete-time Markov chains based on the linear recursion formula, with equilibrium being defined as the steady-state of the chain, which finally leads to a formula for the effective population size N_e. An additional analysis considers the effect of non-exactness of a recursion formula on the steady-state, demonstrating that the resulting error in expected LD can be substantial. In an application to the HapMap data of two human populations we illustrate the dependency of the N_e-estimate on the distribution of minor allele frequencies (MAFs), showing that the estimated N_e can vary by up to 30% when a uniform instead of a skewed distribution of MAFs is taken as a basis to select SNPs for the analyses. Our analyses provide new insights into the mathematical complexity of the problem studied.

XORRO: Rapid Paired-End Read Overlapper

XORRO: Rapid Paired-End Read Overlapper
Russell J. Dickson, Gregory B. Gloor
(Submitted on 16 Apr 2013)

Background: Computational analysis of next-generation sequencing data is outpaced by data generation in many cases. In one such case, paired-end reads can be produced from the Illumina sequencing method faster than they can be overlapped by downstream analysis. The advantages in read length and accuracy provided by overlapping paired-end reads demonstrates the necessity for software to efficiently solve this problem.
Results: XORRO is an extremely efficient paired-end read overlapping program. XORRO can overlap millions of short paired-end reads in a few minutes. It uses 64-bit registers with a two bit alphabet to represent sequences and does comparisons using low-level logical operations like XOR, AND, bitshifting and popcount.
Conclusions: As of the writing of this manuscript, XORRO provides the fastest solution to the paired-end read overlap problem. XORRO is available for download at: sourceforge.net/projects/xorro-overlap/

High-speed and accurate color-space short-read alignment with CUSHAW2

High-speed and accurate color-space short-read alignment with CUSHAW2
Yongchao Liu, Bernt Popp, Bertil Schmidt
(Submitted on 17 Apr 2013)

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are available at this http URL Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de