khmer: Working with Big Data in Bioinformatics

khmer: Working with Big Data in Bioinformatics
Eric McDonald, C. Titus Brown
(Submitted on 9 Mar 2013)

We introduce design and optimization considerations for the ‘khmer’ package.

Comments: Invited chapter for forthcoming book on Performance of Open Source Applications

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis
Marc Harper, Luisa Gronenberg, James Liao, Christopher Lee
(Submitted on 3 Mar 2013)

Discovering all the genetic causes of a phenotype is an important goal in functional genomics. In this paper we combine an experimental design for multiple independent detections of the genetic causes of a phenotype, with a high-throughput sequencing analysis that maximizes sensitivity for comprehensively identifying them. Testing this approach on a set of 24 mutant strains generated for a metabolic phenotype with many known genetic causes, we show that this pathway-based phenotype sequencing analysis greatly improves sensitivity of detection compared with previous methods, and reveals a wide range of pathways that can cause this phenotype. We demonstrate our approach on a metabolic re-engineering phenotype, the PEP/OAA metabolic node in E. coli, which is crucial to a substantial number of metabolic pathways and under renewed interest for biofuel research. Out of 2157 mutations in these strains, pathway-phenoseq discriminated just five gene groups (12 genes) as statistically significant causes of the phenotype. Experimentally, these five gene groups, and the next two high-scoring pathway-phenoseq groups, either have a clear connection to the PEP metabolite level or offer an alternative path of producing oxaloacetate (OAA), and thus clearly explain the phenotype. These high-scoring gene groups also show strong evidence of positive selection pressure, compared with strictly neutral selection in the rest of the genome.

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim
(Submitted on 18 Feb 2013)

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise interaction model is formally equivalent to the disordered Potts models of statistical mechanics and it generalizes previous approaches to interdependent positions. Its structure allows for co-variation of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of PWMs also have this last feature, we show that pairwise interaction models outperform them. The significant pairwise interactions are found to be sparse and found dominantly between consecutive base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown to give significantly different predictions than a model based on independent positions.

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor
Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson
(Submitted on 15 Feb 2013)

RNA sequencing (RNA-seq) has been rapidly adopted for the multilayered profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, thus there is a need for guidance on current best practices. This protocol presents a “state-of-the-art” computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day, even with modest resources.

Equitability, mutual information, and the maximal information coefficient

Equitability, mutual information, and the maximal information coefficient
Justin B. Kinney, Gurinder S. Atwal
(Submitted on 31 Jan 2013)

Reshef et al. recently proposed a new statistical measure, the “maximal information coefficient” (MIC), for quantifying arbitrary dependencies between pairs of stochastic quantities. MIC is based on mutual information, a fundamental quantity in information theory that is widely understood to serve this need. MIC, however, is not an estimate of mutual information. Indeed, it was claimed that MIC possesses a desirable mathematical property called “equitability” that mutual information lacks. This was not proven; instead it was argued solely through the analysis of simulated data. Here we show that this claim, in fact, is incorrect. First we offer mathematical proof that no (non-trivial) dependence measure satisfies the definition of equitability proposed by Reshef et al.. We then propose a self-consistent and more general definition of equitability that follows naturally from the Data Processing Inequality. Mutual information satisfies this new definition of equitability while MIC does not. Finally, we show that the simulation evidence offered by Reshef et al. was artifactual. We conclude that estimating mutual information is not only practical for many real-world applications, but also provides a natural solution to the problem of quantifying associations in large data sets.

Equitability Analysis of the Maximal Information Coefficient, with Comparisons

Equitability Analysis of the Maximal Information Coefficient, with Comparisons
David Reshef (1), Yakir Reshef (1), Michael Mitzenmacher (2), Pardis Sabeti (2) (1, 2 – contributed equally)
(Submitted on 27 Jan 2013)

A measure of dependence is said to be equitable if it gives similar scores to equally noisy relationships of different types. Equitability is important in data exploration when the goal is to identify a relatively small set of strongest associations within a dataset as opposed to finding as many non-zero associations as possible, which often are too many to sift through. Thus an equitable statistic, such as the maximal information coefficient (MIC), can be useful for analyzing high-dimensional data sets. Here, we explore both equitability and the properties of MIC, and discuss several aspects of the theory and practice of MIC. We begin by presenting an intuition behind the equitability of MIC through the exploration of the maximization and normalization steps in its definition. We then examine the speed and optimality of the approximation algorithm used to compute MIC, and suggest some directions for improving both. Finally, we demonstrate in a range of noise models and sample sizes that MIC is more equitable than natural alternatives, such as mutual information estimation and distance correlation.

Our paper: Assemblathon 2 and pizza

Our next guest post is by Keith Bradnam (‏@kbradnam) on the Assemblathon (@assemblathon) paper: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXived here.

Making pizzas and genome assemblies

In Davis, California there are 18 different establishments that predominantly sell pizzas and I often muse on the important issue of ‘who makes the best pizza?’. It’s a question that is deceptive in its simplicity, but there are many subtleties that lie behind it, most notably: what do we mean by best? The best quality pizza probably. But does quality refer to the best ingredients, the best pizza chef, or the best overall flavor? There are many other pizza-related metrics that could be combined into an equation to decide who makes the best pizza. Such an equation has to factor in the price, size, choice of toppings, quality (however we decide to measure it), ease of ordering, average time to deliver etc.

Even then, such an equation might have to assume that your needs reflect the typical needs of an average pizza consumer. But what if you have special needs (e.g, you can’t eat gluten) or you have certain constraints (you need a 131 foot pizza to go)? Hopefully, it is clear that the notion of a ‘best’ pizza is highly subjective and the best pizza for one person is almost certainly not going to be the best pizza for someone else.

What is true for ‘making pizzas’ is also largely true for ‘making genome assemblies’. There are probably as many genome assemblers out there as there are pizza establishments in Davis, and people clearly want to know which one is the best. But how do you measure the ‘best’ genome assembly? Many published genome sequences result from a single assembly of next-generation sequencing (NGS) data using a single piece of assembly software. Could you make a better assembly by using different software? Could you make a better assembly just from tweaking the settings of the same software? It is hard to know, and often costly — at least in terms of time and resources — to find out.

That’s where the Assemblathon comes in. The Assemblathon is a contest designed to put a wide range of genome assemblers through their paces; different teams are invited to attempt to assemble the same genome sequence, and we can hopefully point out the notable differences that can arise in the resulting assemblies. Assemblathon 1 used synthetic NGS data with teams trying to reconstruct a small (~100 MB ) synthetic genome. I.e. a genome for which we knew what the true answer should look like. For Assemblathon 2 — manuscript now available on arxiv.org — we upped the stakes and made NGS data available for three large vertebrate genomes (a bird, a fish, and a snake). Teams were invited to assemble any or all of the genomes. Teams were also free to use as much or as little of the NGS data as they liked. For the bird species (a budgerigar), the situation was further complicated by the fact that the NGS data comprised reads from three different platforms (Illumina, Roche 454, and Pacific Biosciences). In total we received 43 assemblies from 21 participating teams.

How did we try to make sense of all of these entries, especially when we would never know what the correct answer was for each genome? We were helped by having optical maps for each species which could be compared to the scaffolds in each genome assembly. We also had some Fosmid sequences for the bird and snake which helped provide a small set of ‘trusted’ reference sequences. In addition to these experimental datasets we tried employing various statistical methods to assess the quality of the assemblies (such as calculating metrics like the frequently used N50 measure). In the end, we filled a spreadsheet with over 100 different measures for each assembly (many of them related to each other).

From this unwieldy dataset we chose ten key metrics, measures that largely reflected different aspects of an assembly’s quality. Analysis of these key metrics led to two main conclusions — which some may find disappointing:

  1. Assembly quality can vary a lot depending on which metrics you focus on; we found many assemblies that excelled in some of the key metrics, but fared poorly when judged by others.
  2. Assemblers that tended to score well — when averaged across the 10 key metrics — in one species, did not always perform as well when assembling the genome of another species.

With respect to the second point, it is important to point out that the genomes of three species differed with regard to size, repeat content, and heterozygosity. It is perhaps equally important to point out that the NGS data
provided for each species differed in terms of insert sizes, read lengths, and abundance. Thus it is hard to ascertain whether inter-species differences in the quality of the assemblies were chiefly influenced by differences in the underlying genomes, the properties of the NGS data that were available, or by a combination of both factors. Further complicating the picture is that not all teams attempted to assemble all three genomes; so in terms of assessing the general usefulness of assembly software, we could only look at the smaller number of teams that submitted entries for two or more species.

In many ways, this manuscript represents some very early, and tentative steps into the world of comparative genome assembler assessments. Much more work needs to be done, and perhaps many more Assemblathons need to be run if we are to best understand what make a genome assembly a good assembly. Assemblathons are not the only game in town however, and other efforts like dnGASP and GAGE are important too. It is also good to see that others are leveraging the Assemblathon datasets (the first published analysis of Assemblathon 2 data was not by us!).

So while I can give an answer to the question ‘what is the best genome assembler?’, the answer is probably not going to be to your liking. With our current knowledge, we can say that the best genome assembler is the one that:

  1. you have the expertise to install and run
  2. you have the suitable infrastructure (CPU & RAM) to run the assembler
  3. you have sufficient time to run the assembler
  4. is designed to work with the specific mix of NGS data that you have generated
  5. best addresses what you want to get out of a genome assembly (bigger overall assembly, more genes, most accuracy, longer scaffolds, most resolution of haplotypes, most tolerant of repeats, etc.)

Just as it might be hard to find somewhere that sells an inexpensive gluten-free, vegan pizza that’s made with fresh ingredients, has lots of toppings and can be quickly delivered to you at 4:00 am, it may be equally hard to find a genome assembler that ticks all of the boxes that you are interested in. For now at least, it seems that you can’t have your cake — or pizza — and eat it.

Our paper: Unbiased statistical testing of shared genetic control for potentially related traits

This guest post is by Chris Wallace on the preprint Unbiased statistical testing of shared genetic control for potentially related traits, available from the arXiv here. This is a cross-post from her group’s blog.

We have a new paper on arXiv detailing some work on colocalisation analysis, a method to determine whether two traits share a common causal variant. This is of interest in autoimmune disease genetics as the associated loci of so many autoimmune diseases overlap 1, but, for some genes, it appears the causal variants are distinct. It is also relevant for integrating disease association and eQTL data, to understand whether association of a disease to a particular locus is mediated by a variant’s effect on expression of a specific gene, possibly in a specific tissue.

However, determining whether traits share a common causal variant as opposed to distinct causal variants, probably in some LD, is not straightforward. It is well established that regression coefficients are aymptotically unbiased. However, when a SNP has been selected because it is the most associated in a region, then coefficients do then tend to be biased away from the null, ie their effect is overestimated. Because SNPs need to be selected to describe the association in any region in order to do colocalisation analysis, and because the coefficient bias will differ between datasets, there could be a tendancy to call truly colocalising traits as distinct. In fact, application of a formal statistical test for colocalisation 2 in a naive manner could have a type 1 error rate around 10-20% for a nominal size of 5%. This of course suggests that our earlier analysis of type 1 diabetes and monocyte gene expression 3 needs to be revised because it is likely we will have falsely rejected some genes which mediate the type 1 diabetes association in a region.

In this paper, we demonstrate two methods to overcome the problem. One, possibly more attractive to frequentists, is to avoid the variable selection by performing the analysis on principle components which summarise the genetic variation in a region. There is an issue with how many components are required, and our simulations suggest enough components need to be selected to capture around 85% of variation in a region. Obviously, this leads to a huge increase in degrees of freedom but, surprisingly, the power was not much worse compared to our favoured option of averaging p values over the variable selection using Bayesian Model Averaging. The idea of averaging p values is possibly anathema to Bayesians and frequentists alike, but these “posterior predictive p values” do have some history, having been introduced by Rubin in 1984 4. If you are prepared to mix Bayesian and frequentist theory sufficiently to average a p value over a posterior distribution (in this case, the posterior is of the SNPs which jointly summarise the association to both traits), it’s quite a nice idea. We used it before 3 as an alternative to taking a profile likelihood approach to dealing with a nuisance parameter, instead calculating p values conditional on the nuisance parameter, and averaging over its posterior. In this paper, we show by simulation that it does a good job of maintaining type 1 error and tends to be more powerful than the principle components approach.

There are many questions regarding integration of data from different GWAS that this paper doesn’t address: how to do this on a genomewide basis, for multiple traits, or when samples are not independent (GWAS which share a common set of controls, for example). Thus, it is a small step, but a useful contribution, I think, demonstrating a statistically sound method of investigating potentially shared causal variants in individual loci in detail. And while detailed investigation of individual loci may be currently be less fashionable than genomewide analyses, those detailed analyses are crucial for fine resolution analysis.

Unbiased statistical testing of shared genetic control for potentially related traits

Unbiased statistical testing of shared genetic control for potentially related traits
Chris Wallace
(Submitted on 23 Jan 2013)

Integration of data from genomewide single nucleotide polymorphism (SNP) association studies of different traits should allow researchers to disentangle the genetics of potentially related traits within individually associated regions. Methods have ranged from visual comparison of association $p$ values for each trait to formal statistical colocalisation testing of individual regions, which requires selection of a set of SNPs summarizing the association in a region. We show that the SNP selection method greatly affects type 1 error rates, with all published studies to date having used SNP selection methods that result in substantially biased inference. The primary reasons are twofold: random variation in the prescence of linkage disequilibrium means selected SNPs do not fully capture the association signal, and selecting SNPs on the basis of significance leads to biased effect size estimates.
We show that unbiased inference can be made either by avoiding variable selection and instead testing the most informative principal components or by integrating over variable selection using Bayesian model averaging. Application to data from Graves’ disease and Hashimoto’s thyroiditis reveals a common genetic signature across seven regions shared between the diseases, and indicates that for five out of six regions which have been significantly associated with one disease and not the other, the lack of evidence in one disease represents genuine absence of association rather than lack of power.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)
(Submitted on 23 Jan 2013)

Background – The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results – In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions – Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.