Equitability Analysis of the Maximal Information Coefficient, with Comparisons

Equitability Analysis of the Maximal Information Coefficient, with Comparisons
David Reshef (1), Yakir Reshef (1), Michael Mitzenmacher (2), Pardis Sabeti (2) (1, 2 – contributed equally)
(Submitted on 27 Jan 2013)

A measure of dependence is said to be equitable if it gives similar scores to equally noisy relationships of different types. Equitability is important in data exploration when the goal is to identify a relatively small set of strongest associations within a dataset as opposed to finding as many non-zero associations as possible, which often are too many to sift through. Thus an equitable statistic, such as the maximal information coefficient (MIC), can be useful for analyzing high-dimensional data sets. Here, we explore both equitability and the properties of MIC, and discuss several aspects of the theory and practice of MIC. We begin by presenting an intuition behind the equitability of MIC through the exploration of the maximization and normalization steps in its definition. We then examine the speed and optimality of the approximation algorithm used to compute MIC, and suggest some directions for improving both. Finally, we demonstrate in a range of noise models and sample sizes that MIC is more equitable than natural alternatives, such as mutual information estimation and distance correlation.


The infinitely many genes model with horizontal gene transfer

The infinitely many genes model with horizontal gene transfer
Franz Baumdicker, Peter Pfaffelhuber
(Submitted on 28 Jan 2013)

The genome of bacterial species is much more flexible than that of eukaryotes. Moreover, the distributed genome hypothesis for bacteria states that the total number of genes present in a bacterial population is greater than the genome of every single individual. The pangenome, i.e. the set of all genes of a bacterial species (or a sample), comprises the core genes which are present in all living individuals, and accessory genes, which are carried only by some individuals. In order to use accessory genes for adaptation to environmental forces, genes can be transferred horizontally between individuals. Here, we extend the infinitely many genes model from Baumdicker, Hess and Pfaffelhuber (2010) for horizontal gene transfer. We take a genealogical view and give a construction — called the Ancestral Gene Transfer Graph — of the joint genealogy of all genes in the pangenome. As application, we compute moments of several statistics (e.g. the number of differences between two individuals and the gene frequency spectrum) under the infinitely many genes model with horizontal gene transfer.

Our paper: Assemblathon 2 and pizza

Our next guest post is by Keith Bradnam (‏@kbradnam) on the Assemblathon (@assemblathon) paper: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXived here.

Making pizzas and genome assemblies

In Davis, California there are 18 different establishments that predominantly sell pizzas and I often muse on the important issue of ‘who makes the best pizza?’. It’s a question that is deceptive in its simplicity, but there are many subtleties that lie behind it, most notably: what do we mean by best? The best quality pizza probably. But does quality refer to the best ingredients, the best pizza chef, or the best overall flavor? There are many other pizza-related metrics that could be combined into an equation to decide who makes the best pizza. Such an equation has to factor in the price, size, choice of toppings, quality (however we decide to measure it), ease of ordering, average time to deliver etc.

Even then, such an equation might have to assume that your needs reflect the typical needs of an average pizza consumer. But what if you have special needs (e.g, you can’t eat gluten) or you have certain constraints (you need a 131 foot pizza to go)? Hopefully, it is clear that the notion of a ‘best’ pizza is highly subjective and the best pizza for one person is almost certainly not going to be the best pizza for someone else.

What is true for ‘making pizzas’ is also largely true for ‘making genome assemblies’. There are probably as many genome assemblers out there as there are pizza establishments in Davis, and people clearly want to know which one is the best. But how do you measure the ‘best’ genome assembly? Many published genome sequences result from a single assembly of next-generation sequencing (NGS) data using a single piece of assembly software. Could you make a better assembly by using different software? Could you make a better assembly just from tweaking the settings of the same software? It is hard to know, and often costly — at least in terms of time and resources — to find out.

That’s where the Assemblathon comes in. The Assemblathon is a contest designed to put a wide range of genome assemblers through their paces; different teams are invited to attempt to assemble the same genome sequence, and we can hopefully point out the notable differences that can arise in the resulting assemblies. Assemblathon 1 used synthetic NGS data with teams trying to reconstruct a small (~100 MB ) synthetic genome. I.e. a genome for which we knew what the true answer should look like. For Assemblathon 2 — manuscript now available on arxiv.org — we upped the stakes and made NGS data available for three large vertebrate genomes (a bird, a fish, and a snake). Teams were invited to assemble any or all of the genomes. Teams were also free to use as much or as little of the NGS data as they liked. For the bird species (a budgerigar), the situation was further complicated by the fact that the NGS data comprised reads from three different platforms (Illumina, Roche 454, and Pacific Biosciences). In total we received 43 assemblies from 21 participating teams.

How did we try to make sense of all of these entries, especially when we would never know what the correct answer was for each genome? We were helped by having optical maps for each species which could be compared to the scaffolds in each genome assembly. We also had some Fosmid sequences for the bird and snake which helped provide a small set of ‘trusted’ reference sequences. In addition to these experimental datasets we tried employing various statistical methods to assess the quality of the assemblies (such as calculating metrics like the frequently used N50 measure). In the end, we filled a spreadsheet with over 100 different measures for each assembly (many of them related to each other).

From this unwieldy dataset we chose ten key metrics, measures that largely reflected different aspects of an assembly’s quality. Analysis of these key metrics led to two main conclusions — which some may find disappointing:

  1. Assembly quality can vary a lot depending on which metrics you focus on; we found many assemblies that excelled in some of the key metrics, but fared poorly when judged by others.
  2. Assemblers that tended to score well — when averaged across the 10 key metrics — in one species, did not always perform as well when assembling the genome of another species.

With respect to the second point, it is important to point out that the genomes of three species differed with regard to size, repeat content, and heterozygosity. It is perhaps equally important to point out that the NGS data
provided for each species differed in terms of insert sizes, read lengths, and abundance. Thus it is hard to ascertain whether inter-species differences in the quality of the assemblies were chiefly influenced by differences in the underlying genomes, the properties of the NGS data that were available, or by a combination of both factors. Further complicating the picture is that not all teams attempted to assemble all three genomes; so in terms of assessing the general usefulness of assembly software, we could only look at the smaller number of teams that submitted entries for two or more species.

In many ways, this manuscript represents some very early, and tentative steps into the world of comparative genome assembler assessments. Much more work needs to be done, and perhaps many more Assemblathons need to be run if we are to best understand what make a genome assembly a good assembly. Assemblathons are not the only game in town however, and other efforts like dnGASP and GAGE are important too. It is also good to see that others are leveraging the Assemblathon datasets (the first published analysis of Assemblathon 2 data was not by us!).

So while I can give an answer to the question ‘what is the best genome assembler?’, the answer is probably not going to be to your liking. With our current knowledge, we can say that the best genome assembler is the one that:

  1. you have the expertise to install and run
  2. you have the suitable infrastructure (CPU & RAM) to run the assembler
  3. you have sufficient time to run the assembler
  4. is designed to work with the specific mix of NGS data that you have generated
  5. best addresses what you want to get out of a genome assembly (bigger overall assembly, more genes, most accuracy, longer scaffolds, most resolution of haplotypes, most tolerant of repeats, etc.)

Just as it might be hard to find somewhere that sells an inexpensive gluten-free, vegan pizza that’s made with fresh ingredients, has lots of toppings and can be quickly delivered to you at 4:00 am, it may be equally hard to find a genome assembler that ticks all of the boxes that you are interested in. For now at least, it seems that you can’t have your cake — or pizza — and eat it.

Our paper: An experimental test for genetic constraints in Drosophila melanogaster

Our next guest post is by Ian Dworkin (@IanDworkin) on his paper (along with coauthors) An experimental test for genetic constraints in Drosophila melanogaster.

We have recently posted a (heavily revised) manuscript to arXiv detailing how we used the fruit fly Drosophila melanogaster (you can read here about why these little flies are so wonderful) to test a particular hypothesis about a genetic constraint, and more generally how our knowledge of development may inform us about the structure of the genetic variance-covariance matrix, G. Also we developed a really cool set of statistical models that evaluated our explicit hypotheses (more on that right at the end of the post)!

As a quick reminder (or introduction), G summarizes both how much genetic variation particular traits have, as well as how much traits co-vary genetically. This covariation can be due to “pleiotropy” which is a fancy word for when a gene (or a mutation in that gene) influences more than one trait. ie. a mutation might influence both your eye and hair colour). These traits can also covary together when two or more alleles (each influencing different traits) are physically close to each other (linked) and recombination has not had enough time to break these combinations apart. I highly recommend Jeff Conner’s recent review in Evolution for a nice review of these (and other concepts related to some issues I discuss below).

Evolutionary biology, in particular evolutionary quantitative genetics thinks a lot about the G-matrix, and how it interacts with natural selection (or drift) to generate evolutionary change. This is summarized by the now famous equation linking change in trait means(Δ) as a function of both genetic variation (and covariation) and the strength of natural selection (usually measured as a so-called selection gradient, β). This is the multivariate (more than one trait) version of the breeders equation (made most famous by all of the seminal work by R. Lande).


Why do we care so much about this little equation? It encapsulates many pretty heady ideas.  First and foremost that you can not have evolutionary change without genetic variation. That’s right, natural selection by itself is not enough. You can have very strong selection for traits (such as running speed) to survive better with a predator around, but if there is no heritable variation for running speed, no (evolutionary) change will happen in the proceeding generations (and good luck with that tiger coming your way). However, once you have to consider multiple traits (running speed, endurance and hearing), we have to think about whether there is available genetic variations for combinations of traits, and whether these are “oriented” in a similar direction to natural selection. If not, it may be that evolutionary change with be slowed considerably (even if each traits seems to have lots of heritable variation). Of course if the genetic variation for all of these traits is pointing in the same direction as selection, then evolution may proceed very quickly indeed! The ideas get more interesting and complex from there, but they are not the for this discussion (the paper above by Jeff Conner, and this great review by Katrina McGuigan are definitely worth reading for more on this).

In any case, much thought has been given to how this G matrix can change both by natural selection and by other factors such as new mutation. Depending on how G changes, future evolutionary potential might change, which is pretty cool if you think about it! How might G change then? These are important ideas, because while we can estimate what G looks like, and how it might change (in particular due to natural selection), it is much harder to know what it will look like far in the future, making our ability to predict long term evolutionary change more difficult.
So what might help us predict G? One idea is that our knowledge of developmental biology will help us understand the effects of mutations, and thus G. If so, developmental biology could be a particularly powerful way of predicting the potential for evolutionary change, or lack there of (a so called developmental constraint).

To test this idea, I decided to use a homeotic mutation. Homeosis is the term used for when one structure (like an arm) is transformed (during development) to another (related) structure like a leg.  In fruitflies homeotic mutations are the stuff of legend (and nobel prizes), in particular for the wonderful cases of the poor critters growing with legs (instead of antenna) out of their heads, or four winged flies. You can see wonderful examples of mutations causing such homeotic changes in flies and other critters here.

In our case we used a much weaker and subtler homeotic mutation Ubx1, which causes slight, largely quantitative changes. For example with this mutation, the third set of legs on the fly would be expected to resemble (in terms of lengths of the different parts of the leg) the second set of legs (flies like all insects have 3 sets of legs as adults). We wanted to know whether when we changed the third legs to look like second legs, would the G for the transformed third leg look that of a normal third leg or a normal second leg? Thus we were trying to predict changes in G based on what we know (a priori) of development and genetics in the fruitfly.

So what did we find? The most important points are summarized in figure 2 and table 3 (if you want to check out the paper that is). The TL’DR version is this: Yes, the legs homeotically transformed like we expected, but G of the mutant legs did not really change very much from that of a normal third leg. In other words, our knowledge of development did not really help us much in understanding changes in G. There are a few reasons why (which we explain in the paper), but I think that it is an interesting punchline, and I will leave it up to you to decide what it means (and if our experiment, analysis and interpretation are reasonable and logically consistent).

I also really want to give a shout out to one of the co-authors (JH) who developed the particular statistical model that we ended up using. He developed a set of explicit models that really helped us test our specific hypotheses directly with the data and experimental design at hand. This is sadly rarely done with statistics, so it is worth reading just for that! I really think (hope?) that this combination of approaches can be very useful for evolutionary genetics. Let me know what you think!

Our paper: Unbiased statistical testing of shared genetic control for potentially related traits

This guest post is by Chris Wallace on the preprint Unbiased statistical testing of shared genetic control for potentially related traits, available from the arXiv here. This is a cross-post from her group’s blog.

We have a new paper on arXiv detailing some work on colocalisation analysis, a method to determine whether two traits share a common causal variant. This is of interest in autoimmune disease genetics as the associated loci of so many autoimmune diseases overlap 1, but, for some genes, it appears the causal variants are distinct. It is also relevant for integrating disease association and eQTL data, to understand whether association of a disease to a particular locus is mediated by a variant’s effect on expression of a specific gene, possibly in a specific tissue.

However, determining whether traits share a common causal variant as opposed to distinct causal variants, probably in some LD, is not straightforward. It is well established that regression coefficients are aymptotically unbiased. However, when a SNP has been selected because it is the most associated in a region, then coefficients do then tend to be biased away from the null, ie their effect is overestimated. Because SNPs need to be selected to describe the association in any region in order to do colocalisation analysis, and because the coefficient bias will differ between datasets, there could be a tendancy to call truly colocalising traits as distinct. In fact, application of a formal statistical test for colocalisation 2 in a naive manner could have a type 1 error rate around 10-20% for a nominal size of 5%. This of course suggests that our earlier analysis of type 1 diabetes and monocyte gene expression 3 needs to be revised because it is likely we will have falsely rejected some genes which mediate the type 1 diabetes association in a region.

In this paper, we demonstrate two methods to overcome the problem. One, possibly more attractive to frequentists, is to avoid the variable selection by performing the analysis on principle components which summarise the genetic variation in a region. There is an issue with how many components are required, and our simulations suggest enough components need to be selected to capture around 85% of variation in a region. Obviously, this leads to a huge increase in degrees of freedom but, surprisingly, the power was not much worse compared to our favoured option of averaging p values over the variable selection using Bayesian Model Averaging. The idea of averaging p values is possibly anathema to Bayesians and frequentists alike, but these “posterior predictive p values” do have some history, having been introduced by Rubin in 1984 4. If you are prepared to mix Bayesian and frequentist theory sufficiently to average a p value over a posterior distribution (in this case, the posterior is of the SNPs which jointly summarise the association to both traits), it’s quite a nice idea. We used it before 3 as an alternative to taking a profile likelihood approach to dealing with a nuisance parameter, instead calculating p values conditional on the nuisance parameter, and averaging over its posterior. In this paper, we show by simulation that it does a good job of maintaining type 1 error and tends to be more powerful than the principle components approach.

There are many questions regarding integration of data from different GWAS that this paper doesn’t address: how to do this on a genomewide basis, for multiple traits, or when samples are not independent (GWAS which share a common set of controls, for example). Thus, it is a small step, but a useful contribution, I think, demonstrating a statistically sound method of investigating potentially shared causal variants in individual loci in detail. And while detailed investigation of individual loci may be currently be less fashionable than genomewide analyses, those detailed analyses are crucial for fine resolution analysis.

Unbiased statistical testing of shared genetic control for potentially related traits

Unbiased statistical testing of shared genetic control for potentially related traits
Chris Wallace
(Submitted on 23 Jan 2013)

Integration of data from genomewide single nucleotide polymorphism (SNP) association studies of different traits should allow researchers to disentangle the genetics of potentially related traits within individually associated regions. Methods have ranged from visual comparison of association $p$ values for each trait to formal statistical colocalisation testing of individual regions, which requires selection of a set of SNPs summarizing the association in a region. We show that the SNP selection method greatly affects type 1 error rates, with all published studies to date having used SNP selection methods that result in substantially biased inference. The primary reasons are twofold: random variation in the prescence of linkage disequilibrium means selected SNPs do not fully capture the association signal, and selecting SNPs on the basis of significance leads to biased effect size estimates.
We show that unbiased inference can be made either by avoiding variable selection and instead testing the most informative principal components or by integrating over variable selection using Bayesian model averaging. Application to data from Graves’ disease and Hashimoto’s thyroiditis reveals a common genetic signature across seven regions shared between the diseases, and indicates that for five out of six regions which have been significantly associated with one disease and not the other, the lack of evidence in one disease represents genuine absence of association rather than lack of power.

Natural selection. VI. Partitioning the information in fitness and characters by path analysis

Natural selection. VI. Partitioning the information in fitness and characters by path analysis
Steven A. Frank
(Submitted on 22 Jan 2013)

Three steps aid in the analysis of selection. First, describe phenotypes by their component causes. Components include genes, maternal effects, symbionts, and any other predictors of phenotype that are of interest. Second, describe fitness by its component causes, such as an individual’s phenotype, its neighbors’ phenotypes, resource availability, and so on. Third, put the predictors of phenotype and fitness into an exact equation for evolutionary change, providing a complete expression of selection and other evolutionary processes. The complete expression separates the distinct causal roles of the various hypothesized components of phenotypes and fitness. Traditionally, those components are given by the covariance, variance, and regression terms of evolutionary models. I show how to interpret those statistical expressions with respect to information theory. The resulting interpretation allows one to read the fundamental equations of selection and evolution as sentences that express how various causes lead to the accumulation of information by selection and the decay of information by other evolutionary processes. The interpretation in terms of information leads to a deeper understanding of selection and heritability, and a clearer sense of how to formulate causal hypotheses about evolutionary process. Kin selection appears as a particular type of causal analysis that partitions social effects into meaningful components.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)
(Submitted on 23 Jan 2013)

Background – The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results – In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions – Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

[This post is a commentary by Alkes L. Price on “Polygenic modeling with Bayesian sparse linear mixed models” by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]

Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.

Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.

The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.

Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.

Alkes Price

Comprehensive evaluation of differential expression analysis methods for RNA-seq data

Comprehensive evaluation of differential expression analysis methods for RNA-seq data
Franck Rapaport, Raya Khanin, Yupu Liang, Azra Krek, Paul Zumbo, Christopher E. Mason, Nicholas D. Socci, Doron Betel
(Submitted on 22 Jan 2013)

High-throughput sequencing of RNA transcripts (RNA-seq) has become the method of choice for detection of differential expression (DE). Concurrent with the growing popularity of this technology there has been a significant research effort devoted towards understanding the statistical properties of this data and the development of analysis methods. We report on a comprehensive evaluation of the commonly used DE methods using the SEQC benchmark data set. We evaluate a number of key features including: assessment of normalization, accuracy of DE detection, modeling of genes expressed in only one condition, and the impact of sequencing depth and number of replications on identifying DE genes. We find significant differences among the methods with no single method consistently outperforming the others. Furthermore, the performance of array-based approach is comparable to methods customized for RNA-seq data. Perhaps most importantly, our results demonstrate that increasing the number of replicate samples provides significantly more detection power than increased sequencing depth.