Our next guest post is by Keith Bradnam (@kbradnam) on the Assemblathon (@assemblathon) paper: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXived here.
Making pizzas and genome assemblies
In Davis, California there are 18 different establishments that predominantly sell pizzas and I often muse on the important issue of ‘who makes the best pizza?’. It’s a question that is deceptive in its simplicity, but there are many subtleties that lie behind it, most notably: what do we mean by best? The best quality pizza probably. But does quality refer to the best ingredients, the best pizza chef, or the best overall flavor? There are many other pizza-related metrics that could be combined into an equation to decide who makes the best pizza. Such an equation has to factor in the price, size, choice of toppings, quality (however we decide to measure it), ease of ordering, average time to deliver etc.
Even then, such an equation might have to assume that your needs reflect the typical needs of an average pizza consumer. But what if you have special needs (e.g, you can’t eat gluten) or you have certain constraints (you need a 131 foot pizza to go)? Hopefully, it is clear that the notion of a ‘best’ pizza is highly subjective and the best pizza for one person is almost certainly not going to be the best pizza for someone else.
What is true for ‘making pizzas’ is also largely true for ‘making genome assemblies’. There are probably as many genome assemblers out there as there are pizza establishments in Davis, and people clearly want to know which one is the best. But how do you measure the ‘best’ genome assembly? Many published genome sequences result from a single assembly of next-generation sequencing (NGS) data using a single piece of assembly software. Could you make a better assembly by using different software? Could you make a better assembly just from tweaking the settings of the same software? It is hard to know, and often costly — at least in terms of time and resources — to find out.
That’s where the Assemblathon comes in. The Assemblathon is a contest designed to put a wide range of genome assemblers through their paces; different teams are invited to attempt to assemble the same genome sequence, and we can hopefully point out the notable differences that can arise in the resulting assemblies. Assemblathon 1 used synthetic NGS data with teams trying to reconstruct a small (~100 MB ) synthetic genome. I.e. a genome for which we knew what the true answer should look like. For Assemblathon 2 — manuscript now available on arxiv.org — we upped the stakes and made NGS data available for three large vertebrate genomes (a bird, a fish, and a snake). Teams were invited to assemble any or all of the genomes. Teams were also free to use as much or as little of the NGS data as they liked. For the bird species (a budgerigar), the situation was further complicated by the fact that the NGS data comprised reads from three different platforms (Illumina, Roche 454, and Pacific Biosciences). In total we received 43 assemblies from 21 participating teams.
How did we try to make sense of all of these entries, especially when we would never know what the correct answer was for each genome? We were helped by having optical maps for each species which could be compared to the scaffolds in each genome assembly. We also had some Fosmid sequences for the bird and snake which helped provide a small set of ‘trusted’ reference sequences. In addition to these experimental datasets we tried employing various statistical methods to assess the quality of the assemblies (such as calculating metrics like the frequently used N50 measure). In the end, we filled a spreadsheet with over 100 different measures for each assembly (many of them related to each other).
From this unwieldy dataset we chose ten key metrics, measures that largely reflected different aspects of an assembly’s quality. Analysis of these key metrics led to two main conclusions — which some may find disappointing:
- Assembly quality can vary a lot depending on which metrics you focus on; we found many assemblies that excelled in some of the key metrics, but fared poorly when judged by others.
- Assemblers that tended to score well — when averaged across the 10 key metrics — in one species, did not always perform as well when assembling the genome of another species.
With respect to the second point, it is important to point out that the genomes of three species differed with regard to size, repeat content, and heterozygosity. It is perhaps equally important to point out that the NGS data
provided for each species differed in terms of insert sizes, read lengths, and abundance. Thus it is hard to ascertain whether inter-species differences in the quality of the assemblies were chiefly influenced by differences in the underlying genomes, the properties of the NGS data that were available, or by a combination of both factors. Further complicating the picture is that not all teams attempted to assemble all three genomes; so in terms of assessing the general usefulness of assembly software, we could only look at the smaller number of teams that submitted entries for two or more species.
In many ways, this manuscript represents some very early, and tentative steps into the world of comparative genome assembler assessments. Much more work needs to be done, and perhaps many more Assemblathons need to be run if we are to best understand what make a genome assembly a good assembly. Assemblathons are not the only game in town however, and other efforts like dnGASP and GAGE are important too. It is also good to see that others are leveraging the Assemblathon datasets (the first published analysis of Assemblathon 2 data was not by us!).
So while I can give an answer to the question ‘what is the best genome assembler?’, the answer is probably not going to be to your liking. With our current knowledge, we can say that the best genome assembler is the one that:
- you have the expertise to install and run
- you have the suitable infrastructure (CPU & RAM) to run the assembler
- you have sufficient time to run the assembler
- is designed to work with the specific mix of NGS data that you have generated
- best addresses what you want to get out of a genome assembly (bigger overall assembly, more genes, most accuracy, longer scaffolds, most resolution of haplotypes, most tolerant of repeats, etc.)
Just as it might be hard to find somewhere that sells an inexpensive gluten-free, vegan pizza that’s made with fresh ingredients, has lots of toppings and can be quickly delivered to you at 4:00 am, it may be equally hard to find a genome assembler that ticks all of the boxes that you are interested in. For now at least, it seems that you can’t have your cake — or pizza — and eat it.