Casey Bergman [@caseybergman and @bergmanlab] kindly wrote a post about his recently arXived paper:
Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster
As part of the Drosophila 12 Genome Project, Steve Salzberg and colleagues’ published a pioneering paper in 2005 showing that complete genomes of the bacterial endosymbiont Wolbachia pipientis can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. This paper always left an impression on me as a very clever use of extracting new biology from existing genomic data, and when the era of resequencing multiple strains of D. melanogaster kicked off a few years ago, it seemed like a natural extension to ask if this approach could be adapted to a next-generation sequencing data to study the co-evolution of Wolbachia and Drosophila using whole genome data.
In the current work, we used short-read next generation sequencing data from two major resequencing efforts in D. melanogaster — the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP) — together with the reference Wolbachia genome published by Wu et al. (2005) and extracted over 175 complete Wolbachia genomes and nearly 300 complete mitochondrial genomes. Readers can find the main results in the paper, which is currently in review. I’d like to discuss here the social context of the project and some of the reasons we submitted to arXiv.
This project started out as summer project for a masters student, Mark Richardson, in 2010 who did an amazing job developing the initial pipeline made most of the initial discoveries in the paper. Mark and I started a collaboration with Frank Jiggins and Mike McGwire shortly after to verify that our in silico genotyping results were making sense, who suggested to bring in Lucy Weinart and John Welch to help with the more sophisticated Bayesian phylogenetic analysis. Another PhD student in my lab, Raquel Linheiro, adapted her transposable element detection pipeline to identify particular Wolbachia sublineages which was crucial to linking our data with previous results. This was a great collaboration, where everyone made significant contributions, and I would collaborate with everyone again (and I hope to!).
At the time (summer 2010), we only had access to the North American strains from the DGRP sample; knowing that North American D. melanogaster are derived populations, we were cautious about the impact that population structure had on our results. We planned in early 2011 to publish on only the DGRP dataset since Mark was going off to do a PhD in Australia and I didn’t have anyone else in the group working on this project. In the summer of 2011, the African DPGP data came online and I decided to take a peek and run the pipeline on the African strains as well. This led to a major overhaul of the project and set us back a year, since all the data had to be reanalyzed again together and the interpretation of the biogeography results was substantially altered. This was in some ways lucky because our initial interpretation of evidence for a selective sweep on one of the cytoplasmic lineages was probably wrong, and it saved us from having to back peddle on this misinterpretation in a later publication.
As we plugged away at trying to finish this project, we had inquiries about the status of the project from several other groups working in the Wolbachia field. Honestly this stressed me out quite a bit, since some of the inquiries were coming from post-docs in big labs. But instead of just sitting on the data, after we finalized the dataset we decided to release these data openly on our lab blog in April 2012. We decided on an open release as a way to help these teams (and others we didn’t know about), but also to get some priority in this area by providing the “gold standard” that other groups could use (and cite!). For the record, I will note that we asked two teams who contacted us about our project if they would reciprocate by sharing unpublished genomic data or in one case published genomic data that was not submitted to GenBank; both declined.
After making the decision to release the data pre-publication, it was a natural step to submit the manuscript to arXiv. I’m an open science advocate and used the Nature Preprint server occasionally in the past. I never really liked the Nature Preprint server, though, since I thought people posted there to give their manuscript the stink of being “Nature (in prep)” on their CV. And I never posted to arXiv in the past, since I always thought it was for more hardcore computational or mathematical biology. But recently, I was convinced by Rosie Redfield, Leonid Kruglyak and colleagues putting their Arsenic Life paper on arXiv that more empirical work in quantitative biology was arXiv-able. And just as with releasing our data early, it seemed like the best way to prevent being scooped was to get our results out as quickly as possible and letting people know about it.
So we went for it. And I have to say the experience has been thoroughly rewarding. Submitting was a piece of cake, easier than any journal I’ve ever submitted to. Having a URL to point to allowed me to tweet about it, which got some exposure to the paper and some new colleagues on twitter. It also allowed me to send a submitted manuscript around to colleagues for informal review, without cluttering up their inboxes with big attachments or providing a moral dilemma about who they can share the manuscript with. And somehow submitting to arXiv pushed the “it’s submitted” button in my brain, which made me a whole lot less stressed about the possibility of being scooped and I’ve been more relaxed throughout the formal submission process. Finally, I know that the pre-publication release of the data and posting of the manuscript has led to a group in Russia using these sequences into their work, and I’ve just gotten a manuscript to review from this group citing our arXiv manuscript and extending our results before our paper is even published! This is what research is all about, right: doing science, getting it out, and letting others build on it. I’ll definitely submit to arXiv for all my papers from my lab, and look forward to the Haldane’s Sieve readership giving us a hard time about our manuscripts while they evolve into formal publications.
Minor typo (probably auto-correct at work), “…some of the reasons we submitted to arrive.” –> arXiv?
The case of a big lab declining to share “published genomic data that was not submitted to GenBank” sounds odd. Was it the raw data supporting the published data you wanted?
Hi Peter – Sadly, it was the raw NGS data from a paper in 2011. To be fair, it is tough to get NGS data easily into SRA, but this is not the explanation for why they wouldn’t make it available on request. I was really bummed about this since it was key for making a finer point in our interpretation, and we had to do a bunch of other analysis to make this point in a less rigrous way. Since then, a collaborator has replicated the collection of this sequence data and we confirmed our result, but all in all it was not the best outcome in my view since we couldn’t included it in the paper. In some ways though, this event was a big reason for deciding to make our data and paper as open as possible, since I realized how other people holding back data impeded our work.
We also used the data too Casey! Kudos for sharing and advancing science faster.
This is great Seth! Let us know if there is any other data that would help to have, and if you spot any problems in the files.
Pingback: New paper on faster X effect in Drosophila – sort of…
Pingback: Most viewed on Haldane’s Sieve: August-September 2012 | Haldane's Sieve
Pingback: Preprint Servers – do you recommend ? – The Global Innovations
Pingback: Haldane’s Sieve sifts through 2012 | Haldane's Sieve