Friday, January 22, 2016

Thoughts on the NEJM editorial: what’s good for the (experimental) goose is good for the (computational) gander

Huge Twitter explosion about this editorial in the NEJM about “research parasites”. Basically, the authors say that computational people interested in working with someone else’s data should work together with the experimenters (which, incidentally, is how I would approach something like that in most cases). Things get a bit darker (and perhaps more revealing) when they also call out “research parasites”–aka “Mountain Dew chugging computational types”, to paraphrase what I’ve heard elsewhere–who are are to them just people sitting around, umm, chugging Mountain Dew while banging on their computers, stealing papers from those who worked so hard to generate these datasets.

So this NEJM editorial is certainly wrong on many counts, and I think that most people have that covered. Not only that, but it is particularly tone-deaf: “… or even use the data to try to disprove what the original investigators had posited.” Seriously?!?

The response has been particularly strong from the computational genomics community, who are often reliant on other people’s data. Ewan Birney had a nice set of Tweets on the topic, first noting that “For me this is the start of clinical research transitioning from a data limited to an analysis limited world.”, noting further that “This is what mol. biology / genomics went through in the 90s/00s and it’s scary for the people who base their science on control of data.” True, perhaps.

He then goes on to say: “1. Publication means... publication, including the data. No ifs, no buts. Patient data via restricted access (bonafide researcher) terms.”

Agreed, who can argue with that! But let’s put this chain of reasoning together. If we are moving to an “analysis limited world”, then it is the analyses that are the precious resource. And all the arguments for sharing data are just as applicable to sharing analyses, no? Isn’t the progress of science impeded by people not sharing their analyses? This is not just an abstract argument: for example, we have been doing some ATAC-seq experiments in the lab, and we had a very hard time finding out exactly how to analyze that data, because there was no code out there for how to do it, even in published papers (for the record, Will Greenleaf has been very kind and helpful via personal communication, and this has been fine for us).

What does, say, Genome Research have to say about it? Well, here’s what they say about data:
Genome Research will not publish manuscripts where data used and/or reported in the paper is not freely available in either a public database or on the Genome Research website. There are no exceptions.
Uh, so that’s pretty explicit. And here’s what they say about code:
Authors submitting papers that describe or present a new computer program or algorithm or papers where in-house software is necessary to reproduce the work should be prepared to make a downloadable program freely available. We encourage authors to also make the source code available.
Okay, so only if there’s some novel analysis, and then only if you want to or if someone asks you. Probably via e-mail. To which someone may or may not respond. Hmm, kettle, the pot is calling…

So what happens in practice at Genome Research? I took a quick look at the first three papers from the current TOC (1, 2, 3).

The first paper has a “Supplemental PERL.zip” that contains some very poorly documented code in a few files and as far as I can tell, is missing a file called “mcmctree_copy.ctl” that I’m guessing is pretty important to the running the mcmctree algorithm.

The third paper is perhaps the best, with a link to a software package that seems fairly well put together. But still, no link to the actual code to make the actual figures in the paper, as far as I can see, just “DaPars analysis was performed as described in the original paper (Masamha et al. 2014) by using the code available at https://code.google.com/p/dapars with default settings.”

The second paper has no code at all. They have a fairly detailed description of their analysis in the supplement, but again, no actual code I could run.

Aren’t these the same things we’ve been complaining about in experimental materials and methods forever? First paper: missing steps of a protocol? Second paper: vague prescription referencing previous paper and a “kit”? Third paper: just a description of how they did it, just like, you know, most “old fashioned” materials and methods from experimental biology papers.

Look, trust me, I understand completely why this is the case in these papers, and I’m not trying to call these authors out. All I’m saying is that if you’re going to get on your high horse and say that data is part of the paper and must be distributed, no ifs, no buts, well, then distribute the analyses as well–and I don’t want to hear any ifs or buts. If we require authors to deposit their sequence data, then surely we can require that they upload their code. Where is the mandate for depositing code on the journal website?

Of course, in the real world, there are legitimate ifs and buts. Let me anticipate one: “Our analyses are so heterogeneous, and it’s so complicated for us to share the code in a usable way.” I’m actually very sympathetic to that. Indeed, we have lots of data that is very heterogeneous and hard to share reasonably–for anyone who really believes all data MUST be accessible, well, I’ve got around 12TB of images for our next paper submission that I would love for you to pay to host… and that probably nobody will ever use. Not all science is genomics, and what works in one place won’t necessarily make sense elsewhere. (As an aside, in computational applied math, many people keep their codes secret to avoid “research parasites”, so it’s not just data gatherers who feel threatened.)

Where, might you ask, is the moral indignation on the part of our experimental colleagues complaining about how computational folks don’t make their codes accessible? First off, I think many of these folks are in fact annoyed (I am, for instance), but are much less likely to be on Twitter and the like. Secondly, I think that many non-computational folks are brow-beaten by p-value toting computational people telling them they don’t even know how to analyze their own data, leading them to feel like they are somehow unable to contribute meaningfully in the first place.

So my point is, sure, data should be available, but let’s not all be so self-righteous about it. Anyway, there, I said it. Peace. :)

PS: Just in case you were wondering, we make all our software and processed data available, and our most recent paper has all the scripts to make all the figures–and we’ll keep doing that moving forward. I think it's good practice, my point is that reasonable people could disagree.

Update: Nice discussion with Casey Bergman in the comments.
Update (4/28/2016): Fixed links to Genome Research papers (thanks to Quaid Morris for pointing this out). Also, Quaid pointed out that I was being unreasonable, and that 2/3 actually did provide code. So I looked at the next 3 papers from that issue (4, 5, 6). Of these, none of them had any code provided. For what it's worth, I agree with Quaid that it is not necessarily reasonable to require code. My point is that we should be reasonable about data as well.

5 comments:

  1. I think you are making a false analogy here between experimental *data* and computational *methods*. We all (except NEJM editors) agree that primary experimental data (in processed form) should be made available. The fair comparison in computational terms is to require computationally generated *data* (e.g. gene predictions, expression estimates) to be made available. In general this practice is followed, since computationally generated data is just data.

    However, to require reproducible code to be made avilable, then the fair equivalent would be to require all *experimental protocols* to be made available in explict detail such that they could lead to the exact results in the paper. Not sure experimentalists would be willing or able to do this.

    When all experimentalists are required to publish protocols with this level of detail, then I'll accept that all code must be made available. Till then, it is a double standard to require us computational types to make *explicitly reproducible methods and data* available while experimentalists can get away with only having to release a *narrative sketch of methods and data*. Insisting on this double standard reinforces the second class status of computational biology by burdening us with a higher bar that experimental biology can't ever realistically hope to achieve.

    ReplyDelete
    Replies
    1. Hi Casey,

      Thanks for writing!

      I see your point, and indeed had thought about that as well. I think my point is not actually that computational people should be held to this standard. Rather, what I find off-putting are the idealistic proclamations from computational types that “data must be free” and “it’s what the taxpayers are paying for” and “Open data is how science should be done; it will accelerate progress.” Virtually all these high-minded arguments apply just as well to making analyses open and available. If we’re going to make declarations of “no ifs, no buts” for data availability for these sorts of reasons, then same goes for computational analyses, in my opinion. In our lab, we would have greatly benefitted from analyses being available in many instances (as recently as last week). The distinction between methods/protocols and data is relatively inconsequential with respect to the lofty ideals of "open science” that are usually trotted out as the justification for forcing data availability. Whether or not it presents a burden is typically not part of the discussion for making data available anymore, so why should it be for analyses? To be clear, in my opinion, we should instead try and be practical in both cases.

      Along these lines, I also take issue with the proclamation that all processed data *must* be made available as though it is an intrinsic, metaphysical good and is somehow *the* product of a paper. While it is certainly a good thing to make data available, focusing on data as the primary output of a paper is I think wrong. To me, the product of most scientific papers is knowledge. Knowledge comes from experiments/measurements -> data -> analyses -> knowledge. These are all parts of the paper, and the data itself has no special place in the chain. In the ideal world, papers must provide everything to reproduce this entire chain. In the real world, we approximate this to greater or lesser extent: we describe experimental protocols with varying levels of detail and, apparently, same for computational analyses. Historically, we have also not required data to be made available–and we generated the vast majority of molecular biology knowledge without such requirements. Now, we make data available mostly because it is increasingly possible to do so logistically and the real or perceived practical benefits of reanalyzing data, but again, data is not in a privileged place in relation to experimental protocols or computational analyses. One could easily make the argument that, it’s now relatively easy to make analyses available, and so we should do it for all the same high-minded reasons people trot out for making data available. Experimental protocols should also perhaps be spelled out in full, and indeed, this is on the books at virtually all journals–great! There are some logistical and practical matters that make protocols harder to share, and so perhaps it’s harder to do. But if we believe our lofty ideals apply to all parts of the chain of generating knowledge, then there is equal moral imperative to make any and all parts of the chain available for others to whatever extent practical. Saying that, well, if you can’t do it reasonably for one part of the chain (experimental protocols) then you can’t require it for another (computational analyses) seems rather arbitrary to me–indeed, why is data somehow now in the privileged position of being required?

      To sum up, I guess what I’m saying is that I find it a bit jarring to get high-minded lectures from computational folks about making sure our data is in some GEO repository and then get tit-for-tat rationales for not posting code from published papers. :)

      Anyway, all this said, I understand why people don’t often provide code–we do it, and it’s a lot of work. Perhaps we agree on that. My overarching point is not so much that we should force analyses to be made available, but rather that I’m not going to stand on a soapbox and tell everyone else that they *have* to (like people do for data).

      Anyway, fun discussion, cheers!

      Delete
    2. Also, I would point out that in my experience, this computationally generated data is seldom available. A quick look through most papers with genomics data will quickly show that most intermediates are not available, nor are the final data used to generate the figures. It has been a vexing problem for us when we try and use even just the *data* from most of these types of papers.

      Delete
  2. I don't really feel the need to argue against this--analysis code should generally be available and certainly a lot more of it than is now.

    ReplyDelete
    Replies
    1. Good to hear! :)
      I've definitely gotten a fair number of arguments against, mostly along the lines of Casey Bergman's above, which I find understandable, but ultimately unconvincing… ;)

      Delete