Sunday, February 28, 2016

From reproducibility to over-reproducibility

[See also follow up post.]

It's no secret that biomedical research is requiring more and more computational analyses these days, and with that has come some welcome discussion of how to make those analyses reproducible. On some level, I guess it's a no-brainer: if it's not reproducible, it's not science, right? And on a practical level, I think there are a lot of good things about making your analysis reproducible, including the following (vaguely ranked starting with what I consider most important):
  1. Umm, that it’s reproducible.
  2. It makes you a bit more careful about making your code more likely to be right, cleaner, and readable to others.
  3. This in turn makes it easier for others in the lab to access and play with the analyses and data in the future, including the PI.
  4. It could be useful for others outside the lab, although as I’ve said before, I think the uses for our data outside our lab are relatively limited beyond the scientific conclusions we have made. Still, whatever, it’s there if you want it. I also freely admit this might be more important for people who do work other people actually care about. :)
Balanced against these benefits, though, is a non-negligible negative:
  1. It takes a lot of time.
On balance, I think making things as reproducible as possible is time well spent. In particular, it's time that could be well spent by the large proportion of the biomedical research enterprise that currently doesn't think about this sort of thing at all, and I think it is imperative for those of us with a computational inclination to help train others to make their analyses reproducible.

My worry, however, is that the strategies for reproducibility that computational types are often promoting are off-target and not necessarily adapted for the needs and skills of the people they are trying to reach. There is a certain strain of hyper-reproducible zealotry that I think is discouraging others to adopt some basic practices that could greatly benefit their research, and at the same time is limiting the productivity of even its own practitioners. You know what I'm talking about: it's the idea of turning your entire paper into a program, so you just type "make paper" and out pops the fully formed and formatted manuscript. Fine in the abstract, but in a line of work (like many others) in which time is our most precious commodity, these compulsions represent a complete failure to correctly measure opportunity costs. In other words, instead of hard coding the adjustment of the figure spacing of your LaTeX preprint, spend that time writing another paper. I think it’s really important to remember that our job is science, not programming, and if we focus too heavily on the procedural aspects of making everything reproducible and fully documented, we risk turning off those who are less comfortable with programming from the very real benefits of making their analysis reproducible.

Here are the two biggest culprits in my view: version control and figure scripting.

Let's start with version control. I think we can all agree that the most important part of making a scientific analysis reproducible is to make sure the analysis is in a script and not just typed or clicked into a program somewhere, only for those commands to vanish into faded memory. A good, reproducible analysis script should start with raw data, go through all the computational manipulations required, and leave you with a number or graphical element that ends up in your paper somewhere. This makes the analysis reproducible, because someone else can now just run the code and see how your raw data turned into that p-value in subpanel Figure 4G. And remember, that someone else is most likely your future self :).

Okay, so we hopefully all agree on the need for scripts. Then, however, almost every discussion about computational reproducibility begins with a directive to adopt git or some other version control system, as though it’s the obvious next step. Hmm. I’m just going to come right out and say that for the majority of computational projects (at least in our lab), version control is a waste of time. Why? Well, what is the goal of making a reproducible analysis? I believe the goal is to have a documented set of scripts that take raw data and reliably turn it into a bit of knowledge of some kind. The goal of version control is to manage code, in particular emphasizing “reversibility, concurrency, and annotation [of changes to code]”. While one can imagine some overlap between these goals, I don’t necessarily see a natural connection between them. To make that more concrete, let’s try to answer the question that I’ve been asking (and been asked), which is “Why not just use Dropbox?”. After all, Dropbox will keep all your code and data around (including older versions), shared between people seamlessly, and probably will only go down if WWIII breaks out. And it's easy to use. Here are a few potential arguments I can imagine people might make in favor of version control:
  1. You can avoid having Fig_1.ai, Fig_1_2.ai, Fig_1_2_final_AR_PG_JK.ai, etc. Just make the change and commit! You have all the old versions!
  2. You can keep track of who changed what code and roll things back (and manage file conflicts).
Well, to point 1, I actually think that there’s nothing really wrong with having all these different copies of a file around. It makes it really easy to quickly see what changed between different versions, which is especially useful for binary files (like Illustrator files) that you can’t run a diff on. Sure, it’s maybe a bit cleaner to have just one Fig_1.ai, but in practice, I think it’s actually less useful. In our lab, we haven’t bothered doing that, and it’s all worked out just fine.

Which brings us then to point 2, about tracking code changes. In thinking about this, I think it’s useful to separate out code that is for general purpose tools in the lab and code that is specific for a particular project. For code for general purpose tools that multiple team members are contributing to, version control makes a lot of sense–that’s what it was really designed for, after all. It’s very helpful to see older versions of the codebase, see the exact changes that other members of the team have made, and so forth.

These rationales don’t really apply, though, to code that people will write for analyzing data for a particular project. In our lab, and I suspect most others, this code is typically written by one or two people, and if two, they’re typically working in very close contact. Moreover, the end goal is not to have a record of a shifting codebase, but rather to have a single, finalized set of analysis scripts that will reproduce the figures and numbers in the paper. For this reason, the ability to roll back to previous versions of the code and annotate changes is of little utility in practice. I asked around lab, and I think there was maybe one time when we rolled back code. Otherwise, basically, for most analyses for papers, we just move forward and don’t worry about it. I suppose there is theoretically the possibility that some old analysis could prove useful that you could recover through version control, but honestly, most of the time, that ends up in a separate folder anyway. (One might say that’s not clean, but I think that it’s actually just fine. If an analysis is different in kind, then replacing it via version control doesn’t really make sense–it’s not a replacement of previous code per se.)

Of course, one could say, well, even if version control isn’t strictly necessary for reproducible analyses, what does it hurt? In my opinion, the big negative is the amount of friction version control injects into virtually every aspect of the analysis process. This is the price you pay for versioning and annotation, and I think there’s no way to get around that. With Dropbox, I just stick a file in and it shows up everywhere, up to date, magically. No muss, no fuss. If you use version control, it’s constant committing, pushing, pulling, updating, and adding notes. Moreover, if you’re like me, you will screw up at some point, leading to some problem, potentially catastrophic, that you will spend hours trying to figure out. I’m clearly not alone:
“Abort: remote heads forked” anyone? :) At that point, we all just call over the one person in lab who knows how to deal with all this crap and hope for the best. And look, I’m relatively computer savvy, so I can only imagine how intimidating all this is for people who are less computer savvy. The bottom line is that version control is cumbersome, arcane and time-consuming, and most importantly, doesn’t actually contribute much to a reproducible computational analysis. If the point is to encourage people who are relatively new to computation to make scripts and organize their computational results, I think directing them adopt version control is a very bad idea. Indeed, for a while I was making everyone in our lab use version control for their projects, and overall, it has been a net negative in terms of time. We switched to Dropbox for a few recent projects and life is MUCH better–and just as reproducible.

Oh, and I think there are some people who use version control for the text of their papers (almost certainly a proper subset of those who are for some reason writing their papers in Markdown or LaTeX). Unless your paper has a lot of math in it, I have no idea why anyone would subject themselves to this form of torture. Let me be the one to tell you that you are no less smart or tough if you use Google Docs. In fact, some might say you’re more smart, because you don’t let command-line ethos/ideology get in the way of actually getting things done… :)

Which brings me to the example of figure scripting. Figure scripting is the process of making a figure completely from a script. Such a script will make all the subpanels, adjust all the font sizes, deal with all the colors, and so forth. In an ideal world with infinite time, this would be great–who wouldn't want to make all their figures magically appear by typing make figures? In practice, there are definitely some diminishing returns, and it's up to you where the line is between making it reproducible and getting it done. For me, the hard line is that all graphical elements representing data values should be coded. Like, if I make a scatterplot, then the locations of the points relatively to axes should be hard coded. Beyond that, Illustrator time! Illustrator will let you set the font size, the line weighting, marker color, and virtually every other thing you can think of simply and relatively intuitively, with immediate feedback. If you can set your font sizes and so forth programmatically, more power to you. But it's worth keeping in mind that the time you spend programming these things is time you could be spending on something else. This time can be substantial: check out this lengthy bit of code written to avoid a trip to Illustrator. Also, as the complexity of what you're trying to do gets greater, the fewer packages there are to help you make your figure. For instance, consider this figure from one of Marshall's papers:


Making gradient bars and all the lines and annotations would be a nightmare to do via script (and this isn't even very complicated). Yes, if you decide to make a change, you will have to redo some manual work in Illustrator, hence the common wisdom to make it all in scripts to "save time redoing things". But given the amount of effort it takes to figure out how to code that stuff, nine times out of ten, the total amount of time spent just redoing it will be less. And in a time when nobody reads things carefully, adding all these visual elements to your paper to make it easier to explain your work quickly is a strong imperative–stronger than making sure it all comes from a script, in my view.

Anyway, all that said, what do we actually do in the lab? Having gone through a couple iterations, we've basically settled on the following. We make a Dropbox folder for the paper, and within the folder, we have subfolders, one for raw(ish) data, one for scripts, one for graphs and one for figures (perhaps with some elaborations depending on circumstances). In the scripts folder is a set of, uh, scripts that, when run, take the raw(ish) data and turn it into the graphical elements. We then assemble those graphical elements into figures, along with a readme file to document which files went into the figure. Those figures can contain heavily altered versions of the graphical elements, and we will typically adjust font sizes, ticks, colors, you name it, but if you want to figure out why some data point was where it was, the chain is fully accessible. Then, when we're done, we put the files all into bitbucket for anyone to access.

Oh, and one other thing about permanence: our scripts use some combination of R and MATLAB, and they work for now. They may not work forever. That's fine. Life goes on, and most papers don't. Those that do do so because of their scientific conclusions, not their data or analysis per se. So I'm not worried about it.

Update, 3/1/2016: Pretty predictable pushback from a lot of people, especially about version control. First, just to reiterate, we use version control for our general purpose tools, which are edited and used by many people, thus making version control the right tool for the job. Still, I have yet to hear any truly compelling arguments for using version control that would mitigate against the substantial associated complexity for the use case I am discussing here, which is making the analyses in a paper reproducible. There's a lot of bald assertions of the benefits of version control out there without any real evidence for their validity other than "well, I think this should be better", also with little frank discussion of the hassles of version control. This strikes me as similar to the pushback against the LaTeX vs. Word paper. Evidence be damned! :)