RajLab: From reproducibility to over-reproducibility

Sunday, February 28, 2016

From reproducibility to over-reproducibility

[See also follow up post.]

It's no secret that biomedical research is requiring more and more computational analyses these days, and with that has come some welcome discussion of how to make those analyses reproducible. On some level, I guess it's a no-brainer: if it's not reproducible, it's not science, right? And on a practical level, I think there are a lot of good things about making your analysis reproducible, including the following (vaguely ranked starting with what I consider most important):

Umm, that it’s reproducible.
It makes you a bit more careful about making your code more likely to be right, cleaner, and readable to others.
This in turn makes it easier for others in the lab to access and play with the analyses and data in the future, including the PI.
It could be useful for others outside the lab, although as I’ve said before, I think the uses for our data outside our lab are relatively limited beyond the scientific conclusions we have made. Still, whatever, it’s there if you want it. I also freely admit this might be more important for people who do work other people actually care about. :)

Balanced against these benefits, though, is a non-negligible negative:

It takes a lot of time.

On balance, I think making things as reproducible as possible is time well spent. In particular, it's time that could be well spent by the large proportion of the biomedical research enterprise that currently doesn't think about this sort of thing at all, and I think it is imperative for those of us with a computational inclination to help train others to make their analyses reproducible.

My worry, however, is that the strategies for reproducibility that computational types are often promoting are off-target and not necessarily adapted for the needs and skills of the people they are trying to reach. There is a certain strain of hyper-reproducible zealotry that I think is discouraging others to adopt some basic practices that could greatly benefit their research, and at the same time is limiting the productivity of even its own practitioners. You know what I'm talking about: it's the idea of turning your entire paper into a program, so you just type "make paper" and out pops the fully formed and formatted manuscript. Fine in the abstract, but in a line of work (like many others) in which time is our most precious commodity, these compulsions represent a complete failure to correctly measure opportunity costs. In other words, instead of hard coding the adjustment of the figure spacing of your LaTeX preprint, spend that time writing another paper. I think it’s really important to remember that our job is science, not programming, and if we focus too heavily on the procedural aspects of making everything reproducible and fully documented, we risk turning off those who are less comfortable with programming from the very real benefits of making their analysis reproducible.

Here are the two biggest culprits in my view: version control and figure scripting.

Let's start with version control. I think we can all agree that the most important part of making a scientific analysis reproducible is to make sure the analysis is in a script and not just typed or clicked into a program somewhere, only for those commands to vanish into faded memory. A good, reproducible analysis script should start with raw data, go through all the computational manipulations required, and leave you with a number or graphical element that ends up in your paper somewhere. This makes the analysis reproducible, because someone else can now just run the code and see how your raw data turned into that p-value in subpanel Figure 4G. And remember, that someone else is most likely your future self :).

Okay, so we hopefully all agree on the need for scripts. Then, however, almost every discussion about computational reproducibility begins with a directive to adopt git or some other version control system, as though it’s the obvious next step. Hmm. I’m just going to come right out and say that for the majority of computational projects (at least in our lab), version control is a waste of time. Why? Well, what is the goal of making a reproducible analysis? I believe the goal is to have a documented set of scripts that take raw data and reliably turn it into a bit of knowledge of some kind. The goal of version control is to manage code, in particular emphasizing “reversibility, concurrency, and annotation [of changes to code]”. While one can imagine some overlap between these goals, I don’t necessarily see a natural connection between them. To make that more concrete, let’s try to answer the question that I’ve been asking (and been asked), which is “Why not just use Dropbox?”. After all, Dropbox will keep all your code and data around (including older versions), shared between people seamlessly, and probably will only go down if WWIII breaks out. And it's easy to use. Here are a few potential arguments I can imagine people might make in favor of version control:

You can avoid having Fig_1.ai, Fig_1_2.ai, Fig_1_2_final_AR_PG_JK.ai, etc. Just make the change and commit! You have all the old versions!
You can keep track of who changed what code and roll things back (and manage file conflicts).

Well, to point 1, I actually think that there’s nothing really wrong with having all these different copies of a file around. It makes it really easy to quickly see what changed between different versions, which is especially useful for binary files (like Illustrator files) that you can’t run a diff on. Sure, it’s maybe a bit cleaner to have just one Fig_1.ai, but in practice, I think it’s actually less useful. In our lab, we haven’t bothered doing that, and it’s all worked out just fine.

Which brings us then to point 2, about tracking code changes. In thinking about this, I think it’s useful to separate out code that is for general purpose tools in the lab and code that is specific for a particular project. For code for general purpose tools that multiple team members are contributing to, version control makes a lot of sense–that’s what it was really designed for, after all. It’s very helpful to see older versions of the codebase, see the exact changes that other members of the team have made, and so forth.

These rationales don’t really apply, though, to code that people will write for analyzing data for a particular project. In our lab, and I suspect most others, this code is typically written by one or two people, and if two, they’re typically working in very close contact. Moreover, the end goal is not to have a record of a shifting codebase, but rather to have a single, finalized set of analysis scripts that will reproduce the figures and numbers in the paper. For this reason, the ability to roll back to previous versions of the code and annotate changes is of little utility in practice. I asked around lab, and I think there was maybe one time when we rolled back code. Otherwise, basically, for most analyses for papers, we just move forward and don’t worry about it. I suppose there is theoretically the possibility that some old analysis could prove useful that you could recover through version control, but honestly, most of the time, that ends up in a separate folder anyway. (One might say that’s not clean, but I think that it’s actually just fine. If an analysis is different in kind, then replacing it via version control doesn’t really make sense–it’s not a replacement of previous code per se.)

Of course, one could say, well, even if version control isn’t strictly necessary for reproducible analyses, what does it hurt? In my opinion, the big negative is the amount of friction version control injects into virtually every aspect of the analysis process. This is the price you pay for versioning and annotation, and I think there’s no way to get around that. With Dropbox, I just stick a file in and it shows up everywhere, up to date, magically. No muss, no fuss. If you use version control, it’s constant committing, pushing, pulling, updating, and adding notes. Moreover, if you’re like me, you will screw up at some point, leading to some problem, potentially catastrophic, that you will spend hours trying to figure out. I’m clearly not alone:

“Abort: remote heads forked” anyone? :) At that point, we all just call over the one person in lab who knows how to deal with all this crap and hope for the best. And look, I’m relatively computer savvy, so I can only imagine how intimidating all this is for people who are less computer savvy. The bottom line is that version control is cumbersome, arcane and time-consuming, and most importantly, doesn’t actually contribute much to a reproducible computational analysis. If the point is to encourage people who are relatively new to computation to make scripts and organize their computational results, I think directing them adopt version control is a very bad idea. Indeed, for a while I was making everyone in our lab use version control for their projects, and overall, it has been a net negative in terms of time. We switched to Dropbox for a few recent projects and life is MUCH better–and just as reproducible.

Oh, and I think there are some people who use version control for the text of their papers (almost certainly a proper subset of those who are for some reason writing their papers in Markdown or LaTeX). Unless your paper has a lot of math in it, I have no idea why anyone would subject themselves to this form of torture. Let me be the one to tell you that you are no less smart or tough if you use Google Docs. In fact, some might say you’re more smart, because you don’t let command-line ethos/ideology get in the way of actually getting things done… :)

Which brings me to the example of figure scripting. Figure scripting is the process of making a figure completely from a script. Such a script will make all the subpanels, adjust all the font sizes, deal with all the colors, and so forth. In an ideal world with infinite time, this would be great–who wouldn't want to make all their figures magically appear by typing make figures? In practice, there are definitely some diminishing returns, and it's up to you where the line is between making it reproducible and getting it done. For me, the hard line is that all graphical elements representing data values should be coded. Like, if I make a scatterplot, then the locations of the points relatively to axes should be hard coded. Beyond that, Illustrator time! Illustrator will let you set the font size, the line weighting, marker color, and virtually every other thing you can think of simply and relatively intuitively, with immediate feedback. If you can set your font sizes and so forth programmatically, more power to you. But it's worth keeping in mind that the time you spend programming these things is time you could be spending on something else. This time can be substantial: check out this lengthy bit of code written to avoid a trip to Illustrator. Also, as the complexity of what you're trying to do gets greater, the fewer packages there are to help you make your figure. For instance, consider this figure from one of Marshall's papers:

Making gradient bars and all the lines and annotations would be a nightmare to do via script (and this isn't even very complicated). Yes, if you decide to make a change, you will have to redo some manual work in Illustrator, hence the common wisdom to make it all in scripts to "save time redoing things". But given the amount of effort it takes to figure out how to code that stuff, nine times out of ten, the total amount of time spent just redoing it will be less. And in a time when nobody reads things carefully, adding all these visual elements to your paper to make it easier to explain your work quickly is a strong imperative–stronger than making sure it all comes from a script, in my view.

Anyway, all that said, what do we actually do in the lab? Having gone through a couple iterations, we've basically settled on the following. We make a Dropbox folder for the paper, and within the folder, we have subfolders, one for raw(ish) data, one for scripts, one for graphs and one for figures (perhaps with some elaborations depending on circumstances). In the scripts folder is a set of, uh, scripts that, when run, take the raw(ish) data and turn it into the graphical elements. We then assemble those graphical elements into figures, along with a readme file to document which files went into the figure. Those figures can contain heavily altered versions of the graphical elements, and we will typically adjust font sizes, ticks, colors, you name it, but if you want to figure out why some data point was where it was, the chain is fully accessible. Then, when we're done, we put the files all into bitbucket for anyone to access.

Oh, and one other thing about permanence: our scripts use some combination of R and MATLAB, and they work for now. They may not work forever. That's fine. Life goes on, and most papers don't. Those that do do so because of their scientific conclusions, not their data or analysis per se. So I'm not worried about it.

Update, 3/1/2016: Pretty predictable pushback from a lot of people, especially about version control. First, just to reiterate, we use version control for our general purpose tools, which are edited and used by many people, thus making version control the right tool for the job. Still, I have yet to hear any truly compelling arguments for using version control that would mitigate against the substantial associated complexity for the use case I am discussing here, which is making the analyses in a paper reproducible. There's a lot of bald assertions of the benefits of version control out there without any real evidence for their validity other than "well, I think this should be better", also with little frank discussion of the hassles of version control. This strikes me as similar to the pushback against the LaTeX vs. Word paper. Evidence be damned! :)

30 comments:

Michael KuhnFebruary 29, 2016 at 5:10 AM
Dropbox actually keeps versions only for 30 days, or one year if you're a paying customer. So I wouldn't rely on it for long-term storage of versions. I see two reasons to use version control:

1. Scripts evolve e.g. during reviews, and it may be necessary to go back to the state of the, say, first submission to check why a p-value has changed.

2. A lot of bioinformatic analyses are exploratory programming, and this inevitably leads to dead ends where I have a functional piece of code that turns out to be unnecessary. Having this dead end in version control (with a prominent comment: "final state of X analysis before removing it") is better than having it commented-out in the scripts. (I also have a section "Things that didn't work" in a readme file, referencing this commit.)

Also, I think Mercurial is much more user-friendly than git.

PS: I think it's funny that you link my ggplot2 hack -- I only started with the script after doing the same things in Illustrator one too many times. Neither extensive, fragile hacks nor rote work in Illustrator are great.
ReplyDelete
Replies
David MartinFebruary 29, 2016 at 10:40 AM
As a former wet lab biologist and (for most of my career) bioinformatician, it reads somewhat differently if you replace the computational activities with standard wet lab activities. Do so and most wet lab PI's would be shocked at poor practice. The part of computational science that most biologists (IME) struggle with is accurate record keeping, jsut as you would want in a lab book. How reproducible is the work you do from your lab journal? The computational work should be as reprodicible or better.

Saying 'used script XYZ with these options' is fine if script XYZ (or SOP XYZ) has not changed, but at least in my experience, scripts get modified and improved, made more robust or new features added as they tend to be tools to achieve analyses that are repeated again and again as work progresses.

Version control is overkill until you needed it. An inability to use it correctly (even with nice tools like SourceTree is on the same level as your grad student or postdoc being unable to use a machine or lab technique properly and getting poor/variable results that are frustrating and waste time.

As for time use, yes it is great to use illustrator for final tweaks and layout, but basic work should be done right first time or you are wasting time.
ReplyDelete
Replies
Lorena PantanoFebruary 29, 2016 at 2:40 PM
Actually very good points. The only flaw I can see is that this requires all people in a project working the same way. And some times this is not straightforward. I agree that not all people need subversion, or is not fundamental that you have a script that creates everything. I would vote more for good documentation and good code style than any other thing.

How subversion/git helps me is:

- force me to create 'good' code (that means the best I can do) all the time, because I like to have commits that means something, and force me think about how things need to change and not just test/changes things.
- it really helps me to find when something changed, that's only if you really use the commit messages. If I hadn't git, I will need to change the script and somewhere duplicate the scripts, or write down why/when/how just in case in 6 months I forgot why this functions is what it is (just maybe to avoid repeat the error).
- how updated is the code or how many time changes. By experience, normally a code that only changed 1 or that it was published a year a go and nothing happened after that has higher chances to have bugs. Here I am talking about something in the middle between a proper tool and some scripts to do some analysis (proper tools need (my opinion) to have version control).

so for sure, if you can do all this by yourself, then is good. For me, it would be much more difficult to have documentation and code in different folders or naming versions.

In my case, one time I couldn't reproduce my nice results, and decided to move to GIT, and for sure is complicated at first. But I think, that with the proper knowledge (i don' think you need to be an expert) is enough to be a GIT friend.

You mentioned that this is painful for people with no computer knowledge. For sure, I don't expect this for everybody. But if your paper requires run many tools and many lines/scripts in R/matlab, I would expect to have enough to be able to get a clean/understandable code, and I will trust more if I can see the evolution than the final product.

I understand all your points for papers, more specifically text and figures, and I am actually in the middle of what you explain and scripting. As a note, sometimes the simplest figure is the best and many people try to do fancy figures when maybe not needed.

good discussion anyways!
ReplyDelete
Replies
MaxFebruary 29, 2016 at 4:48 PM
This comment has been removed by the author.
ReplyDelete
Replies
MaxFebruary 29, 2016 at 4:50 PM
I agree. It takes a lot of time. So people who write that everything should be reproducible are typically (not that there aren't exception) not from the labs that publish in glamour journals like Nature and Science and for a reason. But then, people that publish a lot typically don't blog or tweet a lot, also for a reason.

One note: plotting packages are getting better. You can see it happening, under our eyes. Look e.g. at things like ggplot2 which was already somewhat a step forward (and sometimes backwards) and also http://stanford.edu/~mwaskom/software/seaborn/. I am hoping (and you may say in vain) that going to Illustrator will be less common. Sure. Next year. It'll all be so easy.
ReplyDelete
Replies
emb3February 29, 2016 at 10:48 PM
You neglected to amortize the initial time costa of, eg, making figures. after making a few it is a minimal time investment to make new, reproducible ones from scratch because you've learned the syntax and you have base code in a file for when you need to make new ones.
ReplyDelete
Replies
Fabien CampagneFebruary 29, 2016 at 10:53 PM
I agree that reproducibility is hard with existing tools (e.g., git, docker), and probably too hard for biologists and many beginners in bioinformatics.
I also agree that there are diminishing results and that automating enough that you can re-run analyses on new datasets is often sufficient. We also use illustrator to take plots the extra length to get it to publication quality, rather than scripting every font. I have no issue as this as long as beautification does not change the message of the plot.
However, I don't think the solution you outline is general enough to work for most situations. I don't doubt that it works in your lab, but for instance, it would not work in mine where raw data is hundreds of GB of sequencing data.
I believe the future will come from making reproducibility easier, and that this will likely involve better, more seamless tools, as well as education. I think reproducibility will only win (i.e., be widely adopted) once it becomes easier and more convenient than the alternatives.
ReplyDelete
Replies
dimpaseMarch 1, 2016 at 5:40 PM
version control is priceless to work collaboratively on a paper (needless to say, it's 1000 more priceless if you collaborate on code), say. When your co-author (particularly if it is yours PhD student!) does an update, you really do not want to read the whole text again; surely you can make diffs to see what got changed, but it gets very messy very quickly, as there always more updates. The alternative is to have all these "version of blah/blah dated such an such by him and her" mess in your files.
ReplyDelete
Replies
John PeloquinMarch 1, 2016 at 8:48 PM
Google Docs and similar systems (MS Word + OneDrive, Dropbox), I think, should count as a limited form of version control: one file per repo, linear history, no commit messages, and only one file format allowed. In return for these limitations, you get automatic, real-time commit/push/pull.

I don't see any particular reason for treating math-heavy papers differently than math-light papers. Microsoft's equation editor became rather nice back in 2007, and allows cross-referencing equation numbers. The procedure involves a lot of clicking through dialog boxes, but so do most things in Word.

I personally use version control and text-based formats when writing because I find it more cognitively comfortable than MS Word or Google Docs. (Although I learned these tools for reasons entirely unrelated to writing.) However, I almost never recommend version control, scripting, etc. to other people because being "the one person in lab who knows how to deal with all this crap", as Raj says, can be hazardous to one's time.
ReplyDelete
Replies
Ryan PepperMarch 2, 2016 at 3:27 AM
I think the reason that so many people are ardent enthusiasts for version control is simply that all of us have had some kind of problem that would have been solved had we used it. Using version control on its own is not enough to make your work reliable; it's just part of a series of things that you should do when writing software, such as writing tests and running them against your changes.

Writing software tests when you're analysing big data is always going to be pretty tricky, but you should be able to procedurally a minimal set of data and run your analysis against it and make sure it gives you the answer that you expect.

It's really important, because if someone changes something, you get immediate feedback if something breaks "Your tests have failed!". I'm no huge fan of Git, but I do like GitHub, because of the integrations you have with testing frameworks like TravisCI and CircleCI which are free for open source projects.

This also makes your life 100x easier if you want to refactor something - you can check that you get exactly the same answer once you have cleaned up your code.

There is a really good talk by Mike Croucher from Sheffield University; the references here may be of interest.
http://mikecroucher.github.io/MLPM_talk/
ReplyDelete
Replies
Vincent NoelAugust 13, 2016 at 7:23 PM
Thanks for that article, you have articulated my own unconscious thoughts on the subject. I did not have the guts to go against the very well-accepted idea that version control is always inherently the best practice, by principle.

I have also fudged scripted figures by hand, but when I do it I often end up doing the manual modifications enough times to regret not doing them by script. To each his own I guess. Still, I'm 100% with you that doing git all by yourself on paper code is a big waste of time.
ReplyDelete
Replies

Add comment