Thursday, March 3, 2016

From over-reproducibility to a reproducibility wish-list

Well, it’s clear that that last blog post on over-reproducibility touched a bit of a nerve. ;)

Anyway, lot of the feedback was rather predictable and not particularly convincing, but I was pointed to this discussion on the software carpentry website, which was actually really nice:

On 2016-03-02 1:51 PM, Steven Haddock wrote:
> It is interesting how this has morphed into a discussion of ways to convince / teach git to skeptics, but I must say I agreed with a lot of the points in the RajLab post.
> Taking a realistic and practical approach to use of computing tools is not something that needs to be shot down (people sound sensitive!). Even if you can’t type `make paper` to recapitulate your work, you can still be doing good science…
+1 (at least) to both points. What I've learned from this is that many scientists still see cliffs where they want on-ramps; better docs and lessons will help, but we really (really) to put more effort into usability and interoperability. (Diff and merge for spreadsheets!)

So let me turn this around and ask Arjun: what would it take to convince you that it *was* worth using version control and makefiles and the like to manage your work? What would you, as a scientist, accept as compelling?


Dr Greg Wilson
Director of Instructor Training
Software Carpentry Foundation

First off, thanks to Greg for asking! I really appreciate the active attempt to engage.

Secondly, let me just say that as to the question of what it would take for us to use version control, the answer is nothing at all, because we already use it! More specifically, we use it in places where we think it’s most appropriate and efficient.

I think it may be helpful for me to explain what we do in the lab and how we got here. Our lab works primarily on single cell biology, and our methods are primarily single molecule/single cell imaging techniques and, more recently, various sequencing techniques (mostly RNA-seq, some ATAC-seq, some single cell RNA-seq). My lab has people with pretty extensive coding experience and people with essentially no coding experience, and many in between (I see it as part of my educational mission to try and get everyone to get better at coding during their time in the lab). My PhD is in applied math with a side of molecular biology, during which time we developed a lot of the single RNA molecule techniques that we are still using today. During my PhD, I was doing the computational parts of my science in an only vaguely reproducible way, and that scared me. Like “Hmm, that data point looks funny, where did that come from?”. Thus, in my postdoc, I started developing a little MATLAB "package" for documenting and performing image analysis. I think this is where our first efforts in computational reproducibility began.

When I started in the lab in 2010, my (totally awesome) first student Marshall and I took the opportunity to refactor our image analysis code, and we decided to adopt version control for these general image processing tools. After a bit of discussion, we settled on Mercurial and because it was supposed to be easier to use than git. This has served us fairly well. Then, my brilliant former postdoc Gautham got way into software engineering and completely refactored our entire image processing pipeline, which is basically what we are using today, and is the version that we point others to use here. Since then, various people have contributed modules and so forth. For this sort of work, version control is absolutely essential: we have a team of people contributing to a large, complex codebase that is used by many people in the lab. No brainer.

In our work, we use these image processing tools to take raw data and turn it into numbers that we then use to hopefully do some science. This involves the use of various analysis scripts that will take this data, perform whatever statistical analysis and so forth on it, and then turn that into a graphical element. Typically, this is done by one, more often two, people in the lab, typically working closely together.
Right around the time Gautham left the lab, we had several discussions about software best practices in the lab. Gautham argued that every project should have a repository for these analysis scripts. He also argued that the commit history could serve as a computational lab notebook. At the time, I thought the idea of a repo for every project was a good one, and I cajoled people in the lab into doing it. I pretty quickly pushed back on the version-control-as-computational-lab-notebook claim, and I still feel that pretty strongly. I think it’s interesting to think about why. Version control is a tool that allows you to keep track of changes to code. It is not something that will naturally document what that code does. My feeling is that version control is in some ways a victim of its own success: it is such a useful tool for managing code that it is now widely used and promoted, and as a side-effect it is now being used for a lot of thing for which it is not quite the right tool for the job, a point I’ll come back to.

Fast forward a little bit. Using version control in the repo-for-every-project model was just not working for most people in the lab. To give a sense of what we’re doing, in most projects, there’s a range of analyses, sometimes just making a simple box-plot or bar graph, sometimes long-ish scripts that take, say, RNA counts per cell and fit to a model of RNA production, extracting model parameters with error bounds. Sometimes it might be something still more complicated. The issue with version control in this scenario is all the headache. Some remote heads would get forked. Somehow things weren't syncing right. Some other weird issue would come up. Plus, frankly all the commit/push/pull/update was causing some headaches, especially if someone forgot to push. One student in the lab and I were just working on a large project together, and after bumping into these issues over and over, she just said “screw it, can we just use Dropbox?” I was actually reluctant at first, but then I thought about it a bit more. What were we really losing? As I mention in the blog post, our goal is a reproducible analysis. For this, versioning is at best a means towards this goal, and in practice for us, a relatively tangential means. Yes, you can go back and use earlier versions. Who cares? The number of times we’ve had to do that in this context is basically zero. One case people have mentioned as a potential benefit for version control is performing alternative, exploratory analyses on a particular dataset, the idea being you can roll back and compare results. I would argue that version control is not the best way to perform or document this. Let’s set I have a script for “myCoolAnalysis”. What we do in lab is make “myAlternativeAnalysis” in which we code our new analysis. Now I can easily compare. Importantly, we have both versions around. The idea of keeping the alternative version in version control is I think a bad one: it’s not discoverable except by searching the commit log. Let’s say that you wanted to go back to that analysis in the future. How would I find it? I think it makes much more sense to have it present in the current version of the code than to dig through the commit history. One could argue that you could fork the repo, but then changes to other, unrelated parts of the repo would be hard to deal with. Overall, version control is just not the right tool for this, in my opinion.

Another, somewhat related point that people have raised is looking back to see why some particular output changed. Here, we’re basically talking about bugs/flawed analyses. There is some merit to this, and so I acknowledge there is a tradeoff, and that once you get to a certain scale, version control is very helpful. However, I think that for scientific programming at the scale I’m talking about, it’s usually fairly clear what caused something to change, and I’m less concerned about why something changed and much more worried about whether we’re actually getting the right answer, which is always a question about the code as it stands. For us, the vast majority of the time, we are moving forward. I think the emphasis here would be better on teaching people about how to test their code (which is a scientific problem more than a programming problem) than version control.

Which leads me to really answering the question: what would I love to have in the lab? On a very practical level, look, version control is still just too hard and annoying to use for a lot of people and injects a lot of friction into the process. I have some very smart people in my lab, and we all have struggled from time to time. I’m sure we can figure it out, but honestly, I see little impetus to do so for the use cases outlined above, and yes, our work is 100% reproducible without it. Moving (back) to Dropbox has been a net productivity win, allowing us to work quickly and efficiently together. Also, the hassle free nature of it was a real relief. On our latest project, while using version control, we were always asking “oh, did you push that?”, “hmm, what happened?”, “oh, I forgot to update”. (And yes, we know about and sometimes use SourceTree.) These little hassles all add up to a real cognitive burden, and I’m sorry, but it's just a plain fact that Dropbox is less work. Now it’s just “Oh, I updated those graphs”, “Looks great, nice!”. Anyway, what I would love is Dropbox with a little bit more version tracking. And Dropbox does have some rudimentary versioning, basically a way to recover from an "oh *#*$" moment–the thing I miss most is probably a quick diff. Until this magical system emerges, though, on balance, it is currently just more efficient for us not to use version control for this type of computational work. I posit that the majority of people who could benefit from some minimal computational reproducibility practices fall into this category as well.

Testing: I think getting people in the habit of testing would be a huge move in the right direction. And I think this means scientific code testing, not just “program doesn’t crash” testing. When I teach my class on molecular systems biology, one of my secret goals is to teach students a little bit about scientific programming. For those who have some programming experience, they often fall into the trap of thinking “well, the program ran, so it must have worked”, which is often fine for, say, a website or something, but it’s usually just the beginning of the story for scientific programming and simulations. Did you look for the order of convergence (or convergence at all)? Did you look for whether you’re getting the predicted distribution in a well-known degenerate case? Most people don’t think about programming that way. Note that none of this has anything to do with version control per se.

On a bigger level, I think the big unmet need is that of a nice way to document an analysis as it currently stands. Gautham and I had a lot of discussions about this when he was in lab. What would such documentation do? Ideally, it would document the analysis in a searchable and discoverable way. This was something Gautham and I discussed at length and didn’t get around to implementing. Here’s one idea we were tossing around. Let’s say that you kept your work in a directory tree structure, with analyses organized by subfolder. Like, could keep that analysis of H3K4me3 in “histoneModificationComparisons/H3K4me3/”, then H3K27me3 in “histoneModificationComparisons/H3K27me3/”. In each directory, you have the scripts associated with a particular analysis, and then running those scripts produces an output graph. That output graph could either be stored in the same folder or in a separate “graphs” subfolder. Now, the scripts and the graphs would have metadata (not sure what this would look like in practice), so you could have a script go through and quickly generate a table of contents with links to all these graphs for easy display and search. Perhaps this is similar to those IPython notebooks or whatever. Anyway, the main features is that this would make all those analyses (including older ones that don't make it in the paper) discoverable (via tagging/table of contents) and searchable (search:“H3K27”). For me, this would be a really helpful way to document an analysis, and would be relatively lightweight and would fit into our current workflow. Which reminds me: we should do this.

I also think that a lot of this discussion is really sort of veering around the simple task of keeping a computational lab notebook. This is basically a narrative about what you tried, what worked, what didn’t work, and how you did it, why you did it, and what you learned. I believe there have been a lot of computational lab notebook attempts out there, from essentially keyloggers on up, and I don’t know of any that have really taken off. I think the main thing that needs to change there is simply the culture. Version control is not a notebook, keylogging is not a notebook, the only thing that is a notebook is you actually spending the time to write down what you did, carefully and clearly–just like in the lab. When I have cajoled people in the lab into doing this, the resulting documents have been highly useful to others as how-to guides and as references. There have been depressingly few such documents, though.

Also, seriously, let's not encourage people to use version control for maintaining their papers. This is just about the worst way to sell version control. Unless you're doing some heavy math with LaTeX or working with a very large document, Google Docs or some equivalent is the clear choice every time, and it will be impossible to convince me otherwise. Version control is a tool for maintaining code. It was never meant for managing a paper. Much better tools exist. For instance, Google Docs excels at easy sharing, collaboration, simultaneous editing, commenting and reply-to-commenting. Sure, one can approximate these using text-based systems and version control. The question is why anyone would like to do that. Not everything you do on a computer maps naturally to version control.

Anyway, that ended up being a pretty long response to what was a fairly short question, but I also just want to reiterate that I find it reassuring that people like Greg are willing to listen to these ramblings and hopefully find something positive from it. My lab is really committed to reproducible computational analyses, and I think I speak for many when I describe the challenges we and others face in making it happen. Hopefully this can stimulate some new discussion and ideas!


  1. I enthusiastically agree on the importance of (1) discoverable, instructive documentation and (2) testing/verification. Documentation is far more important than software tools for efficiently communicating knowledge. It's very frustrating and sad when work has to be re-done because the last person who knew the relevant details graduated years ago and left no paper trail.

    Regarding computational laboratory notebooks, I'm personally fond of a folder structure that goes project_name/journal/worker_name/YYYY-MM-DD_keywords_about_this_entry.{docx,org,txt,rtf,...}. Each file has a corresponding folder with the same name, just without the file extension. Supporting images, snippets, or tables that are specific to that entry (i.e., not going to be used repeatedly) go in that folder. Hyperlinks (file:relative/path/to/data) are used liberally in the journal entries to refer to data in the central data store, wherever that may be. This seems pretty flexible and unlikely to cause new problems while solving old ones. Each person can work in their own preferred format, or just add text files with URLs to online documentation. I often old entries to cross-reference relevant new entries; these edits should perhaps be confined to a specific section of the file dedicated to this purpose.

    I am considering cutting the binding off my completed paper notebooks and running them through an automatic feed scanner, especially when I move between jobs. The fact that a paper notebook only exists in one place creates issues. It's also a risk---a notebook may get damaged or lost.

    I don't think version control helps reproducibility, unless we're talking about reproducing (really, replaying) every action that led to the final product. It may reduce errors, though. Version control does reduce bug counts in software, but code review (which I believe you've blogged about several months ago) actually reduces bug counts more. I unfortunately can't recall the citation for this, but it was definitely based on empirical bug counts. So for scientific practice, adopting version control may be less useful than ensuring robust intra-lab peer review.

  2. Thanks for sharing your view, it is an interesting point of discussion.

    I understand your argument and agree that not everything needs to be under version control. I also use just Dropbox or GoogleDrive for small projects that only have a few simple scripts that I write mostly all at once and that I don't modify much later on. However, I find version control very useful for software that I keep modifying even if it's a solo work. (BTW, I'm very familiar with the git command line but I use SourceTree GUI 90% of the time).

    One advantage of VC is that it forces me to write a log of what I am doing. This helps when juggling between different projects. Going back to a project after 1 or a few weeks I can easily see what I was working on and keep going. The same can be done with README file but I find harder to keep this discipline.
    Another advantage is code review. At commit-time, I have a second look at what I changed. Many times I find typos and small errors in this phase and I can correct them before committing. Other times, I catch unintended modifications that were some random test/exploration I forgot to revert. I would probably find the same bugs later on, when analyzing the results but it would take more time to track them down.
    In some cases, I keep png figures in the repository. SourceTree shows figures changes side-by-side and I find it a quick way to track if some modification had unintended consequences.

    I was thinking that maybe can use git inside the dropbox folder. Everybody will receive the latest version via Dropbox, like you do now. In addition, one person is in charge of committing so you still have snapshots that you can diff (and the full repository is also available to everybody because the .git folder is synced by Dropbox).

    Regarding the lab-notebooks I use Jupyter (ex IPython) notebooks. Jupyter notebook is very close to what you describe. It is a document (with sections, links, equations, etc..) that contains the code and the results/figures (you run the notebook to populate it with output and figures). I tend to have a notebook per data file (or per data folder). When, later on, I go back the data I can open the notebook, see the narrative and the results. Even if you don't write much text, the notebook gives a lightweight narrative with a sequential order or logical connection to the series of figures. I find it better than I script and a folder with a bunch of figures.

    The database that you envision is still missing and is an orthogonal problem to using scripts vs notebooks. But I guess the big problem is finding proper field-specific (or lab-specific) tags to categorize all of your work.

    1. These are interesting points! I actually really like the idea of using version control together with Dropbox as you suggest. Perhaps we'll try that out in the lab.

      Still not convinced about version control as a notebook. I think that we need to change the culture of how we document computational analyses. If version control helps you enforce that, great, but I agree with you that it's not the only way to do it. I think that it's not the best tool for this, though.

      Jupyter notebooks: not a bad idea. Although I've heard online that some reproducibility mavens don't recommend it. Not sure why.

    2. Which tool do you think is better to take notes about the work you do on software? Documentation of the software is a different beast, I agree with you. You really need to write documentation to make the software reusable. But for keeping track of what you do daily I don't know better tools than VCS.

      For the other point, notebooks are an improvements compared to a script for reproducibility because they embed documentation with code and results (they provide the logic "flow"). Critics of notebooks point out that notebooks don't capture automatically all your environment (meaning all software installed, platform and so forth), so they don't include all the information to reproduce your work. That's true, but to mitigate this I, for one, print the version of all the used libraries at the beginning of the notebook so this information is stored with the notebook. But in reality this problem is orthogonal from the notebooks. For project I want to reproduce years later, I use a separated conda environment.

      Another "pain point" of using notebooks is that they are JSON based and not as easily diff-able as plain scripts. That's also true, but there are workarounds. First, for notebooks embedded in a library I always commit the notebook without output (this make it easier to visually inspect the diff before committing). There is also ipymd that is an extension to store notebooks in markdown format (very easy to diff).

      People are aware of these problems and are trying to solve them. But the Jupyter notebook of today can already simplify organization of data analysis projects and make them more reproducible.

  3. Google Docs is definitely an easy and effective way to maintain and share a variety of documents. My only concern over the years is the level of privacy these documents have. I always feel anxious about using it for confidential or proprietary materials.