Sunday, February 28, 2016

From reproducibility to over-reproducibility

[See also follow up post.]

It's no secret that biomedical research is requiring more and more computational analyses these days, and with that has come some welcome discussion of how to make those analyses reproducible. On some level, I guess it's a no-brainer: if it's not reproducible, it's not science, right? And on a practical level, I think there are a lot of good things about making your analysis reproducible, including the following (vaguely ranked starting with what I consider most important):
  1. Umm, that it’s reproducible.
  2. It makes you a bit more careful about making your code more likely to be right, cleaner, and readable to others.
  3. This in turn makes it easier for others in the lab to access and play with the analyses and data in the future, including the PI.
  4. It could be useful for others outside the lab, although as I’ve said before, I think the uses for our data outside our lab are relatively limited beyond the scientific conclusions we have made. Still, whatever, it’s there if you want it. I also freely admit this might be more important for people who do work other people actually care about. :)
Balanced against these benefits, though, is a non-negligible negative:
  1. It takes a lot of time.
On balance, I think making things as reproducible as possible is time well spent. In particular, it's time that could be well spent by the large proportion of the biomedical research enterprise that currently doesn't think about this sort of thing at all, and I think it is imperative for those of us with a computational inclination to help train others to make their analyses reproducible.

My worry, however, is that the strategies for reproducibility that computational types are often promoting are off-target and not necessarily adapted for the needs and skills of the people they are trying to reach. There is a certain strain of hyper-reproducible zealotry that I think is discouraging others to adopt some basic practices that could greatly benefit their research, and at the same time is limiting the productivity of even its own practitioners. You know what I'm talking about: it's the idea of turning your entire paper into a program, so you just type "make paper" and out pops the fully formed and formatted manuscript. Fine in the abstract, but in a line of work (like many others) in which time is our most precious commodity, these compulsions represent a complete failure to correctly measure opportunity costs. In other words, instead of hard coding the adjustment of the figure spacing of your LaTeX preprint, spend that time writing another paper. I think it’s really important to remember that our job is science, not programming, and if we focus too heavily on the procedural aspects of making everything reproducible and fully documented, we risk turning off those who are less comfortable with programming from the very real benefits of making their analysis reproducible.

Here are the two biggest culprits in my view: version control and figure scripting.

Let's start with version control. I think we can all agree that the most important part of making a scientific analysis reproducible is to make sure the analysis is in a script and not just typed or clicked into a program somewhere, only for those commands to vanish into faded memory. A good, reproducible analysis script should start with raw data, go through all the computational manipulations required, and leave you with a number or graphical element that ends up in your paper somewhere. This makes the analysis reproducible, because someone else can now just run the code and see how your raw data turned into that p-value in subpanel Figure 4G. And remember, that someone else is most likely your future self :).

Okay, so we hopefully all agree on the need for scripts. Then, however, almost every discussion about computational reproducibility begins with a directive to adopt git or some other version control system, as though it’s the obvious next step. Hmm. I’m just going to come right out and say that for the majority of computational projects (at least in our lab), version control is a waste of time. Why? Well, what is the goal of making a reproducible analysis? I believe the goal is to have a documented set of scripts that take raw data and reliably turn it into a bit of knowledge of some kind. The goal of version control is to manage code, in particular emphasizing “reversibility, concurrency, and annotation [of changes to code]”. While one can imagine some overlap between these goals, I don’t necessarily see a natural connection between them. To make that more concrete, let’s try to answer the question that I’ve been asking (and been asked), which is “Why not just use Dropbox?”. After all, Dropbox will keep all your code and data around (including older versions), shared between people seamlessly, and probably will only go down if WWIII breaks out. And it's easy to use. Here are a few potential arguments I can imagine people might make in favor of version control:
  1. You can avoid having Fig_1.ai, Fig_1_2.ai, Fig_1_2_final_AR_PG_JK.ai, etc. Just make the change and commit! You have all the old versions!
  2. You can keep track of who changed what code and roll things back (and manage file conflicts).
Well, to point 1, I actually think that there’s nothing really wrong with having all these different copies of a file around. It makes it really easy to quickly see what changed between different versions, which is especially useful for binary files (like Illustrator files) that you can’t run a diff on. Sure, it’s maybe a bit cleaner to have just one Fig_1.ai, but in practice, I think it’s actually less useful. In our lab, we haven’t bothered doing that, and it’s all worked out just fine.

Which brings us then to point 2, about tracking code changes. In thinking about this, I think it’s useful to separate out code that is for general purpose tools in the lab and code that is specific for a particular project. For code for general purpose tools that multiple team members are contributing to, version control makes a lot of sense–that’s what it was really designed for, after all. It’s very helpful to see older versions of the codebase, see the exact changes that other members of the team have made, and so forth.

These rationales don’t really apply, though, to code that people will write for analyzing data for a particular project. In our lab, and I suspect most others, this code is typically written by one or two people, and if two, they’re typically working in very close contact. Moreover, the end goal is not to have a record of a shifting codebase, but rather to have a single, finalized set of analysis scripts that will reproduce the figures and numbers in the paper. For this reason, the ability to roll back to previous versions of the code and annotate changes is of little utility in practice. I asked around lab, and I think there was maybe one time when we rolled back code. Otherwise, basically, for most analyses for papers, we just move forward and don’t worry about it. I suppose there is theoretically the possibility that some old analysis could prove useful that you could recover through version control, but honestly, most of the time, that ends up in a separate folder anyway. (One might say that’s not clean, but I think that it’s actually just fine. If an analysis is different in kind, then replacing it via version control doesn’t really make sense–it’s not a replacement of previous code per se.)

Of course, one could say, well, even if version control isn’t strictly necessary for reproducible analyses, what does it hurt? In my opinion, the big negative is the amount of friction version control injects into virtually every aspect of the analysis process. This is the price you pay for versioning and annotation, and I think there’s no way to get around that. With Dropbox, I just stick a file in and it shows up everywhere, up to date, magically. No muss, no fuss. If you use version control, it’s constant committing, pushing, pulling, updating, and adding notes. Moreover, if you’re like me, you will screw up at some point, leading to some problem, potentially catastrophic, that you will spend hours trying to figure out. I’m clearly not alone:
“Abort: remote heads forked” anyone? :) At that point, we all just call over the one person in lab who knows how to deal with all this crap and hope for the best. And look, I’m relatively computer savvy, so I can only imagine how intimidating all this is for people who are less computer savvy. The bottom line is that version control is cumbersome, arcane and time-consuming, and most importantly, doesn’t actually contribute much to a reproducible computational analysis. If the point is to encourage people who are relatively new to computation to make scripts and organize their computational results, I think directing them adopt version control is a very bad idea. Indeed, for a while I was making everyone in our lab use version control for their projects, and overall, it has been a net negative in terms of time. We switched to Dropbox for a few recent projects and life is MUCH better–and just as reproducible.

Oh, and I think there are some people who use version control for the text of their papers (almost certainly a proper subset of those who are for some reason writing their papers in Markdown or LaTeX). Unless your paper has a lot of math in it, I have no idea why anyone would subject themselves to this form of torture. Let me be the one to tell you that you are no less smart or tough if you use Google Docs. In fact, some might say you’re more smart, because you don’t let command-line ethos/ideology get in the way of actually getting things done… :)

Which brings me to the example of figure scripting. Figure scripting is the process of making a figure completely from a script. Such a script will make all the subpanels, adjust all the font sizes, deal with all the colors, and so forth. In an ideal world with infinite time, this would be great–who wouldn't want to make all their figures magically appear by typing make figures? In practice, there are definitely some diminishing returns, and it's up to you where the line is between making it reproducible and getting it done. For me, the hard line is that all graphical elements representing data values should be coded. Like, if I make a scatterplot, then the locations of the points relatively to axes should be hard coded. Beyond that, Illustrator time! Illustrator will let you set the font size, the line weighting, marker color, and virtually every other thing you can think of simply and relatively intuitively, with immediate feedback. If you can set your font sizes and so forth programmatically, more power to you. But it's worth keeping in mind that the time you spend programming these things is time you could be spending on something else. This time can be substantial: check out this lengthy bit of code written to avoid a trip to Illustrator. Also, as the complexity of what you're trying to do gets greater, the fewer packages there are to help you make your figure. For instance, consider this figure from one of Marshall's papers:


Making gradient bars and all the lines and annotations would be a nightmare to do via script (and this isn't even very complicated). Yes, if you decide to make a change, you will have to redo some manual work in Illustrator, hence the common wisdom to make it all in scripts to "save time redoing things". But given the amount of effort it takes to figure out how to code that stuff, nine times out of ten, the total amount of time spent just redoing it will be less. And in a time when nobody reads things carefully, adding all these visual elements to your paper to make it easier to explain your work quickly is a strong imperative–stronger than making sure it all comes from a script, in my view.

Anyway, all that said, what do we actually do in the lab? Having gone through a couple iterations, we've basically settled on the following. We make a Dropbox folder for the paper, and within the folder, we have subfolders, one for raw(ish) data, one for scripts, one for graphs and one for figures (perhaps with some elaborations depending on circumstances). In the scripts folder is a set of, uh, scripts that, when run, take the raw(ish) data and turn it into the graphical elements. We then assemble those graphical elements into figures, along with a readme file to document which files went into the figure. Those figures can contain heavily altered versions of the graphical elements, and we will typically adjust font sizes, ticks, colors, you name it, but if you want to figure out why some data point was where it was, the chain is fully accessible. Then, when we're done, we put the files all into bitbucket for anyone to access.

Oh, and one other thing about permanence: our scripts use some combination of R and MATLAB, and they work for now. They may not work forever. That's fine. Life goes on, and most papers don't. Those that do do so because of their scientific conclusions, not their data or analysis per se. So I'm not worried about it.

Update, 3/1/2016: Pretty predictable pushback from a lot of people, especially about version control. First, just to reiterate, we use version control for our general purpose tools, which are edited and used by many people, thus making version control the right tool for the job. Still, I have yet to hear any truly compelling arguments for using version control that would mitigate against the substantial associated complexity for the use case I am discussing here, which is making the analyses in a paper reproducible. There's a lot of bald assertions of the benefits of version control out there without any real evidence for their validity other than "well, I think this should be better", also with little frank discussion of the hassles of version control. This strikes me as similar to the pushback against the LaTeX vs. Word paper. Evidence be damned! :)

30 comments:

  1. Dropbox actually keeps versions only for 30 days, or one year if you're a paying customer. So I wouldn't rely on it for long-term storage of versions. I see two reasons to use version control:

    1. Scripts evolve e.g. during reviews, and it may be necessary to go back to the state of the, say, first submission to check why a p-value has changed.

    2. A lot of bioinformatic analyses are exploratory programming, and this inevitably leads to dead ends where I have a functional piece of code that turns out to be unnecessary. Having this dead end in version control (with a prominent comment: "final state of X analysis before removing it") is better than having it commented-out in the scripts. (I also have a section "Things that didn't work" in a readme file, referencing this commit.)

    Also, I think Mercurial is much more user-friendly than git.

    PS: I think it's funny that you link my ggplot2 hack -- I only started with the script after doing the same things in Illustrator one too many times. Neither extensive, fragile hacks nor rote work in Illustrator are great.

    ReplyDelete
    Replies
    1. Interesting points! Still not convinced about version control, though:
      1. In practice, this doesn't really happen very much, at least for us. It's rare that a change like this happens without some clue as to why. In the end, there is just one right answer anyway.
      2. No need to delete this exploratory programming! Just leave it in a subfolder of unusedAnalyses or whatever. If It's actually a functional piece of code, I think it's better documented that way in your final product than to document it via a commit, which is less discoverable.

      For what it's worth, we use Mercurial in lab, and it's still a pain: https://bitbucket.org/arjunrajlaboratory/

      Haha, sorry for calling out your ggplot2 hack. Agree that Illustrator work is tedious. Overall, I think it's the lesser of two evils in terms of net overall time. :)

      Delete
  2. As a former wet lab biologist and (for most of my career) bioinformatician, it reads somewhat differently if you replace the computational activities with standard wet lab activities. Do so and most wet lab PI's would be shocked at poor practice. The part of computational science that most biologists (IME) struggle with is accurate record keeping, jsut as you would want in a lab book. How reproducible is the work you do from your lab journal? The computational work should be as reprodicible or better.

    Saying 'used script XYZ with these options' is fine if script XYZ (or SOP XYZ) has not changed, but at least in my experience, scripts get modified and improved, made more robust or new features added as they tend to be tools to achieve analyses that are repeated again and again as work progresses.

    Version control is overkill until you needed it. An inability to use it correctly (even with nice tools like SourceTree is on the same level as your grad student or postdoc being unable to use a machine or lab technique properly and getting poor/variable results that are frustrating and waste time.

    As for time use, yes it is great to use illustrator for final tweaks and layout, but basic work should be done right first time or you are wasting time.

    ReplyDelete
    Replies
    1. This.

      If keeping every script you ever run is equivalent to writing out the protocol in full in your lab book every time you do an experiment, using version control is the same as writing "Used standard stain protocol dated 02/03/2016" in your lab book on a numbered page, signed off by your supervisor every week.

      Delete
    2. Interesting points. I think, though, that you guys are confusing reproducibility with process. Computational reproducibility is not at all equivalent to a lab notebook. *Reproducibility* is like the (ideal version of the) materials and methods section of your paper. It should be a fully spelled out way to run your analysis from your data. Then it's reproducible. Documenting the *process* of how you got there is a different matter, which, for experiments, is the function that a lab notebook serves. Version control can serve as some means of tracking what you do, but no less a VCS proponent than Titus Brown argues (rightly) that VCS is *not* a computational lab notebook: http://ivory.idyll.org/blog/is-version-control-an-electronic-lab-notebook.html

      Delete
    3. I agree that version controls is only one tool and that there are other ways to achieve the end goal. I think that this is the point that Titus is making in his post. It just happens that I find it a useful tool for achieving this goal.

      I'd also say that I find version control more useful during the process of a project than for packaging the final result (when just publishing a script is fine). Most of the time I'm more worried about being able to reproduce my own results from two years ago when I first got the data.

      One interesting point is that Titus suggests that repeating stuff is cheap in computation, but i'm not sure that is always true. The analytical pipeline for one of the projects my lab currently works on takes over three weeks to run on our institutional cluster. Nor is it as "hands-off" as Titus suggests it should be: in that three weeks at least one of the nodes is bound to suffer a network outage, or have another job running on it grab all the memory or something and make everything fall over. All this means we've only run it in its entirerity 3 times in the 2 years of the project, but we rerun sections of it all the time.

      So when a collaborator says: "that slide you sent me a year ago had a plot that looked like this, but now that plot looks like that: whats going on?" being able to check what has changed in the meantime can be very useful.

      I agree with you about touching up plots: i might draw the line between what is done programmatically,and what manually in a different place, but I still do both.

      Delete
  3. Actually very good points. The only flaw I can see is that this requires all people in a project working the same way. And some times this is not straightforward. I agree that not all people need subversion, or is not fundamental that you have a script that creates everything. I would vote more for good documentation and good code style than any other thing.

    How subversion/git helps me is:

    - force me to create 'good' code (that means the best I can do) all the time, because I like to have commits that means something, and force me think about how things need to change and not just test/changes things.
    - it really helps me to find when something changed, that's only if you really use the commit messages. If I hadn't git, I will need to change the script and somewhere duplicate the scripts, or write down why/when/how just in case in 6 months I forgot why this functions is what it is (just maybe to avoid repeat the error).
    - how updated is the code or how many time changes. By experience, normally a code that only changed 1 or that it was published a year a go and nothing happened after that has higher chances to have bugs. Here I am talking about something in the middle between a proper tool and some scripts to do some analysis (proper tools need (my opinion) to have version control).

    so for sure, if you can do all this by yourself, then is good. For me, it would be much more difficult to have documentation and code in different folders or naming versions.

    In my case, one time I couldn't reproduce my nice results, and decided to move to GIT, and for sure is complicated at first. But I think, that with the proper knowledge (i don' think you need to be an expert) is enough to be a GIT friend.

    You mentioned that this is painful for people with no computer knowledge. For sure, I don't expect this for everybody. But if your paper requires run many tools and many lines/scripts in R/matlab, I would expect to have enough to be able to get a clean/understandable code, and I will trust more if I can see the evolution than the final product.

    I understand all your points for papers, more specifically text and figures, and I am actually in the middle of what you explain and scripting. As a note, sometimes the simplest figure is the best and many people try to do fancy figures when maybe not needed.

    good discussion anyways!

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. I agree. It takes a lot of time. So people who write that everything should be reproducible are typically (not that there aren't exception) not from the labs that publish in glamour journals like Nature and Science and for a reason. But then, people that publish a lot typically don't blog or tweet a lot, also for a reason.

    One note: plotting packages are getting better. You can see it happening, under our eyes. Look e.g. at things like ggplot2 which was already somewhat a step forward (and sometimes backwards) and also http://stanford.edu/~mwaskom/software/seaborn/. I am hoping (and you may say in vain) that going to Illustrator will be less common. Sure. Next year. It'll all be so easy.

    ReplyDelete
  6. You neglected to amortize the initial time costa of, eg, making figures. after making a few it is a minimal time investment to make new, reproducible ones from scratch because you've learned the syntax and you have base code in a file for when you need to make new ones.

    ReplyDelete
    Replies
    1. One might be able to argue about time spent for simple graphs, but if you want to do something complex, it's just not possible unless you use Illustrator or equivalent. Nor does scripting give you the immediate WYSIWYG feedback that Illustrator does for evaluating design choices. Why limit yourself?

      Delete
    2. WYSIWYG - that's not true at all, since re-running figs take a few secs at most. And it's just so nice to be able to fix mistakes, add data, or do figures for revised manuscripts at the push of a button.

      The opportunity cost is a real issue, but IME it pays off down the road. Plus there is another advantage, which is that other scientists can use your code for their own work (and vice versa). By hosting it on GitHub and giving the version used in the paper a DOI it can be used, cited, and improved by others. Again, delayed benefits, but benefits none-the-less.

      Delete
  7. I agree that reproducibility is hard with existing tools (e.g., git, docker), and probably too hard for biologists and many beginners in bioinformatics.
    I also agree that there are diminishing results and that automating enough that you can re-run analyses on new datasets is often sufficient. We also use illustrator to take plots the extra length to get it to publication quality, rather than scripting every font. I have no issue as this as long as beautification does not change the message of the plot.
    However, I don't think the solution you outline is general enough to work for most situations. I don't doubt that it works in your lab, but for instance, it would not work in mine where raw data is hundreds of GB of sequencing data.
    I believe the future will come from making reproducibility easier, and that this will likely involve better, more seamless tools, as well as education. I think reproducibility will only win (i.e., be widely adopted) once it becomes easier and more convenient than the alternatives.

    ReplyDelete
    Replies
    1. Well, try Dropbox or something like it, you might be surprised how little you miss and how much you gain. Oh, and sorry, data volume is not the issue, nor will it impress me :). All of our papers have usually minimum around 1TB of data, latest that we're working on has around 10TB of data, including probably around 500-1K GB of sequencing data. And we're able to manage the paper analyses just fine in Dropbox. :)

      Delete
    2. I was not trying to impress and I am sorry if you felt that way.

      Delete
    3. I'm joking, no offenses taken or intended! :) Sorry, I think that came across wrong!

      Delete
    4. Also, I agree about new tools being required. I think it's important to consider the real goals of reproducibility and come up with a tool for that, rather than version control, which solves a different set of problems.

      Delete
  8. version control is priceless to work collaboratively on a paper (needless to say, it's 1000 more priceless if you collaborate on code), say. When your co-author (particularly if it is yours PhD student!) does an update, you really do not want to read the whole text again; surely you can make diffs to see what got changed, but it gets very messy very quickly, as there always more updates. The alternative is to have all these "version of blah/blah dated such an such by him and her" mess in your files.

    ReplyDelete
    Replies
    1. Hmm, for my lab, the best solution has been Google Docs. Allows concurrent editing, easily tracks changes in a visual way, allows easy commenting and replies to comments… FAR better than version control for document preparation, in my opinion. My personal suspicion is that the only reason to use version control for writing a (non-math heavy) paper is ideology. :)

      Delete
  9. Google Docs and similar systems (MS Word + OneDrive, Dropbox), I think, should count as a limited form of version control: one file per repo, linear history, no commit messages, and only one file format allowed. In return for these limitations, you get automatic, real-time commit/push/pull.

    I don't see any particular reason for treating math-heavy papers differently than math-light papers. Microsoft's equation editor became rather nice back in 2007, and allows cross-referencing equation numbers. The procedure involves a lot of clicking through dialog boxes, but so do most things in Word.

    I personally use version control and text-based formats when writing because I find it more cognitively comfortable than MS Word or Google Docs. (Although I learned these tools for reasons entirely unrelated to writing.) However, I almost never recommend version control, scripting, etc. to other people because being "the one person in lab who knows how to deal with all this crap", as Raj says, can be hazardous to one's time.

    ReplyDelete
    Replies
    1. Hey, if it works for you, go for it! For most writing, I've found that the tools in Google Docs are far more well-suited to writing collaboratively than version control is–it is, after all, explicitly designed for, umm, writing. In particular, the commenting and reply-to-commenting is something that is simply unnatural to do via version control, and is a killer feature.

      Delete
    2. I have to disagree with math-heavy papers. My background is in Physics, and there is no way I'd want to use Word to write my thesis. People use LaTeX because it gives you

      i) Publication quality papers
      ii) Syntax that hasn't changed in 30 years - try opening a Word doc from 2000 or 2007 in 2015.
      iii) Your content is separate from the layout.
      iv) It's very highly customisable with the many many packages, and Word can't replicate that behaviour.
      v) You can convert a Paper into a chapter of a thesis or a presentation with very little work.

      For short equations, Word is fine. If you're not doing serious maths, it will do everything you need to do. However, it's just so much slower if you need to reproduce something like the equations on page 9 to 11 of this thesis, which are not really that complicated as written Physics equations can go:

      http://www.fz-juelich.de/SharedDocs/Downloads/PGI/PGI-1/EN/Zimmermann.diplom_pdf.pdf?__blob=publicationFile

      Delete
    3. I totally agree that for large, math-heavy docs (or even just large docs, like theses), LaTeX wins. If you're not doing math and you're not doing a big doc, though, LaTeX is a complete waste of time. And yes, I do know LaTeX fairly well, so this is not an uninformed opinion.

      Delete
    4. In the spirit of learning something, I tried copying the equations from the thesis novelaweek linked. Word lacked the unpaired $\rangle$ and the under-bracket (which I'd actually never seen before). Nor could I find a way to insert $\mathcal{H}$ or $\hbar$ in Word using only the keyboard. So I am forced to change my mind and resume ruling out Word for math. I thought Word had caught up more than it actually had.

      Delete
  10. I think the reason that so many people are ardent enthusiasts for version control is simply that all of us have had some kind of problem that would have been solved had we used it. Using version control on its own is not enough to make your work reliable; it's just part of a series of things that you should do when writing software, such as writing tests and running them against your changes.

    Writing software tests when you're analysing big data is always going to be pretty tricky, but you should be able to procedurally a minimal set of data and run your analysis against it and make sure it gives you the answer that you expect.

    It's really important, because if someone changes something, you get immediate feedback if something breaks "Your tests have failed!". I'm no huge fan of Git, but I do like GitHub, because of the integrations you have with testing frameworks like TravisCI and CircleCI which are free for open source projects.

    This also makes your life 100x easier if you want to refactor something - you can check that you get exactly the same answer once you have cleaned up your code.

    There is a really good talk by Mike Croucher from Sheffield University; the references here may be of interest.
    http://mikecroucher.github.io/MLPM_talk/

    ReplyDelete
    Replies
    1. I certainly agree that there are certain types of code for which version control is a good thing. It's just not necessarily the best thing for every project and for every type of person, and recommending it for everyone and for everything is (to me) a bad idea if we want to encourage reproducibility among a broader swath of scientists.

      Delete
    2. BTW: Most "resources" I've seen on the matter have very little in the way of concrete situations in which version control is a good thing for, say, scripts for doing analysis for a paper. If there are some concrete examples, I'd love to see them!

      Delete
    3. First: I agree that _if_ Dropbox kept old versions permanently, it would be adequate version control for most scientific purposes. Too bad it doesn't. And neither does Google Docs.

      Second: Examples. All of these I have experienced myself and I've frequently seen them victimize others.

      - Scripts that generate multiple outputs (say A and B). Close to deadline output B is discovered to be erroneous. You fix the script to fix output B. Later you discover that output A no longer matches what you recorded. Why? Did the input change? Did your fix for output B create an error for output A? Did your fix for output B also fix an undiscovered error in output A? Had you just misrecorded output A the first time? Without version control, good luck answering those questions by your deadline.

      - It's 11pm, my 100 line script is not working as expected, and we need the result by morning. Wait, someone changed it a little bit, I had no idea. What did they change, I can see 3 things are different but are there other ones too? Which of my labmates do I wake up to answer my questions?

      - We're making our presentation slides for the conference next week. The data conditioning script has a last-modified date of 7 days ago (!!). Was it a behavioral change or did someone add a new comment? Why was it changed and who did it?

      - You yourself are trying to replicate your own work, days or months later. You know Figure 3's data came from this script because you did it. But when you run the script now, no matter what you try you don't get data that looks like Figure 3. Why? Without version control you often can't answer why, because you have no certainty what the script was that you ran.

      I think the last one really gets to the heart of the matter. It happens far too often, and when it does happen you (or your students) have to choose between spending a lot of time figuring it out (and maybe still not succeeding), or just throwing up your hands and making a guess.

      I'm sympathetic to the idea that GitHub is too hard to use. Who wouldn't prefer something easier for individuals and small labs, maybe a Dropbox app that actually stores all the versions, together with another Dropbox app that scripts can call so that they can log their own version when they run.

      I guess I don't even see it as reproducibility. Version control speeds up the process of verifying your own work before you send it out. And we're all doing that, right?

      Delete
    4. I think the case can be made for version control if you're working with a large group, or even a small group (or one person) where you are working on very complicated code. That said, a lot of what you are talking about is really a question of code verification than tracking. I would be much more worried about what the actual right answer is than why something or other changed. Version control can sometimes be helpful for thus, but ultimately, verification is a scientific problem, not a programming problem. To the last point: if you can't remember what script you ran and somehow can't get your result to come out the same, that suggests a bigger problem in how your code is structured for reproducibility. I think that just serves to highlight the point that version control is *not* the same thing as reproducibility.

      I guess what I'm saying is that people should spend the time to figure it out and should not be throwing up their hands and making a guess. Computational reproducibility (which I strongly support) is about getting the *right* answer, *every* time.

      Delete
  11. Thanks for that article, you have articulated my own unconscious thoughts on the subject. I did not have the guts to go against the very well-accepted idea that version control is always inherently the best practice, by principle.

    I have also fudged scripted figures by hand, but when I do it I often end up doing the manual modifications enough times to regret not doing them by script. To each his own I guess. Still, I'm 100% with you that doing git all by yourself on paper code is a big waste of time.

    ReplyDelete