Comments on RajLab: From over-reproducibility to a reproducibility wish-list

Google Docs is definitely an easy and effective wa...

2016-03-30T13:50:16.491-04:00

Google Docs is definitely an easy and effective way to maintain and share a variety of documents. My only concern over the years is the level of privacy these documents have. I always feel anxious about using it for confidential or proprietary materials.

Which tool do you think is better to take notes ab...

2016-03-06T17:16:22.322-05:00

Which tool do you think is better to take notes about the work you do on software? Documentation of the software is a different beast, I agree with you. You really need to write documentation to make the software reusable. But for keeping track of what you do daily I don't know better tools than VCS.

For the other point, notebooks are an improvements compared to a script for reproducibility because they embed documentation with code and results (they provide the logic "flow"). Critics of notebooks point out that notebooks don't capture automatically all your environment (meaning all software installed, platform and so forth), so they don't include all the information to reproduce your work. That's true, but to mitigate this I, for one, print the version of all the used libraries at the beginning of the notebook so this information is stored with the notebook. But in reality this problem is orthogonal from the notebooks. For project I want to reproduce years later, I use a separated conda environment.

Another "pain point" of using notebooks is that they are JSON based and not as easily diff-able as plain scripts. That's also true, but there are workarounds. First, for notebooks embedded in a library I always commit the notebook without output (this make it easier to visually inspect the diff before committing). There is also ipymd that is an extension to store notebooks in markdown format (very easy to diff).

People are aware of these problems and are trying to solve them. But the Jupyter notebook of today can already simplify organization of data analysis projects and make them more reproducible.

These are interesting points! I actually really li...

2016-03-06T06:41:43.900-05:00

These are interesting points! I actually really like the idea of using version control together with Dropbox as you suggest. Perhaps we'll try that out in the lab.

Still not convinced about version control as a notebook. I think that we need to change the culture of how we document computational analyses. If version control helps you enforce that, great, but I agree with you that it's not the only way to do it. I think that it's not the best tool for this, though.

Jupyter notebooks: not a bad idea. Although I've heard online that some reproducibility mavens don't recommend it. Not sure why.

Thanks for sharing your view, it is an interesting...

2016-03-04T02:48:48.698-05:00

Thanks for sharing your view, it is an interesting point of discussion.

I understand your argument and agree that not everything needs to be under version control. I also use just Dropbox or GoogleDrive for small projects that only have a few simple scripts that I write mostly all at once and that I don't modify much later on. However, I find version control very useful for software that I keep modifying even if it's a solo work. (BTW, I'm very familiar with the git command line but I use SourceTree GUI 90% of the time).

One advantage of VC is that it forces me to write a log of what I am doing. This helps when juggling between different projects. Going back to a project after 1 or a few weeks I can easily see what I was working on and keep going. The same can be done with README file but I find harder to keep this discipline.
Another advantage is code review. At commit-time, I have a second look at what I changed. Many times I find typos and small errors in this phase and I can correct them before committing. Other times, I catch unintended modifications that were some random test/exploration I forgot to revert. I would probably find the same bugs later on, when analyzing the results but it would take more time to track them down.
In some cases, I keep png figures in the repository. SourceTree shows figures changes side-by-side and I find it a quick way to track if some modification had unintended consequences.

I was thinking that maybe can use git inside the dropbox folder. Everybody will receive the latest version via Dropbox, like you do now. In addition, one person is in charge of committing so you still have snapshots that you can diff (and the full repository is also available to everybody because the .git folder is synced by Dropbox).

Regarding the lab-notebooks I use Jupyter (ex IPython) notebooks. Jupyter notebook is very close to what you describe. It is a document (with sections, links, equations, etc..) that contains the code and the results/figures (you run the notebook to populate it with output and figures). I tend to have a notebook per data file (or per data folder). When, later on, I go back the data I can open the notebook, see the narrative and the results. Even if you don't write much text, the notebook gives a lightweight narrative with a sequential order or logical connection to the series of figures. I find it better than I script and a folder with a bunch of figures.

The database that you envision is still missing and is an orthogonal problem to using scripts vs notebooks. But I guess the big problem is finding proper field-specific (or lab-specific) tags to categorize all of your work.

I enthusiastically agree on the importance of (1) ...

2016-03-04T00:01:52.297-05:00

I enthusiastically agree on the importance of (1) discoverable, instructive documentation and (2) testing/verification. Documentation is far more important than software tools for efficiently communicating knowledge. It's very frustrating and sad when work has to be re-done because the last person who knew the relevant details graduated years ago and left no paper trail.

Regarding computational laboratory notebooks, I'm personally fond of a folder structure that goes project_name/journal/worker_name/YYYY-MM-DD_keywords_about_this_entry.{docx,org,txt,rtf,...}. Each file has a corresponding folder with the same name, just without the file extension. Supporting images, snippets, or tables that are specific to that entry (i.e., not going to be used repeatedly) go in that folder. Hyperlinks (file:relative/path/to/data) are used liberally in the journal entries to refer to data in the central data store, wherever that may be. This seems pretty flexible and unlikely to cause new problems while solving old ones. Each person can work in their own preferred format, or just add text files with URLs to online documentation. I often old entries to cross-reference relevant new entries; these edits should perhaps be confined to a specific section of the file dedicated to this purpose.

I am considering cutting the binding off my completed paper notebooks and running them through an automatic feed scanner, especially when I move between jobs. The fact that a paper notebook only exists in one place creates issues. It's also a risk---a notebook may get damaged or lost.

I don't think version control helps reproducibility, unless we're talking about reproducing (really, replaying) every action that led to the final product. It may reduce errors, though. Version control does reduce bug counts in software, but code review (which I believe you've blogged about several months ago) actually reduces bug counts more. I unfortunately can't recall the citation for this, but it was definitely based on empirical bug counts. So for scientific practice, adopting version control may be less useful than ensuring robust intra-lab peer review.