tag:blogger.com,1999:blog-5506135718533366764.post7862476734071590533..comments2024-03-23T14:27:28.386-04:00Comments on RajLab: From reproducibility to over-reproducibilityARhttp://www.blogger.com/profile/13811773097412828786noreply@blogger.comBlogger30125tag:blogger.com,1999:blog-5506135718533366764.post-86296734470722949162016-08-13T19:23:20.727-04:002016-08-13T19:23:20.727-04:00Thanks for that article, you have articulated my o...Thanks for that article, you have articulated my own unconscious thoughts on the subject. I did not have the guts to go against the very well-accepted idea that version control is always inherently the best practice, by principle. <br /><br />I have also fudged scripted figures by hand, but when I do it I often end up doing the manual modifications enough times to regret not doing them by script. To each his own I guess. Still, I'm 100% with you that doing git all by yourself on paper code is a big waste of time. Vincent Noelhttps://www.blogger.com/profile/00699260473386576724noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-21638355314031123002016-03-12T07:46:12.756-05:002016-03-12T07:46:12.756-05:00I think the case can be made for version control i...I think the case can be made for version control if you're working with a large group, or even a small group (or one person) where you are working on very complicated code. That said, a lot of what you are talking about is really a question of code verification than tracking. I would be much more worried about what the actual right answer is than why something or other changed. Version control can sometimes be helpful for thus, but ultimately, verification is a scientific problem, not a programming problem. To the last point: if you can't remember what script you ran and somehow can't get your result to come out the same, that suggests a bigger problem in how your code is structured for reproducibility. I think that just serves to highlight the point that version control is *not* the same thing as reproducibility.<br /><br />I guess what I'm saying is that people should spend the time to figure it out and should not be throwing up their hands and making a guess. Computational reproducibility (which I strongly support) is about getting the *right* answer, *every* time.ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-70505604276038470452016-03-11T23:18:58.282-05:002016-03-11T23:18:58.282-05:00First: I agree that _if_ Dropbox kept old versions...First: I agree that _if_ Dropbox kept old versions permanently, it would be adequate version control for most scientific purposes. Too bad it doesn't. And neither does Google Docs.<br /><br />Second: Examples. All of these I have experienced myself and I've frequently seen them victimize others.<br /><br />- Scripts that generate multiple outputs (say A and B). Close to deadline output B is discovered to be erroneous. You fix the script to fix output B. Later you discover that output A no longer matches what you recorded. Why? Did the input change? Did your fix for output B create an error for output A? Did your fix for output B also fix an undiscovered error in output A? Had you just misrecorded output A the first time? Without version control, good luck answering those questions by your deadline.<br /><br />- It's 11pm, my 100 line script is not working as expected, and we need the result by morning. Wait, someone changed it a little bit, I had no idea. What did they change, I can see 3 things are different but are there other ones too? Which of my labmates do I wake up to answer my questions?<br /><br />- We're making our presentation slides for the conference next week. The data conditioning script has a last-modified date of 7 days ago (!!). Was it a behavioral change or did someone add a new comment? Why was it changed and who did it?<br /><br />- You yourself are trying to replicate your own work, days or months later. You know Figure 3's data came from this script because you did it. But when you run the script now, no matter what you try you don't get data that looks like Figure 3. Why? Without version control you often can't answer why, because you have no certainty what the script was that you ran.<br /><br />I think the last one really gets to the heart of the matter. It happens far too often, and when it does happen you (or your students) have to choose between spending a lot of time figuring it out (and maybe still not succeeding), or just throwing up your hands and making a guess.<br /><br />I'm sympathetic to the idea that GitHub is too hard to use. Who wouldn't prefer something easier for individuals and small labs, maybe a Dropbox app that actually stores all the versions, together with another Dropbox app that scripts can call so that they can log their own version when they run.<br /><br />I guess I don't even see it as reproducibility. Version control speeds up the process of verifying your own work before you send it out. And we're all doing that, right?Danielhttps://www.blogger.com/profile/14768681721522073619noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-51823708915718934012016-03-02T17:26:44.244-05:002016-03-02T17:26:44.244-05:00In the spirit of learning something, I tried copyi...In the spirit of learning something, I tried copying the equations from the thesis novelaweek linked. Word lacked the unpaired $\rangle$ and the under-bracket (which I'd actually never seen before). Nor could I find a way to insert $\mathcal{H}$ or $\hbar$ in Word using only the keyboard. So I am forced to change my mind and resume ruling out Word for math. I thought Word had caught up more than it actually had.<br />jpeloquinhttps://www.blogger.com/profile/08433488666116415792noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-68336128944669140002016-03-02T08:49:42.209-05:002016-03-02T08:49:42.209-05:00BTW: Most "resources" I've seen on t...BTW: Most "resources" I've seen on the matter have very little in the way of concrete situations in which version control is a good thing for, say, scripts for doing analysis for a paper. If there are some concrete examples, I'd love to see them!ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-14776811563118390972016-03-02T08:48:14.526-05:002016-03-02T08:48:14.526-05:00I certainly agree that there are certain types of ...I certainly agree that there are certain types of code for which version control is a good thing. It's just not necessarily the best thing for every project and for every type of person, and recommending it for everyone and for everything is (to me) a bad idea if we want to encourage reproducibility among a broader swath of scientists.ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-11866922947743422382016-03-02T07:44:36.279-05:002016-03-02T07:44:36.279-05:00I totally agree that for large, math-heavy docs (o...I totally agree that for large, math-heavy docs (or even just large docs, like theses), LaTeX wins. If you're not doing math and you're not doing a big doc, though, LaTeX is a complete waste of time. And yes, I do know LaTeX fairly well, so this is not an uninformed opinion.ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-21879171752691708672016-03-02T03:27:09.256-05:002016-03-02T03:27:09.256-05:00I think the reason that so many people are ardent ...I think the reason that so many people are ardent enthusiasts for version control is simply that all of us have had some kind of problem that would have been solved had we used it. Using version control on its own is not enough to make your work reliable; it's just part of a series of things that you should do when writing software, such as writing tests and running them against your changes.<br /><br />Writing software tests when you're analysing big data is always going to be pretty tricky, but you should be able to procedurally a minimal set of data and run your analysis against it and make sure it gives you the answer that you expect.<br /><br />It's really important, because if someone changes something, you get immediate feedback if something breaks "Your tests have failed!". I'm no huge fan of Git, but I do like GitHub, because of the integrations you have with testing frameworks like TravisCI and CircleCI which are free for open source projects. <br /><br />This also makes your life 100x easier if you want to refactor something - you can check that you get exactly the same answer once you have cleaned up your code.<br /><br />There is a really good talk by Mike Croucher from Sheffield University; the references here may be of interest. <br />http://mikecroucher.github.io/MLPM_talk/<br />Ryan Pepperhttps://www.blogger.com/profile/00519346896563820862noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-14614066221133372842016-03-02T03:02:04.910-05:002016-03-02T03:02:04.910-05:00I have to disagree with math-heavy papers. My back...I have to disagree with math-heavy papers. My background is in Physics, and there is no way I'd want to use Word to write my thesis. People use LaTeX because it gives you<br /><br />i) Publication quality papers<br />ii) Syntax that hasn't changed in 30 years - try opening a Word doc from 2000 or 2007 in 2015.<br />iii) Your content is separate from the layout.<br />iv) It's very highly customisable with the many many packages, and Word can't replicate that behaviour.<br />v) You can convert a Paper into a chapter of a thesis or a presentation with very little work.<br /><br />For short equations, Word is fine. If you're not doing serious maths, it will do everything you need to do. However, it's just so much slower if you need to reproduce something like the equations on page 9 to 11 of this thesis, which are not really that complicated as written Physics equations can go:<br /><br />http://www.fz-juelich.de/SharedDocs/Downloads/PGI/PGI-1/EN/Zimmermann.diplom_pdf.pdf?__blob=publicationFileRyan Pepperhttps://www.blogger.com/profile/00519346896563820862noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-82167948708871115762016-03-01T21:00:30.326-05:002016-03-01T21:00:30.326-05:00Hey, if it works for you, go for it! For most writ...Hey, if it works for you, go for it! For most writing, I've found that the tools in Google Docs are far more well-suited to writing collaboratively than version control is–it is, after all, explicitly designed for, umm, writing. In particular, the commenting and reply-to-commenting is something that is simply unnatural to do via version control, and is a killer feature.ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-36372802588674994262016-03-01T20:48:04.858-05:002016-03-01T20:48:04.858-05:00Google Docs and similar systems (MS Word + OneDriv...Google Docs and similar systems (MS Word + OneDrive, Dropbox), I think, should count as a limited form of version control: one file per repo, linear history, no commit messages, and only one file format allowed. In return for these limitations, you get automatic, real-time commit/push/pull.<br /><br />I don't see any particular reason for treating math-heavy papers differently than math-light papers. Microsoft's equation editor became rather nice back in 2007, and allows cross-referencing equation numbers. The procedure involves a lot of clicking through dialog boxes, but so do most things in Word.<br /><br />I personally use version control and text-based formats when writing because I find it more cognitively comfortable than MS Word or Google Docs. (Although I learned these tools for reasons entirely unrelated to writing.) However, I almost never recommend version control, scripting, etc. to other people because being "the one person in lab who knows how to deal with all this crap", as Raj says, can be hazardous to one's time.John Peloquinnoreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-81563657710224017132016-03-01T18:07:49.516-05:002016-03-01T18:07:49.516-05:00Hmm, for my lab, the best solution has been Google...Hmm, for my lab, the best solution has been Google Docs. Allows concurrent editing, easily tracks changes in a visual way, allows easy commenting and replies to comments… FAR better than version control for document preparation, in my opinion. My personal suspicion is that the only reason to use version control for writing a (non-math heavy) paper is ideology. :)ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-35257710136922720692016-03-01T17:40:19.929-05:002016-03-01T17:40:19.929-05:00version control is priceless to work collaborative...version control is priceless to work collaboratively on a paper (needless to say, it's 1000 more priceless if you collaborate on code), say. When your co-author (particularly if it is yours PhD student!) does an update, you really do not want to read the whole text again; surely you can make diffs to see what got changed, but it gets very messy very quickly, as there always more updates. The alternative is to have all these "version of blah/blah dated such an such by him and her" mess in your files.dimpasehttps://www.blogger.com/profile/14596969929730094920noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-81706371108011044962016-03-01T10:07:18.253-05:002016-03-01T10:07:18.253-05:00WYSIWYG - that's not true at all, since re-run...WYSIWYG - that's not true at all, since re-running figs take a few secs at most. And it's just so nice to be able to fix mistakes, add data, or do figures for revised manuscripts at the push of a button. <br /><br />The opportunity cost is a real issue, but IME it pays off down the road. Plus there is another advantage, which is that other scientists can use your code for their own work (and vice versa). By hosting it on GitHub and giving the version used in the paper a DOI it can be used, cited, and improved by others. Again, delayed benefits, but benefits none-the-less. emb3https://www.blogger.com/profile/16059914982416950283noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-2533473560954921862016-03-01T09:00:17.378-05:002016-03-01T09:00:17.378-05:00I agree that version controls is only one tool and...I agree that version controls is only one tool and that there are other ways to achieve the end goal. I think that this is the point that Titus is making in his post. It just happens that I find it a useful tool for achieving this goal. <br /><br />I'd also say that I find version control more useful during the process of a project than for packaging the final result (when just publishing a script is fine). Most of the time I'm more worried about being able to reproduce my own results from two years ago when I first got the data. <br /><br />One interesting point is that Titus suggests that repeating stuff is cheap in computation, but i'm not sure that is always true. The analytical pipeline for one of the projects my lab currently works on takes over three weeks to run on our institutional cluster. Nor is it as "hands-off" as Titus suggests it should be: in that three weeks at least one of the nodes is bound to suffer a network outage, or have another job running on it grab all the memory or something and make everything fall over. All this means we've only run it in its entirerity 3 times in the 2 years of the project, but we rerun sections of it all the time. <br /><br />So when a collaborator says: "that slide you sent me a year ago had a plot that looked like this, but now that plot looks like that: whats going on?" being able to check what has changed in the meantime can be very useful. <br /><br />I agree with you about touching up plots: i might draw the line between what is done programmatically,and what manually in a different place, but I still do both. IanSudberyhttps://www.blogger.com/profile/07857589169905959966noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-42111806052066351222016-03-01T07:52:17.764-05:002016-03-01T07:52:17.764-05:00Also, I agree about new tools being required. I th...Also, I agree about new tools being required. I think it's important to consider the real goals of reproducibility and come up with a tool for that, rather than version control, which solves a different set of problems.ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-31880978206294166802016-03-01T07:47:59.492-05:002016-03-01T07:47:59.492-05:00I'm joking, no offenses taken or intended! :) ...I'm joking, no offenses taken or intended! :) Sorry, I think that came across wrong!ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-83394161037032671312016-03-01T07:46:24.263-05:002016-03-01T07:46:24.263-05:00I was not trying to impress and I am sorry if you ...I was not trying to impress and I am sorry if you felt that way. Fabien Campagnehttps://www.blogger.com/profile/14602557682810138019noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-54708654037244982492016-03-01T06:17:08.380-05:002016-03-01T06:17:08.380-05:00Interesting points. I think, though, that you guys...Interesting points. I think, though, that you guys are confusing reproducibility with process. Computational reproducibility is not at all equivalent to a lab notebook. *Reproducibility* is like the (ideal version of the) materials and methods section of your paper. It should be a fully spelled out way to run your analysis from your data. Then it's reproducible. Documenting the *process* of how you got there is a different matter, which, for experiments, is the function that a lab notebook serves. Version control can serve as some means of tracking what you do, but no less a VCS proponent than Titus Brown argues (rightly) that VCS is *not* a computational lab notebook: http://ivory.idyll.org/blog/is-version-control-an-electronic-lab-notebook.htmlARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-51459399907319156492016-03-01T05:55:47.981-05:002016-03-01T05:55:47.981-05:00One might be able to argue about time spent for si...One might be able to argue about time spent for simple graphs, but if you want to do something complex, it's just not possible unless you use Illustrator or equivalent. Nor does scripting give you the immediate WYSIWYG feedback that Illustrator does for evaluating design choices. Why limit yourself?ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-47950469209015114472016-03-01T05:51:55.906-05:002016-03-01T05:51:55.906-05:00Well, try Dropbox or something like it, you might ...Well, try Dropbox or something like it, you might be surprised how little you miss and how much you gain. Oh, and sorry, data volume is not the issue, nor will it impress me :). All of our papers have usually minimum around 1TB of data, latest that we're working on has around 10TB of data, including probably around 500-1K GB of sequencing data. And we're able to manage the paper analyses just fine in Dropbox. :)ARhttps://www.blogger.com/profile/13811773097412828786noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-44891176209530356022016-03-01T05:00:10.966-05:002016-03-01T05:00:10.966-05:00This.
If keeping every script you ever run is e...This. <br /><br />If keeping every script you ever run is equivalent to writing out the protocol in full in your lab book every time you do an experiment, using version control is the same as writing "Used standard stain protocol dated 02/03/2016" in your lab book on a numbered page, signed off by your supervisor every week. IanSudberyhttps://www.blogger.com/profile/07857589169905959966noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-31752049006774457052016-02-29T22:53:13.962-05:002016-02-29T22:53:13.962-05:00I agree that reproducibility is hard with existing...I agree that reproducibility is hard with existing tools (e.g., git, docker), and probably too hard for biologists and many beginners in bioinformatics. <br />I also agree that there are diminishing results and that automating enough that you can re-run analyses on new datasets is often sufficient. We also use illustrator to take plots the extra length to get it to publication quality, rather than scripting every font. I have no issue as this as long as beautification does not change the message of the plot.<br />However, I don't think the solution you outline is general enough to work for most situations. I don't doubt that it works in your lab, but for instance, it would not work in mine where raw data is hundreds of GB of sequencing data. <br />I believe the future will come from making reproducibility easier, and that this will likely involve better, more seamless tools, as well as education. I think reproducibility will only win (i.e., be widely adopted) once it becomes easier and more convenient than the alternatives. <br />Fabien Campagnehttps://www.blogger.com/profile/14602557682810138019noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-30819132681364307332016-02-29T22:48:21.265-05:002016-02-29T22:48:21.265-05:00You neglected to amortize the initial time costa o...You neglected to amortize the initial time costa of, eg, making figures. after making a few it is a minimal time investment to make new, reproducible ones from scratch because you've learned the syntax and you have base code in a file for when you need to make new ones. emb3https://www.blogger.com/profile/12728212816816490416noreply@blogger.comtag:blogger.com,1999:blog-5506135718533366764.post-51133302050301084922016-02-29T16:50:53.543-05:002016-02-29T16:50:53.543-05:00I agree. It takes a lot of time. So people who wri...I agree. It takes a lot of time. So people who write that everything should be reproducible are typically (not that there aren't exception) not from the labs that publish in glamour journals like Nature and Science and for a reason. But then, people that publish a lot typically don't blog or tweet a lot, also for a reason.<br /><br />One note: plotting packages are getting better. You can see it happening, under our eyes. Look e.g. at things like ggplot2 which was already somewhat a step forward (and sometimes backwards) and also http://stanford.edu/~mwaskom/software/seaborn/. I am hoping (and you may say in vain) that going to Illustrator will be less common. Sure. Next year. It'll all be so easy. Maxhttps://www.blogger.com/profile/17193549110961823919noreply@blogger.com