RajLab

Saturday, June 11, 2016

Some thoughts on lab communication

I recently came across this nice post about tough love in science:

https://ambikamath.wordpress.com/2016/05/16/on-tough-love-in-science/
and this passage at the start really stuck out:

My very first task in the lab as an undergrad was to pull layers of fungus off dozens of cups of tomato juice. My second task was PCR, at which I initially excelled. Cock-sure after a week of smaller samples, I remember confidently attempting an 80-reaction PCR, with no positive control. Every single reaction failed…

I vividly recall a flash of disappointment across the face of one of my PIs, probably mourning all that wasted Taq. That combination—“this happens to all of us, but it really would be best if it didn’t happen again”—was exactly what I needed to keep going and to be more careful.

Now, communication is easy when it's all like "Hey, I've got this awesome idea, what do you think?" "Oh yeah, that's the best idea ever!" "Boo-yah!" [secret handshake followed by football head-butt]. What I love about this quote is how it perfectly highlights how good communication can inspire and reassure, even in a tough situation—and how bad communication can lead to humiliation and disengagement.

I'm sure there are lots of theories and data out there about communication (or not :)), but when it comes down to putting things into practice, I've found that having simple rules or principles is often a lot easier to follow and to quantify. One that has been particularly effective for me is to avoid "you" language, which is the ultimate simple rule: just avoid saying "you"! Now that I've been following that rule for some time and thinking about why it's so effective at improving communication, I think there's a relatively simple principle beneath it that is helpful as well: if you're saying something for someone else's benefit, then good. If you're saying something for your own benefit, then bad. Do more of the former, less of the latter.

How does this work out in practice? Let's take the example from the quote above. As a (disappointed) human being, your instinct is going to be to think "Oh man, how could you have done that!?" A simple application of no-you-language will help you avoid saying this obviously bad thing. But there are counterproductive no-you-language ways to respond as well: "Well, that was disappointing!" "That was a big waste" "I would really double check things before doing that again". Perhaps the first two of these are straightforwardly incorrect, but I think the last one is counterproductive as well. Let's dissect the real reasons you would say "I would really double check before doing that again". Now, of course the trainee is going to be feeling pretty awful—people generally know when they've screwed up, especially if they screwed up bad. Anyone with a brain knows that if you screw up big, you should probably double check and be more careful. So what's the real reasoning behind telling someone to double check? It's basically to say "I noticed you screwed up and you should be more careful." Ah, the hidden you language revealed! What this sentence is really about is giving yourself the opportunity to vent your frustration with the situation.

So what to say? I think the answer is to take a step back, think about the science and the person, and come up with something that is beneficial to the trainee. If they're new, maybe "Running a positive control every time is really a good idea." (unless they already realized that mistake). Or "Whenever I scale up the reaction, I always check…" These bits of advice often work well when coupled with a personal story, like "I remember when I screwed up one of these big ones early on, and what I found helped me was…". I will sometimes use another mythic figure from the lab's recent past, since I'm old enough now that personal lab stories sound a little too "crazy old grandpa" to be very effective…

It is also possible that there is nothing to learn from this mistake and that it was just, well, a mistake. In which case, there is nothing you can say that is for anyone's benefit other than yourself, and in those situations, it really is just better to say nothing. This can take a lot of discipline, because it's hard not to express those sorts of feelings right when they're hitting you the hardest. But it's worth it. If it's a repeated issue that's really affecting things, there are two options: 1. address it later during a performance review, or 2. don't. Often, with those sorts of issues, there's honestly not much difference in outcome between these options, so maybe it's just better to go with 2.

Another common category of negative communication are all the sundry versions of "I told you so". This is obviously something you say for your own benefit rather than the other person, and indeed it is so clearly accusatory that most folks know not to say this specific phrase. But I think this is just one of a class of what I call "scorekeeping" statements, which are ones that serve only to remind people of who was right or wrong. Like "But I thought we agreed to…" or "Last time I was supposed to…" They're very tempting, because as scientists we are in the business of telling each other that we're right and wrong, but when you're working with someone in the lab, scoring these types of points is corrosive in the long term. Just remember that the next time your PI asks you to change the figure back the other way around for the 4th time… :)

Along those lines, I think it's really important for trainees (not just PIs) to think about how to improve their communication skills as well. One thing I hear often is "Before I was a PI, I got all this training in science, and now I'm suddenly supposed to do all this stuff I wasn't trained for, like managing people". I actually disagree with this. To me, the concept of "managing people" is sort of a misnomer, because in the ideal case, you're not really "managing" anyone at all, but rather working with them as equals. That implies an equal stake in and commitment to productive communications on both ends, which also means that there are opportunities to learn and improve for all parties. I urge trainees to take advantage of those opportunities. Few of us are born with perfect interpersonal skills, especially in work situations, and extra especially in science, where things change and go wrong all the time, practically begging people to assign blame to each other. It's a lot of work, but a little practice and discipline in this area can go a long way.

Wednesday, June 8, 2016

What’s so bad about teeny tiny p-values?

Every so often, I’ll see someone make fun of a really small p-value, usually along with some line like “If your p-value is smaller than 1/(number of molecules in the universe), you must be doing something wrong”. At first, this sounds like a good burn, but thinking about it a bit more, I just don’t get this criticism.

First, the number itself. Is it somehow because the number of molecules in the universe is so large? Perhaps this conjures up some image of “well, this result is saying this effect could never happy anywhere ever in the whole universe by chance—that seems crazy!”, and makes it seem like there’s some flaw in the computation or deduction. Pretty easy to spot the flaw in that logic, of course: configurational space can of course be much larger than the raw number of constituent parts. For example, let’s say I mix some red dye into a cup of water and then pour half of the dyed water into another cup. Now there is some probability that, randomly, all the red dye stays in one cup and no dye goes in the other. That probability is 1/(2^numberOfDyeMolecules), which is clearly going to be a pretty teeny-tiny number.

Here’s another example that may hit a bit closer to home: during cell division, the nuclear envelope breaks down, and so many nuclear molecules (say, lincRNA) must get scattered throughout the cell (and yes, we have observed this to be the case for e.g. Xist and a few others). Then, once the nucleus reforms, those lincRNA seem to be right back in the nucleus. What is the chance that the lincRNA just happened to be back in the nucleus by chance? Well, again, 1/(2^numberOfRNAMolecules) (assuming a 50/50 nucleus/cytoplasm split), which for many lincRNA is probably like 1/1024 or so, but for something like MALAT1, would be 1/(2^2000) or so. I think we can pretty safely reject the hypothesis that there is no active trafficking of MALAT1 back into the nucleus… :)

I think the more substantial concern people raise with these p-values is that when you get something so small, it probably means that you’re not taking into account some sort of systematic error; in other words, the null model isn’t right. For instance, let’s say I measured a slight difference in the concentration of dye molecules in the second cup above. Even a pretty small change will have an infinitesimal p-value, but the most likely scenario is that some systematic error is responsible (like dye getting absorbed by the material on the second glass or the glasses having slightly different transparencies or whatever). In genomics—or basically any study where you are doing a lot of comparisons—the same sort of thing can happen if the null/background model is slightly off for each of a large number of comparisons.

All that said, I still don’t see why people make fun of small p-values. If you have a really strong effect, then it’s entirely possible that you can get such a tiny p-value. In which case, the response is typically “well, if it’s that obvious, then why do any statistics?” Okay, fine, I’m totally down with that! But then we’re basically saying that there’s no really strong effects out there: if you’re doing enough comparisons that you might get one of these tiny p-values, then any strong, real effect must generate one of these p-values, no? In fact, if you don’t get a tiny p-value for one of these multi-factorial comparisons, then you must be looking at something that is only a minor effect at best, like something that only explains a small amount of the variance. Whether that matter or not is a scientific question, not a statistical one, but one thing I can say is that I don’t know many examples (at least in our neck of the molecular biology woods) in which something which was statistically significant but explained only a small amount of the variance was really scientifically meaningful. Perhaps GWAS is a counterexample to my point? Dunno. Regardless, I just don’t see much justification in mocking the teeny tiny p-value.

Oh, and here’s a teeny tiny p-value from our work. Came from comparing some bars in a graph that were so obviously different that only a reviewer would have the good sense to ask for the p-value… ;)

Update, 6/9/2016:
I think that there are a lot of examples that I think illustrate some of these points better. First, not all tiny p-values are necessarily the result of obvious differences or faulty null models in large numbers of comparisons. Take Uri Alon's network motifs work. Following his book, he showed that in the transcriptional network of E. coli (424 nodes, 519 edges), there were 40 examples of autoregulation. Is this higher, lower or equal to what you would expect? Well, maybe you have a good intuitive handle on random network theory, but for me, the fact that this is very far from the null expectation of around 1.2±1.1 autoregulatory motifs (p-value something like 10^-30) is not immediately obvious. One can (and people) quibble about the particular type of random network model, but in the end, the p-values were always teeny tiny and I don't think that is either obvious or unimportant.

Second, the fact that a large number of small effects can give a tiny p-value doesn't automatically discount their significance. My impression from genetics is that many phenotypes are composed of large numbers of small effects. Moreover, the effects of a perturbation of, say, a gene knockout can lead to a large number of small effects. To say those are not meaningful is an (open, to my mind) scientific question, but whether the p-value is small or not is not really relevant.

This is not to say that all tiny p-values mean there's some science worth looking into. Some of the worst offenders are the examples of "binning", where, e.g. half-life of individual genes correlates with some DNA sequence element, R^2=0.21, p=10^-17 (totally made up example, no offense to anyone in this field!). No strong rule comes from this, so who knows if we actually learned something. I suppose an argument can be made either way, but the bottom line is that those are scientific questions, and the size of the p-value is irrelevant. If the p-value were bigger, would that really change anything?

Sunday, May 22, 2016

Spring cleaning, old notebooks, and a little linear algebra problem

Update 5/25/2016: Solution at the bottom

These days, I spend most of my time thinking about microscopes and gene regulation and so forth, which makes it all the more of a surprising coincidence that on the eve of what looks to be a great math-bio symposium here at Penn tomorrow, I was doing some spring cleaning in the attic and happened across a bunch of old notebooks from my undergraduate and graduate school days in math and physics (and a bunch of random personal stuff that I'll save for another day—which is to say, never). I was fully planning to throw all those notebooks away, since of course the last time I really looked at it was probably well over 10 years ago, and I did indeed throw away a couple from some of my less memorable classes. But I was surprised that I actually wanted to keep a hold of most of them.

Why? I think partly that they serve as an (admittedly faint) reminder that I used to actually know how to do some math. It's actually pretty remarkable to me how much we all learn during our time in formal class training, and it is sort of sad how much we forget. I wonder to what degree it's all in there somewhere, and how long it would take me to get back up to speed if necessary. I may never know, but I can say that all that background has definitely shaped me and the way that I approach problems, and I think that's largely been for the best. I often joke in lab about how classes are a waste of time, but it's clear from looking these over that that's definitely not the case.

I also happened across a couple of notebooks that brought back some fond memories. One was Math 250(A?) at Berkeley, then taught by Robin Hartshorne. Now, Hartshorne was a genius. That much was clear on day one, when he looked around the room and precisely counted the number of students in the room (which was around 40 or so) in approximately 0.58 seconds. All the students looked at each other, wondering whether this was such a good idea after all. For those who stuck with it, they got exceptionally clear lectures on group theory, along with by far the hardest problem sets of any class I've taken (except for a differential geometry class I dropped, but that's another story). Of the ten problems assigned every week, I could do maybe one or two, after which I puzzled away, mostly in complete futility, until I went to his very well attended office hours, at which he would give hints to help solve the problems. I can't remember most of the details, but I remember that one of the hints was so incredibly arcane that I couldn't imagine how anyone, ever, could have come up with the answer. I think that Hartshorne knew just how hard all this was, because one time I came to his office hours after a midterm when a bunch of people were going over a particular problem, and I said "Oh yeah, I think I got that one!" and he looked at me with genuine incredulity, at which point I explained my solution. Hartshorne looked relieved, pointed out the flaw, and all went back to normal in the universe. :) Of course, there were a couple kids in that class for whom Hartshorne wouldn't have been surprised to see a solution from, but that wasn't me, for sure.

While rummaging around in that box of old notebooks, I also found some old lecture notes that I really wanted to keep. Many of these are from one of my PhD advisor's, Charlie Peskin, who had some wonderful notes on mathematical physiology, neuroscience, and probability. His ability to explain ideas to students with widely-varying backgrounds was truly incredible, and his notes are so clear and fresh. I also kept notes from a couple of my other undergrad classes that I really loved, notably Dan Roksar's quantum mechanics series, Hirosi Ooguri's statistical mechanics and thermodynamics, and Leo Harrington's set theory class (which was truly mind-bending).

It was also fun to look through a few of the problem sets and midterms that I had taken—particularly odd now to look at some old dusty blue books and imagine how much stress they had caused at the time. I don't remember many of the details, but I somehow still vaguely remembered two problems, one in undergrad, one in grad school as being particularly interesting. The undergrad one was some sort of superconducting sphere problem in my electricity and magnetism course that I can't fully recall, but it had something to do with spherical harmonics. It was a fun problem.

The other was from a homework in a linear algebra class I took in grad school from Sylvia Serfaty, and I did manage to find it hiding in the back of one of my notebooks. A simple-seeming problem: given an n×n matrix A, formulate necessary and sufficient conditions for the 2n×2n matrix B defined as

B = |A A|
|0 A|

to be diagonalizable. I'll give you a hint that is perhaps what one might guess from the n=1 case: the condition is that A = 0. In that case, sufficiency is trivial (B = 0 is definitely diagonalizable), but showing necessity—i.e., show that if B is diagonalizable, then A=0—is not quite so straightforward. Or, well, there's a tricky way to get it, at least. Free beer to whoever figures it out first with a solution as tricky (or more tricky) than the one I'm thinking of! Will post an answer in a couple days.

Update, 5/25/2016: Here's the solution!

Sunday, May 1, 2016

The long tail of artificial narrow superintelligence

As readers of the blog have probably guessed, there is a distinct strain of futurism in the lab, mostly led by Paul, Ally, Ian and I (everyone else just mostly rolls their eyes, but what do they know?). So it was against this backdrop that we had a heated discussion recently about the implications of AlphaGo.

It started with a discussion I had with someone who is an expert on machine learning and knows a bit of Go, and he said that AlphaGo was a huge PR stunt. He said this based on the fact that the way AlphaGo wins is basically by using deep learning to evaluate board positions really well, while doing a huge number of calculations to determine what play to make to evaluate that position. Is that really “thinking”? Here, opinions were split. Ally was strongly in the camp of this being thinking, and I think her argument was pretty valid. After all, how different is that necessarily from how humans play? They probably think up possible places to go and then evaluate the board position. I was of the opinion that this is a different type of thinking than human thinking entirely.

Thinking about it some more, I think perhaps we’re both right. Using neural networks to read the board is indeed amazing, and a feat that most thought would not be possible for a while. It’s also clear that AlphaGo is doing a huge number of more “traditional” brute force computations of potential moves than Lee Sedol was. The question then becomes how close the neural network part of AlphaGo is compared to Lee Sedol’s intuition, given that the brute force logic parts are probably tipped far in AlphaGo’s favor. This is sort of a hard question to answer, because it’s unclear how closely matched they were. I was, perhaps like many, sort of shocked that Lee Sedol managed to win game 4. Was that a sign that they were not so far apart from each other? Or just a weird flukey sucker punch from Sedol? Hard to say. I think the fact that AlphaGo was probably no match for Sedol a few months prior is probably a strong indication that AlphaGo is not radically stronger than Sedol. So my feeling is that Sedol’s intuition is still perhaps greater than AlphaGo’s, which allowed him to keep up despite such a huge disadvantage is traditional computation power.

Either way, given the trajectory, I’m guessing that within a few months, AlphaGo will be so far superior that no human will ever, ever be able to beat it. Maybe this is through improvements to the neural network or to traditional computation, but whatever the case, it will not be thinking the same way as humans. The point is that it doesn’t matter, as far as playing Go is concerned. We will have (already have?) created the strongest Go player ever.

And I think this is just the beginning. A lot of the discourse around artificial intelligence revolves around the potential for artificial general super-intelligence (like us, but smarter), like a paper-clip making app that will turn the universe into a gigantic stack of paper-clips. I think we will get there, but well before then, I wonder if we’ll be surrounded by so much narrow-sense artificial super-intelligence (like us, but smarter at one particular thing) that life as we know it will be completely altered.

Imagine a world in which there is super-human level performance at various “brain” tasks. What will be the remaining motivation to do those things? Will everything just be a sport or leisure activity (like running for fun)? Right now, we distinguish (perhaps artificially) between what’s deemed “important” and what’s just a game. But what if we had a computer for doing proving math theorems or coming up with algorithms, one vastly better than any human? Could you still have a career as a mathematician? Or would it just be one big math olympiad that we do for fun? I’m now thinking that it’s possible for virtually everything humans think is important and do for "work" could be overtaken by “dumb” artificial narrow super-intelligence, well before the arrival of a conscious general super-intelligence. Hmm.

Anyway, for now, back in our neck of the woods, we've still got a ways to go in getting image segmentation to perform as well as humans. But we’re getting closer! After that, I guess we'll just do segmentation for fun, right? :)

Wednesday, April 6, 2016

The hierarchy of academic escapism

Work to escape from the chaos of home.

Conference travel to escape from the chaos of work.

Laptop in hotel room to escape from the chaos of the poster session.

Email to escape the tedium of reviewing papers.

Netflix to escape the tedium of email.

Sleep to escape the tedium of Sherlock Season 3.

And then it was Tuesday.

Thursday, March 3, 2016

From over-reproducibility to a reproducibility wish-list

Well, it’s clear that that last blog post on over-reproducibility touched a bit of a nerve. ;)

Anyway, lot of the feedback was rather predictable and not particularly convincing, but I was pointed to this discussion on the software carpentry website, which was actually really nice:

On 2016-03-02 1:51 PM, Steven Haddock wrote:
> It is interesting how this has morphed into a discussion of ways to convince / teach git to skeptics, but I must say I agreed with a lot of the points in the RajLab post.
>
> Taking a realistic and practical approach to use of computing tools is not something that needs to be shot down (people sound sensitive!). Even if you can’t type `make paper` to recapitulate your work, you can still be doing good science…
>
+1 (at least) to both points. What I've learned from this is that many scientists still see cliffs where they want on-ramps; better docs and lessons will help, but we really (really) to put more effort into usability and interoperability. (Diff and merge for spreadsheets!)

So let me turn this around and ask Arjun: what would it take to convince you that it *was* worth using version control and makefiles and the like to manage your work? What would you, as a scientist, accept as compelling?

Thanks,
Greg

--
Dr Greg Wilson
Director of Instructor Training
Software Carpentry Foundation

First off, thanks to Greg for asking! I really appreciate the active attempt to engage.

Secondly, let me just say that as to the question of what it would take for us to use version control, the answer is nothing at all, because we already use it! More specifically, we use it in places where we think it’s most appropriate and efficient.

I think it may be helpful for me to explain what we do in the lab and how we got here. Our lab works primarily on single cell biology, and our methods are primarily single molecule/single cell imaging techniques and, more recently, various sequencing techniques (mostly RNA-seq, some ATAC-seq, some single cell RNA-seq). My lab has people with pretty extensive coding experience and people with essentially no coding experience, and many in between (I see it as part of my educational mission to try and get everyone to get better at coding during their time in the lab). My PhD is in applied math with a side of molecular biology, during which time we developed a lot of the single RNA molecule techniques that we are still using today. During my PhD, I was doing the computational parts of my science in an only vaguely reproducible way, and that scared me. Like “Hmm, that data point looks funny, where did that come from?”. Thus, in my postdoc, I started developing a little MATLAB "package" for documenting and performing image analysis. I think this is where our first efforts in computational reproducibility began.

When I started in the lab in 2010, my (totally awesome) first student Marshall and I took the opportunity to refactor our image analysis code, and we decided to adopt version control for these general image processing tools. After a bit of discussion, we settled on Mercurial and bitbucket.org because it was supposed to be easier to use than git. This has served us fairly well. Then, my brilliant former postdoc Gautham got way into software engineering and completely refactored our entire image processing pipeline, which is basically what we are using today, and is the version that we point others to use here. Since then, various people have contributed modules and so forth. For this sort of work, version control is absolutely essential: we have a team of people contributing to a large, complex codebase that is used by many people in the lab. No brainer.

In our work, we use these image processing tools to take raw data and turn it into numbers that we then use to hopefully do some science. This involves the use of various analysis scripts that will take this data, perform whatever statistical analysis and so forth on it, and then turn that into a graphical element. Typically, this is done by one, more often two, people in the lab, typically working closely together.

Right around the time Gautham left the lab, we had several discussions about software best practices in the lab. Gautham argued that every project should have a repository for these analysis scripts. He also argued that the commit history could serve as a computational lab notebook. At the time, I thought the idea of a repo for every project was a good one, and I cajoled people in the lab into doing it. I pretty quickly pushed back on the version-control-as-computational-lab-notebook claim, and I still feel that pretty strongly. I think it’s interesting to think about why. Version control is a tool that allows you to keep track of changes to code. It is not something that will naturally document what that code does. My feeling is that version control is in some ways a victim of its own success: it is such a useful tool for managing code that it is now widely used and promoted, and as a side-effect it is now being used for a lot of thing for which it is not quite the right tool for the job, a point I’ll come back to.

Fast forward a little bit. Using version control in the repo-for-every-project model was just not working for most people in the lab. To give a sense of what we’re doing, in most projects, there’s a range of analyses, sometimes just making a simple box-plot or bar graph, sometimes long-ish scripts that take, say, RNA counts per cell and fit to a model of RNA production, extracting model parameters with error bounds. Sometimes it might be something still more complicated. The issue with version control in this scenario is all the headache. Some remote heads would get forked. Somehow things weren't syncing right. Some other weird issue would come up. Plus, frankly all the commit/push/pull/update was causing some headaches, especially if someone forgot to push. One student in the lab and I were just working on a large project together, and after bumping into these issues over and over, she just said “screw it, can we just use Dropbox?” I was actually reluctant at first, but then I thought about it a bit more. What were we really losing? As I mention in the blog post, our goal is a reproducible analysis. For this, versioning is at best a means towards this goal, and in practice for us, a relatively tangential means. Yes, you can go back and use earlier versions. Who cares? The number of times we’ve had to do that in this context is basically zero. One case people have mentioned as a potential benefit for version control is performing alternative, exploratory analyses on a particular dataset, the idea being you can roll back and compare results. I would argue that version control is not the best way to perform or document this. Let’s set I have a script for “myCoolAnalysis”. What we do in lab is make “myAlternativeAnalysis” in which we code our new analysis. Now I can easily compare. Importantly, we have both versions around. The idea of keeping the alternative version in version control is I think a bad one: it’s not discoverable except by searching the commit log. Let’s say that you wanted to go back to that analysis in the future. How would I find it? I think it makes much more sense to have it present in the current version of the code than to dig through the commit history. One could argue that you could fork the repo, but then changes to other, unrelated parts of the repo would be hard to deal with. Overall, version control is just not the right tool for this, in my opinion.

Another, somewhat related point that people have raised is looking back to see why some particular output changed. Here, we’re basically talking about bugs/flawed analyses. There is some merit to this, and so I acknowledge there is a tradeoff, and that once you get to a certain scale, version control is very helpful. However, I think that for scientific programming at the scale I’m talking about, it’s usually fairly clear what caused something to change, and I’m less concerned about why something changed and much more worried about whether we’re actually getting the right answer, which is always a question about the code as it stands. For us, the vast majority of the time, we are moving forward. I think the emphasis here would be better on teaching people about how to test their code (which is a scientific problem more than a programming problem) than version control.

Which leads me to really answering the question: what would I love to have in the lab? On a very practical level, look, version control is still just too hard and annoying to use for a lot of people and injects a lot of friction into the process. I have some very smart people in my lab, and we all have struggled from time to time. I’m sure we can figure it out, but honestly, I see little impetus to do so for the use cases outlined above, and yes, our work is 100% reproducible without it. Moving (back) to Dropbox has been a net productivity win, allowing us to work quickly and efficiently together. Also, the hassle free nature of it was a real relief. On our latest project, while using version control, we were always asking “oh, did you push that?”, “hmm, what happened?”, “oh, I forgot to update”. (And yes, we know about and sometimes use SourceTree.) These little hassles all add up to a real cognitive burden, and I’m sorry, but it's just a plain fact that Dropbox is less work. Now it’s just “Oh, I updated those graphs”, “Looks great, nice!”. Anyway, what I would love is Dropbox with a little bit more version tracking. And Dropbox does have some rudimentary versioning, basically a way to recover from an "oh *#*$" moment–the thing I miss most is probably a quick diff. Until this magical system emerges, though, on balance, it is currently just more efficient for us not to use version control for this type of computational work. I posit that the majority of people who could benefit from some minimal computational reproducibility practices fall into this category as well.

Testing: I think getting people in the habit of testing would be a huge move in the right direction. And I think this means scientific code testing, not just “program doesn’t crash” testing. When I teach my class on molecular systems biology, one of my secret goals is to teach students a little bit about scientific programming. For those who have some programming experience, they often fall into the trap of thinking “well, the program ran, so it must have worked”, which is often fine for, say, a website or something, but it’s usually just the beginning of the story for scientific programming and simulations. Did you look for the order of convergence (or convergence at all)? Did you look for whether you’re getting the predicted distribution in a well-known degenerate case? Most people don’t think about programming that way. Note that none of this has anything to do with version control per se.

On a bigger level, I think the big unmet need is that of a nice way to document an analysis as it currently stands. Gautham and I had a lot of discussions about this when he was in lab. What would such documentation do? Ideally, it would document the analysis in a searchable and discoverable way. This was something Gautham and I discussed at length and didn’t get around to implementing. Here’s one idea we were tossing around. Let’s say that you kept your work in a directory tree structure, with analyses organized by subfolder. Like, could keep that analysis of H3K4me3 in “histoneModificationComparisons/H3K4me3/”, then H3K27me3 in “histoneModificationComparisons/H3K27me3/”. In each directory, you have the scripts associated with a particular analysis, and then running those scripts produces an output graph. That output graph could either be stored in the same folder or in a separate “graphs” subfolder. Now, the scripts and the graphs would have metadata (not sure what this would look like in practice), so you could have a script go through and quickly generate a table of contents with links to all these graphs for easy display and search. Perhaps this is similar to those IPython notebooks or whatever. Anyway, the main features is that this would make all those analyses (including older ones that don't make it in the paper) discoverable (via tagging/table of contents) and searchable (search:“H3K27”). For me, this would be a really helpful way to document an analysis, and would be relatively lightweight and would fit into our current workflow. Which reminds me: we should do this.

I also think that a lot of this discussion is really sort of veering around the simple task of keeping a computational lab notebook. This is basically a narrative about what you tried, what worked, what didn’t work, and how you did it, why you did it, and what you learned. I believe there have been a lot of computational lab notebook attempts out there, from essentially keyloggers on up, and I don’t know of any that have really taken off. I think the main thing that needs to change there is simply the culture. Version control is not a notebook, keylogging is not a notebook, the only thing that is a notebook is you actually spending the time to write down what you did, carefully and clearly–just like in the lab. When I have cajoled people in the lab into doing this, the resulting documents have been highly useful to others as how-to guides and as references. There have been depressingly few such documents, though.

Also, seriously, let's not encourage people to use version control for maintaining their papers. This is just about the worst way to sell version control. Unless you're doing some heavy math with LaTeX or working with a very large document, Google Docs or some equivalent is the clear choice every time, and it will be impossible to convince me otherwise. Version control is a tool for maintaining code. It was never meant for managing a paper. Much better tools exist. For instance, Google Docs excels at easy sharing, collaboration, simultaneous editing, commenting and reply-to-commenting. Sure, one can approximate these using text-based systems and version control. The question is why anyone would like to do that. Not everything you do on a computer maps naturally to version control.

Anyway, that ended up being a pretty long response to what was a fairly short question, but I also just want to reiterate that I find it reassuring that people like Greg are willing to listen to these ramblings and hopefully find something positive from it. My lab is really committed to reproducible computational analyses, and I think I speak for many when I describe the challenges we and others face in making it happen. Hopefully this can stimulate some new discussion and ideas!

Sunday, February 28, 2016

From reproducibility to over-reproducibility

[See also follow up post.]

It's no secret that biomedical research is requiring more and more computational analyses these days, and with that has come some welcome discussion of how to make those analyses reproducible. On some level, I guess it's a no-brainer: if it's not reproducible, it's not science, right? And on a practical level, I think there are a lot of good things about making your analysis reproducible, including the following (vaguely ranked starting with what I consider most important):

Umm, that it’s reproducible.
It makes you a bit more careful about making your code more likely to be right, cleaner, and readable to others.
This in turn makes it easier for others in the lab to access and play with the analyses and data in the future, including the PI.
It could be useful for others outside the lab, although as I’ve said before, I think the uses for our data outside our lab are relatively limited beyond the scientific conclusions we have made. Still, whatever, it’s there if you want it. I also freely admit this might be more important for people who do work other people actually care about. :)

Balanced against these benefits, though, is a non-negligible negative:

It takes a lot of time.

On balance, I think making things as reproducible as possible is time well spent. In particular, it's time that could be well spent by the large proportion of the biomedical research enterprise that currently doesn't think about this sort of thing at all, and I think it is imperative for those of us with a computational inclination to help train others to make their analyses reproducible.

My worry, however, is that the strategies for reproducibility that computational types are often promoting are off-target and not necessarily adapted for the needs and skills of the people they are trying to reach. There is a certain strain of hyper-reproducible zealotry that I think is discouraging others to adopt some basic practices that could greatly benefit their research, and at the same time is limiting the productivity of even its own practitioners. You know what I'm talking about: it's the idea of turning your entire paper into a program, so you just type "make paper" and out pops the fully formed and formatted manuscript. Fine in the abstract, but in a line of work (like many others) in which time is our most precious commodity, these compulsions represent a complete failure to correctly measure opportunity costs. In other words, instead of hard coding the adjustment of the figure spacing of your LaTeX preprint, spend that time writing another paper. I think it’s really important to remember that our job is science, not programming, and if we focus too heavily on the procedural aspects of making everything reproducible and fully documented, we risk turning off those who are less comfortable with programming from the very real benefits of making their analysis reproducible.

Here are the two biggest culprits in my view: version control and figure scripting.

Let's start with version control. I think we can all agree that the most important part of making a scientific analysis reproducible is to make sure the analysis is in a script and not just typed or clicked into a program somewhere, only for those commands to vanish into faded memory. A good, reproducible analysis script should start with raw data, go through all the computational manipulations required, and leave you with a number or graphical element that ends up in your paper somewhere. This makes the analysis reproducible, because someone else can now just run the code and see how your raw data turned into that p-value in subpanel Figure 4G. And remember, that someone else is most likely your future self :).

Okay, so we hopefully all agree on the need for scripts. Then, however, almost every discussion about computational reproducibility begins with a directive to adopt git or some other version control system, as though it’s the obvious next step. Hmm. I’m just going to come right out and say that for the majority of computational projects (at least in our lab), version control is a waste of time. Why? Well, what is the goal of making a reproducible analysis? I believe the goal is to have a documented set of scripts that take raw data and reliably turn it into a bit of knowledge of some kind. The goal of version control is to manage code, in particular emphasizing “reversibility, concurrency, and annotation [of changes to code]”. While one can imagine some overlap between these goals, I don’t necessarily see a natural connection between them. To make that more concrete, let’s try to answer the question that I’ve been asking (and been asked), which is “Why not just use Dropbox?”. After all, Dropbox will keep all your code and data around (including older versions), shared between people seamlessly, and probably will only go down if WWIII breaks out. And it's easy to use. Here are a few potential arguments I can imagine people might make in favor of version control:

You can avoid having Fig_1.ai, Fig_1_2.ai, Fig_1_2_final_AR_PG_JK.ai, etc. Just make the change and commit! You have all the old versions!
You can keep track of who changed what code and roll things back (and manage file conflicts).

Well, to point 1, I actually think that there’s nothing really wrong with having all these different copies of a file around. It makes it really easy to quickly see what changed between different versions, which is especially useful for binary files (like Illustrator files) that you can’t run a diff on. Sure, it’s maybe a bit cleaner to have just one Fig_1.ai, but in practice, I think it’s actually less useful. In our lab, we haven’t bothered doing that, and it’s all worked out just fine.

Which brings us then to point 2, about tracking code changes. In thinking about this, I think it’s useful to separate out code that is for general purpose tools in the lab and code that is specific for a particular project. For code for general purpose tools that multiple team members are contributing to, version control makes a lot of sense–that’s what it was really designed for, after all. It’s very helpful to see older versions of the codebase, see the exact changes that other members of the team have made, and so forth.

These rationales don’t really apply, though, to code that people will write for analyzing data for a particular project. In our lab, and I suspect most others, this code is typically written by one or two people, and if two, they’re typically working in very close contact. Moreover, the end goal is not to have a record of a shifting codebase, but rather to have a single, finalized set of analysis scripts that will reproduce the figures and numbers in the paper. For this reason, the ability to roll back to previous versions of the code and annotate changes is of little utility in practice. I asked around lab, and I think there was maybe one time when we rolled back code. Otherwise, basically, for most analyses for papers, we just move forward and don’t worry about it. I suppose there is theoretically the possibility that some old analysis could prove useful that you could recover through version control, but honestly, most of the time, that ends up in a separate folder anyway. (One might say that’s not clean, but I think that it’s actually just fine. If an analysis is different in kind, then replacing it via version control doesn’t really make sense–it’s not a replacement of previous code per se.)

Of course, one could say, well, even if version control isn’t strictly necessary for reproducible analyses, what does it hurt? In my opinion, the big negative is the amount of friction version control injects into virtually every aspect of the analysis process. This is the price you pay for versioning and annotation, and I think there’s no way to get around that. With Dropbox, I just stick a file in and it shows up everywhere, up to date, magically. No muss, no fuss. If you use version control, it’s constant committing, pushing, pulling, updating, and adding notes. Moreover, if you’re like me, you will screw up at some point, leading to some problem, potentially catastrophic, that you will spend hours trying to figure out. I’m clearly not alone:

“Abort: remote heads forked” anyone? :) At that point, we all just call over the one person in lab who knows how to deal with all this crap and hope for the best. And look, I’m relatively computer savvy, so I can only imagine how intimidating all this is for people who are less computer savvy. The bottom line is that version control is cumbersome, arcane and time-consuming, and most importantly, doesn’t actually contribute much to a reproducible computational analysis. If the point is to encourage people who are relatively new to computation to make scripts and organize their computational results, I think directing them adopt version control is a very bad idea. Indeed, for a while I was making everyone in our lab use version control for their projects, and overall, it has been a net negative in terms of time. We switched to Dropbox for a few recent projects and life is MUCH better–and just as reproducible.

Oh, and I think there are some people who use version control for the text of their papers (almost certainly a proper subset of those who are for some reason writing their papers in Markdown or LaTeX). Unless your paper has a lot of math in it, I have no idea why anyone would subject themselves to this form of torture. Let me be the one to tell you that you are no less smart or tough if you use Google Docs. In fact, some might say you’re more smart, because you don’t let command-line ethos/ideology get in the way of actually getting things done… :)

Which brings me to the example of figure scripting. Figure scripting is the process of making a figure completely from a script. Such a script will make all the subpanels, adjust all the font sizes, deal with all the colors, and so forth. In an ideal world with infinite time, this would be great–who wouldn't want to make all their figures magically appear by typing make figures? In practice, there are definitely some diminishing returns, and it's up to you where the line is between making it reproducible and getting it done. For me, the hard line is that all graphical elements representing data values should be coded. Like, if I make a scatterplot, then the locations of the points relatively to axes should be hard coded. Beyond that, Illustrator time! Illustrator will let you set the font size, the line weighting, marker color, and virtually every other thing you can think of simply and relatively intuitively, with immediate feedback. If you can set your font sizes and so forth programmatically, more power to you. But it's worth keeping in mind that the time you spend programming these things is time you could be spending on something else. This time can be substantial: check out this lengthy bit of code written to avoid a trip to Illustrator. Also, as the complexity of what you're trying to do gets greater, the fewer packages there are to help you make your figure. For instance, consider this figure from one of Marshall's papers:

Making gradient bars and all the lines and annotations would be a nightmare to do via script (and this isn't even very complicated). Yes, if you decide to make a change, you will have to redo some manual work in Illustrator, hence the common wisdom to make it all in scripts to "save time redoing things". But given the amount of effort it takes to figure out how to code that stuff, nine times out of ten, the total amount of time spent just redoing it will be less. And in a time when nobody reads things carefully, adding all these visual elements to your paper to make it easier to explain your work quickly is a strong imperative–stronger than making sure it all comes from a script, in my view.

Anyway, all that said, what do we actually do in the lab? Having gone through a couple iterations, we've basically settled on the following. We make a Dropbox folder for the paper, and within the folder, we have subfolders, one for raw(ish) data, one for scripts, one for graphs and one for figures (perhaps with some elaborations depending on circumstances). In the scripts folder is a set of, uh, scripts that, when run, take the raw(ish) data and turn it into the graphical elements. We then assemble those graphical elements into figures, along with a readme file to document which files went into the figure. Those figures can contain heavily altered versions of the graphical elements, and we will typically adjust font sizes, ticks, colors, you name it, but if you want to figure out why some data point was where it was, the chain is fully accessible. Then, when we're done, we put the files all into bitbucket for anyone to access.

Oh, and one other thing about permanence: our scripts use some combination of R and MATLAB, and they work for now. They may not work forever. That's fine. Life goes on, and most papers don't. Those that do do so because of their scientific conclusions, not their data or analysis per se. So I'm not worried about it.

Update, 3/1/2016: Pretty predictable pushback from a lot of people, especially about version control. First, just to reiterate, we use version control for our general purpose tools, which are edited and used by many people, thus making version control the right tool for the job. Still, I have yet to hear any truly compelling arguments for using version control that would mitigate against the substantial associated complexity for the use case I am discussing here, which is making the analyses in a paper reproducible. There's a lot of bald assertions of the benefits of version control out there without any real evidence for their validity other than "well, I think this should be better", also with little frank discussion of the hassles of version control. This strikes me as similar to the pushback against the LaTeX vs. Word paper. Evidence be damned! :)