Tuesday, June 28, 2016

Reproducibility, reputation and playing the long game in science

Every so often these days, something will come up again about how most research findings are false. Lots of ink has already been spilled on the topic, so I won’t dwell on the reproducibility issue too long, but the whole thing has gotten me thinking more and more about the meaning and consequences of scientific reputation.

Why reputation? Reputation and reproducibility are somewhat related but clearly distinct concepts. In my field (I guess?) of molecular biology, I think that reputation and reproducibility are particularly strongly correlated because the nature of the field is such that perceived reproducibility is heavily tied to the large number of judgement calls you have make in the course of your research. As such, perhaps reputation has evolved as the best way to measure reproducibility in this area.

I think that this stands in stark contrast with the more common diagnosis one sees these days for the problem of irreproducibility, which is that it's all down to statistical innumeracy. Every so often, I’ll see tweets like this (names removed unless claimed by owner):






The implication here is that the problem with all this “cell” biology is that the Ns are so low as to render the results statistically meaningless. The implicit solution to the problem is then “Isn’t data cheap now? Just get more data! It’s all in the analysis, all we need to do is make that reproducible!” Well, if you think that github accounts, pre-registered studies and iPython notebooks will magically solve the reproducibility problem, think again. Better statistical and analysis management practices are of course good, but the excessive focus on these solutions to me ignores the bigger point, which is that, especially in molecular and cellular biology, good judgement about your data and experiments trumps all. (I do find it worrying that statistics has somehow evolved to the point of absolving ourselves of the responsibility for the scientific inferences we make ("But look at the p-value!"). I think this statistical primacy is perhaps part of an bigger—and in my opinion, ill-considered—attempt to systematize and industrialize scientific reasoning, but that’s another discussion.)

Here’s a good example from the (infamous?) study claiming to show that aspartame induces cancer. (I looked this over a while ago given my recently acquired Coke Zero habit. Don’t judge.) Here’s a table summarizing their results:


The authors claim that this shows an effect of increased lymphomas and leukemias in the female rats through the entire dose range of aspartame. And while I haven’t done the stats myself, looking at the numbers, the claim seems statistically valid. But the whole thing really hinges on the one control datapoint for the female rats, which is (seemingly strangely) low compared to virtually everything else. If that number was, say, 17% instead of 8%, I’m guessing essentially all the statistical significance would go away. Is this junk science? Well, I think so, and the FDA agrees. But I would fully agree that this is a judgement call, and in a vacuum would require further study—in particular, to me, it looks like there is some overall increase in cancers in these rats at very high doses, and while it is not statistically significant in their particular statistical treatment, my feeling is that there is something there, although probably just a non-specific effect arising from the crazy high doses they used.

Hey, you might say, that’s not science! Discarding data points because they “seem off” and pulling out statistically weak “trends” for further analysis? Well, whatever, in my experience, that’s how a lot of real (and reproducible) science gets done.

Now, it would be perfectly reasonable of you to disagree with me. After all, in the absence of further data, my inklings are nothing more than an opinion. And in this case, at least we can argue about the data as it is presented. In most papers in molecular biology, you don’t even get to see the data from all experiments they didn’t report for whatever reason. The selective reporting of experiments sounds terrible, and is probably responsible for at least some amount of junky science, but here’s the thing: I think molecular biology would be uninterpretable without it. So many experiments fail or give weird results for so many different reasons, and reporting them all would leave an endless maze that would be impossible to navigate sensibly. (I think this is a consequence of studying complex systems with relatively imprecise—and largely uncalibrated—experimental tools.) Of course, such a system is ripe for abuse, because anyone can easily leave out a key control that doesn’t go their way under the guise of “the cells looked funny that day”, but then again, there are days where the cells really do look funny. So basically, in the end, you are stuck with trust: you have to trust that the person you’re listening to made the right decisions, that they checked all the boxes that you didn’t even know existed, and that they exhibited sound judgement. How do you know what work to follow up on? In a vacuum, hard to say, but that’s where reputation comes in. And when it comes to reputation, I think there’s value in playing the long game.

Reputation comes in a couple different forms. One is public reputation. This is the one you get from talks you give and the papers you publish, and it can suffer from hype and sloppiness. People do still read papers and listen to talks (well, at least sometimes), and eventually they will notice if you cut corners and oversell your claims. Not much to say about this except that one way to get a good public reputation is to, well, do good science! Another important thing is to just be honest. Own up to the limitations of your work, and I’ve found that people will actually respect you more. It’s pretty easy to sniff out someone who’s being disingenuous (as the lawyerly answers from Elizabeth Holmes have shown), and I think people will actually respect you more if you just straight up say what you really think. Plus, it makes people think you’re smart if you show you’ve already thought about all the various problems.

Far more murky is the large gray zone of private reputation, which encompasses all the trust in the work that you don’t see publicly. This is going out to dinner with a colleague and hearing “Oh yeah, so-and-so is really solid”… or “That person did the same experiment 40 times in grad school to get that one result” or “Oh yeah, well, I don’t believe a single word out of that person’s mouth.” All of which I have heard, and don’t let me forget my personal favorite “Mr. Artifact bogus BS guy”. Are these just meaningless rumors? Sometimes, but mostly not. What has been surprising to me is how much signal there is in this reputational gossip relative to noise—when I hear about someone with a shady reputation, I will often hear very similar things independently from multiple sources.

I think this is (rightly) because most scientists know that spreading science gossip about people is generally something to be done with great care (if at all). Nevertheless, I think it serves a very important purpose, because there’s a lot of reputational information that is just hard to share publicly. Many reasons for this, one of them being that the burden of proof for calling someone out publicly is very high, the potential for negative fallout is large, and you can easily develop your own now-very-public reputation for being a bitter, combative pain in the ass. A world in which all scientists called each other out publicly on everything would probably be non-functional.

Of course, this must all be balanced against the very significant negatives to scientific gossip. It is entirely possible that someone could be unfairly smeared in this way, although honestly, I’m not sure how many instances of this I’ve really seen. (I do know of one case in which one scientist supposedly started a whisper campaign against another scientist about their normalization method or something suitably petty, although I have to say the concerns seemed valid to me.)

So how much gossip should we spread? For me, that completely depends on the context. With close friends, well, that’s part of the fun! :) With other folks, I’m of course far more restrained, and I try to stick to what I know firsthand, although it’s impossible to give a straight up rule given the number of factors to weigh. Are they asking for an evaluation of a potential collaborator? Are we discussing a result that they are planning to follow up on in the lab, thus potentially harming a trainee? Will they even care what I say either way? An interesting special case is trainees in the lab. I think they actually stand to benefit greatly from this informal reputational chatter. Not only do they learn who to avoid, but even just knowing the fact that not everyone in science can be trusted is a valuable lesson.

Which leads to another important problem with private reputations: if they are private, what about all the other people who could benefit from that knowledge but don’t have access to it? This failure can manifest in a variety of ways. For people with less access to the scientific establishment (smaller or poorer countries, e.g.), you basically just have to take the literature at face value. The same can be true even within the scientific establishment; for example, in interdisciplinary work, you’ll often have one community that doesn’t know the gossip of another (lots of examples where I’ll meet someone who talks about a whole bogus subfield without realizing it’s bogus). And sometimes you just don’t get wind in time. The damage in terms of time wasted is real. I remember a time when our group was following up a cool-seeming result that ended up being bogus as far as we could tell, and I met a colleague at a conference, told her about it, and she said they saw the same thing. Now two people know, and perhaps the handful of other people that I’ve mentioned this to. That doesn’t seem right.

At this point, I often wonder about a related issue: do these private reputations even matter? I know plenty of scientists with widely-acknowledged bad reputations who are very successful. Why doesn’t it stick? Part of it is that our review systems for papers and grants just don’t accommodate this sort of information. How do you give a rational-sounding review that says “I just don’t believe this”? Some people do give those sorts of reviews, but come across as, again, bitter and combative, so most don’t. Not sure what to do about this problem. In the specific case of publishing papers, I often wonder why journal editors don’t get wind of these issues. Perhaps they just are in the wrong circles? Or maybe there are unspoken union rules about ratting people out to editors? Or maybe it’s just really hard not to send a paper to review if it looks strong on the face of it, and at that point, it’s really hard for reviewers to do anything about it. It is possible that preprints and more public discussion may help with this? Of course, then people would actually have to read each other’s papers…

That said, while the downsides of a bad private reputation may not materialize as often as we feel they should, the good news is that I think the benefits to a good private reputation can be great. If people think you do good, solid work, I think that people will support you even if you’re not always publishing flashy papers and so forth. It’s a legitimate path to success in science, and don’t let the doom and gloomers and quit-lit types tell you otherwise. How to develop and maintain a good private reputation? Well, I think it’s largely the same as maintaining a good public one: do good science and don’t be a jerk. The main difference is that you have to do these things ALL THE TIME. There is no break. Your trainees and mentors will talk. Your colleagues will talk. It’s what you do on a daily basis that will ensure that they all have good things to say about you.

(Side point… I often hear that “Well, in industry, we are held to a different standard, we need things to actually work, unlike in academia.” Maybe. Another blog post on this soon, but I’m not convinced industry is any better than academia in this regard.)

Anyway, in the end, I think that molecular biology is the sort of field in which scientific reputation will remain an integral part of how we assess our science, for better or for worse. Perhaps we should develop a more public culture of calling people out like in physics, but I’m not sure that would necessarily work very well, and I think the hostile nature of discourse in that field contributes to a lack of diversity. Perhaps the ultimate analysis of whether to spread gossip or do something gossip-worthy is just based on what it takes for you to get a good night’s sleep.

Saturday, June 11, 2016

Some thoughts on lab communication

I recently came across this nice post about tough love in science:
https://ambikamath.wordpress.com/2016/05/16/on-tough-love-in-science/
and this passage at the start really stuck out:
My very first task in the lab as an undergrad was to pull layers of fungus off dozens of cups of tomato juice. My second task was PCR, at which I initially excelled. Cock-sure after a week of smaller samples, I remember confidently attempting an 80-reaction PCR, with no positive control. Every single reaction failed… 
I vividly recall a flash of disappointment across the face of one of my PIs, probably mourning all that wasted Taq. That combination—“this happens to all of us, but it really would be best if it didn’t happen again”—was exactly what I needed to keep going and to be more careful.
Now, communication is easy when it's all like "Hey, I've got this awesome idea, what do you think?" "Oh yeah, that's the best idea ever!" "Boo-yah!" [secret handshake followed by football head-butt]. What I love about this quote is how it perfectly highlights how good communication can inspire and reassure, even in a tough situation—and how bad communication can lead to humiliation and disengagement.

I'm sure there are lots of theories and data out there about communication (or not :)), but when it comes down to putting things into practice, I've found that having simple rules or principles is often a lot easier to follow and to quantify. One that has been particularly effective for me is to avoid "you" language, which is the ultimate simple rule: just avoid saying "you"! Now that I've been following that rule for some time and thinking about why it's so effective at improving communication, I think there's a relatively simple principle beneath it that is helpful as well: if you're saying something for someone else's benefit, then good. If you're saying something for your own benefit, then bad. Do more of the former, less of the latter.

How does this work out in practice? Let's take the example from the quote above. As a (disappointed) human being, your instinct is going to be to think "Oh man, how could you have done that!?" A simple application of no-you-language will help you avoid saying this obviously bad thing. But there are counterproductive no-you-language ways to respond as well: "Well, that was disappointing!" "That was a big waste" "I would really double check things before doing that again". Perhaps the first two of these are straightforwardly incorrect, but I think the last one is counterproductive as well. Let's dissect the real reasons you would say "I would really double check before doing that again". Now, of course the trainee is going to be feeling pretty awful—people generally know when they've screwed up, especially if they screwed up bad. Anyone with a brain knows that if you screw up big, you should probably double check and be more careful. So what's the real reasoning behind telling someone to double check? It's basically to say "I noticed you screwed up and you should be more careful." Ah, the hidden you language revealed! What this sentence is really about is giving yourself the opportunity to vent your frustration with the situation.

So what to say? I think the answer is to take a step back, think about the science and the person, and come up with something that is beneficial to the trainee. If they're new, maybe "Running a positive control every time is really a good idea." (unless they already realized that mistake).  Or "Whenever I scale up the reaction, I always check…" These bits of advice often work well when coupled with a personal story, like "I remember when I screwed up one of these big ones early on, and what I found helped me was…". I will sometimes use another mythic figure from the lab's recent past, since I'm old enough now that personal lab stories sound a little too "crazy old grandpa" to be very effective…

It is also possible that there is nothing to learn from this mistake and that it was just, well, a mistake. In which case, there is nothing you can say that is for anyone's benefit other than yourself, and in those situations, it really is just better to say nothing. This can take a lot of discipline, because it's hard not to express those sorts of feelings right when they're hitting you the hardest. But it's worth it. If it's a repeated issue that's really affecting things, there are two options: 1. address it later during a performance review, or 2. don't. Often, with those sorts of issues, there's honestly not much difference in outcome between these options, so maybe it's just better to go with 2.

Another common category of negative communication are all the sundry versions of "I told you so". This is obviously something you say for your own benefit rather than the other person, and indeed it is so clearly accusatory that most folks know not to say this specific phrase. But I think this is just one of a class of what I call "scorekeeping" statements, which are ones that serve only to remind people of who was right or wrong. Like "But I thought we agreed to…" or "Last time I was supposed to…" They're very tempting, because as scientists we are in the business of telling each other that we're right and wrong, but when you're working with someone in the lab, scoring these types of points is corrosive in the long term. Just remember that the next time your PI asks you to change the figure back the other way around for the 4th time… :)

Along those lines, I think it's really important for trainees (not just PIs) to think about how to improve their communication skills as well. One thing I hear often is "Before I was a PI, I got all this training in science, and now I'm suddenly supposed to do all this stuff I wasn't trained for, like managing people". I actually disagree with this. To me, the concept of "managing people" is sort of a misnomer, because in the ideal case, you're not really "managing" anyone at all, but rather working with them as equals. That implies an equal stake in and commitment to productive communications on both ends, which also means that there are opportunities to learn and improve for all parties. I urge trainees to take advantage of those opportunities. Few of us are born with perfect interpersonal skills, especially in work situations, and extra especially in science, where things change and go wrong all the time, practically begging people to assign blame to each other. It's a lot of work, but a little practice and discipline in this area can go a long way.

Wednesday, June 8, 2016

What’s so bad about teeny tiny p-values?

Every so often, I’ll see someone make fun of a really small p-value, usually along with some line like “If your p-value is smaller than 1/(number of molecules in the universe), you must be doing something wrong”. At first, this sounds like a good burn, but thinking about it a bit more, I just don’t get this criticism.

First, the number itself. Is it somehow because the number of molecules in the universe is so large? Perhaps this conjures up some image of “well, this result is saying this effect could never happy anywhere ever in the whole universe by chance—that seems crazy!”, and makes it seem like there’s some flaw in the computation or deduction. Pretty easy to spot the flaw in that logic, of course: configurational space can of course be much larger than the raw number of constituent parts. For example, let’s say I mix some red dye into a cup of water and then pour half of the dyed water into another cup. Now there is some probability that, randomly, all the red dye stays in one cup and no dye goes in the other. That probability is 1/(2^numberOfDyeMolecules), which is clearly going to be a pretty teeny-tiny number.

Here’s another example that may hit a bit closer to home: during cell division, the nuclear envelope breaks down, and so many nuclear molecules (say, lincRNA) must get scattered throughout the cell (and yes, we have observed this to be the case for e.g. Xist and a few others). Then, once the nucleus reforms, those lincRNA seem to be right back in the nucleus. What is the chance that the lincRNA just happened to be back in the nucleus by chance? Well, again, 1/(2^numberOfRNAMolecules) (assuming a 50/50 nucleus/cytoplasm split), which for many lincRNA is probably like 1/1024 or so, but for something like MALAT1, would be 1/(2^2000) or so. I think we can pretty safely reject the hypothesis that there is no active trafficking of MALAT1 back into the nucleus… :)

I think the more substantial concern people raise with these p-values is that when you get something so small, it probably means that you’re not taking into account some sort of systematic error; in other words, the null model isn’t right. For instance, let’s say I measured a slight difference in the concentration of dye molecules in the second cup above. Even a pretty small change will have an infinitesimal p-value, but the most likely scenario is that some systematic error is responsible (like dye getting absorbed by the material on the second glass or the glasses having slightly different transparencies or whatever). In genomics—or basically any study where you are doing a lot of comparisons—the same sort of thing can happen if the null/background model is slightly off for each of a large number of comparisons.

All that said, I still don’t see why people make fun of small p-values. If you have a really strong effect, then it’s entirely possible that you can get such a tiny p-value. In which case, the response is typically “well, if it’s that obvious, then why do any statistics?” Okay, fine, I’m totally down with that! But then we’re basically saying that there’s no really strong effects out there: if you’re doing enough comparisons that you might get one of these tiny p-values, then any strong, real effect must generate one of these p-values, no? In fact, if you don’t get a tiny p-value for one of these multi-factorial comparisons, then you must be looking at something that is only a minor effect at best, like something that only explains a small amount of the variance. Whether that matter or not is a scientific question, not a statistical one, but one thing I can say is that I don’t know many examples (at least in our neck of the molecular biology woods) in which something which was statistically significant but explained only a small amount of the variance was really scientifically meaningful. Perhaps GWAS is a counterexample to my point? Dunno. Regardless, I just don’t see much justification in mocking the teeny tiny p-value.

Oh, and here’s a teeny tiny p-value from our work. Came from comparing some bars in a graph that were so obviously different that only a reviewer would have the good sense to ask for the p-value… ;)


Update, 6/9/2016:
I think that there are a lot of examples that I think illustrate some of these points better. First, not all tiny p-values are necessarily the result of obvious differences or faulty null models in large numbers of comparisons. Take Uri Alon's network motifs work. Following his book, he showed that in the transcriptional network of E. coli (424 nodes, 519 edges), there were 40 examples of autoregulation. Is this higher, lower or equal to what you would expect? Well, maybe you have a good intuitive handle on random network theory, but for me, the fact that this is very far from the null expectation of around 1.2±1.1 autoregulatory motifs (p-value something like 10^-30) is not immediately obvious. One can (and people) quibble about the particular type of random network model, but in the end, the p-values were always teeny tiny and I don't think that is either obvious or unimportant.

Second, the fact that a large number of small effects can give a tiny p-value doesn't automatically discount their significance. My impression from genetics is that many phenotypes are composed of large numbers of small effects. Moreover, the effects of a perturbation of, say, a gene knockout can lead to a large number of small effects. To say those are not meaningful is an (open, to my mind) scientific question, but whether the p-value is small or not is not really relevant.

This is not to say that all tiny p-values mean there's some science worth looking into. Some of the worst offenders are the examples of "binning", where, e.g. half-life of individual genes correlates with some DNA sequence element, R^2=0.21, p=10^-17 (totally made up example, no offense to anyone in this field!). No strong rule comes from this, so who knows if we actually learned something. I suppose an argument can be made either way, but the bottom line is that those are scientific questions, and the size of the p-value is irrelevant. If the p-value were bigger, would that really change anything?

Sunday, May 22, 2016

Spring cleaning, old notebooks, and a little linear algebra problem

Update 5/25/2016: Solution at the bottom

These days, I spend most of my time thinking about microscopes and gene regulation and so forth, which makes it all the more of a surprising coincidence that on the eve of what looks to be a great math-bio symposium here at Penn tomorrow, I was doing some spring cleaning in the attic and happened across a bunch of old notebooks from my undergraduate and graduate school days in math and physics (and a bunch of random personal stuff that I'll save for another day—which is to say, never). I was fully planning to throw all those notebooks away, since of course the last time I really looked at it was probably well over 10 years ago, and I did indeed throw away a couple from some of my less memorable classes. But I was surprised that I actually wanted to keep a hold of most of them.

Why? I think partly that they serve as an (admittedly faint) reminder that I used to actually know how to do some math. It's actually pretty remarkable to me how much we all learn during our time in formal class training, and it is sort of sad how much we forget. I wonder to what degree it's all in there somewhere, and how long it would take me to get back up to speed if necessary. I may never know, but I can say that all that background has definitely shaped me and the way that I approach problems, and I think that's largely been for the best. I often joke in lab about how classes are a waste of time, but it's clear from looking these over that that's definitely not the case.

I also happened across a couple of notebooks that brought back some fond memories. One was Math 250(A?) at Berkeley, then taught by Robin Hartshorne. Now, Hartshorne was a genius. That much was clear on day one, when he looked around the room and precisely counted the number of students in the room (which was around 40 or so) in approximately 0.58 seconds. All the students looked at each other, wondering whether this was such a good idea after all. For those who stuck with it, they got exceptionally clear lectures on group theory, along with by far the hardest problem sets of any class I've taken (except for a differential geometry class I dropped, but that's another story). Of the ten problems assigned every week, I could do maybe one or two, after which I puzzled away, mostly in complete futility, until I went to his very well attended office hours, at which he would give hints to help solve the problems. I can't remember most of the details, but I remember that one of the hints was so incredibly arcane that I couldn't imagine how anyone, ever, could have come up with the answer. I think that Hartshorne knew just how hard all this was, because one time I came to his office hours after a midterm when a bunch of people were going over a particular problem, and I said "Oh yeah, I think I got that one!" and he looked at me with genuine incredulity, at which point I explained my solution. Hartshorne looked relieved, pointed out the flaw, and all went back to normal in the universe. :) Of course, there were a couple kids in that class for whom Hartshorne wouldn't have been surprised to see a solution from, but that wasn't me, for sure.

While rummaging around in that box of old notebooks, I also found some old lecture notes that I really wanted to keep. Many of these are from one of my PhD advisor's, Charlie Peskin, who had some wonderful notes on mathematical physiology, neuroscience, and probability. His ability to explain ideas to students with widely-varying backgrounds was truly incredible, and his notes are so clear and fresh. I also kept notes from a couple of my other undergrad classes that I really loved, notably Dan Roksar's quantum mechanics series, Hirosi Ooguri's statistical mechanics and thermodynamics, and Leo Harrington's set theory class (which was truly mind-bending).

It was also fun to look through a few of the problem sets and midterms that I had taken—particularly odd now to look at some old dusty blue books and imagine how much stress they had caused at the time. I don't remember many of the details, but I somehow still vaguely remembered two problems, one in undergrad, one in grad school as being particularly interesting. The undergrad one was some sort of superconducting sphere problem in my electricity and magnetism course that I can't fully recall, but it had something to do with spherical harmonics. It was a fun problem.

The other was from a homework in a linear algebra class I took in grad school from Sylvia Serfaty, and I did manage to find it hiding in the back of one of my notebooks. A simple-seeming problem: given an n×n matrix A, formulate necessary and sufficient conditions for the 2n×2n matrix B defined as

B = |A A|
    |0 A|

to be diagonalizable. I'll give you a hint that is perhaps what one might guess from the n=1 case: the condition is that A = 0. In that case, sufficiency is trivial (B = 0 is definitely diagonalizable), but showing necessity—i.e., show that if B is diagonalizable, then A=0—is not quite so straightforward. Or, well, there's a tricky way to get it, at least. Free beer to whoever figures it out first with a solution as tricky (or more tricky) than the one I'm thinking of! Will post an answer in a couple days.

Update, 5/25/2016: Here's the solution!


Sunday, May 1, 2016

The long tail of artificial narrow superintelligence

As readers of the blog have probably guessed, there is a distinct strain of futurism in the lab, mostly led by Paul, Ally, Ian and I (everyone else just mostly rolls their eyes, but what do they know?). So it was against this backdrop that we had a heated discussion recently about the implications of AlphaGo.

It started with a discussion I had with someone who is an expert on machine learning and knows a bit of Go, and he said that AlphaGo was a huge PR stunt. He said this based on the fact that the way AlphaGo wins is basically by using deep learning to evaluate board positions really well, while doing a huge number of calculations to determine what play to make to evaluate that position. Is that really “thinking”? Here, opinions were split. Ally was strongly in the camp of this being thinking, and I think her argument was pretty valid. After all, how different is that necessarily from how humans play? They probably think up possible places to go and then evaluate the board position. I was of the opinion that this is a different type of thinking than human thinking entirely.

Thinking about it some more, I think perhaps we’re both right. Using neural networks to read the board is indeed amazing, and a feat that most thought would not be possible for a while. It’s also clear that AlphaGo is doing a huge number of more “traditional” brute force computations of potential moves than Lee Sedol was. The question then becomes how close the neural network part of AlphaGo is compared to Lee Sedol’s intuition, given that the brute force logic parts are probably tipped far in AlphaGo’s favor. This is sort of a hard question to answer, because it’s unclear how closely matched they were. I was, perhaps like many, sort of shocked that Lee Sedol managed to win game 4. Was that a sign that they were not so far apart from each other? Or just a weird flukey sucker punch from Sedol? Hard to say. I think the fact that AlphaGo was probably no match for Sedol a few months prior is probably a strong indication that AlphaGo is not radically stronger than Sedol. So my feeling is that Sedol’s intuition is still perhaps greater than AlphaGo’s, which allowed him to keep up despite such a huge disadvantage is traditional computation power.

Either way, given the trajectory, I’m guessing that within a few months, AlphaGo will be so far superior that no human will ever, ever be able to beat it. Maybe this is through improvements to the neural network or to traditional computation, but whatever the case, it will not be thinking the same way as humans. The point is that it doesn’t matter, as far as playing Go is concerned. We will have (already have?) created the strongest Go player ever.

And I think this is just the beginning. A lot of the discourse around artificial intelligence revolves around the potential for artificial general super-intelligence (like us, but smarter), like a paper-clip making app that will turn the universe into a gigantic stack of paper-clips. I think we will get there, but well before then, I wonder if we’ll be surrounded by so much narrow-sense artificial super-intelligence (like us, but smarter at one particular thing) that life as we know it will be completely altered.

Imagine a world in which there is super-human level performance at various “brain” tasks. What will be the remaining motivation to do those things? Will everything just be a sport or leisure activity (like running for fun)? Right now, we distinguish (perhaps artificially) between what’s deemed “important” and what’s just a game. But what if we had a computer for doing proving math theorems or coming up with algorithms, one vastly better than any human? Could you still have a career as a mathematician? Or would it just be one big math olympiad that we do for fun? I’m now thinking that it’s possible for virtually everything humans think is important and do for "work" could be overtaken by “dumb” artificial narrow super-intelligence, well before the arrival of a conscious general super-intelligence. Hmm.

Anyway, for now, back in our neck of the woods, we've still got a ways to go in getting image segmentation to perform as well as humans. But we’re getting closer! After that, I guess we'll just do segmentation for fun, right? :)

Wednesday, April 6, 2016

The hierarchy of academic escapism

Work to escape from the chaos of home. 

Conference travel to escape from the chaos of work.

Laptop in hotel room to escape from the chaos of the poster session.

Email to escape the tedium of reviewing papers.

Netflix to escape the tedium of email.

Sleep to escape the tedium of Sherlock Season 3.

And then it was Tuesday.

Thursday, March 3, 2016

From over-reproducibility to a reproducibility wish-list

Well, it’s clear that that last blog post on over-reproducibility touched a bit of a nerve. ;)

Anyway, lot of the feedback was rather predictable and not particularly convincing, but I was pointed to this discussion on the software carpentry website, which was actually really nice:

On 2016-03-02 1:51 PM, Steven Haddock wrote:
> It is interesting how this has morphed into a discussion of ways to convince / teach git to skeptics, but I must say I agreed with a lot of the points in the RajLab post.
>
> Taking a realistic and practical approach to use of computing tools is not something that needs to be shot down (people sound sensitive!). Even if you can’t type `make paper` to recapitulate your work, you can still be doing good science…
>
+1 (at least) to both points. What I've learned from this is that many scientists still see cliffs where they want on-ramps; better docs and lessons will help, but we really (really) to put more effort into usability and interoperability. (Diff and merge for spreadsheets!)

So let me turn this around and ask Arjun: what would it take to convince you that it *was* worth using version control and makefiles and the like to manage your work? What would you, as a scientist, accept as compelling?

Thanks,
Greg

--
Dr Greg Wilson
Director of Instructor Training
Software Carpentry Foundation


First off, thanks to Greg for asking! I really appreciate the active attempt to engage.

Secondly, let me just say that as to the question of what it would take for us to use version control, the answer is nothing at all, because we already use it! More specifically, we use it in places where we think it’s most appropriate and efficient.

I think it may be helpful for me to explain what we do in the lab and how we got here. Our lab works primarily on single cell biology, and our methods are primarily single molecule/single cell imaging techniques and, more recently, various sequencing techniques (mostly RNA-seq, some ATAC-seq, some single cell RNA-seq). My lab has people with pretty extensive coding experience and people with essentially no coding experience, and many in between (I see it as part of my educational mission to try and get everyone to get better at coding during their time in the lab). My PhD is in applied math with a side of molecular biology, during which time we developed a lot of the single RNA molecule techniques that we are still using today. During my PhD, I was doing the computational parts of my science in an only vaguely reproducible way, and that scared me. Like “Hmm, that data point looks funny, where did that come from?”. Thus, in my postdoc, I started developing a little MATLAB "package" for documenting and performing image analysis. I think this is where our first efforts in computational reproducibility began.

When I started in the lab in 2010, my (totally awesome) first student Marshall and I took the opportunity to refactor our image analysis code, and we decided to adopt version control for these general image processing tools. After a bit of discussion, we settled on Mercurial and bitbucket.org because it was supposed to be easier to use than git. This has served us fairly well. Then, my brilliant former postdoc Gautham got way into software engineering and completely refactored our entire image processing pipeline, which is basically what we are using today, and is the version that we point others to use here. Since then, various people have contributed modules and so forth. For this sort of work, version control is absolutely essential: we have a team of people contributing to a large, complex codebase that is used by many people in the lab. No brainer.

In our work, we use these image processing tools to take raw data and turn it into numbers that we then use to hopefully do some science. This involves the use of various analysis scripts that will take this data, perform whatever statistical analysis and so forth on it, and then turn that into a graphical element. Typically, this is done by one, more often two, people in the lab, typically working closely together.
Right around the time Gautham left the lab, we had several discussions about software best practices in the lab. Gautham argued that every project should have a repository for these analysis scripts. He also argued that the commit history could serve as a computational lab notebook. At the time, I thought the idea of a repo for every project was a good one, and I cajoled people in the lab into doing it. I pretty quickly pushed back on the version-control-as-computational-lab-notebook claim, and I still feel that pretty strongly. I think it’s interesting to think about why. Version control is a tool that allows you to keep track of changes to code. It is not something that will naturally document what that code does. My feeling is that version control is in some ways a victim of its own success: it is such a useful tool for managing code that it is now widely used and promoted, and as a side-effect it is now being used for a lot of thing for which it is not quite the right tool for the job, a point I’ll come back to.

Fast forward a little bit. Using version control in the repo-for-every-project model was just not working for most people in the lab. To give a sense of what we’re doing, in most projects, there’s a range of analyses, sometimes just making a simple box-plot or bar graph, sometimes long-ish scripts that take, say, RNA counts per cell and fit to a model of RNA production, extracting model parameters with error bounds. Sometimes it might be something still more complicated. The issue with version control in this scenario is all the headache. Some remote heads would get forked. Somehow things weren't syncing right. Some other weird issue would come up. Plus, frankly all the commit/push/pull/update was causing some headaches, especially if someone forgot to push. One student in the lab and I were just working on a large project together, and after bumping into these issues over and over, she just said “screw it, can we just use Dropbox?” I was actually reluctant at first, but then I thought about it a bit more. What were we really losing? As I mention in the blog post, our goal is a reproducible analysis. For this, versioning is at best a means towards this goal, and in practice for us, a relatively tangential means. Yes, you can go back and use earlier versions. Who cares? The number of times we’ve had to do that in this context is basically zero. One case people have mentioned as a potential benefit for version control is performing alternative, exploratory analyses on a particular dataset, the idea being you can roll back and compare results. I would argue that version control is not the best way to perform or document this. Let’s set I have a script for “myCoolAnalysis”. What we do in lab is make “myAlternativeAnalysis” in which we code our new analysis. Now I can easily compare. Importantly, we have both versions around. The idea of keeping the alternative version in version control is I think a bad one: it’s not discoverable except by searching the commit log. Let’s say that you wanted to go back to that analysis in the future. How would I find it? I think it makes much more sense to have it present in the current version of the code than to dig through the commit history. One could argue that you could fork the repo, but then changes to other, unrelated parts of the repo would be hard to deal with. Overall, version control is just not the right tool for this, in my opinion.

Another, somewhat related point that people have raised is looking back to see why some particular output changed. Here, we’re basically talking about bugs/flawed analyses. There is some merit to this, and so I acknowledge there is a tradeoff, and that once you get to a certain scale, version control is very helpful. However, I think that for scientific programming at the scale I’m talking about, it’s usually fairly clear what caused something to change, and I’m less concerned about why something changed and much more worried about whether we’re actually getting the right answer, which is always a question about the code as it stands. For us, the vast majority of the time, we are moving forward. I think the emphasis here would be better on teaching people about how to test their code (which is a scientific problem more than a programming problem) than version control.

Which leads me to really answering the question: what would I love to have in the lab? On a very practical level, look, version control is still just too hard and annoying to use for a lot of people and injects a lot of friction into the process. I have some very smart people in my lab, and we all have struggled from time to time. I’m sure we can figure it out, but honestly, I see little impetus to do so for the use cases outlined above, and yes, our work is 100% reproducible without it. Moving (back) to Dropbox has been a net productivity win, allowing us to work quickly and efficiently together. Also, the hassle free nature of it was a real relief. On our latest project, while using version control, we were always asking “oh, did you push that?”, “hmm, what happened?”, “oh, I forgot to update”. (And yes, we know about and sometimes use SourceTree.) These little hassles all add up to a real cognitive burden, and I’m sorry, but it's just a plain fact that Dropbox is less work. Now it’s just “Oh, I updated those graphs”, “Looks great, nice!”. Anyway, what I would love is Dropbox with a little bit more version tracking. And Dropbox does have some rudimentary versioning, basically a way to recover from an "oh *#*$" moment–the thing I miss most is probably a quick diff. Until this magical system emerges, though, on balance, it is currently just more efficient for us not to use version control for this type of computational work. I posit that the majority of people who could benefit from some minimal computational reproducibility practices fall into this category as well.

Testing: I think getting people in the habit of testing would be a huge move in the right direction. And I think this means scientific code testing, not just “program doesn’t crash” testing. When I teach my class on molecular systems biology, one of my secret goals is to teach students a little bit about scientific programming. For those who have some programming experience, they often fall into the trap of thinking “well, the program ran, so it must have worked”, which is often fine for, say, a website or something, but it’s usually just the beginning of the story for scientific programming and simulations. Did you look for the order of convergence (or convergence at all)? Did you look for whether you’re getting the predicted distribution in a well-known degenerate case? Most people don’t think about programming that way. Note that none of this has anything to do with version control per se.

On a bigger level, I think the big unmet need is that of a nice way to document an analysis as it currently stands. Gautham and I had a lot of discussions about this when he was in lab. What would such documentation do? Ideally, it would document the analysis in a searchable and discoverable way. This was something Gautham and I discussed at length and didn’t get around to implementing. Here’s one idea we were tossing around. Let’s say that you kept your work in a directory tree structure, with analyses organized by subfolder. Like, could keep that analysis of H3K4me3 in “histoneModificationComparisons/H3K4me3/”, then H3K27me3 in “histoneModificationComparisons/H3K27me3/”. In each directory, you have the scripts associated with a particular analysis, and then running those scripts produces an output graph. That output graph could either be stored in the same folder or in a separate “graphs” subfolder. Now, the scripts and the graphs would have metadata (not sure what this would look like in practice), so you could have a script go through and quickly generate a table of contents with links to all these graphs for easy display and search. Perhaps this is similar to those IPython notebooks or whatever. Anyway, the main features is that this would make all those analyses (including older ones that don't make it in the paper) discoverable (via tagging/table of contents) and searchable (search:“H3K27”). For me, this would be a really helpful way to document an analysis, and would be relatively lightweight and would fit into our current workflow. Which reminds me: we should do this.

I also think that a lot of this discussion is really sort of veering around the simple task of keeping a computational lab notebook. This is basically a narrative about what you tried, what worked, what didn’t work, and how you did it, why you did it, and what you learned. I believe there have been a lot of computational lab notebook attempts out there, from essentially keyloggers on up, and I don’t know of any that have really taken off. I think the main thing that needs to change there is simply the culture. Version control is not a notebook, keylogging is not a notebook, the only thing that is a notebook is you actually spending the time to write down what you did, carefully and clearly–just like in the lab. When I have cajoled people in the lab into doing this, the resulting documents have been highly useful to others as how-to guides and as references. There have been depressingly few such documents, though.

Also, seriously, let's not encourage people to use version control for maintaining their papers. This is just about the worst way to sell version control. Unless you're doing some heavy math with LaTeX or working with a very large document, Google Docs or some equivalent is the clear choice every time, and it will be impossible to convince me otherwise. Version control is a tool for maintaining code. It was never meant for managing a paper. Much better tools exist. For instance, Google Docs excels at easy sharing, collaboration, simultaneous editing, commenting and reply-to-commenting. Sure, one can approximate these using text-based systems and version control. The question is why anyone would like to do that. Not everything you do on a computer maps naturally to version control.

Anyway, that ended up being a pretty long response to what was a fairly short question, but I also just want to reiterate that I find it reassuring that people like Greg are willing to listen to these ramblings and hopefully find something positive from it. My lab is really committed to reproducible computational analyses, and I think I speak for many when I describe the challenges we and others face in making it happen. Hopefully this can stimulate some new discussion and ideas!