Monday, November 3, 2014

Why don’t bioinformaticians learn how to run gels?

Just read an interesting post from Sean Eddy about genomics. Lots of points there about sequencing and big science and other stuff that seems well above my pay grade. But the post also brings up the notion that biologists should be able to do their own data analysis, in particular scripting with Perl/Python. I’ve heard this subjected debated before many times, and I’m sure I’ll hear it again. But I don't think it's the right way to think about it.

First off, I want to say that I agree with the underlying premise in theory. Yes, it would be great for everyone to have some basic skills in quantitative analysis and programming. It would certainly be useful for biologists to be able to analyze their own data, and we do all our own analysis at the command line in the lab, typically using tools graciously and freely provided by others. For others with different skills and interests, there is finite time in the day, and maybe they don’t have the time and inclination to learn this stuff. To require biologists to learn to do things at the command line is I think missing a huge opportunity, and is also a bit unfair.

Consider the following: how many bioinformaticians are required to learn and perform library prep to do their work? And what if we told them to “just figure it out by Googling around”? I’m not even talking about understanding all the various technical aspects of library prep, I mean even just doing the basic protocols. Probably not very many have been required to do this. I’m sure they could do it and figure it out, but why should they, you might ask? A reasonable question. Well, then why should biologists be subjected to the pain of shell/Perl scripting just to figure out if some genes’ expression went up or down? Why does this work in only one direction? Remember, scripting is NOT SCIENCE. It is just a tool. I see no reason why everyone should have to learn about all the details of every tool in order to do their science. This even applies just within the realm of computation: how many people who use the log function know anything about how to implement it? Going up the chain, I don’t need to know why MATLAB uses Householder transformations to compute a QR factorization instead of Gram-Schmidt or even that it does so at all–I can just call it and trust that MATLAB does the best thing by default. That is the nature of a mature tool.

Indeed, it is particularly ironic to hear these calls for DIY learning from genomic informaticians, when the experimental side of that same work is amongst the most commoditized and standardized bench work in existence (funnily enough, to a point where bioinformaticians might actually be able to do it with only minimal training!). Basically, add and remove liquids to/from each other for 1-2 days, squirt it in some sequencing chip and say go, then download the data. It’s pretty close to the big green “GO” button that everyone dreams about. And it comes from years of careful thought and consideration about the needs of the USER of the tool, not of the provider. Make no mistake, the technology underlying sequencing is very complicated and sophisticated. But the reason sequencing has taken off the way it has is because USING the (hardware/wetware) tool is very simple. Just like scripting/data processing, sequencing is not science, but a tool. It is, at this point, a much easier to use one than analysis software, in my opinion.

I of course appreciate that part of the reason that sequencing itself is so well developed is because there are huge companies with tremendous resources backing the effort. Fair enough. Perhaps it will require a commercial effort to build an easy to use pipeline for analysis. Maybe not. Either way, though, I think the main thing to keep in mind if you are in the tool business is that if you want people to use your tool, you will get a lot further by LISTENING (and I mean actually listening) to your users and their needs than you will by simply telling them about all the things that they ought to do and ought to know. It’s hard work, and requires a lot of thought and attention, and I certainly understand the sentiment that it may not fall within the purview of academic work. But I think it needs to happen one way or another. In the same way that simplified mobile operating systems brought computation to many more people than before, so will easy to use bioinformatics pipelines bring sequencing tools to many more biologists, which is a good thing.

This is most certainly not to say that biologists shouldn't be getting some more quantitative training, especially in computers. There is no doubt that learning some principles of programming and quantitative/statistical analysis can be hugely beneficial, given the way science as a whole is headed. Again, that is not the same thing as learning scripting. In fact, being able to script is completely unrelated to quantitative thinking and only moderately related to any high level concepts in programming. It is busywork, plain and simple. In my lab, we do quantitative work, and writing these scripts is still basically what I would consider a big waste of time. We can do it, but it has nothing to do with science, quantitative or otherwise, and most of us would much rather not have to bother. Even worse for science is that the requirement of scripting leaves those who can’t do it because of limited time or whatever out in the cold.

Oh, and by the way, I think Galaxy is a great step in this direction. Bravo to the developers, and thank you for your hard work!

Update, 11/4: In case you're wondering if we practice what we preach, we have two versions of our image analysis software. One is open source, very powerful, completely extensible, fancy software engineering, etc. The other one is super limited, but designed for use by scientists, not programmers. Both are freely available, but guess which one gets used by orders of magnitude more people...


  1. Agree it's as unrealistic to expect biologists to be able to script complex analyses, as it is to expect bioinformaticians to conduct perfect library preps (still bring me out in cold sweats..)
    Coming as a medic, who has had to learn both lab and basic bioinformatics... I think it's really important in the discussion to differentiate between 'scripting as the equivalent of learning to use windows and interact with data/use Excel to analyse' and 'scripting as in designing and writing an analysis tool'. The former is I think, achievable and realistic to expect. I agree that to expect the beginner to learn the latter is nuts.
    We've set up an in-house Unix/Informatics for beginners course in our group- and massively benefited from having a (very very patient) person to guide us through - reduces learning time by about a hundredfold from blind googling...

  2. Whenever this debate comes up, I wonder why don't train biologists more like physicist. In physics, even experimentalists are trained in math and programming way beyond what biologists get. They don't use all of the fancy math that theorists use, but they know how to quantitatively analyze their data.

    It seems like in physics this is a solved problem. The comp bio/ experimental bio split should be a lot more like the split in physics.

    1. Hehe, except that there are very few theorists in (molecular) biology, and those that are are typically considered crackpots until some experiment comes along. Computational biology is not the same as theory, and so if we made a split like in physics, there wouldn't be much on that side of the split.

    2. There is a very simple explanation - nobody has a serious reason to do that, because the incentives are all stacked in the direction of "get the student pipetting at the bench ASAP" as that's how data is produced. If departments had to invest time and effort to actually teach their grad students about quantitative reasoning (and not just that, as a rule they don't do a good job teaching them biology either), this would be, first, time that those teaching the classes would rather spend writing grants and papers, and second, time that the students are not producing data. You don't need to know a lot about anything to move liquids from one tube to another, which is what forms the bulk of most biology PhDs. And so departments generally don't see a lot of reason to invest resources into teaching much more than that.

      Just take a look at the typical graduate curriculums in physics and math and compare them to those in biology - the rule is that ratio of the number of classes taken is 3-4-to-1 (i.e. biologists take 4-6 classes, physicists take 16 or more), and those are some very serious classes in physics/math while they tend to be a real joke in biology (very light workload, no oversight over what has been and what has not been learned).

      I don't have direct observations, but I would imagine the situation is at least somewhat similar in chemistry too, where the culture is even more productivity-over-everything-else-focused.

    3. I largely disagree. First off, many departments are requiring more and more quantitative classes. The number of classes differs, but that is largely a function of the age and depth of the field. Suggesting that most biology PhDs consist merely moving a bunch of liquids around mindlessly is both naive and insulting, not to mention wrong. Moreover, I think that quantitative sciences first have to show that quantitative approaches have something valuable to add to biology–and that is a challenge for us quantitative types, one that I feel we have a long way to go on.

    4. I don't see how what you're saying invalidates what I posted.

      1) Departments may be requiring more and more quantitative classes, but that is because they are trying to address a problem that has only come up fairly recently - with the explosion of data, they just can't find a sufficient number of people to work on it. So they are forced to act. But what about the people who have been trained prior to that, who are the ones the OP was mostly about?

      2) That they are requiring more and more such classes does not mean the outcome of those classes is satisfactory. I had to take one such class too - it was a complete joke.

      3) I used some hyperbole, but that does not change the overall validity of what I said - the best PhD students do a lot more than moving liquids around, that's correct. But I also know what I have seen around me, and I was/am at a fairly elite institution. What I have seen is not pretty - close to zero amount of what you would call "training", and lots of people who have regressed rather than grown intellectually relative to the time they entered the program, and a system that in no uncertain terms conveyed the message that I should be at the bench pipetting from the second week I set foot on campus. If you've had a better experience, that's great. But at this point I've seen some ugly situations at quite a few labs and institutions, and combined with some knowledge of the way the financial incentives in the overall system of biomedical research are set up, it gives me sufficient reason to draw general conclusions.

  3. A lot more bioinformaticans that you realize can run gels just fine if we want to -- many of us were originally *trained* as bench biologists, Eddy included. Not to mention that these days bench work is becoming easier and easier as pre-poured gels are available (in my day we had to pour them ourselves) plus there are all these new lab robots that automate so much like picking colonies.

    1. I think we actually agree. I never meant to insinuate that bioinformaticians *don't* know how to run gels, but rather that they are not *required* to for their work. And I think that's probably a good thing. It's awesome that we have made bench biology easier and easier with pre-cast gels and robots, etc. (both luxuries I have never known!). I just wish the same was true for bioinformatics pipelines.

    2. Although I can see that the somewhat inflammatory title of my post would insinuate a lack of knowledge. I think that the point I am trying to make is best stated as bioinformaticians are not *required* to know much about bench work, so it seems unfair to require non-bioinformaticians to know much about bioinformatics work. As a matter of self-improvement and self-preservation, I think it's valuable to learn about bioinformatics, but I just think the learning curve is so steep that it excludes many people currently.

  4. This post really resonates with me, especially as someone with a purely experimental background trying to slog through this DIY programming learning curve in my second year of grad school. "Just googling" takes up a lot of time, and it can be really overwhelming (especially when you don't get many continuous chunks of time away from the bench). It would help if these tools were easier to use/modify/develop.