Just read an interesting post from Sean Eddy about genomics. Lots of points there about sequencing and big science and other stuff that seems well above my pay grade. But the post also brings up the notion that biologists should be able to do their own data analysis, in particular scripting with Perl/Python. I’ve heard this subjected debated before many times, and I’m sure I’ll hear it again. But I don't think it's the right way to think about it.
First off, I want to say that I agree with the underlying premise in theory. Yes, it would be great for everyone to have some basic skills in quantitative analysis and programming. It would certainly be useful for biologists to be able to analyze their own data, and we do all our own analysis at the command line in the lab, typically using tools graciously and freely provided by others. For others with different skills and interests, there is finite time in the day, and maybe they don’t have the time and inclination to learn this stuff. To require biologists to learn to do things at the command line is I think missing a huge opportunity, and is also a bit unfair.
Consider the following: how many bioinformaticians are required to learn and perform library prep to do their work? And what if we told them to “just figure it out by Googling around”? I’m not even talking about understanding all the various technical aspects of library prep, I mean even just doing the basic protocols. Probably not very many have been required to do this. I’m sure they could do it and figure it out, but why should they, you might ask? A reasonable question. Well, then why should biologists be subjected to the pain of shell/Perl scripting just to figure out if some genes’ expression went up or down? Why does this work in only one direction? Remember, scripting is NOT SCIENCE. It is just a tool. I see no reason why everyone should have to learn about all the details of every tool in order to do their science. This even applies just within the realm of computation: how many people who use the log function know anything about how to implement it? Going up the chain, I don’t need to know why MATLAB uses Householder transformations to compute a QR factorization instead of Gram-Schmidt or even that it does so at all–I can just call it and trust that MATLAB does the best thing by default. That is the nature of a mature tool.
Indeed, it is particularly ironic to hear these calls for DIY learning from genomic informaticians, when the experimental side of that same work is amongst the most commoditized and standardized bench work in existence (funnily enough, to a point where bioinformaticians might actually be able to do it with only minimal training!). Basically, add and remove liquids to/from each other for 1-2 days, squirt it in some sequencing chip and say go, then download the data. It’s pretty close to the big green “GO” button that everyone dreams about. And it comes from years of careful thought and consideration about the needs of the USER of the tool, not of the provider. Make no mistake, the technology underlying sequencing is very complicated and sophisticated. But the reason sequencing has taken off the way it has is because USING the (hardware/wetware) tool is very simple. Just like scripting/data processing, sequencing is not science, but a tool. It is, at this point, a much easier to use one than analysis software, in my opinion.
I of course appreciate that part of the reason that sequencing itself is so well developed is because there are huge companies with tremendous resources backing the effort. Fair enough. Perhaps it will require a commercial effort to build an easy to use pipeline for analysis. Maybe not. Either way, though, I think the main thing to keep in mind if you are in the tool business is that if you want people to use your tool, you will get a lot further by LISTENING (and I mean actually listening) to your users and their needs than you will by simply telling them about all the things that they ought to do and ought to know. It’s hard work, and requires a lot of thought and attention, and I certainly understand the sentiment that it may not fall within the purview of academic work. But I think it needs to happen one way or another. In the same way that simplified mobile operating systems brought computation to many more people than before, so will easy to use bioinformatics pipelines bring sequencing tools to many more biologists, which is a good thing.
This is most certainly not to say that biologists shouldn't be getting some more quantitative training, especially in computers. There is no doubt that learning some principles of programming and quantitative/statistical analysis can be hugely beneficial, given the way science as a whole is headed. Again, that is not the same thing as learning scripting. In fact, being able to script is completely unrelated to quantitative thinking and only moderately related to any high level concepts in programming. It is busywork, plain and simple. In my lab, we do quantitative work, and writing these scripts is still basically what I would consider a big waste of time. We can do it, but it has nothing to do with science, quantitative or otherwise, and most of us would much rather not have to bother. Even worse for science is that the requirement of scripting leaves those who can’t do it because of limited time or whatever out in the cold.
Oh, and by the way, I think Galaxy is a great step in this direction. Bravo to the developers, and thank you for your hard work!
Update, 11/4: In case you're wondering if we practice what we preach, we have two versions of our image analysis software. One is open source, very powerful, completely extensible, fancy software engineering, etc. The other one is super limited, but designed for use by scientists, not programmers. Both are freely available, but guess which one gets used by orders of magnitude more people...