Saturday, July 11, 2015

How should we do script review to spot errors?

Sydney just thought up a great idea for the lab: she was wondering if someone could review all her analysis scripts to look for errors before we finalize it and submit a manuscript. Sort of like a code review, I guess. I think this is awesome, and can definitely reduce the potential for getting some very serious egg on your face after publication. (Note: I'm not talking about infrastructure-type software, which I think has a very different set of problems and solutions. This is about analysis scripts for the science itself.)

We all discussed briefly at group meeting about how this might work in practice, which took on a very practical significance because Chris was going over figures for the paper he's putting together. Here were some of the points of discussion, much revolving around the time it takes for someone to go over someone else's code.

  1. When should the review happen? In the ideal world, the reviewer would be involved each step of the way, spotting errors early on in the process. In practice, that's a pretty big burden on the reviewer, and there's the potential to spend time reviewing analyses that never see the light of day. So I think we all thought it's better done at the end. Of course, doing it at the bitter end could be, well, bitter. So we're thinking maybe doing it in chunks when specific pieces of the analysis are finalized?
  2. Who should do it? Someone well-versed in the project would obviously be able to go through it faster. Also, they may be better able to suggest "sanity checks" (additional analyses to demonstrate correctness) than someone naive to the project. Then again, might their familiarity blind them to certain errors? I'm just not sure at this stage how much work it is to go through this.
  3. Related: How actively should the code author be involved? On the one hand, looking at raw code without any guidance can be very intimidating and time-consuming. On the other hand, having someone lead you through the code might inadvertently steer the reviewer away from problem areas.
  4. Who should do it, part 2? Some folks in the lab are a bit more computationally savvy than others. I worry that the more computationally savvy folks might get overburdened. It could be a training exercise for others to learn, but the quality of the review itself might suffer somewhat.
  5. How should we assign credit? Acknowledgement on the paper? Co-authorship? I could see making a case either way, guess it probably depends on the specifics.

Anyway, don't know if anyone out there has tried something like this, but if so, we'd love to hear your thoughts. I think it's increasingly important to think about these days.

1 comment:

  1. arjun, in our lab, we don't consider codes any different from biological methods, except that repeating an expt requires us spend money in reagents/consumables and re-writing codes requires us to spend additional time only (if you forget one's time equal to money part).

    if there is a bug, we usually caught it during replication process. it's v difficult unless someone goes line by line and tries to reproduce the error. in our work, there is a lot of shell scripting and they are all short, so not so difficult to check during replication. we usually don’t check the code at the end of the project, which tend to be years if not months, but at the end of a part of the project, for example, end of variant calling from the exome data, where we have gone through read qc, alignment, base calling and base call filtering and annotation (this is a v simplified example of course).

    about involvement of people, the person who wrote the code must be involved, if possible. its like a lab method. how can you replicate one’s method without feedback/active help from someone who has performed the expt in the first place? everyone has their own little tweaks and/or favorites. i know folks who are particular how they hold the pipette inside the hood during cell culture. code writing is no different. the debugging without the person who wrote it or without his/her active help is difficult but unavoidable at times, particularly in a place like ours where a lot of short-term students/interns come and go.

    about credit, to me, the person who has debugged the code deserves a co-authorship, not an acknowledgement. biologists often don't consider bioinformatics as real science, at least i can speak about india, and think the analytical part of the science is a necessity but not central to the whole story. i beg to differ, i think both are equally imp. thank you. binay panda (