Monday, January 6, 2025

Documenting computational analyses by provenance vs. function

tl;dr: I think it’s time we rethink a lot of how we document computational work. Prompted by AI but also just general increasing complexity of software, we need to move from documenting how something came to be towards documenting what that something is. This more practical form of documentation will allow us to focus our efforts on what matters scientifically.

It has long been held as sacrosanct that proper scientific reporting requires documenting the provenance of any particular output. To translate: if you want to share something—an experimental result, whatever—you have to describe exactly how you did it, every step of the way.

This same sentiment has been applied to computational analyses. Given the potential (and I emphasize potential) to provide an exact record of what was done, it has been a long standing goal to provide code that provides an immutable record of the path from the data to the figures in the paper. But this paradigm has started to seem both less ideal and less practical in the modern software environment, even more so with the advent of large statistical models (“AI”).

The issue is that somewhere along the way, software became a lot more like a living organism than a static entity. Virtually all software depends on a maze of interdependent packages, and despite many attempts, like environments and docker containers and whatever, there’s really no way to avoid the fact that keeping software valid and runnable requires ongoing maintenance work. Machine learning models compound this problem. These models are largely inscrutable, and their black box outputs can vary due to from seemingly minor changes in the prompting or other input. What do we do?

I think the solution is to document based on function. What I mean is that we should focus more on documenting our software by verifying its output than worrying about every parameter that goes into it. For example: in image analysis, a key problem has always been segmentation, meaning how you identify (i.e., circle) cells for quantification. Everybody had their own algorithm and would pass around scripts to document the pipeline. The thing is… nobody really cared all that much about the algorithms, most of which were completely specific to the particular dataset. What we cared about a lot more about (or at least should have cared more about) was the quality of the output. How good was the segmentation? What were the false positives and negatives? What were the failure modes and how might that affect the downstream analysis? I think we would do a lot better trying to focus on that aspect of documenting our science. For instance, with machine learning tools, image analysis has undergone a major transformation, with these models having an uncanny ability to segment cells now and automate analyses that were previously unthinkable. Thing is, people retrain all their own local models, and minor parameters change, and at some point… who cares? It’s wasted effort to keep track of the details, and far more important to know whether the output is right. So let’s document that verification.

Same applies in genomic data analysis. Genomic analyses often depend on a large number of parameters that can vary from dataset to dataset. Documenting these is important, but honestly, I think it’s a bit beside the point. The main thing is not the precise thresholds and parameters that went into your peak-finding algorithm, but rather the plain fact of whether it actually found your peaks correctly.

This discussion may remind you of unit testing, in which you put your software through a suite of tests to make sure each part does the right thing. The whole idea is to verify what the code does and not how it does it. So not a new concept at all.

The use of LLMs is another example of how difficult and, ultimately, futile it is to insist on documentation by provenance. Let’s say I ask ChatGPT to help me figure out the pathway that corresponds to the activity of a list of gene names. Now, maybe I’ll get the same answer if I run it again next week, or maybe not. Does it matter? I don’t think so, as long as the answer is verified as being right.

By the way, experimental documentation often does the same thing wherever possible. Take, for instance, plasmids. Yes, I am old enough to remember reading through methods sections to learn some fun cloning tricks. But mostly… who cares? If I get the plasmid from AddGene, I don’t usually care one bit how the pieces were put together or what kind of prep kit you used. What I care about is the plasmids actual sequence—verification based on function rather than provenance. If you look around, you’ll see that whenever it is possible, people will use this mode of verification, with things like certificates of analysis and whatever. Experienced researchers also know that you can’t trust methods sections. For instance, if you read about a drug at a particular concentration, you typically have to do the dose curve in house. It’s not something shady, just the way it is. Verification by provenance is just what we do when we don’t have any other alternative.

So where does this leave us? A couple ideas:

Visualize and document intermediates. Human or computer verification of intermediate stages of the analysis pipeline. Show the reader that your spot detection algorithm is accurately finding spots, or that your RNA-seq analysis is accurately counting reads.

Journals should focus on software verification rather than just software availability. Lots of published software just plain doesn’t run. I don’t doubt that the software probably did run at some point. It’s just really hard to keep everything up to date. How can the journal verify in some way that the software actually did run and produces reasonable output? I’m not sure. Perhaps every paper must present some kind of battery of tests and the results of their algorithm’s performance in those tests?

Anyway, I don’t know the answers, but I do know that the problem of software validity is a growing problem, and one that is likely to get worse with the increasingly pervasive use of machine learning techniques for which completely documentation of provenance is far less valuable than documenting by function.