RajLab

Monday, January 6, 2025

Documenting computational analyses by provenance vs. function

tl;dr: I think it’s time we rethink a lot of how we document computational work. Prompted by AI but also just general increasing complexity of software, we need to move from documenting how something came to be towards documenting what that something is. This more practical form of documentation will allow us to focus our efforts on what matters scientifically.

It has long been held as sacrosanct that proper scientific reporting requires documenting the provenance of any particular output. To translate: if you want to share something—an experimental result, whatever—you have to describe exactly how you did it, every step of the way.

This same sentiment has been applied to computational analyses. Given the potential (and I emphasize potential) to provide an exact record of what was done, it has been a long standing goal to provide code that provides an immutable record of the path from the data to the figures in the paper. But this paradigm has started to seem both less ideal and less practical in the modern software environment, even more so with the advent of large statistical models (“AI”).

The issue is that somewhere along the way, software became a lot more like a living organism than a static entity. Virtually all software depends on a maze of interdependent packages, and despite many attempts, like environments and docker containers and whatever, there’s really no way to avoid the fact that keeping software valid and runnable requires ongoing maintenance work. Machine learning models compound this problem. These models are largely inscrutable, and their black box outputs can vary due to from seemingly minor changes in the prompting or other input. What do we do?

I think the solution is to document based on function. What I mean is that we should focus more on documenting our software by verifying its output than worrying about every parameter that goes into it. For example: in image analysis, a key problem has always been segmentation, meaning how you identify (i.e., circle) cells for quantification. Everybody had their own algorithm and would pass around scripts to document the pipeline. The thing is… nobody really cared all that much about the algorithms, most of which were completely specific to the particular dataset. What we cared about a lot more about (or at least should have cared more about) was the quality of the output. How good was the segmentation? What were the false positives and negatives? What were the failure modes and how might that affect the downstream analysis? I think we would do a lot better trying to focus on that aspect of documenting our science. For instance, with machine learning tools, image analysis has undergone a major transformation, with these models having an uncanny ability to segment cells now and automate analyses that were previously unthinkable. Thing is, people retrain all their own local models, and minor parameters change, and at some point… who cares? It’s wasted effort to keep track of the details, and far more important to know whether the output is right. So let’s document that verification.

Same applies in genomic data analysis. Genomic analyses often depend on a large number of parameters that can vary from dataset to dataset. Documenting these is important, but honestly, I think it’s a bit beside the point. The main thing is not the precise thresholds and parameters that went into your peak-finding algorithm, but rather the plain fact of whether it actually found your peaks correctly.

This discussion may remind you of unit testing, in which you put your software through a suite of tests to make sure each part does the right thing. The whole idea is to verify what the code does and not how it does it. So not a new concept at all.

The use of LLMs is another example of how difficult and, ultimately, futile it is to insist on documentation by provenance. Let’s say I ask ChatGPT to help me figure out the pathway that corresponds to the activity of a list of gene names. Now, maybe I’ll get the same answer if I run it again next week, or maybe not. Does it matter? I don’t think so, as long as the answer is verified as being right.

By the way, experimental documentation often does the same thing wherever possible. Take, for instance, plasmids. Yes, I am old enough to remember reading through methods sections to learn some fun cloning tricks. But mostly… who cares? If I get the plasmid from AddGene, I don’t usually care one bit how the pieces were put together or what kind of prep kit you used. What I care about is the plasmids actual sequence—verification based on function rather than provenance. If you look around, you’ll see that whenever it is possible, people will use this mode of verification, with things like certificates of analysis and whatever. Experienced researchers also know that you can’t trust methods sections. For instance, if you read about a drug at a particular concentration, you typically have to do the dose curve in house. It’s not something shady, just the way it is. Verification by provenance is just what we do when we don’t have any other alternative.

So where does this leave us? A couple ideas:

Visualize and document intermediates. Human or computer verification of intermediate stages of the analysis pipeline. Show the reader that your spot detection algorithm is accurately finding spots, or that your RNA-seq analysis is accurately counting reads.

Journals should focus on software verification rather than just software availability. Lots of published software just plain doesn’t run. I don’t doubt that the software probably did run at some point. It’s just really hard to keep everything up to date. How can the journal verify in some way that the software actually did run and produces reasonable output? I’m not sure. Perhaps every paper must present some kind of battery of tests and the results of their algorithm’s performance in those tests?

Anyway, I don’t know the answers, but I do know that the problem of software validity is a growing problem, and one that is likely to get worse with the increasingly pervasive use of machine learning techniques for which completely documentation of provenance is far less valuable than documenting by function.

Wednesday, June 5, 2024

Project choice: Lean into your strengths

TL;DR: Projects are not entirely good or bad on their own. They have to match the person doing them: you! Be honest with yourself about what your strengths and passions are. Choose a project that is fundamentally aligned with those strengths. Do NOT choose a project that relies heavily on things you are not intrinsically motivated to do. You may be tempted to pick a project to shore up on weaknesses, but don’t. Any project will have aspects that will require you to work on your weaknesses, but a project that is fundamentally aligned with your weaknesses is going to be an exercise in misery.

One of the most common questions I get from new students is how to choose a scientific project. Clearly a super important part of the scientific process, but one that has had a somewhat magical quality to it, as though there is some magic wand that one waves over a set of eppendorf tubes to turn them into a preprint that everyone wants to read. Of course, many scientists have some introspection and insight into their thought processes, and while that has largely been passed on by word of mouth, there have been some wonderful recent efforts to describe project ideation (creativity), selection, and execution (see work from Itai Yanai/Martin Lorsch, Uri Alon, Michael Fischbach, and probably several others I’m missing with apologies).

But I feel like a lot of this discussion has missed one critical feature: you. As in you, the one actually doing the project. Everyone has different strengths and weaknesses as a scientist, or more relevantly, passions and aversions. In my experience, which I’m sure many have shared, it’s the match between the project and the scientist that matters far more than the project on its own.

Why does it matter so much? Here’s my theory. Academic research is a highly unstructured work environment. It is hard to quantify, on a daily, weekly, or even monthly basis, exactly what constitutes “progress”. As such, it relies very strongly on intrinsic motivation. As in, you really have to want to do something in order to put in the sustained effort required to actually do it, because it is very difficult to quantify progress from the outside to help force you to do things you don’t want to do. It is possible to force yourself to do things you don’t want in the short term, but if you are not fundamentally excited to do something, it is very hard to keep yourself motivated to do it in the medium-to-long term.

What does this mean in practice? I think it’s easier to see how it plays out by looking at common failure modes in person-project matching. One common thing I’ve seen is sometimes people feel like they need to build experimental skills even though they are fundamentally more interested in computational work, so they want to work on a project that has a significant experimental component. Then what happens is some version of the following: “I could do this experiment today, but it’s Thursday at 4pm. I’ll do it tomorrow. Oh wait, it’s Friday, now I should probably wait until Monday” and next thing you know a month goes by and the experiment still hasn’t gotten done. Sometimes, if you take the same person and give them a dataset, they’re like “I just need this analysis to finish running by 4pm, then I can run the next step, oh wouldn’t it be cool if XYZ were true, hold on let me try this…”. It’s hard to ascribe these delays or accelerations to any one particular decision, but in aggregate, they have an enormous compounding effect. Same sort of thing the other way around.

By the way, this doesn’t mean that you shouldn’t try things, especially early on. I worked in a lab for a summer after my first year of math grad school basically as an exercise in getting some exposure to experimental work, even though I thought I’d never EVER do it for my actual thesis work. Turns out I had a true passion for experiments. Been trying to lean into that ever since! But you have to continuously evaluate and be brutally honest with yourself about whether you’re doing what you’re doing because you really like it or because you think you should like it. I’ve found graduate students often get caught in the trap of working on what they think they should like instead of what they actually like.

This same reasoning affects choice of advisor, both graduate and postdoctoral, especially the latter. Pick an advisor who can help you build on your strengths, and not someone who specializes in your weaknesses. This is not to say that you can’t have complementary skills—especially for postdocs, it is often very fruitful to combine your skills from your PhD with a set of techniques in the postdoc lab. But if you join a lab where the advisor is a skilled computationalist but you want to do some cutting edge experiments, it must be done with a lot of care. You want to be sure the rest of the environment is strong, because it will be difficult for your advisor to guide you to innovate at the edge of the field given their own strengths and weaknesses. Not to say it can’t be done, but just that it should be done very carefully.

Anyway, all that to say, when choosing a project, make sure it matches your intrinsic strengths and motivations. Research is already hard enough, work on things you like to do!

Monday, February 5, 2024

Pre-registration in molecular biology

A few years back, perhaps in pre-pandy times, I was on a faculty development panel in which I was one of two presenters. I was of course there to present on how to use Twitter to build your brand (sigh, I’m lame), and a more senior faculty member (I think a neuroscientist) was there to talk about pre-registration in lab work. He was very kind and wise-seeming, and explained how he had been pre-registering their results in the lab for a while, and how it transformed their work.

What is pre-registration? It’s probably most familiar to you in the form of clinical studies, where there was a notorious selection bias in which results would be reported. Like, does drinking coffee cause flatulence? One would have to do a randomized controlled trial to check. But if people did, say, 100 clinical trials and only reported the ones where there was a “positive” result, then you would see 5 clinical trials with p < 0.05 showing that coffee causes flatulence, and none of the contradictory results. So now you have to pre-register a trial, meaning that you have to say, I am going to do this trial with this power and what not, and then you are obligated to report the outcome, no matter what the outcome is. A great idea!

But here was someone advocating for pre-registration much closer to home, in our day to day lab work. I remembering being vehemently and vocally opposed. Sure, clinical trials are one thing, with a clearly stated hypothesis and major resources devoted to a single experiment. But in my line of work, where we are constantly trying new experiments and checking out new avenues of work, where there are tons of false leads and new directions? How could that possibly work without gumming up the works in needless bureaucracy? I was vehemently and vocally opposed, to which the senior faculty member just patiently and calmly responded “Sure, I hear you, just think about it”.

Ever since, I keep coming back to that moment, and it has come to have a major effect on how I approach our science—and especially our reporting of it. The key take home point is: if you did an experiment to answer a question, and you don’t have any reason to exclude it based on the experiment itself, then you have to report the results. Repeat: unless there is an independent basis for the exclusion of a result, you have to report the results. Or, to put it another way: if you would have included the data if the result had come out the other way, you have to report it.

Selective reporting of data is a strange issue in molecular biology in that almost everyone agrees that it is wrong and yet the overall culture of the field leans towards selective reporting in so many ways. Here is an example from our own work. In a recent paper, we were trying to confirm the knockdown of a particular protein. We were able to show a convincing knockdown by RNA FISH, but also wanted to show that the protein levels went down. We did a bunch of westerns, but the results came out ambiguously: sometimes we saw an effect and sometimes not (there are reasons that that could be the case, but we didn't confirm those because they were very difficult). The standard thing to do here would be to not report the western results. But there was no reason to exclude the experiment other than being annoyed with the results. So, we reported it.

But again, the cultural standard in molecular biology is often not to report such ambiguous results. I saw this mindset a lot early in my career, back when RNA FISH was considered cool and people wanted our help to add some RNA FISH to their paper to spice it up. There were several times when people came to us with data in support of a, shall we say… “fanciful” hypothesis, and then we would do the RNA FISH, which would basically show the hypothesis was wrong. At which point, the would-be collaborator would beg out, saying that given the “ambiguous” nature of the RNA FISH results, “perhaps we should save the data for the next paper” (which of course never materialized). After enough of these moments, I started asking potential collaborators what stage of their paper they were at, and if they were close to the end, whether they really wanted us to do this experiment. At least one time, when faced with this choice, the person said, uhhh, let’s not!

There have also been many times when we’ve tried following up on work where we are pretty sure there has been a lot of selective reporting of positive results. Let’s just say that that is an unpleasant realization to make.

I want to emphasize that I don’t think that people are being malicious or fraudulent in their work. I think the vast majority of scientists are honest people and are not trying to do something wrong. I just think that science would benefit from having a more transparent reporting of results, because it is sometimes the data that doesn’t fit the narrative that leads to something new in the future. I also don’t necessarily think we need to formally pre-register our work, although it might be an interesting experiment to try. We should just try and shift our culture a bit towards transparent reporting. One potential challenge in doing science this way is that our stories are a lot less likely to be “perfect”. There will almost always be some bits of conflicting evidence, and given our adversarial peer review system, there is seemingly a lot of pressure to keep these conflicting results out. Or is there? We have been doing this for quite a while, and I would say that our experience has been largely fine in the sense that reviewers don’t mind as long as you are transparent about it. I say “largely” because there have definitely been cases in which reviewers point out some issue that we were transparent about and reject our paper because of it. So at least in my experience, I would say that adopting this more transparent reporting of results is not entirely without consequence. All I can say is that if we do decide to make this cultural shift, we also have to be more tolerant of imperfections in the “story” when we put our reviewer hats on.

By the way, I think a lot of people tend to think of selective reporting as a problem of experimental science. Not at all the case! Same goes for every analysis of e.g. some large scale dataset: if you checked for some signal in the data, you have to report the result, regardless of whether the result came out the way you wanted. It’s actually if anything even more of an issue in computational work in some ways, where many hypotheses can be tested with the same data in (relatively) rapid fashion.

There is also a bit of a gray area in terms of what to do about false leads. Sometimes, you have an idea that goes in a new direction that has nothing to do with the story of the paper. I don’t know what to do in this case. Certainly, science would be in some ways better for having these results out there, since there was probably (hopefully?) some basis for the experiment or analysis in the first place. But it may just serve to distract from the main thread of the paper, making it harder to follow. I don’t know how best to balance these competing and important principles, but I think it’s an important discussion for us to have.

I’m very curious how people will respond to this discussion. Ultimately, there is no form or checklist that can solve the issues we have in science. Pre-registration sounds like a bureaucratic solution, but in the end, it’s just a call for careful, honest thought about the work we do. I’m sure some people reading this will have a strongly negative reaction, much like I did at first. All I’m saying is “Sure, I hear you, just think about it.” 🙂

Tuesday, September 26, 2023

“Refusing the call” and presenting a scientific story

When scientists present in an informal setting where questions are expected, I always have an internal bet with myself as to how long until some daring person asks the first question, after which everyone else joins in and the questions rapidly start pouring out. This usually happens around the 10 minute mark. This phenomenon has gotten me wondering what this means for how best to structure a scientific talk.

I think this “dam breaking” phenomenon can be best thought of in terms of “refusal of the call”, which is a critical part of the classic hero’s journey in the theory of storytelling. The hero typically is leading some sort of hum-drum existence, until suddenly there is a “call to adventure”. Think Luke Skywalker in Star Wars (Episode IV, of course) when Obi Wan proposes that he go on an adventure to save the galaxy, only for Luke to say “Awww, I hate the empire, but what can I do about it?”. (Related point, Mark Hamill sucks.) Usually, shortly afterwards, the hero will “refuse the call” to adventure—usually from fear or lack of confidence or perhaps just from having common sense. This refusal involves some sort of rejection of the premise of the proposed adventure, which then needs to be overcome.

I think that’s exactly what’s going on in a scientific talk. As Nancy Duarte says, in a presentation, your audience is the hero. You are Obi Wan, presenting the call to adventure (an exciting new idea). And, almost immediately afterward, your audience (the hero) is going to refuse the call, meaning they are going to challenge your premise. In the context of a scientific talk, I think that’s where you have to present some sort of data. Like, I’ve presented you with this cool idea, here’s some preliminary result that gives it some credibility. Then the hero will follow the guide a little further on the adventure.

The mistake I sometimes see in scientific talks is that they let this tension go on for too long. They introduce an idea and then expound on the idea for a while, not providing the relief of a bit of data as the audience is refusing the call. The danger is that the longer the audience's mind runs with their internal criticism, the more it will forever dominate their destiny. Instead, spoon feed it to them slowly. Present an idea. Within a minute, say to the audience “You may be wondering about X. Well here is Y proof.” If you are pacing at their rate of questioning, perhaps a little faster, then they will feel very satisfied.

For instance:

“You may think drug resistance in cancer is caused by genetic mutations and selection. However, what if it is non-genetic in origin? We did sequencing and found no mutations…”

Friday, July 16, 2021

Confusion and credentials in presenting your work

Just listened to a great Planet Money episode in which Dr. Cecelia Conrad describes how she dealt with some horrible racist students in her class who were essentially questioning her credentials. She got the advice from a senior professor to be less clear in her intro class:

This snippet reminded me of some advice I got from my postdoc advisor about giving talks: "You don't want everything to be clear. You should have at least some part of it that is confusing." This advice has really stuck with me through the years, and I have continued to puzzle over it for a long time. Like, it should all be clear, no? I always felt like the measure of success for a presentation should on some level be a monotonically increasing function of its clarity.

But… for a while before the pandemic, I was doing this QR code thing to get feedback after my talks on both degree of clarity and degree of inspiration, and I have to say I feel like I noticed some slight anti-correlation: when I gave a super clear talk, it was seemingly less inspiring, but when I got lower marks for clarity, it was somehow more inspiring. Huh.

Nancy Duarte makes the point that in any presentation, the audience is the hero, and you as the presenter are more like Yoda, the sage who leads the audience on their heroic adventure. Perhaps it is not for nothing that Yoda speaks in wise-seeming syntactically mixed-up babble. Perhaps you have to assert credentials and intellectual dominance at some point in order to inspire your audience? Thoughts on how best to accomplish that goal?

Friday, July 31, 2020

Alternative hypotheses and the Gautham Transform

As I have mentioned several times, having Gautham in the lab really changed how I think about science. In particular, I learned a lot about how to take a more critical approach to science. I think this has made me a far better and more rigorous scientist, and I want to impart those lessons to all members of the lab.

The most important thing I learned from Gautham was to consider alternative hypotheses. I know this sounds like duh, that’s what I learn in my RCR meetings, “expected outcomes and potential pitfalls” sections of grants, and boring classes on how to do science, but I think that’s because we so rarely see how powerful it is in practice. I think it was one of Gautham’s favorite pastimes, and really exemplified his scientific aesthetic (indeed, he was very well known for demonstrating some alternative hypotheses for carrier multiplication, I believe). There were many, many times Gautham proposed alternative hypotheses in our lab, and it was always illuminating. Indeed, one of the main points of his second paper from the lab was about how one could explain “fluctuations between states” by simple population dynamics without any state switching—a whole paper’s worth of alternative hypothesis!

Why do we generally fail to consider alternative hypotheses? One reason is that it’s scary and not fun. Generally, the hypothesis you want to consider is the option that is the fun one. It is scary to contemplate the idea that something fun might turn out to be something boring. (Gautham and I used to joke that the “Gautham Transform” was taking something seemingly interesting and showing that it was actually boring.) The truth of it, though, is that most things are boring. Sure, in biology, there are a lot more surprises than in, say, physics, but there are still far fewer interesting things than are generally claimed. I think that we would all do better to come in with a stronger prior belief that most findings actually have a boring explanation, and a critical implementation of that belief is to propose alternative hypotheses. Keep in mind also that when we are trained, we typically are presented with a list of facts with no alternatives. This manner of pedagogy leaves most of us with very little appreciation for all the wrong turns that comprise science as it’s being made as opposed to the little diagrams in the textbooks.

The other reason we fail to consider alternatives is that it’s a lot of work. It’s always going to be harder to spend as much time actively thinking of ways to show that your pet theory is incorrect, and so in my experience it’s usually more work to come up with plausible alternative hypotheses. Usually, this difficulty manifests as a proclamation of “there’s just no other way it could be!” Thing is… there’s ALWAYS an alternative hypothesis. All models are wrong. You may get to a point where you just get tired, or the alternatives seem too outlandish, but there’s always another alternative to exclude. I remember as we were wrapping up our transcriptional-scaling-with-cell-size manuscript, we got this cool result suggesting that transcription was cut in half upon DNA replication (decrease in burst frequency). I was really into this idea, and Gautham was like, that’s really weird, there must be some other explanation. I was like, I can’t think of one, and I remember him saying “Well, it’s hard, but there has to be something, what you’re proposing is really weird”. So… I spent a couple days thinking about it, and then, voila, an alternative! (The alternative was a global decrease in transcription in S-phase, which Olivia eliminated with a clever experiment measuring transcription from a late-replicating gene.) Point is, it’s hard but necessary work.

(Note: I’m wondering about ways to actively encourage people to consider alternatives on a more regular basis. One suggestion was to stop, say, group meeting somewhere in the middle and just explicitly ask everyone to think of alternatives for a few minutes, then check in. Another option (HT Ben Emert) is to have a lab buddy who’s job is to work with you to challenge hypotheses. Anybody have other thoughts?)

So when do you stop making alternatives? I think that’s largely a matter of taste. At some point, you have to stand by a model you propose, exclude as many plausible alternatives as you can, and then acknowledge that there are other possible explanations for what you see that you just didn’t think of. Progress continues, excluding one alternative at a time…

“Hipster” overlay journals

Been thinking a lot about overlay journals and their implications these days. For those who don’t know, an overlay journal is sort of like a “meta-journal” in that it doesn’t formally publish its own papers. Rather, it provides links to other preprints/papers that it thinks are interesting. On some level, the idea is that the true value of a journal is to serve as a filter for what someone thinks is science worth reading so that you don’t have to read every single paper. An overlay journal provides that filter function without the need for the rest of the (costly) trappings of a journal, like peer review and, uhh, color figures ;).

There is one very interesting aspect of an overlay journal that I don’t think has been discussed very much: in contrast with regular journals, they are fundamentally non-exclusive, meaning that ANY overlay journal can in principle “publish” ANY paper. What this non-exclusivity means is that there is no jockeying between journals to publish the “obviously important” papers, which have a perhaps slightly elevated chance of actually being important. You know, like “we sequenced 10x more single cells than the last paper in a fancy journal” kind of papers. If you run an overlay journal, you never have to gaze longingly at those “high impact” papers—if you want to publish it, just add it to your overlay!

What are the consequences of non-exclusivity? Primarily, I think it would serve to diminish the value of “obviously important” papers. Everyone can identify them based on authors and number of genomes sequenced or whatever, so there’s really not that much value in including them per se. It would be like saying “Here’s my playlist, it’s like a copy of the Billboard Top 40”. Nobody’s going to look to your overlay journal for that kind of stuff (which you can readily get from CNS or Twitter). Rather, the real value would be in making lists of papers that are awesome but might otherwise be overlooked—essentially a hipster playlist. As an editor, your cache would be in your ability to identify these new, cool papers and making Michael Cera-esque mixtapes out of them. Can leave the Hot 100 to Casey Kasem/Spotify algorithms.

Measuring the importance of an overlay journal would also be interesting. Clearly, impact factor is not a useful metric, since anybody can make their impact factor as high as they want by including highly cited papers. I would guess a far more sensible metric would be number of followers of the journal (which makes more sense anyway).

Another interesting aspect of an overlay journal is that it can be retrospective. You could include old papers as well, highlighting old gems that may have been forgotten.

Of course, an interesting question is whether there is any difference between an overlay journal and someone’s Twitter feed. Not sure, actually…

Also, thoughts on existing journals that have hipster qualities to them? I vote Current Biology, my lab votes eLife.