RajLab: computation

Showing posts with label computation. Show all posts

Monday, January 6, 2025

Documenting computational analyses by provenance vs. function

tl;dr: I think it’s time we rethink a lot of how we document computational work. Prompted by AI but also just general increasing complexity of software, we need to move from documenting how something came to be towards documenting what that something is. This more practical form of documentation will allow us to focus our efforts on what matters scientifically.

It has long been held as sacrosanct that proper scientific reporting requires documenting the provenance of any particular output. To translate: if you want to share something—an experimental result, whatever—you have to describe exactly how you did it, every step of the way.

This same sentiment has been applied to computational analyses. Given the potential (and I emphasize potential) to provide an exact record of what was done, it has been a long standing goal to provide code that provides an immutable record of the path from the data to the figures in the paper. But this paradigm has started to seem both less ideal and less practical in the modern software environment, even more so with the advent of large statistical models (“AI”).

The issue is that somewhere along the way, software became a lot more like a living organism than a static entity. Virtually all software depends on a maze of interdependent packages, and despite many attempts, like environments and docker containers and whatever, there’s really no way to avoid the fact that keeping software valid and runnable requires ongoing maintenance work. Machine learning models compound this problem. These models are largely inscrutable, and their black box outputs can vary due to from seemingly minor changes in the prompting or other input. What do we do?

I think the solution is to document based on function. What I mean is that we should focus more on documenting our software by verifying its output than worrying about every parameter that goes into it. For example: in image analysis, a key problem has always been segmentation, meaning how you identify (i.e., circle) cells for quantification. Everybody had their own algorithm and would pass around scripts to document the pipeline. The thing is… nobody really cared all that much about the algorithms, most of which were completely specific to the particular dataset. What we cared about a lot more about (or at least should have cared more about) was the quality of the output. How good was the segmentation? What were the false positives and negatives? What were the failure modes and how might that affect the downstream analysis? I think we would do a lot better trying to focus on that aspect of documenting our science. For instance, with machine learning tools, image analysis has undergone a major transformation, with these models having an uncanny ability to segment cells now and automate analyses that were previously unthinkable. Thing is, people retrain all their own local models, and minor parameters change, and at some point… who cares? It’s wasted effort to keep track of the details, and far more important to know whether the output is right. So let’s document that verification.

Same applies in genomic data analysis. Genomic analyses often depend on a large number of parameters that can vary from dataset to dataset. Documenting these is important, but honestly, I think it’s a bit beside the point. The main thing is not the precise thresholds and parameters that went into your peak-finding algorithm, but rather the plain fact of whether it actually found your peaks correctly.

This discussion may remind you of unit testing, in which you put your software through a suite of tests to make sure each part does the right thing. The whole idea is to verify what the code does and not how it does it. So not a new concept at all.

The use of LLMs is another example of how difficult and, ultimately, futile it is to insist on documentation by provenance. Let’s say I ask ChatGPT to help me figure out the pathway that corresponds to the activity of a list of gene names. Now, maybe I’ll get the same answer if I run it again next week, or maybe not. Does it matter? I don’t think so, as long as the answer is verified as being right.

By the way, experimental documentation often does the same thing wherever possible. Take, for instance, plasmids. Yes, I am old enough to remember reading through methods sections to learn some fun cloning tricks. But mostly… who cares? If I get the plasmid from AddGene, I don’t usually care one bit how the pieces were put together or what kind of prep kit you used. What I care about is the plasmids actual sequence—verification based on function rather than provenance. If you look around, you’ll see that whenever it is possible, people will use this mode of verification, with things like certificates of analysis and whatever. Experienced researchers also know that you can’t trust methods sections. For instance, if you read about a drug at a particular concentration, you typically have to do the dose curve in house. It’s not something shady, just the way it is. Verification by provenance is just what we do when we don’t have any other alternative.

So where does this leave us? A couple ideas:

Visualize and document intermediates. Human or computer verification of intermediate stages of the analysis pipeline. Show the reader that your spot detection algorithm is accurately finding spots, or that your RNA-seq analysis is accurately counting reads.

Journals should focus on software verification rather than just software availability. Lots of published software just plain doesn’t run. I don’t doubt that the software probably did run at some point. It’s just really hard to keep everything up to date. How can the journal verify in some way that the software actually did run and produces reasonable output? I’m not sure. Perhaps every paper must present some kind of battery of tests and the results of their algorithm’s performance in those tests?

Anyway, I don’t know the answers, but I do know that the problem of software validity is a growing problem, and one that is likely to get worse with the increasingly pervasive use of machine learning techniques for which completely documentation of provenance is far less valuable than documenting by function.

Wednesday, August 2, 2017

Figure scripting and how we organize computational work in the lab

Saw a recent Twitter poll from Casey Brown on the topic of figure scripting vs. "Illustrator magic", the former of which is the practice of writing a program to completely generate the figure vs. putting figures into Illustrator to make things look the way you like. Some folks really like programming it all, while I've argued that I don't think this is very efficient, and so arguments go back on forth on Twitter about it. Thing is, I think ALL of us having this discussion here are already way in the right hand tail in terms of trying to be tidy about our computational work, while many (most?) folks out there haven't ever really thought about this at all and could potentially benefit from a discussion of what an organized computational analysis would look like in practice. So anyway, here's what we do, along with some discussion of why and what the tradeoffs are (including talking about figure scripting.

First off, what is the goal? Here, I'm talking about how one might organize a computational analysis in finalized form for a paper (will touch on exploratory analysis later). In my mind, the goal is to have a well-organized, well-documented, readable and, most importantly, complete and consistent record of the computational analysis, from raw data to plots. This has a number of benefits: 1. it is more likely to be free of mistakes; 2. it is easier for others (including within the lab) to understand and reproduce the details of your analysis; 3. it is more likely to be free of mistakes. Did I mention more likely to be free of mistakes? Will talk about that more in a coming post, but that's been the driving force for me as the analyses that we do in the lab become more and more complex.

[If you want to skip the details and get more to the principles behind them, please skip down a bit.]

Okay, so what we've settled on in lab is to have a folder structured like this (version controlled or Dropboxed, whatever):

I'll focus on the "paper" folder, which is ultimately what most people care about. The first thing is "extractionScripts". This contains scripts that pull out numbers from data and store them for further plot-making. Let me take this through the example of image data in the lab. We have a large software toolset called rajlabimagetools that we use for analyzing raw data (and that has it's own whole set of design choices for reproducibility, but that's a story for another day). That stores, alongside the raw data, analysis files that contain things like spot counts and cell outlines and thresholds and so forth. The extraction scripts pull data from those analysis files and puts it into .csv files, which are stored in extractedData. For an analogy with sequencing, this is like maybe taking some form of RNA-seq data and setting up a table of TPM values in a .csv file. Or whatever, you get the point. plotScripts then contains all the actual plotting scripts. These load the .csv files and run whatever to make graphical elements (like a series of histograms or whatever) and stores them in the graphs folder. finalFigures then contains the Illustrator files in which we compile the individual graphs into figures. Along with each figure (like Fig1.ai), we have a Fig1readme.txt that describes exactly what .eps or .pdf files from the graphs folders ended up in, say, Figure 1f (and, ideally, what script). Thus, everything is traceable back from the figure all the way to raw data. Note: within the extractionScripts is a file called "extractAll.m" and in plotScripts "plotAll.R" or something like that. These master scripts basically pull all the data and make all the graphs, and we rerun these completely from scratch right before submission to make sure nothing changed. Incidentally, of course, each of the folders often has a massive number of subfolders and so forth, but you get the idea.

What are the tradeoffs that led us to this workflow? First off, why did we separate things out this way? Back when I was a postdoc (yes, I've been doing various forms of this since 2007 or so), I tried to just arrange things by having a folder per figure. This seemed logical at the time, and has the benefit that the output of the scripts are in close proximity to the script itself (and the figure), but the problem was that figures kept getting endlessly rearranged and remixed, leading to endless tedious (and error-prone) rescripting to regain consistency. So now we just pull in graphical elements as needed. This makes things a bit tricky, since for any particular graph it's not immediately obvious what made that graph, but it's usually not too hard to figure out with some simple searching for filenames (and some verbose naming conventions).

The other thing is why have the extraction scripts separated from the plots? Well, in practice, the raw data is just too huge to distribute easily this way, and if it was all mushed together with the code and intermediates, it would be hard to distribute. But, at least in our case, the more important fact is that most people don't really care about the raw data. They trust that we've probably done that part right, and what they're most interested are the tables of extracted data. So this way, in the paper folder, we've documented how we pulled out the data along while keeping the focus on what most people will be most interested in.

[End of nitty gritty here.]

And then, of course, figure scripting, the topic that brought this whole thing up in the first place. A few thoughts. I get that in principle, scripting is great, because it provides complete documentation, and also because it potentially cuts down on errors. In practice, I think it's hard to efficiently make great figures this way, so we've chosen perhaps a slightly more tedious and error prone but flexible way to make our figures. We use scripts to generate PDFs or EPSs of all relevant graphical elements, typically not spending time to optimize even things like font size and so forth (mostly because all of those have to change so many times in the end anyway). Yes, there is a cost here in terms of redoing things if you end up changing the analysis or plot. Claus Wilke argued that this discourages people from redoing plots, which I think has some truth to it. At the same time, I think that the big problem with figure scripting is that it discourages graphical innovation and encourages people to use lazy defaults that usually suffer from bad design principles—indeed, I would argue it's way too much work currently to make truly good graphics programmatically. Take this example:

Or imagine writing a script for this one:

Maybe you like or don't like these type of figures, but either way, not only would it take FOREVER to write up a script for these (at least for me), but by the time you've done it, you would probably never build up the courage to remix these figures the dozen or so times we've reworked this one over the course of publication. It's just faster, easier, and more intuitive to do with a tool for, you know, playing with graphical elements, which I think encourages innovation. Also, many forms of labeling of graphs that reduce cognitive burden (like putting text descriptors directly next to the line or histogram that they label) are much easier in Illustrator and much harder to do programmatically, so again, this works best for us. It does also, however, introduce a human element for error, and that has happened to us, although I should say that programmatic figures are a typo away from errors as well, and that's happened, too. There is also the option to link figures, and we have done that with images in the past, but in the end, relying on Illustrator to find and maintain links as files get copied around just ended up being too much of a headache.

Note that this is how we organize final figures, but what about exploratory data analysis? In our lab, that ends up being a bit more ad-hoc, although some of the same principles apply. Following the full strictures for everything can get tedious and inhibitory, but one of the main things we try and encourage in the lab is keeping a computational lab notebook. This is like an experimental lab notebook, but, uhh, for computation. Like "I did this, hoped to see this, here's the graph, didn't work." This has been, in practice, a huge win for us, because it's a lot easier to understand human descriptions of a workflow than try and read code, especially after a long time and double especially for newcomers to the lab. Note: I do not think version control and commit messages serve this purpose, because version control is trying to solve a fundamentally different problem than exploratory analysis. Anyway, talked about this computational lab notebook thing before, should write something more about it sometime.

One final point: like I said, one of the main benefits to these sorts of workflows is that they help minimize mistakes. That said, mistakes are going to happen. There is no system that is foolproof, and ultimately, the results will only be as trustworthy as the practitioner is careful. More on that in another post as well.

Anyway, very interested in what other people's workflows look like. Almost certainly many ways to skin the cat, and curious what the tradeoffs are.

Sunday, March 12, 2017

I love Apple, but here are a few problems

First off, I love Apple products. I’ve had only Apple computers for just about 2 decades, and have been really happy to see their products evolve in that time from bold, renegade items to the refined, powerful computers they are today. My lab is filled with Macs, and I view the few PCs that we have to use to run our microscopes with utter disdain. (I’m sort of okay with the Linux workstations we have for power applications, but they honestly don’t get very much use and they’re kind of a pain.)

That said, lately, I’ve noticed a couple problems, and these are not just things like “Apple doesn’t care about Mac software reliability” or “iTunes sucks” or whatever. These are fundamental bets Apple has made, one in hardware and one in software, that I think are showing signs of being misplaced. So I wrote these notes on the off chance that somehow, somewhere, they make their way back to Apple.

One big problem is that Apple’s hardware has lost its innovative edge, mostly because Apple seems disinclined to innovate for various reasons. This has become plainly obvious by watching the undergraduate population at Penn over the last several years. A few years ago, it used to be that a pretty fair chunk of the undergrads I met had MacBook Airs. Like, a huge chunk. It was essentially the standard computer for young people. And rightly so: it was powerful (enough), lightweight, not too expensive, and the OS was clean and let you do all the things you needed to do.

Nowadays, not so much. I’m seeing all these kids with the Surfaces and so forth that are real computers, but with a touch screen/tablet “mode” as well. And here’s the thing: even I’m jealous. Now, I’m not too embarrassed to admit that I have read enough Apple commentary on various blogs to get Apple’s reasons for not making such a computer. First off, Apple believes that most casual users, perhaps including students, should just be using iPads, and that iOS serves their needs while providing the touch/tablet interface. Secondly, they believe that the touch interface has no place, both ergonomically or in principle, on laptop and desktop Macs. And if you’re one of the weird people who somehow needs a touch interface and full laptop capabilities, you should buy both a Mac and an iPad. I’m just realizing now that Apple is just plain wrong on this.

Why don’t I see students with iPads, or an iPad Pro instead of a computer? The reality is that, no matter how much Apple wants to believe it and Apple fans want to rationalize it (typically for “other people”), iOS is just not useful for doing a lot of real work. People want filesystems. People want to easily have multiple windows open, and use programs that just don’t exist on iOS (especially students who may need to install special software for class). The few people I know who have iPad Pros are those who have money to burn on having an iPad Pro as an extra computer, but not as a replacement. The ONLY person I know who would probably be able to work exclusively or even primarily with an iPad is my mom, and even she insists on using what she calls a “real” computer (MacBook Pro).

(Note about filesystems: Apple keeps trying to push this “post-filesystem” world on us, and it just isn’t taking. Philosophical debates aside, here’s a practical example: Apple tried to make people switch away from using “Save As…” to a more versioned system more compatible with the iOS post-filesystem mindset, with commands like “Revert” and “Duplicate”. I tried to buy in, I really did. I memorized all the weird new keyboard shortcuts and kept saying to myself “it’ll become natural any day now”. Never did. Our brains just don’t work that way. And it’s not just me: honestly, I’m the only one in my lab who even understands all this “Duplicate” “Revert” nonsense. The rest of them can’t be bothered—and mostly just use other software without this “functionality” and… Google Drive.)

So you know what would be nice? Having a laptop with a tablet mode/touch screen! Apple’s position is it’s an interface and ergonomic disaster. It’s hard to use interface elements with touch, and it’s hard to use a touch screen on a vertical laptop screen. There are merits to these arguments, but you know what? I see these kids writing notes freehand on their computer, and sketching drawings on their computer, and I really wish I could do that. And no, I don’t want to lug around an iPad to do that and synchronize with my Mac via their stupid janky iCloud. I want it all in one computer. The bottom line is that Surface is cool. Is it as well done as Apple would do it? No. But it does something that I can’t do on an Apple, and I wish I could. Apple is convinced that people don’t want to do those things, and that you shouldn’t be able to do those things. The reality seems to be that people do want to do those things and that it’s actually pretty useful for them. Apple’s mistake is thinking that the reason people bought Apples was for design purity. We bought Apples because they had design functionality. Sometimes these overlap, which has been part of Apple’s genius over the last 15 years, and so you can mistake one for the other. But in the end, a computer is a tool to do things I need.

Speaking of which, the other big problem that Apple has is its approach to cloud computing. I think it’s pretty universally acknowledged that Apple’s cloud computing efforts suck, and I won’t document all that here. Mostly, I’ve been trying to understand exactly why, and I think that the fundamental problem is that Apple is thinking synchronize while everyone else is thinking synchronous. What does that mean? Apple’s is stuck in an “upload/download” (i.e., synchronize) mindset from ten years ago while everyone else has moved on to a far more seamless design in which the distinction between cloud and non-cloud is largely invisible. And whatever attempts Apple has made to move to the latter have been pretty poorly executed (although that at least gives hope that they are thinking about it).

Examples abound, and they largely manifest as irritations in using Apple’s software. Take, for example, something as simple as the Podcast App in the iPhone, which I use every day when I bike to work (using Aftershokz bone conduction headphones, suhweet, try them!). If I didn’t pre-download the next podcast, half the time, it craps out when it gets to the next episode in my playlist, even though I have cell service the whole way. Why? Because when it gets there, it waits to download the next one before playing, and sometimes gets mixed up during the download. So I end up trying to remember to pre-download them. And then I have to watch space with all the downloads, making sure the app removes the downloads. Why am I even thinking about this nowadays? Why can’t it just look at my playlist and make them play seamlessly? Upload/download is an anachronism from a time of synchronize when most things are moving to synchronous.

Same with AppleTV (sucks) compared to Netflix on my computer, or Amazon on my computer, or HBO, or whatever. They just work without me having to thinking about the pre-download of the whatever before the movie can start.

I suppose there was a time when this was important for when you were offline. Whatever, I’m writing this in a Google Doc on an airplane without WiFi. And when I get back online, it will all just merge up seamlessly. With careful thought, it can be done. (And yes, I am one of the 8 people alive who has actually used Pages on the web synchronized with Pages on the Mac—not quite there yet, sorry.)

To its credit, I think Apple does sort of get the problem, belatedly. Problem is that when they have tried synchronous, it’s not well done. Take the example of iCloud Photos or whatever the hell they call it. One critical new feature that I was excited about was that it will sense if you’re running out of space on your device and then delete local copies of old photos, storing just the thumbnails. All your photos accessible, but using up only a bit of space, sounds very synchronous! Problem is that as currently implemented, I have only around 150MB free on my Phone and ~1+ GB of space used by Photos. Same on my wife’s MacBook Pro: not a lot of HD space, but Photos starts doing this cloud sync only when things are already almost completely full. The problem is that Apple views this whole system as a backup measure to kick in only in emergencies, when if they bought into the mentality completely, Photos on my computer would take up only a small fraction of the space it does, freeing up the rest of the computer for everything else I need it to do (you know, with my filesystem). Not to mention that any synchronization and space freeing is completely opaque and happens seemingly at random, so I never trust it. Again, great idea, poor execution.

Anyway, I guess this was marginally more productive than doing the Sudoku in back of United Magazine, but not particularly so, so I’ll stop there. Apple, please get with it, we love you!

Sunday, May 1, 2016

The long tail of artificial narrow superintelligence

As readers of the blog have probably guessed, there is a distinct strain of futurism in the lab, mostly led by Paul, Ally, Ian and I (everyone else just mostly rolls their eyes, but what do they know?). So it was against this backdrop that we had a heated discussion recently about the implications of AlphaGo.

It started with a discussion I had with someone who is an expert on machine learning and knows a bit of Go, and he said that AlphaGo was a huge PR stunt. He said this based on the fact that the way AlphaGo wins is basically by using deep learning to evaluate board positions really well, while doing a huge number of calculations to determine what play to make to evaluate that position. Is that really “thinking”? Here, opinions were split. Ally was strongly in the camp of this being thinking, and I think her argument was pretty valid. After all, how different is that necessarily from how humans play? They probably think up possible places to go and then evaluate the board position. I was of the opinion that this is a different type of thinking than human thinking entirely.

Thinking about it some more, I think perhaps we’re both right. Using neural networks to read the board is indeed amazing, and a feat that most thought would not be possible for a while. It’s also clear that AlphaGo is doing a huge number of more “traditional” brute force computations of potential moves than Lee Sedol was. The question then becomes how close the neural network part of AlphaGo is compared to Lee Sedol’s intuition, given that the brute force logic parts are probably tipped far in AlphaGo’s favor. This is sort of a hard question to answer, because it’s unclear how closely matched they were. I was, perhaps like many, sort of shocked that Lee Sedol managed to win game 4. Was that a sign that they were not so far apart from each other? Or just a weird flukey sucker punch from Sedol? Hard to say. I think the fact that AlphaGo was probably no match for Sedol a few months prior is probably a strong indication that AlphaGo is not radically stronger than Sedol. So my feeling is that Sedol’s intuition is still perhaps greater than AlphaGo’s, which allowed him to keep up despite such a huge disadvantage is traditional computation power.

Either way, given the trajectory, I’m guessing that within a few months, AlphaGo will be so far superior that no human will ever, ever be able to beat it. Maybe this is through improvements to the neural network or to traditional computation, but whatever the case, it will not be thinking the same way as humans. The point is that it doesn’t matter, as far as playing Go is concerned. We will have (already have?) created the strongest Go player ever.

And I think this is just the beginning. A lot of the discourse around artificial intelligence revolves around the potential for artificial general super-intelligence (like us, but smarter), like a paper-clip making app that will turn the universe into a gigantic stack of paper-clips. I think we will get there, but well before then, I wonder if we’ll be surrounded by so much narrow-sense artificial super-intelligence (like us, but smarter at one particular thing) that life as we know it will be completely altered.

Imagine a world in which there is super-human level performance at various “brain” tasks. What will be the remaining motivation to do those things? Will everything just be a sport or leisure activity (like running for fun)? Right now, we distinguish (perhaps artificially) between what’s deemed “important” and what’s just a game. But what if we had a computer for doing proving math theorems or coming up with algorithms, one vastly better than any human? Could you still have a career as a mathematician? Or would it just be one big math olympiad that we do for fun? I’m now thinking that it’s possible for virtually everything humans think is important and do for "work" could be overtaken by “dumb” artificial narrow super-intelligence, well before the arrival of a conscious general super-intelligence. Hmm.

Anyway, for now, back in our neck of the woods, we've still got a ways to go in getting image segmentation to perform as well as humans. But we’re getting closer! After that, I guess we'll just do segmentation for fun, right? :)

Thursday, March 3, 2016

From over-reproducibility to a reproducibility wish-list

Well, it’s clear that that last blog post on over-reproducibility touched a bit of a nerve. ;)

Anyway, lot of the feedback was rather predictable and not particularly convincing, but I was pointed to this discussion on the software carpentry website, which was actually really nice:

On 2016-03-02 1:51 PM, Steven Haddock wrote:
> It is interesting how this has morphed into a discussion of ways to convince / teach git to skeptics, but I must say I agreed with a lot of the points in the RajLab post.
>
> Taking a realistic and practical approach to use of computing tools is not something that needs to be shot down (people sound sensitive!). Even if you can’t type `make paper` to recapitulate your work, you can still be doing good science…
>
+1 (at least) to both points. What I've learned from this is that many scientists still see cliffs where they want on-ramps; better docs and lessons will help, but we really (really) to put more effort into usability and interoperability. (Diff and merge for spreadsheets!)

So let me turn this around and ask Arjun: what would it take to convince you that it *was* worth using version control and makefiles and the like to manage your work? What would you, as a scientist, accept as compelling?

Thanks,
Greg

--
Dr Greg Wilson
Director of Instructor Training
Software Carpentry Foundation

First off, thanks to Greg for asking! I really appreciate the active attempt to engage.

Secondly, let me just say that as to the question of what it would take for us to use version control, the answer is nothing at all, because we already use it! More specifically, we use it in places where we think it’s most appropriate and efficient.

I think it may be helpful for me to explain what we do in the lab and how we got here. Our lab works primarily on single cell biology, and our methods are primarily single molecule/single cell imaging techniques and, more recently, various sequencing techniques (mostly RNA-seq, some ATAC-seq, some single cell RNA-seq). My lab has people with pretty extensive coding experience and people with essentially no coding experience, and many in between (I see it as part of my educational mission to try and get everyone to get better at coding during their time in the lab). My PhD is in applied math with a side of molecular biology, during which time we developed a lot of the single RNA molecule techniques that we are still using today. During my PhD, I was doing the computational parts of my science in an only vaguely reproducible way, and that scared me. Like “Hmm, that data point looks funny, where did that come from?”. Thus, in my postdoc, I started developing a little MATLAB "package" for documenting and performing image analysis. I think this is where our first efforts in computational reproducibility began.

When I started in the lab in 2010, my (totally awesome) first student Marshall and I took the opportunity to refactor our image analysis code, and we decided to adopt version control for these general image processing tools. After a bit of discussion, we settled on Mercurial and bitbucket.org because it was supposed to be easier to use than git. This has served us fairly well. Then, my brilliant former postdoc Gautham got way into software engineering and completely refactored our entire image processing pipeline, which is basically what we are using today, and is the version that we point others to use here. Since then, various people have contributed modules and so forth. For this sort of work, version control is absolutely essential: we have a team of people contributing to a large, complex codebase that is used by many people in the lab. No brainer.

In our work, we use these image processing tools to take raw data and turn it into numbers that we then use to hopefully do some science. This involves the use of various analysis scripts that will take this data, perform whatever statistical analysis and so forth on it, and then turn that into a graphical element. Typically, this is done by one, more often two, people in the lab, typically working closely together.

Right around the time Gautham left the lab, we had several discussions about software best practices in the lab. Gautham argued that every project should have a repository for these analysis scripts. He also argued that the commit history could serve as a computational lab notebook. At the time, I thought the idea of a repo for every project was a good one, and I cajoled people in the lab into doing it. I pretty quickly pushed back on the version-control-as-computational-lab-notebook claim, and I still feel that pretty strongly. I think it’s interesting to think about why. Version control is a tool that allows you to keep track of changes to code. It is not something that will naturally document what that code does. My feeling is that version control is in some ways a victim of its own success: it is such a useful tool for managing code that it is now widely used and promoted, and as a side-effect it is now being used for a lot of thing for which it is not quite the right tool for the job, a point I’ll come back to.

Fast forward a little bit. Using version control in the repo-for-every-project model was just not working for most people in the lab. To give a sense of what we’re doing, in most projects, there’s a range of analyses, sometimes just making a simple box-plot or bar graph, sometimes long-ish scripts that take, say, RNA counts per cell and fit to a model of RNA production, extracting model parameters with error bounds. Sometimes it might be something still more complicated. The issue with version control in this scenario is all the headache. Some remote heads would get forked. Somehow things weren't syncing right. Some other weird issue would come up. Plus, frankly all the commit/push/pull/update was causing some headaches, especially if someone forgot to push. One student in the lab and I were just working on a large project together, and after bumping into these issues over and over, she just said “screw it, can we just use Dropbox?” I was actually reluctant at first, but then I thought about it a bit more. What were we really losing? As I mention in the blog post, our goal is a reproducible analysis. For this, versioning is at best a means towards this goal, and in practice for us, a relatively tangential means. Yes, you can go back and use earlier versions. Who cares? The number of times we’ve had to do that in this context is basically zero. One case people have mentioned as a potential benefit for version control is performing alternative, exploratory analyses on a particular dataset, the idea being you can roll back and compare results. I would argue that version control is not the best way to perform or document this. Let’s set I have a script for “myCoolAnalysis”. What we do in lab is make “myAlternativeAnalysis” in which we code our new analysis. Now I can easily compare. Importantly, we have both versions around. The idea of keeping the alternative version in version control is I think a bad one: it’s not discoverable except by searching the commit log. Let’s say that you wanted to go back to that analysis in the future. How would I find it? I think it makes much more sense to have it present in the current version of the code than to dig through the commit history. One could argue that you could fork the repo, but then changes to other, unrelated parts of the repo would be hard to deal with. Overall, version control is just not the right tool for this, in my opinion.

Another, somewhat related point that people have raised is looking back to see why some particular output changed. Here, we’re basically talking about bugs/flawed analyses. There is some merit to this, and so I acknowledge there is a tradeoff, and that once you get to a certain scale, version control is very helpful. However, I think that for scientific programming at the scale I’m talking about, it’s usually fairly clear what caused something to change, and I’m less concerned about why something changed and much more worried about whether we’re actually getting the right answer, which is always a question about the code as it stands. For us, the vast majority of the time, we are moving forward. I think the emphasis here would be better on teaching people about how to test their code (which is a scientific problem more than a programming problem) than version control.

Which leads me to really answering the question: what would I love to have in the lab? On a very practical level, look, version control is still just too hard and annoying to use for a lot of people and injects a lot of friction into the process. I have some very smart people in my lab, and we all have struggled from time to time. I’m sure we can figure it out, but honestly, I see little impetus to do so for the use cases outlined above, and yes, our work is 100% reproducible without it. Moving (back) to Dropbox has been a net productivity win, allowing us to work quickly and efficiently together. Also, the hassle free nature of it was a real relief. On our latest project, while using version control, we were always asking “oh, did you push that?”, “hmm, what happened?”, “oh, I forgot to update”. (And yes, we know about and sometimes use SourceTree.) These little hassles all add up to a real cognitive burden, and I’m sorry, but it's just a plain fact that Dropbox is less work. Now it’s just “Oh, I updated those graphs”, “Looks great, nice!”. Anyway, what I would love is Dropbox with a little bit more version tracking. And Dropbox does have some rudimentary versioning, basically a way to recover from an "oh *#*$" moment–the thing I miss most is probably a quick diff. Until this magical system emerges, though, on balance, it is currently just more efficient for us not to use version control for this type of computational work. I posit that the majority of people who could benefit from some minimal computational reproducibility practices fall into this category as well.

Testing: I think getting people in the habit of testing would be a huge move in the right direction. And I think this means scientific code testing, not just “program doesn’t crash” testing. When I teach my class on molecular systems biology, one of my secret goals is to teach students a little bit about scientific programming. For those who have some programming experience, they often fall into the trap of thinking “well, the program ran, so it must have worked”, which is often fine for, say, a website or something, but it’s usually just the beginning of the story for scientific programming and simulations. Did you look for the order of convergence (or convergence at all)? Did you look for whether you’re getting the predicted distribution in a well-known degenerate case? Most people don’t think about programming that way. Note that none of this has anything to do with version control per se.

On a bigger level, I think the big unmet need is that of a nice way to document an analysis as it currently stands. Gautham and I had a lot of discussions about this when he was in lab. What would such documentation do? Ideally, it would document the analysis in a searchable and discoverable way. This was something Gautham and I discussed at length and didn’t get around to implementing. Here’s one idea we were tossing around. Let’s say that you kept your work in a directory tree structure, with analyses organized by subfolder. Like, could keep that analysis of H3K4me3 in “histoneModificationComparisons/H3K4me3/”, then H3K27me3 in “histoneModificationComparisons/H3K27me3/”. In each directory, you have the scripts associated with a particular analysis, and then running those scripts produces an output graph. That output graph could either be stored in the same folder or in a separate “graphs” subfolder. Now, the scripts and the graphs would have metadata (not sure what this would look like in practice), so you could have a script go through and quickly generate a table of contents with links to all these graphs for easy display and search. Perhaps this is similar to those IPython notebooks or whatever. Anyway, the main features is that this would make all those analyses (including older ones that don't make it in the paper) discoverable (via tagging/table of contents) and searchable (search:“H3K27”). For me, this would be a really helpful way to document an analysis, and would be relatively lightweight and would fit into our current workflow. Which reminds me: we should do this.

I also think that a lot of this discussion is really sort of veering around the simple task of keeping a computational lab notebook. This is basically a narrative about what you tried, what worked, what didn’t work, and how you did it, why you did it, and what you learned. I believe there have been a lot of computational lab notebook attempts out there, from essentially keyloggers on up, and I don’t know of any that have really taken off. I think the main thing that needs to change there is simply the culture. Version control is not a notebook, keylogging is not a notebook, the only thing that is a notebook is you actually spending the time to write down what you did, carefully and clearly–just like in the lab. When I have cajoled people in the lab into doing this, the resulting documents have been highly useful to others as how-to guides and as references. There have been depressingly few such documents, though.

Also, seriously, let's not encourage people to use version control for maintaining their papers. This is just about the worst way to sell version control. Unless you're doing some heavy math with LaTeX or working with a very large document, Google Docs or some equivalent is the clear choice every time, and it will be impossible to convince me otherwise. Version control is a tool for maintaining code. It was never meant for managing a paper. Much better tools exist. For instance, Google Docs excels at easy sharing, collaboration, simultaneous editing, commenting and reply-to-commenting. Sure, one can approximate these using text-based systems and version control. The question is why anyone would like to do that. Not everything you do on a computer maps naturally to version control.

Anyway, that ended up being a pretty long response to what was a fairly short question, but I also just want to reiterate that I find it reassuring that people like Greg are willing to listen to these ramblings and hopefully find something positive from it. My lab is really committed to reproducible computational analyses, and I think I speak for many when I describe the challenges we and others face in making it happen. Hopefully this can stimulate some new discussion and ideas!

Sunday, February 28, 2016

From reproducibility to over-reproducibility

[See also follow up post.]

It's no secret that biomedical research is requiring more and more computational analyses these days, and with that has come some welcome discussion of how to make those analyses reproducible. On some level, I guess it's a no-brainer: if it's not reproducible, it's not science, right? And on a practical level, I think there are a lot of good things about making your analysis reproducible, including the following (vaguely ranked starting with what I consider most important):

Umm, that it’s reproducible.
It makes you a bit more careful about making your code more likely to be right, cleaner, and readable to others.
This in turn makes it easier for others in the lab to access and play with the analyses and data in the future, including the PI.
It could be useful for others outside the lab, although as I’ve said before, I think the uses for our data outside our lab are relatively limited beyond the scientific conclusions we have made. Still, whatever, it’s there if you want it. I also freely admit this might be more important for people who do work other people actually care about. :)

Balanced against these benefits, though, is a non-negligible negative:

It takes a lot of time.

On balance, I think making things as reproducible as possible is time well spent. In particular, it's time that could be well spent by the large proportion of the biomedical research enterprise that currently doesn't think about this sort of thing at all, and I think it is imperative for those of us with a computational inclination to help train others to make their analyses reproducible.

My worry, however, is that the strategies for reproducibility that computational types are often promoting are off-target and not necessarily adapted for the needs and skills of the people they are trying to reach. There is a certain strain of hyper-reproducible zealotry that I think is discouraging others to adopt some basic practices that could greatly benefit their research, and at the same time is limiting the productivity of even its own practitioners. You know what I'm talking about: it's the idea of turning your entire paper into a program, so you just type "make paper" and out pops the fully formed and formatted manuscript. Fine in the abstract, but in a line of work (like many others) in which time is our most precious commodity, these compulsions represent a complete failure to correctly measure opportunity costs. In other words, instead of hard coding the adjustment of the figure spacing of your LaTeX preprint, spend that time writing another paper. I think it’s really important to remember that our job is science, not programming, and if we focus too heavily on the procedural aspects of making everything reproducible and fully documented, we risk turning off those who are less comfortable with programming from the very real benefits of making their analysis reproducible.

Here are the two biggest culprits in my view: version control and figure scripting.

Let's start with version control. I think we can all agree that the most important part of making a scientific analysis reproducible is to make sure the analysis is in a script and not just typed or clicked into a program somewhere, only for those commands to vanish into faded memory. A good, reproducible analysis script should start with raw data, go through all the computational manipulations required, and leave you with a number or graphical element that ends up in your paper somewhere. This makes the analysis reproducible, because someone else can now just run the code and see how your raw data turned into that p-value in subpanel Figure 4G. And remember, that someone else is most likely your future self :).

Okay, so we hopefully all agree on the need for scripts. Then, however, almost every discussion about computational reproducibility begins with a directive to adopt git or some other version control system, as though it’s the obvious next step. Hmm. I’m just going to come right out and say that for the majority of computational projects (at least in our lab), version control is a waste of time. Why? Well, what is the goal of making a reproducible analysis? I believe the goal is to have a documented set of scripts that take raw data and reliably turn it into a bit of knowledge of some kind. The goal of version control is to manage code, in particular emphasizing “reversibility, concurrency, and annotation [of changes to code]”. While one can imagine some overlap between these goals, I don’t necessarily see a natural connection between them. To make that more concrete, let’s try to answer the question that I’ve been asking (and been asked), which is “Why not just use Dropbox?”. After all, Dropbox will keep all your code and data around (including older versions), shared between people seamlessly, and probably will only go down if WWIII breaks out. And it's easy to use. Here are a few potential arguments I can imagine people might make in favor of version control:

You can avoid having Fig_1.ai, Fig_1_2.ai, Fig_1_2_final_AR_PG_JK.ai, etc. Just make the change and commit! You have all the old versions!
You can keep track of who changed what code and roll things back (and manage file conflicts).

Well, to point 1, I actually think that there’s nothing really wrong with having all these different copies of a file around. It makes it really easy to quickly see what changed between different versions, which is especially useful for binary files (like Illustrator files) that you can’t run a diff on. Sure, it’s maybe a bit cleaner to have just one Fig_1.ai, but in practice, I think it’s actually less useful. In our lab, we haven’t bothered doing that, and it’s all worked out just fine.

Which brings us then to point 2, about tracking code changes. In thinking about this, I think it’s useful to separate out code that is for general purpose tools in the lab and code that is specific for a particular project. For code for general purpose tools that multiple team members are contributing to, version control makes a lot of sense–that’s what it was really designed for, after all. It’s very helpful to see older versions of the codebase, see the exact changes that other members of the team have made, and so forth.

These rationales don’t really apply, though, to code that people will write for analyzing data for a particular project. In our lab, and I suspect most others, this code is typically written by one or two people, and if two, they’re typically working in very close contact. Moreover, the end goal is not to have a record of a shifting codebase, but rather to have a single, finalized set of analysis scripts that will reproduce the figures and numbers in the paper. For this reason, the ability to roll back to previous versions of the code and annotate changes is of little utility in practice. I asked around lab, and I think there was maybe one time when we rolled back code. Otherwise, basically, for most analyses for papers, we just move forward and don’t worry about it. I suppose there is theoretically the possibility that some old analysis could prove useful that you could recover through version control, but honestly, most of the time, that ends up in a separate folder anyway. (One might say that’s not clean, but I think that it’s actually just fine. If an analysis is different in kind, then replacing it via version control doesn’t really make sense–it’s not a replacement of previous code per se.)

Of course, one could say, well, even if version control isn’t strictly necessary for reproducible analyses, what does it hurt? In my opinion, the big negative is the amount of friction version control injects into virtually every aspect of the analysis process. This is the price you pay for versioning and annotation, and I think there’s no way to get around that. With Dropbox, I just stick a file in and it shows up everywhere, up to date, magically. No muss, no fuss. If you use version control, it’s constant committing, pushing, pulling, updating, and adding notes. Moreover, if you’re like me, you will screw up at some point, leading to some problem, potentially catastrophic, that you will spend hours trying to figure out. I’m clearly not alone:

“Abort: remote heads forked” anyone? :) At that point, we all just call over the one person in lab who knows how to deal with all this crap and hope for the best. And look, I’m relatively computer savvy, so I can only imagine how intimidating all this is for people who are less computer savvy. The bottom line is that version control is cumbersome, arcane and time-consuming, and most importantly, doesn’t actually contribute much to a reproducible computational analysis. If the point is to encourage people who are relatively new to computation to make scripts and organize their computational results, I think directing them adopt version control is a very bad idea. Indeed, for a while I was making everyone in our lab use version control for their projects, and overall, it has been a net negative in terms of time. We switched to Dropbox for a few recent projects and life is MUCH better–and just as reproducible.

Oh, and I think there are some people who use version control for the text of their papers (almost certainly a proper subset of those who are for some reason writing their papers in Markdown or LaTeX). Unless your paper has a lot of math in it, I have no idea why anyone would subject themselves to this form of torture. Let me be the one to tell you that you are no less smart or tough if you use Google Docs. In fact, some might say you’re more smart, because you don’t let command-line ethos/ideology get in the way of actually getting things done… :)

Which brings me to the example of figure scripting. Figure scripting is the process of making a figure completely from a script. Such a script will make all the subpanels, adjust all the font sizes, deal with all the colors, and so forth. In an ideal world with infinite time, this would be great–who wouldn't want to make all their figures magically appear by typing make figures? In practice, there are definitely some diminishing returns, and it's up to you where the line is between making it reproducible and getting it done. For me, the hard line is that all graphical elements representing data values should be coded. Like, if I make a scatterplot, then the locations of the points relatively to axes should be hard coded. Beyond that, Illustrator time! Illustrator will let you set the font size, the line weighting, marker color, and virtually every other thing you can think of simply and relatively intuitively, with immediate feedback. If you can set your font sizes and so forth programmatically, more power to you. But it's worth keeping in mind that the time you spend programming these things is time you could be spending on something else. This time can be substantial: check out this lengthy bit of code written to avoid a trip to Illustrator. Also, as the complexity of what you're trying to do gets greater, the fewer packages there are to help you make your figure. For instance, consider this figure from one of Marshall's papers:

Making gradient bars and all the lines and annotations would be a nightmare to do via script (and this isn't even very complicated). Yes, if you decide to make a change, you will have to redo some manual work in Illustrator, hence the common wisdom to make it all in scripts to "save time redoing things". But given the amount of effort it takes to figure out how to code that stuff, nine times out of ten, the total amount of time spent just redoing it will be less. And in a time when nobody reads things carefully, adding all these visual elements to your paper to make it easier to explain your work quickly is a strong imperative–stronger than making sure it all comes from a script, in my view.

Anyway, all that said, what do we actually do in the lab? Having gone through a couple iterations, we've basically settled on the following. We make a Dropbox folder for the paper, and within the folder, we have subfolders, one for raw(ish) data, one for scripts, one for graphs and one for figures (perhaps with some elaborations depending on circumstances). In the scripts folder is a set of, uh, scripts that, when run, take the raw(ish) data and turn it into the graphical elements. We then assemble those graphical elements into figures, along with a readme file to document which files went into the figure. Those figures can contain heavily altered versions of the graphical elements, and we will typically adjust font sizes, ticks, colors, you name it, but if you want to figure out why some data point was where it was, the chain is fully accessible. Then, when we're done, we put the files all into bitbucket for anyone to access.

Oh, and one other thing about permanence: our scripts use some combination of R and MATLAB, and they work for now. They may not work forever. That's fine. Life goes on, and most papers don't. Those that do do so because of their scientific conclusions, not their data or analysis per se. So I'm not worried about it.

Update, 3/1/2016: Pretty predictable pushback from a lot of people, especially about version control. First, just to reiterate, we use version control for our general purpose tools, which are edited and used by many people, thus making version control the right tool for the job. Still, I have yet to hear any truly compelling arguments for using version control that would mitigate against the substantial associated complexity for the use case I am discussing here, which is making the analyses in a paper reproducible. There's a lot of bald assertions of the benefits of version control out there without any real evidence for their validity other than "well, I think this should be better", also with little frank discussion of the hassles of version control. This strikes me as similar to the pushback against the LaTeX vs. Word paper. Evidence be damned! :)

Friday, January 22, 2016

Thoughts on the NEJM editorial: what’s good for the (experimental) goose is good for the (computational) gander

Huge Twitter explosion about this editorial in the NEJM about “research parasites”. Basically, the authors say that computational people interested in working with someone else’s data should work together with the experimenters (which, incidentally, is how I would approach something like that in most cases). Things get a bit darker (and perhaps more revealing) when they also call out “research parasites”–aka “Mountain Dew chugging computational types”, to paraphrase what I’ve heard elsewhere–who are are to them just people sitting around, umm, chugging Mountain Dew while banging on their computers, stealing papers from those who worked so hard to generate these datasets.

So this NEJM editorial is certainly wrong on many counts, and I think that most people have that covered. Not only that, but it is particularly tone-deaf: “… or even use the data to try to disprove what the original investigators had posited.” Seriously?!?

The response has been particularly strong from the computational genomics community, who are often reliant on other people’s data. Ewan Birney had a nice set of Tweets on the topic, first noting that “For me this is the start of clinical research transitioning from a data limited to an analysis limited world.”, noting further that “This is what mol. biology / genomics went through in the 90s/00s and it’s scary for the people who base their science on control of data.” True, perhaps.

He then goes on to say: “1. Publication means... publication, including the data. No ifs, no buts. Patient data via restricted access (bonafide researcher) terms.”

Agreed, who can argue with that! But let’s put this chain of reasoning together. If we are moving to an “analysis limited world”, then it is the analyses that are the precious resource. And all the arguments for sharing data are just as applicable to sharing analyses, no? Isn’t the progress of science impeded by people not sharing their analyses? This is not just an abstract argument: for example, we have been doing some ATAC-seq experiments in the lab, and we had a very hard time finding out exactly how to analyze that data, because there was no code out there for how to do it, even in published papers (for the record, Will Greenleaf has been very kind and helpful via personal communication, and this has been fine for us).

What does, say, Genome Research have to say about it? Well, here’s what they say about data:

Genome Research will not publish manuscripts where data used and/or reported in the paper is not freely available in either a public database or on the Genome Research website. There are no exceptions.

Uh, so that’s pretty explicit. And here’s what they say about code:

Authors submitting papers that describe or present a new computer program or algorithm or papers where in-house software is necessary to reproduce the work should be prepared to make a downloadable program freely available. We encourage authors to also make the source code available.

Okay, so only if there’s some novel analysis, and then only if you want to or if someone asks you. Probably via e-mail. To which someone may or may not respond. Hmm, kettle, the pot is calling…

So what happens in practice at Genome Research? I took a quick look at the first three papers from the current TOC (1, 2, 3).

The first paper has a “Supplemental PERL.zip” that contains some very poorly documented code in a few files and as far as I can tell, is missing a file called “mcmctree_copy.ctl” that I’m guessing is pretty important to the running the mcmctree algorithm.

The third paper is perhaps the best, with a link to a software package that seems fairly well put together. But still, no link to the actual code to make the actual figures in the paper, as far as I can see, just “DaPars analysis was performed as described in the original paper (Masamha et al. 2014) by using the code available at https://code.google.com/p/dapars with default settings.”

The second paper has no code at all. They have a fairly detailed description of their analysis in the supplement, but again, no actual code I could run.

Aren’t these the same things we’ve been complaining about in experimental materials and methods forever? First paper: missing steps of a protocol? Second paper: vague prescription referencing previous paper and a “kit”? Third paper: just a description of how they did it, just like, you know, most “old fashioned” materials and methods from experimental biology papers.

Look, trust me, I understand completely why this is the case in these papers, and I’m not trying to call these authors out. All I’m saying is that if you’re going to get on your high horse and say that data is part of the paper and must be distributed, no ifs, no buts, well, then distribute the analyses as well–and I don’t want to hear any ifs or buts. If we require authors to deposit their sequence data, then surely we can require that they upload their code. Where is the mandate for depositing code on the journal website?

Of course, in the real world, there are legitimate ifs and buts. Let me anticipate one: “Our analyses are so heterogeneous, and it’s so complicated for us to share the code in a usable way.” I’m actually very sympathetic to that. Indeed, we have lots of data that is very heterogeneous and hard to share reasonably–for anyone who really believes all data MUST be accessible, well, I’ve got around 12TB of images for our next paper submission that I would love for you to pay to host… and that probably nobody will ever use. Not all science is genomics, and what works in one place won’t necessarily make sense elsewhere. (As an aside, in computational applied math, many people keep their codes secret to avoid “research parasites”, so it’s not just data gatherers who feel threatened.)

Where, might you ask, is the moral indignation on the part of our experimental colleagues complaining about how computational folks don’t make their codes accessible? First off, I think many of these folks are in fact annoyed (I am, for instance), but are much less likely to be on Twitter and the like. Secondly, I think that many non-computational folks are brow-beaten by p-value toting computational people telling them they don’t even know how to analyze their own data, leading them to feel like they are somehow unable to contribute meaningfully in the first place.

So my point is, sure, data should be available, but let’s not all be so self-righteous about it. Anyway, there, I said it. Peace. :)

PS: Just in case you were wondering, we make all our software and processed data available, and our most recent paper has all the scripts to make all the figures–and we’ll keep doing that moving forward. I think it's good practice, my point is that reasonable people could disagree.

Update: Nice discussion with Casey Bergman in the comments.
Update (4/28/2016): Fixed links to Genome Research papers (thanks to Quaid Morris for pointing this out). Also, Quaid pointed out that I was being unreasonable, and that 2/3 actually did provide code. So I looked at the next 3 papers from that issue (4, 5, 6). Of these, none of them had any code provided. For what it's worth, I agree with Quaid that it is not necessarily reasonable to require code. My point is that we should be reasonable about data as well.

Monday, December 28, 2015

Is all of Silicon Valley on a first name basis?

One very annoying software trend I've noticed in the last several years is the use of just first names in software. For instance, iOS shows first names only in messages. Google Inbox has tons of e-mail conversations involving me and someone named "John". Also, my new artificially intelligent scheduling assistant (which is generally awesome) will put appointments with "Jenn" on my calendar. Hmm. For me, those variables need a namespace.

I'm assuming this is all in some effort to make software more friendly and conversational, and it demos great for some Apple exec to say "Ask Tim if he wants to have lunch on Wednesday" into his phone and have it automatically know they meant Tim Cook. Great, but in my professional life (and, uh, I'm guessing maybe Tim Cook's also), I interact with a pretty large number of people, some only occasionally, making this first name only convention pretty annoying.

Which makes me wonder if the logical next step is just to refer to people by their e-mail or Twitter. I'm sure that would generate a lot of debate as to which is the identifier of choice, but I'm guessing that ORCID is probably not going to be it. :)

Sunday, December 20, 2015

Impressions from a couple weeks with my new robo-assistant, Amy Ingram

Like many, I both love the idea of artificial intelligence and hate spending time on logistics. For that reason, I was super excited to hear about X.ai, which is some startup in NYC that makes an artificially intelligent scheduler e-mail bot. It takes care of this problem (e-mail conversation):

“Hey Arjun, can we meet next week to talk about some cool project idea or another?”
“Sure, let’s try sometime late next week. How about Thursday 2pm?”
“Oh, actually, I’ve got class then, but I’m free at 3pm.”
“Hmm, sorry, I’ve got something else at 3pm. Maybe Friday 1pm?”
“Unfortunately I’m out of town on Friday, maybe the week after?”
“Sure, what about Tuesday?”
“Well, only problem is that…”

And so on. X.ai’s solution works like this:

“Hey Arjun, can we meet next week to talk about some cool project idea or another?”
“Sure, let’s try sometime late next week. I’m CCing my assistant Amy, who will find us a time.”

And that’s it! Amy will e-mail back and forth with whoever wrote to me and find a time to meet that fits us both, putting it straight on my calendar without me having to lift another finger. Awesome.

So how well does it work? Overall, really well. It took a bit of finagling at first to make sure that that my calendar was appropriately set up (like making sure I’m set to “available” even if my calendar has an all day event) and that Amy knew my preferences, but overall, out of the several meetings attempted so far, only one of them got mixed up, and to be fair, it was a complicated one involving multiple parties and some screw ups on my part due to it being the very first meeting I scheduled with Amy. Overall, Amy has done a great job removing scheduling headaches from my life–indeed, when I analyzed a week in my e-mail, I was surprised how much was spent on scheduling, and so this definitely reduces some overhead. Added benefit: Amy definitely does not drop the ball.

One of the strangest things about using this service so far has been my psychological responses to working with it (her?). Perhaps the most predictable one was that I don’t feel like a “have your people call my people” kind of person. I definitely feel a bit uncomfortable saying things like “I’m CCing my assistant who will find us a time”, like I’m some sort of Really Busy And Important Person instead of someone who teaches a class and jokes around with twenty-somethings all day. Perhaps this is just a bit of misplaced egalitarian/lefty anxiety, or imposter syndrome manifesting itself as a sense that I don’t deserve admin support, or the fact that I’m pretty sure I’m not actually busy enough to merit real human admin support. Anyway, whatever, I just went for it.

So then this is where it starts getting a bit weird. So far, I haven’t been explicitly mentioning that Amy is a robot in my e-mails (like “I’m CCing my robo-assistant Amy…”). That said, for the above reasons of feeling uncomfortably self-important, I actually am relieved when people figure out that it’s a robot, since it somehow seems a bit less “one-percenty”. So why didn't I just say she’s a robot right off the bat? To be perfectly honest, when I really think about it, it’s because I didn't want to hurt her feelings! It’s so strange. Other examples: for the first few meetings, Amy CCs you on the e-mail chain so you can see how she handles it. I felt a strong compulsion to write saying “Thank you!” at the end of the exchange. Same when I write to her to change preferences. Like

“Amy, I prefer my meetings in the afternoon.”
“Okay, I have updated your preferences as follows…”
… “Thank you?!?!?”

Should I bother with the formalities of my typical e-mails, with a formal heading and signature? I think I’ve been doing it, even though it obviously (probably?) doesn’t matter.

Taking it a bit further, should I be nice? Should I get angry if she messes something up? Will my approval or frustration even register? Probably not, I tell myself. But then again, what if it’s part of her neural network to detect feelings of frustration. Would her network change the algorithms somewhat in response? Is that what I would want to happen? I just don’t know. I have to say that I had no idea that this little experiment would have me worrying about the intricacies of human/AI relations.

In some sense, then, I was actually a bit relieved at the outcome of the following exchange. As a test, Sara worked with Amy to set up an appointment for us to get a coffee milkshake (inside joke). She then told Amy to tell me that I should wear camouflage to the appointment, a point that Amy dutifully relayed to me:

Hi Arjun,

I just wanted to pass along this message I received from Sara. It doesn’t look like it’s a message I can provide an answer to, so I suggest you follow up with Sara. directly.
---
Thanks, Amy! 2 o'clock would be great. And please make sure he wears camouflage. Sara
---
Amy

To which I responded:

Hi Amy,

Thanks for the note. Can you please tell Sara that I don’t own any camouflage?

Thanks,
Arjun

And then I heard this back:

Hi Arjun,
Right now I can't pass requests like this to your guests. I think your message would have a stronger impact if you sent it directly to Sara.
Amy

Ah, a distinctly and reassuringly non-human, form-like response. What a relief! Looks like we've still got a little way to go before we have to worry about her (its?) feelings. Still, the singularity is coming, one meeting at a time!

Saturday, December 19, 2015

Will reproducibility reduce the need for supplementary figures?

One constant refrain about the kids these days is that they use way too much supplementary material. All those important controls, buried in the supplement! All the alternative hypotheses that can’t be ruled out, buried in the supplement! All the “shady data” that doesn’t look so nice, buried in the supplement! Now papers are just reduced to ads for the real work, which is… buried in the supplement! The answer to the ultimate question of life, the universe and everything? Supplementary figure 42!

Whatever. Overall, I think the idea of supplementary figures makes sense. Papers have more data and analyses in them than before, and supplementary figures are a good way to keep important but potentially distracting details out of the way. To the extent that papers serve as narratives for our work as well as documentation of it, then it’s important to keep that narrative as focused as possible. Typically, if you know the field well enough to know that a particular control is important, then you likely have an interest sufficient enough to go to the trouble to dig it up in the supplement. If the purpose of the paper is to reach people outside of your niche–which most papers in journals with big supplements are attempting to do–then there’s no point in having all those details front and center.

(As an extended aside/supplementary discussion (haha!), the strategy we’ve mostly adopted (from Jeff Gore, who showed me this strategy when we were postdocs together) is to use supplementary figures like footnotes, like “We found that protein X bound to protein Y half the time. We found this was not due to the particular cross-linking method we used (Supp. Fig. 34)”. Then the supplementary figure legend can have an extended discussion of the point in question, no supplementary text required. This is possible because unlike regular figure legends, you can have interpretation in the legend itself, or at least the journal doesn’t care enough to look.)

I think the distinction between the narrative and documentary role of a paper is where things may start to change with the increased focus on reproducibility. Some supplementary figures are really important to the narrative, like a graph detailing an important control. But many supplementary figures are more like data dumps, like “here’s the same effect in the other 20 genes we analyzed”. Or showing the same analysis but on replicate data. Another type of supplementary figure has various analyses done on the data that may be interesting, but not relevant to the main points of the paper. If not just the data but also the analysis and figures are available in a repository associated with the paper, then is there any need for these sorts of supplementary figures?

Let’s make this more concrete. Let’s say you put up your paper in a repository on github or the equivalent. The way we’ve been doing this lately is to have all processed data (like spot counts or FPKM) in one folder, all scripts in another, and when you run the scripts, it takes the processed data, analyzes it, and puts all the outputted graphical elements into a third folder (with subfolders as appropriate). (We also have a “Figures” folder where we assemble the figures from the graphical elements in Illustrator; more in another post.) Let’s say that we have a side point about the relative spatial positions of transcriptional loci for all the different genes we examined in a couple different datasets; e.g., Supp Figs. 16 and 21 of this paper. As is, the supplementary figures are a bit hard to parse because there’s so much data, and the point is relatively peripheral. What if instead we just pointed to the appropriate set of analyses in the “graphs” folder? And in that folder, it could have a large number of other analyses that we did that didn’t even make the cut for the supplement. I think this is more useful than the supplement as normally presented and more useful than just the raw data, because it also contains additional analyses that may be of interest–and my guess is that these analyses are actually far more valuable than the raw data in many cases. For example, Supp Fig. 11 of that same paper shows an image with our cell-cycle determination procedure, but we had way more quantitative data that we just didn’t show because the supplement was already getting insane. Those analyses would be great candidates for a family of graphs in a repository. Of course, all of this requires these analyses being well-documented and browsable, but again, not sure that’s any worse than the way things are now.

Now, I’m not saying that all supplementary figures would be unnecessary. Some contain important controls and specific points that you want to highlight, e.g., Supp. Fig. 7–just like an important footnote. But analyses of data dumps, replicates, side points and the such might be far more efficiently and usefully kept in a repository.

One potential issue with this scheme is hosting and versioning. Most supplementary information is currently hosted by journals. In this repository-based future, it’s up to Bitbucket or Github to stick around, and the authors are free to modify and remove the repository if they wish. Oh well, nothing’s permanent in this world anyway, so I’m not so worried about that personally. I suppose you could zip up the whole thing and upload it as a supplementary file, although most supplementary information has size restrictions. Not sure about the solution to that.

Part of the reason I’ve been thinking about this lately is because Cell Press has this very annoying policy that you can’t have more supplementary figures than main figures. This wreaked havoc with our “footnote” style we originally used in Olivia’s paper because now you have to basically agglomerate smaller, more focused supplementary figures into huge supplementary mega-figures that are basically a hot figure mess. I find this particularly ironic considering that Cell’s focus on “complete stories” is probably partially to blame for the proliferation of supplementary information in our field. I get that the idea is to reduce the amount of supplementary information, but I don’t think the policy accomplishes this goal and only serves to complicate things. Cell Press, please reconsider!

Saturday, July 11, 2015

How should we do script review to spot errors?

Sydney just thought up a great idea for the lab: she was wondering if someone could review all her analysis scripts to look for errors before we finalize it and submit a manuscript. Sort of like a code review, I guess. I think this is awesome, and can definitely reduce the potential for getting some very serious egg on your face after publication. (Note: I'm not talking about infrastructure-type software, which I think has a very different set of problems and solutions. This is about analysis scripts for the science itself.)

We all discussed briefly at group meeting about how this might work in practice, which took on a very practical significance because Chris was going over figures for the paper he's putting together. Here were some of the points of discussion, much revolving around the time it takes for someone to go over someone else's code.

When should the review happen? In the ideal world, the reviewer would be involved each step of the way, spotting errors early on in the process. In practice, that's a pretty big burden on the reviewer, and there's the potential to spend time reviewing analyses that never see the light of day. So I think we all thought it's better done at the end. Of course, doing it at the bitter end could be, well, bitter. So we're thinking maybe doing it in chunks when specific pieces of the analysis are finalized?
Who should do it? Someone well-versed in the project would obviously be able to go through it faster. Also, they may be better able to suggest "sanity checks" (additional analyses to demonstrate correctness) than someone naive to the project. Then again, might their familiarity blind them to certain errors? I'm just not sure at this stage how much work it is to go through this.
Related: How actively should the code author be involved? On the one hand, looking at raw code without any guidance can be very intimidating and time-consuming. On the other hand, having someone lead you through the code might inadvertently steer the reviewer away from problem areas.
Who should do it, part 2? Some folks in the lab are a bit more computationally savvy than others. I worry that the more computationally savvy folks might get overburdened. It could be a training exercise for others to learn, but the quality of the review itself might suffer somewhat.
How should we assign credit? Acknowledgement on the paper? Co-authorship? I could see making a case either way, guess it probably depends on the specifics.

Anyway, don't know if anyone out there has tried something like this, but if so, we'd love to hear your thoughts. I think it's increasingly important to think about these days.

Sunday, August 17, 2014

Another approach to having data available, standardized and accessible: who cares?

I once went to a talk by someone who spent most of their seminar talking about a platform they had created for integrating and processing data of all different kinds (primarily microarray). After the talk, a Very Wise Colleague of mine and I were chatting with the speaker, and I said something to the effect of “Yeah, it’s so crazy how much effort it takes to deal with other people’s datasets” and both the speaker and I nodded vigorously while Very Wise Colleague smiled a little. Then he said, “Well, you know, another approach to this problem is to just not care.” Now, Very Wise Colleague has forgotten more about this field than I’ve ever learned (times 10), so I have spent the last several years pondering this statement. And I think that as time has gone on and I’ve become at least somewhat less unwise, I think I largely agree with Very Wise Colleague.

I realize this is a less than fashionable point of view these days, especially amongst the “open everything” crowd (heavy overlap with the genomics crowd). I think this largely stems from some very particular aspects of genomics data that are dangerous to generalize to the broader scientific community. So let’s start with a very important exception to my argument and then work from there: the human genome. I think our lab uses the human genome on pretty much a daily basis. Mouse genome as well. As such, it is super handy that the data is available and easily accessed and manipulated because we need the data as a resource of specific important information that does not change or (substantially) improve with time or technology.

I think this is only true of a very small subset of research, though, and leads to the following bigger question: when The Man is paying for research, what are they paying for? In the case of the genome, I think the idea is that they are paying for a valuable set of data that is reasonably finalized and important to the broader scientific endeavor. Same could be said for genomes of other species, or for measuring the melting points of various metals, crystal structures, motions of celestial bodies, etc.–basically anything in which the data yields a reasonably final value of interest. For most other research, though, The Man is paying us to generate scientific insight, not data. Think about virtually every important result in biomedical science from the past however long. Like how mutations to certain genes cause cells to proliferate uncontrollably (i.e., genes cause cancer). Do we really need the original data for any reason? At this point, no. Would anyone at the time have needed the original data for any reason? Maybe a few people who wanted to trade notes on a thing or two, but that’s about it. The main point of the work is the scientific insight one gains from it, which will hopefully stand the test of time. Standing the test of time, by the way, means independent verification of your conclusions (not data) in others labs in other systems. Whether or not you make your data standardized and easily accessible makes no real difference in this context.

I think it’s also really important before building any infrastructure to first think pretty carefully about the "reasonably final" part of reasonably final value of interest. The genome, minor caveats aside, passes this bar. I mean, once you have a person’s genome, you have their sequence, end of story. No better technology will give them a radically better version of the sequence. Such situations in biology are relatively rare, though. Most of the time, technology will march along so fast that by the time you build the infrastructure, the whole field has moved on to something new. I saw so many of those microarray transcriptome profile compendiums and databases that came out just before RNA-seq started to catch on–were those efforts really worthwhile? Given that experience, is it worth doing the same thing now with RNA-seq? Even right now, although I can look up the HeLa transcriptome in online repositories, do I really trust that it’s going to give me the same results that I would get on my HeLa cells growing in my incubator in my lab? Probably just sequence it myself as a control anyway. And by the time someone figures this whole mess out, will some new tech have come along making the whole effort seem hopelessly quaint? Incidentally, I think the same sort of thinking is a pretty strong argument that if a line of research is not going to give a reasonably final value of interest for something, then you better try and get some scientific insight out of it, because purely as data, the whole thing will likely be obsolete in a few years.

Now, of course, making data available and easily shared with others via standards is certainly a laudable goal, and in the absence of any other factors, sure, why not, even for scientific insight-oriented studies. But there are other factors. Primary amongst them is that most researchers I know maintain all sorts of different types of data, often custom to the specific study, and to share means having to in effect write a standard for that type of data. That’s a lot of work, and likely useless as the standards will almost certainly change over time. In areas where the rationale for interoperable data is very strong, then researchers in the field will typically step up to the task with formats and databases, as is the case with genomes and protein structures, etc. For everything else, I feel like it’s probably more efficient to handle it the old fashioned way by just, you know, sending an e-mail–I think personal engagement on the data is more productive than just randomly downloading the data anyway. (Along those lines, I think DrugMonkey was right on with this post about PLOS’s new and completely inane data availability policy.) I think the question really is this: if someone for some reason wants to do a meta-analysis of my work, is the onus on me or them to wrangle with the data to make it comparable with other people’s studies? I think it’s far more efficient for the meta-analyzer to wrangle with the data from the studies they are interested in rather than make everyone go to a lot of trouble to prepare their data in pseudo-standard formats for meta-analyses that will likely never happen.

All this said, I do definitely personally think that making data generally available and accessible is a good thing, and it’s something that we’ve done for a bunch of our papers. We have even released a software package for image analysis that hopefully someone somewhere will find useful outside of the confines of our lab. Or not. I guess the point is that if someone else doesn’t want to use our software, well, that’s fine, too.

Wednesday, July 23, 2014

The hazards of commenting code

- Gautham

It is commonly thought that good code should be thoroughly commented. In fact, this is the opposite of good practice. A coding strategy that does not allow the programmer to use coding as a crutch is good. Programs should be legible on their own.

Here are the most common scenarios:

Bad. The comment is understandable and it precedes an un-understandable piece of code. When the maintainer of the code goes through this, they still have to do a lot of work to figure out how to change the code, or to figure out where the bug might be.
Better. The comment is understandable, and the line of code is also understandable. Now you are making the reader read the same thing twice. This also dilutes code into a sea of just words.
Best. There is no comment. Only an understandable piece of code due to good naming, good abstractions, and a solid design. Good job!
Terrible. The comment is understandable. The code it describes does not do what the comment says. The bug hides in here. The maintainer has to read every piece of your un-understandable code because they have realized they can't trust your comments, which they shouldn't anyway. And so all your commenting effort was for nothing. This scenario is surprisingly common.

When are comments acceptable?

Documentation. If you have a mature set of tools, you might have them to the point that the user can just read the manual, rather than read the code. This is intended for users, not maintainers, and usually takes the form of a large comment that automated documentation generation tools can interpret.
Surprising/odd behavior of libraries you are using. Matlab has some weird things it does, and sometimes I like to notify the maintainer that this line of code looks this way for a reason (especially if the line of code is more complex than a naive implementation would appear to require because of subtleties of the programming language or libraries/packages being used.) It can be counter-argued that rather than put in a comment you could put in a set of unit tests that explore all the edge-case behavior and encapsulate byzantine code into functions whose names describe the requirements that the code is trying to meet.
When your program is best explained with pictures. Programs are strings. But sometimes they represent or manipulate essentially graphical entities. For example, a program that represents a balanced binary search tree involves tree rotation manipulations. These manipulations are very difficult to describe in prose, and so they are similarly difficult to describe in code. Some ASCII art can be a real life saver in this kind of situation, because code is a poor representation of diagrams. So think of it this way: don't let yourself write text in comments but its okay to draw figures in the comments.

For more on these ideas, please just get Robert Martin's book on Clean Code.