Friday, January 22, 2016

Thoughts on the NEJM editorial: what’s good for the (experimental) goose is good for the (computational) gander

Huge Twitter explosion about this editorial in the NEJM about “research parasites”. Basically, the authors say that computational people interested in working with someone else’s data should work together with the experimenters (which, incidentally, is how I would approach something like that in most cases). Things get a bit darker (and perhaps more revealing) when they also call out “research parasites”–aka “Mountain Dew chugging computational types”, to paraphrase what I’ve heard elsewhere–who are are to them just people sitting around, umm, chugging Mountain Dew while banging on their computers, stealing papers from those who worked so hard to generate these datasets.

So this NEJM editorial is certainly wrong on many counts, and I think that most people have that covered. Not only that, but it is particularly tone-deaf: “… or even use the data to try to disprove what the original investigators had posited.” Seriously?!?

The response has been particularly strong from the computational genomics community, who are often reliant on other people’s data. Ewan Birney had a nice set of Tweets on the topic, first noting that “For me this is the start of clinical research transitioning from a data limited to an analysis limited world.”, noting further that “This is what mol. biology / genomics went through in the 90s/00s and it’s scary for the people who base their science on control of data.” True, perhaps.

He then goes on to say: “1. Publication means... publication, including the data. No ifs, no buts. Patient data via restricted access (bonafide researcher) terms.”

Agreed, who can argue with that! But let’s put this chain of reasoning together. If we are moving to an “analysis limited world”, then it is the analyses that are the precious resource. And all the arguments for sharing data are just as applicable to sharing analyses, no? Isn’t the progress of science impeded by people not sharing their analyses? This is not just an abstract argument: for example, we have been doing some ATAC-seq experiments in the lab, and we had a very hard time finding out exactly how to analyze that data, because there was no code out there for how to do it, even in published papers (for the record, Will Greenleaf has been very kind and helpful via personal communication, and this has been fine for us).

What does, say, Genome Research have to say about it? Well, here’s what they say about data:
Genome Research will not publish manuscripts where data used and/or reported in the paper is not freely available in either a public database or on the Genome Research website. There are no exceptions.
Uh, so that’s pretty explicit. And here’s what they say about code:
Authors submitting papers that describe or present a new computer program or algorithm or papers where in-house software is necessary to reproduce the work should be prepared to make a downloadable program freely available. We encourage authors to also make the source code available.
Okay, so only if there’s some novel analysis, and then only if you want to or if someone asks you. Probably via e-mail. To which someone may or may not respond. Hmm, kettle, the pot is calling…

So what happens in practice at Genome Research? I took a quick look at the first three papers from the current TOC (1, 2, 3).

The first paper has a “Supplemental” that contains some very poorly documented code in a few files and as far as I can tell, is missing a file called “mcmctree_copy.ctl” that I’m guessing is pretty important to the running the mcmctree algorithm.

The second paper is perhaps the best, with a link to a software package that seems fairly well put together. But still, no link to the actual code to make the actual figures in the paper, as far as I can see, just “DaPars analysis was performed as described in the original paper (Masamha et al. 2014) by using the code available at with default settings.”

The third paper has no code at all. They have a fairly detailed description of their analysis in the supplement, but again, no actual code I could run.

Aren’t these the same things we’ve been complaining about in experimental materials and methods forever? First paper: missing steps of a protocol? Second paper: vague prescription referencing previous paper and a “kit”? Third paper: just a description of how they did it, just like, you know, most “old fashioned” materials and methods from experimental biology papers.

Look, trust me, I understand completely why this is the case in these papers, and I’m not trying to call these authors out. All I’m saying is that if you’re going to get on your high horse and say that data is part of the paper and must be distributed, no ifs, no buts, well, then distribute the analyses as well–and I don’t want to hear any ifs or buts. If we require authors to deposit their sequence data, then surely we can require that they upload their code. Where is the mandate for depositing code on the journal website?

Of course, in the real world, there are legitimate ifs and buts. Let me anticipate one: “Our analyses are so heterogeneous, and it’s so complicated for us to share the code in a usable way.” I’m actually very sympathetic to that. Indeed, we have lots of data that is very heterogeneous and hard to share reasonably–for anyone who really believes all data MUST be accessible, well, I’ve got around 12TB of images for our next paper submission that I would love for you to pay to host… and that probably nobody will ever use. Not all science is genomics, and what works in one place won’t necessarily make sense elsewhere. (As an aside, in computational applied math, many people keep their codes secret to avoid “research parasites”, so it’s not just data gatherers who feel threatened.)

Where, might you ask, is the moral indignation on the part of our experimental colleagues complaining about how computational folks don’t make their codes accessible? First off, I think many of these folks are in fact annoyed (I am, for instance), but are much less likely to be on Twitter and the like. Secondly, I think that many non-computational folks are brow-beaten by p-value toting computational people telling them they don’t even know how to analyze their own data, leading them to feel like they are somehow unable to contribute meaningfully in the first place.

So my point is, sure, data should be available, but let’s not all be so self-righteous about it. Anyway, there, I said it. Peace. :)

PS: Just in case you were wondering, we make all our software and processed data available, and our most recent paper has all the scripts to make all the figures–and we’ll keep doing that moving forward. I think it's good practice, my point is that reasonable people could disagree.

Update: Nice discussion with Casey Bergman in the comments.

Saturday, January 2, 2016

A proposal for how to label small multiples

I love the concept, invented/defined/popularized/whatever by Tufte, of small multiples. The general procedure is to break apart data into multiple small graphs, each of which contain some subset of the data. Importantly, small multiples often make it easier to compare data and spot trends because the cognitive load is split in a more natural way: understand the graph on a small set of data, then once you get the hang of it, see how that relationship changes across other subsets.

For instance, take this more conventionally over-plotted graph of city vs. highway miles per gallon, with different classes of cars labeled by color:

q2 <- qplot(cty,hwy,data=mpg,color = class) + theme_bw()
ggsave("color.pdf",q2,width = 8, height = 6)

Now there are a number of problems with this graph, but the most pertinent is the fact that there are a lot of colors corresponding to the different categories of car and so it takes a lot of effort to parse. The small multiple solution is to make a bunch of small graphs, one for each category, that allows you to see the differences between each. By the power of ggplot, behold!

q <- qplot(cty,hwy,data=mpg,facets = .~class) + theme_bw()
ggsave("horizontal_multiples.pdf",q,width = 8, height = 2)

Or vertically:

q <- qplot(cty,hwy,data=mpg,facets = class~.) + theme_bw()
ggsave("vertical_multiples.pdf",q,width = 2, height = 8)

Notice how much easier it is to see the differences between categories of car in these small multiples than the more conventional over-plotted version, especially the horizontal one.

Most small multiple plots look like these, and they're typically a huge improvement from heavily over-plotted graphs, but I think there’s room for improvement, especially in the labeling. The biggest problem with small multiple labeling is that most of the axis labels are very far away from the graphs themselves. This is of course a seemingly logical way to set things up because the labels apply to all the multiples, but it leads to a problem because it leads to a lot of mental gymnastics to figure out what the axes are for any one particular multiple.

Thus, my suggestion is actually based on the philosophy of the small multiple itself: explain a graph once, then rely on that knowledge to help the reader parse the rest of the graphs. Check out these before and after comparisons:

The horizontal small multiples also improve, in my opinion:

To me, labeling one the small multiples directly makes it a lot easier to figure out what is in each graph, and thus makes the entire graphic easier to understand quickly. It also adheres to the principle that important information for interpretation should be close to the data. The more people’s eyes wander, the more opportunities they have to get confused. There is of course the issue that by labeling one multiple, you are calling attention to that one in particular, but I think the tradeoff is acceptable. Another issue is a loss of precision in the other multiples. Could include tickmarks as more visible markers, but again, I think the tradeoff is acceptable.

Oh, and how did I perform this magical feat of alternative labeling of small multiples (as well as general cleanup of ggplot's nice-but-not-great output)? Well, I used this amazing software package called “Illustrator” that works with R or basically any software that spits out a PDF ;). I’m of the strong opinion that being able to drag around lines and manipulate graphical elements directly is far more efficient than trying to figure out how to do this stuff programmatically most of the time. But that’s a whole other blog post…

Tuesday, December 29, 2015

Is the academic work ethic really toxic?

Every so often, I’ll read something or other about how the culture of work in academia is toxic, encouraging people to work 24/7/52 (why do people say 24/7/365?) and thus ignore all other aspects of their existence and in the process destroying their life. As I’ve written before, I think this argument gets it backwards. I think most academics work hard because they want to and are immersed in what they are doing, not because of the “culture”. It is the conflation of hours and passion that lead to confusion.

Look, I know people who are more successful than I am and work less than I do. Good for them! That doesn’t mean I’m going to start working less hard. To me, if you’re thinking “I need to work X hours to get job Y/award Z”, well, then you’re in the wrong line of work. If you’re thinking “I really need to know about X because, uh, I just need to know” then academia might be for you. Sure, sometimes figuring out X requires a lot of work, and there is a fair amount of drudgery and discipline required to turn an idea into a finished paper. Most academics I know will make the choice to do that work. Some will do it at a pace I would find unmanageable. Some will do it at a pace I find lethargic. I don’t think it really matters. I read a little while ago that Feng Zhang goes back to work every day after dinner and works until 3am doing experiments himself in the lab (!). I couldn’t do that. But again, reading about Zhang, I think it’s pretty clear that he does it because he has a passion for his work. What’s wrong with that? If he wants to work that way, I don’t see any reason he should be criticized for it. Nor, conversely, lionized for it. I think we can praise his passion, though. Along those lines, I know many academics who are passionate about their work and thus very successful, all while working fairly regular hours (probably not 40/week, but definitely not 80/week), together with long vacations. Again, the only requirement for success in science is a desire to do it, along with the talent and dedication to finish what you start.

I think this conflation of hours and passion leads to some issues when working with trainees. To me, I most enjoy working with people who have a passion for their work. Often, but not always, this means that they work long-ish hours. If someone is not motivated, then a symptom is sometimes working shorter hours–or, other times, working long hours but not getting as much done. If we’re to the point where I’m counting someone’s hours, though, then it’s already too late. For trainees, if your PI is explicitly counting hours, then that means either you should find a new PI or carefully consider why your PI is counting your hours. What’s important is that both parties should realize that hours are the symptom, not the underlying condition.

Monday, December 28, 2015

Is all of Silicon Valley on a first name basis?

One very annoying software trend I've noticed in the last several years is the use of just first names in software. For instance, iOS shows first names only in messages. Google Inbox has tons of e-mail conversations involving me and someone named "John". Also, my new artificially intelligent scheduling assistant (which is generally awesome) will put appointments with "Jenn" on my calendar. Hmm. For me, those variables need a namespace.

I'm assuming this is all in some effort to make software more friendly and conversational, and it demos great for some Apple exec to say "Ask Tim if he wants to have lunch on Wednesday" into his phone and have it automatically know they meant Tim Cook. Great, but in my professional life (and, uh, I'm guessing maybe Tim Cook's also), I interact with a pretty large number of people, some only occasionally, making this first name only convention pretty annoying.

Which makes me wonder if the logical next step is just to refer to people by their e-mail or Twitter. I'm sure that would generate a lot of debate as to which is the identifier of choice, but I'm guessing that ORCID is probably not going to be it. :)

Wednesday, December 23, 2015

Bragging about data volume is lame

I've noticed a trend in some papers these days of bragging about the volume of data you collect. Here's an example (slightly modified) from a paper I was just looking at "We analyzed a total of 293,112 images." Often times, these numbers serve no real purpose except to highlight that you took a lot of data, which I think is sort of lame.

Of course, numbers in general are good and are an important element in describing experiments. Like "We took pictures of at least 5000 cells in 592 conditions." That gives a sense of the scale of the experiment and is important for the interpretation. But if you just say "We imaged a total of 2,948,378 cells", then that provides very little useful information about why you imaged all those cells. Are they all the same? Is that across multiple conditions? What is the point of this number except to impress?

And before you leave a comment, yes, I know we did that in this paper. Oops. I feel icky.

Tuesday, December 22, 2015

Reviewing for eLife is... fun?

Most of the time, I find reviewing papers to be a task that, while fun-sounding in principle, often becomes a chore in practice, especially if the paper is really dense. Which is why I was sort of surprised that I actually had some fun reviewing for eLife just recently. I've previously written about how the post-review harmonization between reviewers is a blessing for authors because it's a lot harder to give one of those crummy, ill-considered reviews when your colleagues know it's you giving them. Funny thing is that it's also fun for reviewers! I really enjoy discussing a paper I just read with my colleagues. I feel like that's an increasingly rare occurrence, and I was happy to have the opportunity. Again, well done eLife!

Sunday, December 20, 2015

Impressions from a couple weeks with my new robo-assistant, Amy Ingram

Like many, I both love the idea of artificial intelligence and hate spending time on logistics. For that reason, I was super excited to hear about, which is some startup in NYC that makes an artificially intelligent scheduler e-mail bot. It takes care of this problem (e-mail conversation):
“Hey Arjun, can we meet next week to talk about some cool project idea or another?”
“Sure, let’s try sometime late next week. How about Thursday 2pm?”
“Oh, actually, I’ve got class then, but I’m free at 3pm.”
“Hmm, sorry, I’ve got something else at 3pm. Maybe Friday 1pm?”
“Unfortunately I’m out of town on Friday, maybe the week after?”
“Sure, what about Tuesday?”
“Well, only problem is that…”
And so on.’s solution works like this:
“Hey Arjun, can we meet next week to talk about some cool project idea or another?”
“Sure, let’s try sometime late next week. I’m CCing my assistant Amy, who will find us a time.”
And that’s it! Amy will e-mail back and forth with whoever wrote to me and find a time to meet that fits us both, putting it straight on my calendar without me having to lift another finger. Awesome.

So how well does it work? Overall, really well. It took a bit of finagling at first to make sure that that my calendar was appropriately set up (like making sure I’m set to “available” even if my calendar has an all day event) and that Amy knew my preferences, but overall, out of the several meetings attempted so far, only one of them got mixed up, and to be fair, it was a complicated one involving multiple parties and some screw ups on my part due to it being the very first meeting I scheduled with Amy. Overall, Amy has done a great job removing scheduling headaches from my life–indeed, when I analyzed a week in my e-mail, I was surprised how much was spent on scheduling, and so this definitely reduces some overhead. Added benefit: Amy definitely does not drop the ball.

One of the strangest things about using this service so far has been my psychological responses to working with it (her?). Perhaps the most predictable one was that I don’t feel like a “have your people call my people” kind of person. I definitely feel a bit uncomfortable saying things like “I’m CCing my assistant who will find us a time”, like I’m some sort of Really Busy And Important Person instead of someone who teaches a class and jokes around with twenty-somethings all day. Perhaps this is just a bit of misplaced egalitarian/lefty anxiety, or imposter syndrome manifesting itself as a sense that I don’t deserve admin support, or the fact that I’m pretty sure I’m not actually busy enough to merit real human admin support. Anyway, whatever, I just went for it.

So then this is where it starts getting a bit weird. So far, I haven’t been explicitly mentioning that Amy is a robot in my e-mails (like “I’m CCing my robo-assistant Amy…”). That said, for the above reasons of feeling uncomfortably self-important, I actually am relieved when people figure out that it’s a robot, since it somehow seems a bit less “one-percenty”. So why didn't I just say she’s a robot right off the bat? To be perfectly honest, when I really think about it, it’s because I didn't want to hurt her feelings! It’s so strange. Other examples: for the first few meetings, Amy CCs you on the e-mail chain so you can see how she handles it. I felt a strong compulsion to write saying “Thank you!” at the end of the exchange. Same when I write to her to change preferences. Like
“Amy, I prefer my meetings in the afternoon.”
“Okay, I have updated your preferences as follows…”
… “Thank you?!?!?”
Should I bother with the formalities of my typical e-mails, with a formal heading and signature? I think I’ve been doing it, even though it obviously (probably?) doesn’t matter.

Taking it a bit further, should I be nice? Should I get angry if she messes something up? Will my approval or frustration even register? Probably not, I tell myself. But then again, what if it’s part of her neural network to detect feelings of frustration. Would her network change the algorithms somewhat in response? Is that what I would want to happen? I just don’t know. I have to say that I had no idea that this little experiment would have me worrying about the intricacies of human/AI relations.

In some sense, then, I was actually a bit relieved at the outcome of the following exchange. As a test, Sara worked with Amy to set up an appointment for us to get a coffee milkshake (inside joke). She then told Amy to tell me that I should wear camouflage to the appointment, a point that Amy dutifully relayed to me:
Hi Arjun,
I just wanted to pass along this message I received from Sara. It doesn’t look like it’s a message I can provide an answer to, so I suggest you follow up with Sara. directly.
Thanks, Amy! 2 o'clock would be great. And please make sure he wears camouflage. Sara
To which I responded:
Hi Amy, 
Thanks for the note. Can you please tell Sara that I don’t own any camouflage?
And then I heard this back:
Hi Arjun,
Right now I can't pass requests like this to your guests. I think your message would have a stronger impact if you sent it directly to Sara.
Ah, a distinctly and reassuringly non-human, form-like response. What a relief! Looks like we've still got a little way to go before we have to worry about her (its?) feelings. Still, the singularity is coming, one meeting at a time!