RajLab: publishing

Showing posts with label publishing. Show all posts

Friday, July 31, 2020

“Hipster” overlay journals

Been thinking a lot about overlay journals and their implications these days. For those who don’t know, an overlay journal is sort of like a “meta-journal” in that it doesn’t formally publish its own papers. Rather, it provides links to other preprints/papers that it thinks are interesting. On some level, the idea is that the true value of a journal is to serve as a filter for what someone thinks is science worth reading so that you don’t have to read every single paper. An overlay journal provides that filter function without the need for the rest of the (costly) trappings of a journal, like peer review and, uhh, color figures ;).

There is one very interesting aspect of an overlay journal that I don’t think has been discussed very much: in contrast with regular journals, they are fundamentally non-exclusive, meaning that ANY overlay journal can in principle “publish” ANY paper. What this non-exclusivity means is that there is no jockeying between journals to publish the “obviously important” papers, which have a perhaps slightly elevated chance of actually being important. You know, like “we sequenced 10x more single cells than the last paper in a fancy journal” kind of papers. If you run an overlay journal, you never have to gaze longingly at those “high impact” papers—if you want to publish it, just add it to your overlay!

What are the consequences of non-exclusivity? Primarily, I think it would serve to diminish the value of “obviously important” papers. Everyone can identify them based on authors and number of genomes sequenced or whatever, so there’s really not that much value in including them per se. It would be like saying “Here’s my playlist, it’s like a copy of the Billboard Top 40”. Nobody’s going to look to your overlay journal for that kind of stuff (which you can readily get from CNS or Twitter). Rather, the real value would be in making lists of papers that are awesome but might otherwise be overlooked—essentially a hipster playlist. As an editor, your cache would be in your ability to identify these new, cool papers and making Michael Cera-esque mixtapes out of them. Can leave the Hot 100 to Casey Kasem/Spotify algorithms.

Measuring the importance of an overlay journal would also be interesting. Clearly, impact factor is not a useful metric, since anybody can make their impact factor as high as they want by including highly cited papers. I would guess a far more sensible metric would be number of followers of the journal (which makes more sense anyway).

Another interesting aspect of an overlay journal is that it can be retrospective. You could include old papers as well, highlighting old gems that may have been forgotten.

Of course, an interesting question is whether there is any difference between an overlay journal and someone’s Twitter feed. Not sure, actually…

Also, thoughts on existing journals that have hipster qualities to them? I vote Current Biology, my lab votes eLife.

Monday, May 6, 2019

Wisdom of crowds and open, asynchronous peer review

I am very much in favor of preprints and open review, but something I listened to on Planet Money recently gave me some food for thought, along with a recent poll I tweeted about re-reviewing papers. The episode was about wisdom of the crowds, and how magically if you take a large number of non-expert guesses about, say, the weight of an ox, the average comes out pretty close to the actual value. Pretty cool effect!

But something in the podcast caught my ear. They talked about how when they asked some kids, you had to watch out, because once one kid said, say, 300 pounds (wildly inaccurate), then if the other kids heard it, then they would all start saying 300 pounds. Maybe some minor variations, but the point is that they were strongly influenced by that initial guess, rather than just picking something essentially purely random. The thing was that if you had no point of reference, then even a guess provides that point of reference.

Okay, so what does this have to do with peer review? What got me thinking about it was the tweet about re-reviewing a paper you had already seen but for a different journal. I'm like nah not gonna do it because it's a waste of time, but some people said, well, you are now biased. So… in a world where we openly and asynchronously review papers (preprints, postpub, whatever), we would have the same problem that the kids guessing the weight of the cow did: whoever gives the first opinion would potentially strongly influence all subsequent opinions. With conventional peer review, everyone does it blind to the others, and so reviews could be considered more independent samplings (probably dramatically undersampled, but that's another blog post). But imagine someone comments on a preprint with some purported flaw. That narrative is very likely to color subsequent reviews and discussions. I think we've all seen this coloring: take eLife collaborative peer review, or even grant review. Everyone harmonizes their scores, and it's often not an averaging. One could argue that unlike randos on the internet guessing a cow's weight, peer reviewers are all experts. Maybe, but I am somehow not so sure that once we are in the world of experts reviewing what is hopefully a reasonably decent paper that there's much signal beyond noise.

What could we do about this? Well, we could commission someone to hold all the open reviews in confidence and then publish them all at once… oh wait, I think we already have some annoying system for that. I dunno, not really sure, but anyway, was something I was wondering about recently, thoughts welcome.

Wednesday, August 2, 2017

Figure scripting and how we organize computational work in the lab

Saw a recent Twitter poll from Casey Brown on the topic of figure scripting vs. "Illustrator magic", the former of which is the practice of writing a program to completely generate the figure vs. putting figures into Illustrator to make things look the way you like. Some folks really like programming it all, while I've argued that I don't think this is very efficient, and so arguments go back on forth on Twitter about it. Thing is, I think ALL of us having this discussion here are already way in the right hand tail in terms of trying to be tidy about our computational work, while many (most?) folks out there haven't ever really thought about this at all and could potentially benefit from a discussion of what an organized computational analysis would look like in practice. So anyway, here's what we do, along with some discussion of why and what the tradeoffs are (including talking about figure scripting.

First off, what is the goal? Here, I'm talking about how one might organize a computational analysis in finalized form for a paper (will touch on exploratory analysis later). In my mind, the goal is to have a well-organized, well-documented, readable and, most importantly, complete and consistent record of the computational analysis, from raw data to plots. This has a number of benefits: 1. it is more likely to be free of mistakes; 2. it is easier for others (including within the lab) to understand and reproduce the details of your analysis; 3. it is more likely to be free of mistakes. Did I mention more likely to be free of mistakes? Will talk about that more in a coming post, but that's been the driving force for me as the analyses that we do in the lab become more and more complex.

[If you want to skip the details and get more to the principles behind them, please skip down a bit.]

Okay, so what we've settled on in lab is to have a folder structured like this (version controlled or Dropboxed, whatever):

I'll focus on the "paper" folder, which is ultimately what most people care about. The first thing is "extractionScripts". This contains scripts that pull out numbers from data and store them for further plot-making. Let me take this through the example of image data in the lab. We have a large software toolset called rajlabimagetools that we use for analyzing raw data (and that has it's own whole set of design choices for reproducibility, but that's a story for another day). That stores, alongside the raw data, analysis files that contain things like spot counts and cell outlines and thresholds and so forth. The extraction scripts pull data from those analysis files and puts it into .csv files, which are stored in extractedData. For an analogy with sequencing, this is like maybe taking some form of RNA-seq data and setting up a table of TPM values in a .csv file. Or whatever, you get the point. plotScripts then contains all the actual plotting scripts. These load the .csv files and run whatever to make graphical elements (like a series of histograms or whatever) and stores them in the graphs folder. finalFigures then contains the Illustrator files in which we compile the individual graphs into figures. Along with each figure (like Fig1.ai), we have a Fig1readme.txt that describes exactly what .eps or .pdf files from the graphs folders ended up in, say, Figure 1f (and, ideally, what script). Thus, everything is traceable back from the figure all the way to raw data. Note: within the extractionScripts is a file called "extractAll.m" and in plotScripts "plotAll.R" or something like that. These master scripts basically pull all the data and make all the graphs, and we rerun these completely from scratch right before submission to make sure nothing changed. Incidentally, of course, each of the folders often has a massive number of subfolders and so forth, but you get the idea.

What are the tradeoffs that led us to this workflow? First off, why did we separate things out this way? Back when I was a postdoc (yes, I've been doing various forms of this since 2007 or so), I tried to just arrange things by having a folder per figure. This seemed logical at the time, and has the benefit that the output of the scripts are in close proximity to the script itself (and the figure), but the problem was that figures kept getting endlessly rearranged and remixed, leading to endless tedious (and error-prone) rescripting to regain consistency. So now we just pull in graphical elements as needed. This makes things a bit tricky, since for any particular graph it's not immediately obvious what made that graph, but it's usually not too hard to figure out with some simple searching for filenames (and some verbose naming conventions).

The other thing is why have the extraction scripts separated from the plots? Well, in practice, the raw data is just too huge to distribute easily this way, and if it was all mushed together with the code and intermediates, it would be hard to distribute. But, at least in our case, the more important fact is that most people don't really care about the raw data. They trust that we've probably done that part right, and what they're most interested are the tables of extracted data. So this way, in the paper folder, we've documented how we pulled out the data along while keeping the focus on what most people will be most interested in.

[End of nitty gritty here.]

And then, of course, figure scripting, the topic that brought this whole thing up in the first place. A few thoughts. I get that in principle, scripting is great, because it provides complete documentation, and also because it potentially cuts down on errors. In practice, I think it's hard to efficiently make great figures this way, so we've chosen perhaps a slightly more tedious and error prone but flexible way to make our figures. We use scripts to generate PDFs or EPSs of all relevant graphical elements, typically not spending time to optimize even things like font size and so forth (mostly because all of those have to change so many times in the end anyway). Yes, there is a cost here in terms of redoing things if you end up changing the analysis or plot. Claus Wilke argued that this discourages people from redoing plots, which I think has some truth to it. At the same time, I think that the big problem with figure scripting is that it discourages graphical innovation and encourages people to use lazy defaults that usually suffer from bad design principles—indeed, I would argue it's way too much work currently to make truly good graphics programmatically. Take this example:

Or imagine writing a script for this one:

Maybe you like or don't like these type of figures, but either way, not only would it take FOREVER to write up a script for these (at least for me), but by the time you've done it, you would probably never build up the courage to remix these figures the dozen or so times we've reworked this one over the course of publication. It's just faster, easier, and more intuitive to do with a tool for, you know, playing with graphical elements, which I think encourages innovation. Also, many forms of labeling of graphs that reduce cognitive burden (like putting text descriptors directly next to the line or histogram that they label) are much easier in Illustrator and much harder to do programmatically, so again, this works best for us. It does also, however, introduce a human element for error, and that has happened to us, although I should say that programmatic figures are a typo away from errors as well, and that's happened, too. There is also the option to link figures, and we have done that with images in the past, but in the end, relying on Illustrator to find and maintain links as files get copied around just ended up being too much of a headache.

Note that this is how we organize final figures, but what about exploratory data analysis? In our lab, that ends up being a bit more ad-hoc, although some of the same principles apply. Following the full strictures for everything can get tedious and inhibitory, but one of the main things we try and encourage in the lab is keeping a computational lab notebook. This is like an experimental lab notebook, but, uhh, for computation. Like "I did this, hoped to see this, here's the graph, didn't work." This has been, in practice, a huge win for us, because it's a lot easier to understand human descriptions of a workflow than try and read code, especially after a long time and double especially for newcomers to the lab. Note: I do not think version control and commit messages serve this purpose, because version control is trying to solve a fundamentally different problem than exploratory analysis. Anyway, talked about this computational lab notebook thing before, should write something more about it sometime.

One final point: like I said, one of the main benefits to these sorts of workflows is that they help minimize mistakes. That said, mistakes are going to happen. There is no system that is foolproof, and ultimately, the results will only be as trustworthy as the practitioner is careful. More on that in another post as well.

Anyway, very interested in what other people's workflows look like. Almost certainly many ways to skin the cat, and curious what the tradeoffs are.

Tuesday, July 4, 2017

A system for paid reviews?

Some discussion on the internet about how slow reviews have gotten and how few reviewers respond, etc. The suggestion floated was paid review, something on the order of $100 per review. I have always found this idea weird, but I have to say that I think review times have gotten bad enough that perhaps we have to do something, and some economists have some research showing that paid reviews speed up review.

In practice, lots of hurdles. Perhaps the most obvious way to do this would be to have journals pay for reviews. The problem would be that it would make publishing even more expensive. Let's say a paper gets 6-9 reviews before getting accepted. Then in order for the journal to be made whole, they'd either take a hit on their crazy profits (haha!), or they'd pass that along in publication charges.

How about this instead? When you submit your paper, you (optionally) pay up front for timely reviews. Like, $300 extra for the reviews, on the assumption that you get a decision within 2 weeks (if not, you get a refund). Journal maybe can even keep a small cut of this for payment overhead. Perhaps a smaller fee for re-review. Would I pay $300 for a decision within 2 weeks instead of 2 months? Often times, I think the answer would be yes.

I think this would have the added benefit of people submitting fewer papers. Perhaps people would think a bit harder before submitting their work and try a bit harder to clean things up before submission. Right now, submitting a paper incurs an overhead on the community to read, understand and provide critical feedback for your paper at essentially no cost to the author, which is perhaps at least part of the reason the system is straining so badly.

One could imagine doing this on BioRxiv, even. Have a service where authors pay and someone commissions paid reviews, THEN the paper gets shopped to journals, maybe after revisions. Something was out there like this (Axios Review), but I guess it closed recently, so maybe not such a hot idea after all.

Thoughts?

Sunday, February 19, 2017

Results from the Guess the Impact Factor Challenge

Results from the Guess the Impact Factor Challenge

By Uschi Symmons and Arjun Raj

tl;dr: We wondered if people could guess the impact factor of the journal a paper was published in by its title. The short answer is not really. The longer answer is sometimes yes. The results suggest that talking about any sort of weird organism makes people think your work is boring, unless you’re talking about CRISPR. This begs the question of whether the people who took this quiz are cynical or just shallow. Much future research will be needed to make this determination.

Introduction:
[Arjun] This whole thing came out of a Tweet I saw:

It showed the title: “Superresolution imaging of nanoscale chromosome contacts”, and the beginning of the link: nature.com. Looking at the title, I thought, well, this sounds like it could plausibly be a paper in Nature, that most impacty of high impact journals (the article is actually in Scientific Reports, which is part of the Nature Publishing Group, which is generally considered to be low impact). This got Uschi and I thinking: could you tell what journal a paper went into by its title alone? Would you be fooled?

[Switching to Uschi and Arjun] By the way, although this whole thing is sort of a joke, we think it does hold some lessons for our glorious preprint based future, in which the main thing you have to go on is the title and the authors. Without the filter/recommendation role that current journals provide, will visibility in such a world be dominated by who the authors are and increasingly bombastic and hype-filled titles? (Not that that’s not the case already, but…)

To see if people could guess the impact factor of the journal a paper was published in solely based on the title we made up a little online questionnaire. More than 300 people filled out the questionnaire—and here are the results.

Methodology:
Our methodology was cooked up in an hour or two discussing by Slack, and has so many flaws it’s hard to enumerate them all. But we’ll try and hit the highlights in the discussion. Anyway, here’s what we did: we chose journals with a range of impact factors, three each in the high, medium, and low categories (>20, 8-20, <8, respectively). We tried to pick journals that would have papers with a flavor that most of our online audience would find familiar. We then chose two papers from each journal, picked from a random issue around December 2014/January 2015. The idea was to pick papers that have maybe receded from memory (and also have accumulated some citation statistics, reported as of Feb. 13, 2017), but not so long ago that the titles would be misleading or seem anachronistic. We picked the paper titles pretty much at random: picked an issue/did a search by date and basically just picked the first paper from the list that was in this area of biomedical science. The idea here was to avoid bias, so there was no attempt to pick “tricky” titles. There was one situation where we looked at an issue of Molecular Systems Biology and the first couple titles had colons in them, which we felt were perhaps a giveaway that it was not high profile, so we picked another issue. Papers and journals given in the results below.

The questionnaire itself presented the titles in random order and asked for each whether it was high, medium, or low impact, based on the cutoffs of 0-8, 8-20, 20+. Answering each question was optional, and we asked people to not answer for any papers that they already knew. At least a few people followed that instruction. We posted the questionnaire on Twitter (Twitter Inc.) and let Google (Alphabet) do its collection magic.

Google response analysis here, code and data here.

Results:
In total, we got 338 responses, mostly within the first day or two of posting. First question: how good were people at guessing the impact factor of the journal? Take a look:

The main conclusion is that people are pretty bad at this game. The average score was around 42%, which was not much above random chance (33%). Also, the best anyone got was 78%. Despite this, it looks like the answers were spread pretty evenly between the three categories, which matches the actual distribution, so there wasn’t a bias towards a particular answer.

Now the question you’ve probably been itching for: how well were people able to guess the journal specific titles? The answer is that they were good for some and not so good for others. To quantify how well people did, we calculated a “Perception score”, which is the average score given to a particular title, with low = 1, medium = 2, high = 3. Here is a table with the results:

Title	Journal	Impact factor	Perception score
Single-base resolution analysis of active DNA demethylation using methylase-assisted bisulfite sequencing	Nature Biotechnology	43.113	2.34
The draft genome sequence of the ferret (Mustela putorius furo) facilitates study of human respiratory disease	Nature Biotechnology	43.113	1.88
Dietary modulation of the microbiome affects autoinflammatory disease	Nature	38.138	2.37
Cell differentiation and germ–soma separation in Ediacaran animal embryo-like fossils	Nature	38.138	1.77
The human splicing code reveals new insights into the genetic determinants of disease	Science	34.661	2.55
Opposite effects of anthelmintic treatment on microbial infection at individual versus population scales	Science	34.661	1.44
Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large-scale reprogramming of gene expression during hematopoiesis	Genome Research	11.351	2.11
Population and single-cell genomics reveal the Aire dependency, relief from Polycomb silencing, and distribution of self-antigen expression in thymic epithelia	Genome Research	11.351	1.81
A high‐throughput ChIP‐Seq for large‐scale chromatin studies	Molecular Systems Biology	10.872	2.22
Genome‐wide study of mRNA degradation and transcript elongation in Escherichia coli	Molecular Systems Biology	10.872	2.02
Browning of human adipocytes requires KLF11 and reprogramming of PPARγ superenhancers	Genes and Development	10.042	2.15
Initiation and maintenance of pluripotency gene expression in the absence of cohesin	Genes and Development	10.042	2.09
Non-targeted metabolomics and lipidomics LC–MS data from maternal plasma of 180 healthy pregnant women	GigaScience	7.463	1.55
Reconstructing a comprehensive transcriptome assembly of a white-pupal translocated strain of the pest fruit fly Bactrocera cucurbitae	GigaScience	7.463	1.25
Asymmetric parental genome engineering by Cas9 during mouse meiotic exit	Scientific Reports	5.228	2.43
Dual sgRNA-directed gene knockout using CRISPR/Cas9 technology in Caenorhabditis elegans	Scientific Reports	5.228	2.25
A hyper-dynamic nature of bivalent promoter states underlies coordinated developmental gene expression modules	BMC Genomics	3.867	2.16
Transcriptomic and proteomic dynamics in the metabolism of a diazotrophic cyanobacterium, Cyanothece sp. PCC 7822 during a diurnal light–dark cycle	BMC Genomics	3.867	1.25

In graphical form:

One thing really leaps out, which is the “bowtie” shape of this plot: while people, averaged together, tend to get medium-impact papers right, there is high variability in aggregate perception for the low and high impact papers. For the middle-tier, one possibility is that there is a bias towards the middle in general (like an “uh, dunno, I guess I’ll just put it in the middle” effect), but we didn’t see much evidence for an excess of “middle” ratings, so maybe people are just better at guessing these ones. Definitely not the case for the high and low end, though. The two titles apiece from Nature and Science had both high and low perceived impact. Also, the two Scientific Reports papers had very high perceived impact, presumably due to the fact that they have CRISPR in the title.

So what, if anything, makes a paper seem high or low impact? Here’s a table stratified by perceived impact factor, notice what all the low ones have in common?

Title	Journal	Impact factor	Perception score
The human splicing code reveals new insights into the genetic determinants of disease	Science	34.661	2.55
Asymmetric parental genome engineering by Cas9 during mouse meiotic exit	Scientific Reports	5.228	2.43
Dietary modulation of the microbiome affects autoinflammatory disease	Nature	38.138	2.37
Single-base resolution analysis of active DNA demethylation using methylase-assisted bisulfite sequencing	Nature Biotechnology	43.113	2.34
Dual sgRNA-directed gene knockout using CRISPR/Cas9 technology in Caenorhabditis elegans	Scientific Reports	5.228	2.25
A high‐throughput ChIP‐Seq for large‐scale chromatin studies	Molecular Systems Biology	10.872	2.22
A hyper-dynamic nature of bivalent promoter states underlies coordinated developmental gene expression modules	BMC Genomics	3.867	2.16
Browning of human adipocytes requires KLF11 and reprogramming of PPARγ superenhancers	Genes and Development	10.042	2.15
Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large-scale reprogramming of gene expression during hematopoiesis	Genome Research	11.351	2.11
Initiation and maintenance of pluripotency gene expression in the absence of cohesin	Genes and Development	10.042	2.09
Genome‐wide study of mRNA degradation and transcript elongation in Escherichia coli	Molecular Systems Biology	10.872	2.02
The draft genome sequence of the ferret (Mustela putorius furo) facilitates study of human respiratory disease	Nature Biotechnology	43.113	1.88
Population and single-cell genomics reveal the Aire dependency, relief from Polycomb silencing, and distribution of self-antigen expression in thymic epithelia	Genome Research	11.351	1.81
Cell differentiation and germ–soma separation in Ediacaran animal embryo-like fossils	Nature	38.138	1.77
Non-targeted metabolomics and lipidomics LC–MS data from maternal plasma of 180 healthy pregnant women	GigaScience	7.463	1.55
Opposite effects of anthelmintic treatment on microbial infection at individual versus population scales	Science	34.661	1.44
Reconstructing a comprehensive transcriptome assembly of a white-pupal translocated strain of the pest fruit fly Bactrocera cucurbitae	GigaScience	7.463	1.25
Transcriptomic and proteomic dynamics in the metabolism of a diazotrophic cyanobacterium, Cyanothece sp. PCC 7822 during a diurnal light–dark cycle	BMC Genomics	3.867	1.25

One thing is that the titles at the bottom seem to be longer, and that is born out quantitatively, although the correlation is perhaps not spectacular:

Any other features of the title? We looked at specificity (which was the sum of the times a species, gene name or tissue was mentioned), declarativeness (“RNA transcription requires RNA polymerase” vs. “On the nature of transcription”), and mention of a “weird organism”, which we basically defined as anything not human or mouse. Check it out:

Hard to say much about declarativeness (declariciousness?), not much data there. Specificity is similarly undersampled, but perhaps there is some tendency for medium impact titles to have more specific information than others? Weird organism, however, really showed an effect. Basically, if you want people to think you wrote a low impact paper, put axolotl or something in the title. Notably, for each of the high impact journals, we had 1 each perceived as high and low impact, and this “weird organism” metric explained that difference completely. The exception to this is, of course, CRISPR: indeed, the highest perceived low impact paper was CRISPR in C. elegans. Note that we also included E. coli as “weird”, although probably should not have.

We then wondered: does this perception even matter? Does it have any bearing on citations? So many confounders here, but take a look:

First off, where you publish clearly is clearly strongly associated with citations, regardless of how your title is perceived. Beyond that, it was murky. Of the high impact titles, the ones with high perception index definitely were cited more, but the n is small there, and the effect is not there for medium and low impact titles. So who knows.

Discussion:
Our conclusion seems to be that mid-tier journals publish things that sound like they should be in mid-tier journals, perhaps with titles with more specificity. Flashy and non-flashy papers (as judged by actual impact factor) both seem to be playing the same hype game, and some of them screw up by talking about a weird organism.

Anyway, before reading too much in into any of this, like we said in the methods section, there are lots of problems with this whole thing. First off, we are vastly underpowered: the total of 18 titles is nowhere near enough to get any real picture of anything but the grossest of trends. It would have been better to have a large number of titles and have the questionnaire randomly select 18 of them, but if we didn’t get enough responses, then we would not have had very good sampling for any particular title. Also, it would have been interesting to have more titles per journal, but we instead opted for more journals just to give a bit more breadth in that respect. Oh well. Some folks also mentioned that 8 is a pretty aggressive cutoff for “low impact”, and that’s probably true. Perception of a journal’s importance and quality is not completely tied to its numerical impact factor, but we think the particular journals we chose would be pretty commonly associated with the tiers of high, medium and low. With all these caveats, should we have given our blog post the more accurate and specific title “Results from the Guess the Impact Factor Challenge in the genomicsy/methodsy subcategory of molecular biology from late 2014/early 2015”? Nah, too boring, who would read that? ;)

We think one very important thing to keep in mind is that what we measured is perceived impact factor. This is most certainly not the same thing as perceived importance. Indeed, we’re guessing that many of you played this game with your cynic hat on, rolling your eyes at obviously “high impact” papers that are probably overhyped, while in the back of your mind remembering key papers in low impact journals. That said, we think there’s probably at least some correspondence between a seemingly high profile title and whether people will click on it—let’s face it, we’re all a bit shallow sometimes. Both of these factors are probably at play in most of us, making it hard to decipher exactly how people made the judgements they did.

Question is what, if anything, should we do in light of this? A desire to “do” something implies that there is some form of systematic injustice that we could either try to fix or, conversely, try to profit from. To the former, one could argue that the current journal system (which we are most definitely not a fan of, to be clear), may provide some role here in “mixing things up”. Since papers in medium and high impact journals get more visibility than those in low impact journals, our results show that high impact journals can give exposure to poorly (or should we say specific or informatively?) titled papers, potentially giving them a citation boost and providing some opportunity for exposure that may not otherwise exist, however flawed the system may be. We think it’s possible that the move to preprints may eliminate that “mixing-things-up” factor and thus increase the incentive to pick the flashiest (and potentially least informative) title possible. After all, let’s say we lived in a fully preprint-based publishing world. Then how would you know what to look at? One obviously dominant factor is who the authors are, but let’s set that aside for now. Beyond that, one other possibility is to try and increase whatever we are measuring with perception score. So perhaps everyone will be writing like that one guy in our field with the crazy bombastic titles (you know who I mean) and nobody will be writing about how “Cas9–crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria” any more. Hmm. Perhaps science Twitter will basically accomplish the same thing once it recovers from this whole Trump thing, who knows.

Perhaps one other lesson from all of this is that science is full of bright and talented people doing pretty amazing work, and not everybody will get the recognition they feel they deserve, though our results suggest that it is possible to manipulate at least the initial perception of our work somewhat. A different question is whether we should care about such manipulations. It is simplistic to say that we should all just do the work we love and not worry about getting recognition and other outward trappings of success. At the same time, it is overly cynical to say that it’s all just a rat race and that nobody cares about the joy of scientific discovery anymore. Maybe happiness is realizing that we are most accurately characterized by living somewhere in the middle… :)

Friday, January 22, 2016

Thoughts on the NEJM editorial: what’s good for the (experimental) goose is good for the (computational) gander

Huge Twitter explosion about this editorial in the NEJM about “research parasites”. Basically, the authors say that computational people interested in working with someone else’s data should work together with the experimenters (which, incidentally, is how I would approach something like that in most cases). Things get a bit darker (and perhaps more revealing) when they also call out “research parasites”–aka “Mountain Dew chugging computational types”, to paraphrase what I’ve heard elsewhere–who are are to them just people sitting around, umm, chugging Mountain Dew while banging on their computers, stealing papers from those who worked so hard to generate these datasets.

So this NEJM editorial is certainly wrong on many counts, and I think that most people have that covered. Not only that, but it is particularly tone-deaf: “… or even use the data to try to disprove what the original investigators had posited.” Seriously?!?

The response has been particularly strong from the computational genomics community, who are often reliant on other people’s data. Ewan Birney had a nice set of Tweets on the topic, first noting that “For me this is the start of clinical research transitioning from a data limited to an analysis limited world.”, noting further that “This is what mol. biology / genomics went through in the 90s/00s and it’s scary for the people who base their science on control of data.” True, perhaps.

He then goes on to say: “1. Publication means... publication, including the data. No ifs, no buts. Patient data via restricted access (bonafide researcher) terms.”

Agreed, who can argue with that! But let’s put this chain of reasoning together. If we are moving to an “analysis limited world”, then it is the analyses that are the precious resource. And all the arguments for sharing data are just as applicable to sharing analyses, no? Isn’t the progress of science impeded by people not sharing their analyses? This is not just an abstract argument: for example, we have been doing some ATAC-seq experiments in the lab, and we had a very hard time finding out exactly how to analyze that data, because there was no code out there for how to do it, even in published papers (for the record, Will Greenleaf has been very kind and helpful via personal communication, and this has been fine for us).

What does, say, Genome Research have to say about it? Well, here’s what they say about data:

Genome Research will not publish manuscripts where data used and/or reported in the paper is not freely available in either a public database or on the Genome Research website. There are no exceptions.

Uh, so that’s pretty explicit. And here’s what they say about code:

Authors submitting papers that describe or present a new computer program or algorithm or papers where in-house software is necessary to reproduce the work should be prepared to make a downloadable program freely available. We encourage authors to also make the source code available.

Okay, so only if there’s some novel analysis, and then only if you want to or if someone asks you. Probably via e-mail. To which someone may or may not respond. Hmm, kettle, the pot is calling…

So what happens in practice at Genome Research? I took a quick look at the first three papers from the current TOC (1, 2, 3).

The first paper has a “Supplemental PERL.zip” that contains some very poorly documented code in a few files and as far as I can tell, is missing a file called “mcmctree_copy.ctl” that I’m guessing is pretty important to the running the mcmctree algorithm.

The third paper is perhaps the best, with a link to a software package that seems fairly well put together. But still, no link to the actual code to make the actual figures in the paper, as far as I can see, just “DaPars analysis was performed as described in the original paper (Masamha et al. 2014) by using the code available at https://code.google.com/p/dapars with default settings.”

The second paper has no code at all. They have a fairly detailed description of their analysis in the supplement, but again, no actual code I could run.

Aren’t these the same things we’ve been complaining about in experimental materials and methods forever? First paper: missing steps of a protocol? Second paper: vague prescription referencing previous paper and a “kit”? Third paper: just a description of how they did it, just like, you know, most “old fashioned” materials and methods from experimental biology papers.

Look, trust me, I understand completely why this is the case in these papers, and I’m not trying to call these authors out. All I’m saying is that if you’re going to get on your high horse and say that data is part of the paper and must be distributed, no ifs, no buts, well, then distribute the analyses as well–and I don’t want to hear any ifs or buts. If we require authors to deposit their sequence data, then surely we can require that they upload their code. Where is the mandate for depositing code on the journal website?

Of course, in the real world, there are legitimate ifs and buts. Let me anticipate one: “Our analyses are so heterogeneous, and it’s so complicated for us to share the code in a usable way.” I’m actually very sympathetic to that. Indeed, we have lots of data that is very heterogeneous and hard to share reasonably–for anyone who really believes all data MUST be accessible, well, I’ve got around 12TB of images for our next paper submission that I would love for you to pay to host… and that probably nobody will ever use. Not all science is genomics, and what works in one place won’t necessarily make sense elsewhere. (As an aside, in computational applied math, many people keep their codes secret to avoid “research parasites”, so it’s not just data gatherers who feel threatened.)

Where, might you ask, is the moral indignation on the part of our experimental colleagues complaining about how computational folks don’t make their codes accessible? First off, I think many of these folks are in fact annoyed (I am, for instance), but are much less likely to be on Twitter and the like. Secondly, I think that many non-computational folks are brow-beaten by p-value toting computational people telling them they don’t even know how to analyze their own data, leading them to feel like they are somehow unable to contribute meaningfully in the first place.

So my point is, sure, data should be available, but let’s not all be so self-righteous about it. Anyway, there, I said it. Peace. :)

PS: Just in case you were wondering, we make all our software and processed data available, and our most recent paper has all the scripts to make all the figures–and we’ll keep doing that moving forward. I think it's good practice, my point is that reasonable people could disagree.

Update: Nice discussion with Casey Bergman in the comments.
Update (4/28/2016): Fixed links to Genome Research papers (thanks to Quaid Morris for pointing this out). Also, Quaid pointed out that I was being unreasonable, and that 2/3 actually did provide code. So I looked at the next 3 papers from that issue (4, 5, 6). Of these, none of them had any code provided. For what it's worth, I agree with Quaid that it is not necessarily reasonable to require code. My point is that we should be reasonable about data as well.

Tuesday, December 22, 2015

Reviewing for eLife is... fun?

Most of the time, I find reviewing papers to be a task that, while fun-sounding in principle, often becomes a chore in practice, especially if the paper is really dense. Which is why I was sort of surprised that I actually had some fun reviewing for eLife just recently. I've previously written about how the post-review harmonization between reviewers is a blessing for authors because it's a lot harder to give one of those crummy, ill-considered reviews when your colleagues know it's you giving them. Funny thing is that it's also fun for reviewers! I really enjoy discussing a paper I just read with my colleagues. I feel like that's an increasingly rare occurrence, and I was happy to have the opportunity. Again, well done eLife!

Saturday, December 19, 2015

Will reproducibility reduce the need for supplementary figures?

One constant refrain about the kids these days is that they use way too much supplementary material. All those important controls, buried in the supplement! All the alternative hypotheses that can’t be ruled out, buried in the supplement! All the “shady data” that doesn’t look so nice, buried in the supplement! Now papers are just reduced to ads for the real work, which is… buried in the supplement! The answer to the ultimate question of life, the universe and everything? Supplementary figure 42!

Whatever. Overall, I think the idea of supplementary figures makes sense. Papers have more data and analyses in them than before, and supplementary figures are a good way to keep important but potentially distracting details out of the way. To the extent that papers serve as narratives for our work as well as documentation of it, then it’s important to keep that narrative as focused as possible. Typically, if you know the field well enough to know that a particular control is important, then you likely have an interest sufficient enough to go to the trouble to dig it up in the supplement. If the purpose of the paper is to reach people outside of your niche–which most papers in journals with big supplements are attempting to do–then there’s no point in having all those details front and center.

(As an extended aside/supplementary discussion (haha!), the strategy we’ve mostly adopted (from Jeff Gore, who showed me this strategy when we were postdocs together) is to use supplementary figures like footnotes, like “We found that protein X bound to protein Y half the time. We found this was not due to the particular cross-linking method we used (Supp. Fig. 34)”. Then the supplementary figure legend can have an extended discussion of the point in question, no supplementary text required. This is possible because unlike regular figure legends, you can have interpretation in the legend itself, or at least the journal doesn’t care enough to look.)

I think the distinction between the narrative and documentary role of a paper is where things may start to change with the increased focus on reproducibility. Some supplementary figures are really important to the narrative, like a graph detailing an important control. But many supplementary figures are more like data dumps, like “here’s the same effect in the other 20 genes we analyzed”. Or showing the same analysis but on replicate data. Another type of supplementary figure has various analyses done on the data that may be interesting, but not relevant to the main points of the paper. If not just the data but also the analysis and figures are available in a repository associated with the paper, then is there any need for these sorts of supplementary figures?

Let’s make this more concrete. Let’s say you put up your paper in a repository on github or the equivalent. The way we’ve been doing this lately is to have all processed data (like spot counts or FPKM) in one folder, all scripts in another, and when you run the scripts, it takes the processed data, analyzes it, and puts all the outputted graphical elements into a third folder (with subfolders as appropriate). (We also have a “Figures” folder where we assemble the figures from the graphical elements in Illustrator; more in another post.) Let’s say that we have a side point about the relative spatial positions of transcriptional loci for all the different genes we examined in a couple different datasets; e.g., Supp Figs. 16 and 21 of this paper. As is, the supplementary figures are a bit hard to parse because there’s so much data, and the point is relatively peripheral. What if instead we just pointed to the appropriate set of analyses in the “graphs” folder? And in that folder, it could have a large number of other analyses that we did that didn’t even make the cut for the supplement. I think this is more useful than the supplement as normally presented and more useful than just the raw data, because it also contains additional analyses that may be of interest–and my guess is that these analyses are actually far more valuable than the raw data in many cases. For example, Supp Fig. 11 of that same paper shows an image with our cell-cycle determination procedure, but we had way more quantitative data that we just didn’t show because the supplement was already getting insane. Those analyses would be great candidates for a family of graphs in a repository. Of course, all of this requires these analyses being well-documented and browsable, but again, not sure that’s any worse than the way things are now.

Now, I’m not saying that all supplementary figures would be unnecessary. Some contain important controls and specific points that you want to highlight, e.g., Supp. Fig. 7–just like an important footnote. But analyses of data dumps, replicates, side points and the such might be far more efficiently and usefully kept in a repository.

One potential issue with this scheme is hosting and versioning. Most supplementary information is currently hosted by journals. In this repository-based future, it’s up to Bitbucket or Github to stick around, and the authors are free to modify and remove the repository if they wish. Oh well, nothing’s permanent in this world anyway, so I’m not so worried about that personally. I suppose you could zip up the whole thing and upload it as a supplementary file, although most supplementary information has size restrictions. Not sure about the solution to that.

Part of the reason I’ve been thinking about this lately is because Cell Press has this very annoying policy that you can’t have more supplementary figures than main figures. This wreaked havoc with our “footnote” style we originally used in Olivia’s paper because now you have to basically agglomerate smaller, more focused supplementary figures into huge supplementary mega-figures that are basically a hot figure mess. I find this particularly ironic considering that Cell’s focus on “complete stories” is probably partially to blame for the proliferation of supplementary information in our field. I get that the idea is to reduce the amount of supplementary information, but I don’t think the policy accomplishes this goal and only serves to complicate things. Cell Press, please reconsider!

Monday, September 7, 2015

Another option for how to shop your paper around

I had a very interesting conversation with a journal editor recently. Normally, when your paper gets solid reviews but gets rejected for “impact” reasons or whatever, the journal will try to funnel you into one of their family journals (“just click the link… just click the link…”). Good deal for them: they get to keep a solid paper, boost a new journal, maybe collect revenue from their open access honeypot, all without much additional work. Good deal for you? Maybe yes, maybe no. But here’s the thing the editor told me: if you got good reviews from some other journal, just take those reviews and send it in with your paper to our journal! Often, if there are no technical flaws, they can accept right away, maybe send for one additional reviewer just to double check. Sort of a personally-managed transfer.

There probably are some thorny ethical or legal issues with doing this, and I have not done it myself. Then again, my feeling, which is completely a guess based on anecdotes, is that some journals are increasingly sending out papers to review that they have no intention of publishing themselves, but want to capture into their family journals. (One thing is that it’s probably easier for editors to get good reviewers that way.) So I'm not sure anyone's hands are clean. Publishing is so demoralizing these days that I think you just do what you have to do.

Anyway, just another option to pass the time until a future of pre-print awesomeness arrives. Maybe we can then just send community feedback to the journal and be done with it!

Tuesday, August 11, 2015

The impact-factor introduction

Last week, I went to the Penn MSTP retreat (for MD/PhD and VMD/PhD students), which was really cool. It truly is The Best MSTP Program in the Galaxy™, with tons of very talented students, including, I'm proud to say, four in our lab! There was lots of interesting and inspiring science in talks and posters throughout the day, and I also got to meet with a couple of cool incoming students, which is always a pleasure.

One thing I noticed several times, however, was the pernicious habit of mentioning of what journals folks in the program were publishing in or somehow associated with, emphasizing, of course, the fancy ones like Nature, etc. I noticed this in particular in the introduction of the keynote speaker, Chris Vakoc (Penn alum from Gerd Blobel's lab), because the introduction only mentioned where his work was published and didn't say anything about what science he actually did! I feel it bears mentioning that Chris gave a magnificent talk about his work on chromatin and cancer, including finding an inhibitor that actually seems to have cured a patient of leukemia. That's real impact.

I've seen these "impact-factor introductions" outside of the MSTP retreat a few times as well, and it really rubs me the wrong way. Frankly, being praised for the journals you've published in is just about the worst praise one could hope for. In a way, it's like saying "I don't even care enough to learn about what you do, but it seems like some other people think it's good". Remember, "where" we publish is just something we invent to separate out the mostly uninteresting science from the perhaps-marginally-less-likely-to-be-uninteresting-but-still-mostly-uninteresting science. If you actually are lucky enough to do something really important, it won't really matter where it's published.

What was even more worrisome was that the introduction for the speaker came from a (very well-intentioned) trainee. I absolutely do not want to single out this trainee, and I am certain the trainee knows about Chris's work and holds it in high regard. Rather, I think the whole thing highlights a culture we have fostered in which trainees have come to value perceived "impact" more than science itself. As another example, I remember bumping into a (non-MSTP) student recently and mentioning that we had recently published a paper, and rather than first asking what it was about, they only asked about where it was published! I think that's frightening, and shows that our trainees are picking up the worst form of scientific careerism from us. Not that I'm some sort of saint, either. I found it surprising to read BioRxiv recently and feel a bit disoriented without a journal name on the paper to help me know whether a paper was worth reading. Hmm. I'm clearly still in recovery.

Now, I'm not an idealist, nor particularly brave. I still want to publish papers in glossy journals for all the same reasons everyone else does, mostly because it will help ensure someone actually reads our work, and because (whether I like it or not) it's important for trainees and also for keeping the lab running. I also personally think that this journal hierarchy system has arisen for reasons that are not easy to fix, some of which are obvious and some less so. More ideas on that hopefully soon. But in the meantime, can we all at least agree not to introduce speakers by where they publish?

Incidentally, the best introduction I've ever gotten was when I gave a talk relatively recently and the introducer said something like "... and so I'm excited to hear Dr. Raj talk about his offbeat brand of science." Now that's an introduction I can live up to!