RajLab: January 2013

Sunday, January 27, 2013

Passive voice in scientific writing

I hate passive voice in scientific writing. I'm certainly not alone in this–many others feel the same way, and even the style guide for a journal we just submitted to explicitly says to avoid passive voice. The rationale for passive voice is pretty weak. Usually, it takes the form of some pompous arguments like "One must divorce oneself from the science" or even "it promotes precision and logic in writing, because it's more difficult." Think about that last argument for a bit! Also, politicians tend to use the passive voice to avoid placing (and taking) blame; e.g., the ubiquitous "Mistakes were made". If you can avoid being like a politician, that's always a good thing.

Anyway, no point in adding any more about the virtues of active over passive voice in scientific writing in general. A quick Google search will point you to many sources. What I thought might be interesting is to consider the use of the passive voice in specific situations that often come up in scientific writing, and some cases in which it might be okay. I think the main point to consider is that passive voice removes the subject from the sentence. Active voice can force you to think about who that subject is.

Case 1, The Invisible Hand: This is something that schools often teach students in lab classes. For example, "The tube was then placed in an ice cold water bath for 20 minutes." This one is easy. Just say "We placed the tube..." Easier to read and understand, period.

Case 2, Magical Mystery Knowledge: This one often shows up when people cite previous work. Examples include "It is well known that...", "It has been shown that...", etc. This is particularly silly if you have a case in which you are citing a single study that demonstrates the fact in question. In that case, just say "Jane Doe et al. showed that circus ponies are more likely to..." That way, you can directly state who said what, which is a nice thing to do. This strategy becomes more complex when you want to cite multiple studies, because listing out each individual study is a pain. I often opt for something like "Researchers have shown that circus ponies are more likely to..." This convention makes the writing sound a bit less pompous, which I think in and of itself increases clarity somewhat.

But I think considering what the subject of the sentence is can often lead to writing that actually increases the relevant information content of the sentence. For instance, maybe the subject of the sentence is not the researcher, but the method they employ: "A statistical analysis of circus livestock showed that circus ponies are more likely to..." Or "A random pony survey showed that circus ponies are more likely to..." Each of these sentences convey more information (and, in fact, different information in each sentence) than the original or even the "Researchers have shown..." formulation, all without really increasing total word count. If there are multiple methods and you want to show the result is robust, you can either cite each method explicitly, or just say "Multiple methods have shown..." No matter what, putting the cited result in context increases the information content of the sentence and makes the writing sound better.

Case 3, The Unknown Mechanism: This is actually one of the trickier situations, because here the subject may literally be unknown. For example, from a recent project: "In the embryo, the processes of gene expression and cell proliferation are coordinated." Now, if you know what mechanism does the coordinating, just say it. "In the embryo, a tiny little nano-homunculus named Bob coordinates gene expression and cell proliferation." Done.

But what if you don't know the mechanism? In that case, the subject of the sentence (the mechanism) is actually unclear, and so one can make the case that the passive formulation is okay. I sort of agree, but there are some alternatives:

"An unknown mechanism coordinates gene expression and cell proliferation."

Sounds a little weird, like the sentence is backward. I think this is because most scientific sentences are more successful when you give a fact and then some interpretation. Ironically, I think a passive formulation may be feel a bit more "forward", such as

"Gene expression and cell proliferation in the embryo are coordinated by an unknown mechanism."

I like this perhaps best of all, because at least you are pointing out that the mechanism is unknown.

There are also some problematic alternatives:

"Nobody knows what coordinates gene expression and cell proliferation in the embryo."

The issue here is that you have taken an explicit statement of fact (that the two processes are coordinated) and now made it implicit and assumed. Not so great.

"The embryo coordinates gene expression and cell proliferation."

Subtle case. The issue here is that the sentence says that it is the embryo itself that is doing the coordination. Now, on some level, that is true. But in saying it this way, we are sort of implying that the embryo itself somehow "wants" to do this coordination. While such a notion is plausible, we don't know this is the case–all we know is the fact that the processes are coordinated. Gautham and I had some arguments on this one (with me arguing this sentence is okay), but upon further reflection I think Gautham is right (often the case).

You'll notice that I often use the words "are coordinated" in discussing this. I guess this is a situation where passive voice is okay by me. Maybe the best thing to do is to just spend one's time figuring out the mechanism itself... :)

Case 4, The Random Flip: This is a particularly easy to avoid case in which the writer has a subject to the sentence, but doesn't use it as the subject, instead putting it in a "is done by" construction. In this case, the writer is just using passive voice for its own sake. Examples are "RNA splicing is performed by the spliceosome." Just say "The spliceosome performs RNA splicing." I think this sometimes happens because people want a sentence to seem fancier than it is so that the sentence can stand on its own. The latter sentence sounds sort of simplistic, even though the information content is literally the same. Clarity is a good thing, though! Also, the latter sentence's simplicity allows one to easily build off of it to include further information: "The spliceosome performs RNA splicing through the interaction of a large number of enzymes." Much better than "RNA splicing is performed by the spliceosome, which mediates the process through the interaction of a large number of enzymes."

Case 5, It Just Sounds Better: Sometimes, though, I have to admit that passive voice just sounds better than active voice. This is particularly the case when the sentence has a fairly complex subject. For example, from our recent paper:

"Researchers generally believe that the transcription of a gene's DNA into RNA is controlled by the interaction of regulatory proteins with DNA sequences proximal to the gene itself."

I suppose we could have said

"Researchers generally believe that the interaction of regulatory proteins with DNA sequences proximal to the gene itself control the transcription of a gene's DNA into RNA."

But that somehow sounds very verbose, and I think the passive formulation is easier to read and understand.

Overall, though, I think that the vast majority of the time, eliminating passive voice from our writing will make it better. Of course, this is all just my opinion. Thoughts welcome!

Saturday, January 19, 2013

Things Marshall hates

Here are some things that Marshall hates (by no means an exhaustive list):

- Excel spreadsheets
- Annoying finance guys talking about Excel spreadsheets
- When the waitstaff at a restaurant ask "Have you dined with us before?"
- Hi-C
- Medical privacy
- Most papers
- The Beatles
- Tyson Bees
- Fusion cuisine
- NorCal
- MATLAB

Saturday, January 5, 2013

When everyone's sequencing

Lately in the lab, we've been doing some sequencing, which has got me thinking about, well, sequencing. One thing I'm wondering is how this changes the nature of the independent research group. The thing is that as the technique itself is commoditized, then where does a research group retain an edge? For instance, if you do high end microscopy, then you have an "edge" in the sense that it's probably hard for most other groups to do it. If you do sequencing, basically anyone can send their RNA to a company to sequence and get the same sort of results. I think that's actually a good thing, overall, but I guess it just means that you really need to be careful about what you decide to sequence, especially in a small lab with limited resources. Ideally something important but somehow not obvious. And then you pray that nobody else is doing the exact same thing. Because they probably are.

Friday, January 4, 2013

Some R tips

Greetings present or future UseRs, Last night I did a typical data processing task similar to the ones I used to have to do when I began working on the Nanog project. But it took a few hours instead of a day and many more lines of bookkeeping code. A good part of this has to do with packages that extend the base R compatibility that seemed irrelevant when I first started learning how to do this stuff. Here they are:

String manipulation

library(stringr)

Allows for easy, sensible and consistent vectorized string operations, including anything you might want to do with regular expressions. Documentation: Look at the help menu for the package in stringr. The function names transparently reveal capability.

Data wrangling

A lot of work goes into massaging data, your own, or someone else's, into different formats. This is especially relevant if you have a bunch of measurements that fall under mutually exclusive categories. For example, gene expression from a time series with and without a drug (two categories - time, drug). Or ChIP binding affinities for a bunch of transcription factors to every gene.

library(reshape2)

Fundamentally, these packages allow a sensible way to change information from being encoded as row entries or as column names. I know, why go through the trouble, right? I thought so too. But after I learned it, I found a huge number of applications. And I actually store most of my data as 'melted' data now. The documentation and philosophy is in Chapter 2 of the creator's thesis. http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf

library(plyr)

If you want to split up data by category, do something to it (possibly summarize it), and put it back together to and from a variety of formats, this is your ticket. This package makes obsolete almost every for loop and a whole bunch of now-pointless bookkeeping code I had to write, plus it makes my intent very clear to myself. Documentation: (another one of Mr. Wickham's creations) http://www.jstatsoft.org/v40/i01/paper

library(data.table)

If you have a huge number of categories, like if your categories are gene names, plyr can be slow. So you can sacrifice flexibility for sometimes 1000x speed by using data.table. Good for simple calculations like within-group averaging or finding maxima. Documentation: various vignettes and faqs are available on the web.

Plotting

library(ggplot2)

Hard to describe this package. It is a totally different way of specifying plots than I am used to in a command line environment. It is almost like a command line version of bringing up the plot wizard in excel, except it has many more options, and the output with default settings is even more stunning and efficient than base R graphics. Another package that makes obsolete a lot of painstaking graphics adjustment, subplot accounting, and complications in changing what kind of plot I want. Documentation is in chapter 3 of the creator's thesis, once again: http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf