RajLab

Friday, February 8, 2013

Single molecule RNA FISH FAQ

UPDATE: I've expanded this post to an entire website. Check it out, including its extensive FAQ section!

So we often get questions about how we know single molecule RNA FISH is working the way we think it is. A LOT of questions. Seriously, a lot. At this point, we've got a fairly extensive list of canned answers, and so I thought it might be useful to post them all in one place for people.

Q: How do you know that the spots you are detecting are single RNA molecules? Couldn't they be conglomerates?

A: This is a good question, and one that has a variety of answers. Many of the control experiments that Sanjay did are in his excellent Vargas et al. PNAS 2005 paper. One (beautiful, in my mind) experiment that Sanjay did was the following. He in vitro synthesized a bunch of target RNA and put it in two different tubes. In these tubes, he labeled the RNA with probes, with the RNA in each tube labeled with a different dye (say, red or green). Then he combined the two tubes, so he had one tube with RNA that was either labeled with red probe or green probe, but not both. He then injected these into the cell and observed. If the RNA were forming conglomerates, then you would expect yellow blobs containing both red and green RNA. If they were single molecules, though, you would expect the spots to be either red or green but never both. The latter is what he observed. You might question whether this holds for endogenous RNA, but he expressed that RNA and compared intensities, and it was the same. This means that the endogenous RNA was also single particles. Nice! Definitely caveats to this, and technically it applies only to this RNA, but whatever, I think this is pretty solid.

There are other things you can do. One is to measure the fluorescent intensity of the spots and show that you get a unimodal distribution of intensities. Pretty weak in my mind, because if you had some spots with two RNA and some with one RNA, these peaks would overlap so much that it would probably look like a unimodal peak anyway. But what do I know.

To me, one of the strongest experiments are some new results from Eric Lubeck and Long Cai (Lubeck and Cai, Nat Meth 2012). They use super-resolution microscopy to actually read out a barcode of different colors along a single RNA molecule. Think about how cool that is for a minute! Anyway, it's very hard to imagine that conglomerates of RNA would show anything like that sort of thing. I think Sanjay has some other similar experiments that corroborate this.

Q: How do you know you're getting all the RNA in the cell?

A: Honest answer: no idea. What we have done to get at this is compare to qRT-PCR data, for whatever that's worth. I think Singer first did this in Femino et al. Science 1998, and Vargas et al. PNAS 2005 has a nice demonstration as well. In those cases, you can try and use absolute standard curves to get an actual average number of RNA molecules per cell via RT-PCR and compare to what you get by molecule counting via RNA FISH. In Vargas et al. PNAS 2005, we got a pretty close correspondence, with the numbers coming within 30% of each other. But given all the vagaries associated with RT-PCR (RT efficiency, PCR efficiency measurement error, etc.), I'm sort of amazed this number came out so close. I think others have shown the same thing with RT-PCR, and so I guess that's pretty good evidence. Many have shown (e.g., Raj et al. Nat Meth 2008) that fold changes in RNA counts are similar when comparing RNA FISH to RT-PCR, but I'm not sure what that really tells you about detection efficiency except that it's the same (maybe good, maybe bad) in both conditions.

Some will tell you that you can detect the same transcript with two different probe sets and look for colocalization between the colors. The idea is that if you detect with both colors, that means that your efficiency is high. I don't think that actually makes sense–if you have an RNA that is inaccessible for whatever reason, this control tells you nothing, and if you have an RNA that is accessible, then a single color will probably detect it. This two color colocalization approach is good for specificity, though...

Q: How do you know your probes are detecting the right RNA?

A: This is where the two color test comes in handy. What you can do is label every other oligo with a different fluorophore (i.e., R,G,R,G,R,G,R,G...). If the signals colocalize, that is pretty good evidence that you're detecting the right RNA, since it is very unlikely that a whole bunch of different oligos are all binding to the same incorrect target. Usually, you don't need to do this, because if you get good signal in a single color, you are almost certainly detecting the right thing. However, if you are seeing bright transcription sites, they could potentially be off targets because even a single oligo can light those up. If you are doing analysis of those sites, you will probably want to check things out this way. Also, lincRNA are very prone to these sorts of issues and you should really check those out with this "odds and evens" approach (we'll have a paper on this soon).

Q: What is the hybridization efficiency of each oligo?

A: Lubeck and Cai estimated a hybridization efficiency of around 60-70%, and we have seen similar numbers. Hard to know for sure why it's not 100%, but whatever, if you get enough oligos, you'll be fine.

Q: How do you know ribosomes are not preventing RNA detection?

A: In the Raj et al. Nat Meth 2008 paper, we simultaneously targeted both the open reading frame (ORF) and the 3' untranslated region (UTR) with differently colored probes and saw good colocalization. Ribosomes should bind to the ORF but not the 3' UTR, so if the ribosomes were causing a problem, we would have noticed many more spots with the 3' UTR probes.

Q: How do you know secondary structure is not a problem?

A: In some of our early experiments (Raj et al. PLoS Bio 2006), we targeted oligos to the PP7 RNA hairpin, which is a very strong secondary structure, and saw great signal. Same for targeting MS2 RNA hairpins. So I'm not so worried about it.

Well, hope this helps someone somewhere. If you're a Ph.D. student doing RNA FISH, you should definitely memorize the answers to these questions–could really help you out in your quals!

Tuesday, February 5, 2013

"Tidy data"

http://vita.had.co.nz/papers/tidy-data.pdf

The article linked above talks about a typical but undiagnosed source of unnecessary effort in data analysis, untidy data, explains what 'tidy data' looks like, and illustrates some tools that help you make the change.

Keeping data tidy saves a lot of effort. "Tidy data" is not a table format that is visually pleasing for a presentation. It is the format you'd most like data to be in for manipulations. In fact, storing data in formats that make for visually pleasing tables usually makes them especially difficult for other folks to use within programming-style analysis tools like R and Matlab. I was reminded of this when a coworker asked for help turning his manual Excel workflow into an automated Matlab workflow.

After trying to get all kinds of different types of data incorporated into an analysis related to my current project, often from the Supplementary Info in scientific papers, I've found that the less creative the authors are with their data presentation, the easier the job is.

Excel unintentionally encourages the basic problem. Since you constantly see the data, and there are all kinds of features to make borders, change fonts, join cells and pretty things up, it is hard to resist the temptation to make it into a pretty table. So your workflow looks like this:

data --> presentable table ( usually stored in an Excel file and given as Supplementary Info.)
presentable table --> Analysis and Graphics

That last step is hard because most presentable data is not readily amenable to downstream analysis. If instead you program your data analysis (or make use of Excel's more advanced features like pivot tables), your workflow can look like this:

data --> presentable table
data --> Analysis and Graphics

As it turns out, liberating yourself from the need to have your data look presentable on its own, lets you structure it in a way that makes for rapid and painless plotting and analysis. Optimize the data format for manipulability, and save your time and others'.

- Gautham

Sunday, January 27, 2013

Passive voice in scientific writing

I hate passive voice in scientific writing. I'm certainly not alone in this–many others feel the same way, and even the style guide for a journal we just submitted to explicitly says to avoid passive voice. The rationale for passive voice is pretty weak. Usually, it takes the form of some pompous arguments like "One must divorce oneself from the science" or even "it promotes precision and logic in writing, because it's more difficult." Think about that last argument for a bit! Also, politicians tend to use the passive voice to avoid placing (and taking) blame; e.g., the ubiquitous "Mistakes were made". If you can avoid being like a politician, that's always a good thing.

Anyway, no point in adding any more about the virtues of active over passive voice in scientific writing in general. A quick Google search will point you to many sources. What I thought might be interesting is to consider the use of the passive voice in specific situations that often come up in scientific writing, and some cases in which it might be okay. I think the main point to consider is that passive voice removes the subject from the sentence. Active voice can force you to think about who that subject is.

Case 1, The Invisible Hand: This is something that schools often teach students in lab classes. For example, "The tube was then placed in an ice cold water bath for 20 minutes." This one is easy. Just say "We placed the tube..." Easier to read and understand, period.

Case 2, Magical Mystery Knowledge: This one often shows up when people cite previous work. Examples include "It is well known that...", "It has been shown that...", etc. This is particularly silly if you have a case in which you are citing a single study that demonstrates the fact in question. In that case, just say "Jane Doe et al. showed that circus ponies are more likely to..." That way, you can directly state who said what, which is a nice thing to do. This strategy becomes more complex when you want to cite multiple studies, because listing out each individual study is a pain. I often opt for something like "Researchers have shown that circus ponies are more likely to..." This convention makes the writing sound a bit less pompous, which I think in and of itself increases clarity somewhat.

But I think considering what the subject of the sentence is can often lead to writing that actually increases the relevant information content of the sentence. For instance, maybe the subject of the sentence is not the researcher, but the method they employ: "A statistical analysis of circus livestock showed that circus ponies are more likely to..." Or "A random pony survey showed that circus ponies are more likely to..." Each of these sentences convey more information (and, in fact, different information in each sentence) than the original or even the "Researchers have shown..." formulation, all without really increasing total word count. If there are multiple methods and you want to show the result is robust, you can either cite each method explicitly, or just say "Multiple methods have shown..." No matter what, putting the cited result in context increases the information content of the sentence and makes the writing sound better.

Case 3, The Unknown Mechanism: This is actually one of the trickier situations, because here the subject may literally be unknown. For example, from a recent project: "In the embryo, the processes of gene expression and cell proliferation are coordinated." Now, if you know what mechanism does the coordinating, just say it. "In the embryo, a tiny little nano-homunculus named Bob coordinates gene expression and cell proliferation." Done.

But what if you don't know the mechanism? In that case, the subject of the sentence (the mechanism) is actually unclear, and so one can make the case that the passive formulation is okay. I sort of agree, but there are some alternatives:

"An unknown mechanism coordinates gene expression and cell proliferation."

Sounds a little weird, like the sentence is backward. I think this is because most scientific sentences are more successful when you give a fact and then some interpretation. Ironically, I think a passive formulation may be feel a bit more "forward", such as

"Gene expression and cell proliferation in the embryo are coordinated by an unknown mechanism."

I like this perhaps best of all, because at least you are pointing out that the mechanism is unknown.

There are also some problematic alternatives:

"Nobody knows what coordinates gene expression and cell proliferation in the embryo."

The issue here is that you have taken an explicit statement of fact (that the two processes are coordinated) and now made it implicit and assumed. Not so great.

"The embryo coordinates gene expression and cell proliferation."

Subtle case. The issue here is that the sentence says that it is the embryo itself that is doing the coordination. Now, on some level, that is true. But in saying it this way, we are sort of implying that the embryo itself somehow "wants" to do this coordination. While such a notion is plausible, we don't know this is the case–all we know is the fact that the processes are coordinated. Gautham and I had some arguments on this one (with me arguing this sentence is okay), but upon further reflection I think Gautham is right (often the case).

You'll notice that I often use the words "are coordinated" in discussing this. I guess this is a situation where passive voice is okay by me. Maybe the best thing to do is to just spend one's time figuring out the mechanism itself... :)

Case 4, The Random Flip: This is a particularly easy to avoid case in which the writer has a subject to the sentence, but doesn't use it as the subject, instead putting it in a "is done by" construction. In this case, the writer is just using passive voice for its own sake. Examples are "RNA splicing is performed by the spliceosome." Just say "The spliceosome performs RNA splicing." I think this sometimes happens because people want a sentence to seem fancier than it is so that the sentence can stand on its own. The latter sentence sounds sort of simplistic, even though the information content is literally the same. Clarity is a good thing, though! Also, the latter sentence's simplicity allows one to easily build off of it to include further information: "The spliceosome performs RNA splicing through the interaction of a large number of enzymes." Much better than "RNA splicing is performed by the spliceosome, which mediates the process through the interaction of a large number of enzymes."

Case 5, It Just Sounds Better: Sometimes, though, I have to admit that passive voice just sounds better than active voice. This is particularly the case when the sentence has a fairly complex subject. For example, from our recent paper:

"Researchers generally believe that the transcription of a gene's DNA into RNA is controlled by the interaction of regulatory proteins with DNA sequences proximal to the gene itself."

I suppose we could have said

"Researchers generally believe that the interaction of regulatory proteins with DNA sequences proximal to the gene itself control the transcription of a gene's DNA into RNA."

But that somehow sounds very verbose, and I think the passive formulation is easier to read and understand.

Overall, though, I think that the vast majority of the time, eliminating passive voice from our writing will make it better. Of course, this is all just my opinion. Thoughts welcome!

Saturday, January 19, 2013

Things Marshall hates

Here are some things that Marshall hates (by no means an exhaustive list):

- Excel spreadsheets
- Annoying finance guys talking about Excel spreadsheets
- When the waitstaff at a restaurant ask "Have you dined with us before?"
- Hi-C
- Medical privacy
- Most papers
- The Beatles
- Tyson Bees
- Fusion cuisine
- NorCal
- MATLAB

Saturday, January 5, 2013

When everyone's sequencing

Lately in the lab, we've been doing some sequencing, which has got me thinking about, well, sequencing. One thing I'm wondering is how this changes the nature of the independent research group. The thing is that as the technique itself is commoditized, then where does a research group retain an edge? For instance, if you do high end microscopy, then you have an "edge" in the sense that it's probably hard for most other groups to do it. If you do sequencing, basically anyone can send their RNA to a company to sequence and get the same sort of results. I think that's actually a good thing, overall, but I guess it just means that you really need to be careful about what you decide to sequence, especially in a small lab with limited resources. Ideally something important but somehow not obvious. And then you pray that nobody else is doing the exact same thing. Because they probably are.

Friday, January 4, 2013

Some R tips

Greetings present or future UseRs, Last night I did a typical data processing task similar to the ones I used to have to do when I began working on the Nanog project. But it took a few hours instead of a day and many more lines of bookkeeping code. A good part of this has to do with packages that extend the base R compatibility that seemed irrelevant when I first started learning how to do this stuff. Here they are:

String manipulation

library(stringr)

Allows for easy, sensible and consistent vectorized string operations, including anything you might want to do with regular expressions. Documentation: Look at the help menu for the package in stringr. The function names transparently reveal capability.

Data wrangling

A lot of work goes into massaging data, your own, or someone else's, into different formats. This is especially relevant if you have a bunch of measurements that fall under mutually exclusive categories. For example, gene expression from a time series with and without a drug (two categories - time, drug). Or ChIP binding affinities for a bunch of transcription factors to every gene.

library(reshape2)

Fundamentally, these packages allow a sensible way to change information from being encoded as row entries or as column names. I know, why go through the trouble, right? I thought so too. But after I learned it, I found a huge number of applications. And I actually store most of my data as 'melted' data now. The documentation and philosophy is in Chapter 2 of the creator's thesis. http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf

library(plyr)

If you want to split up data by category, do something to it (possibly summarize it), and put it back together to and from a variety of formats, this is your ticket. This package makes obsolete almost every for loop and a whole bunch of now-pointless bookkeeping code I had to write, plus it makes my intent very clear to myself. Documentation: (another one of Mr. Wickham's creations) http://www.jstatsoft.org/v40/i01/paper

library(data.table)

If you have a huge number of categories, like if your categories are gene names, plyr can be slow. So you can sacrifice flexibility for sometimes 1000x speed by using data.table. Good for simple calculations like within-group averaging or finding maxima. Documentation: various vignettes and faqs are available on the web.

Plotting

library(ggplot2)

Hard to describe this package. It is a totally different way of specifying plots than I am used to in a command line environment. It is almost like a command line version of bringing up the plot wizard in excel, except it has many more options, and the output with default settings is even more stunning and efficient than base R graphics. Another package that makes obsolete a lot of painstaking graphics adjustment, subplot accounting, and complications in changing what kind of plot I want. Documentation is in chapter 3 of the creator's thesis, once again: http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf

Sunday, December 16, 2012

How to understand biology

Lately, I've been wondering whether creating a complete understanding of biology is hopeless given the complexity involved. Maybe it's like predicting the weather or something--just too many variables and too many unknowns. Often you hear the analogy that the way we try to understand biology today is like trying to understand how a clock works by throwing it at the wall and looking at the pieces. I think there is a fundamental truth to that. Lately, in my talks, I've been using the analogy of grinding up an iPhone into a pile of dust, and then trying to understand how the iPhone works by considering the elemental makeup of that pile of dust. That's sort of like what we do now. We grind up a bunch of cells and see what genes are up and down in comparing one "pile of dust" to another. I think it can be hard to gain real mechanistic insight from that, or at least it seems that it will be hard to really understand how a cell works that way. Hmm. Is there a point when we'll actually get there?

Which leads me to a thought. Maybe it is hopeless to actually figure out how an iPhone works. But maybe we don't need to. Take reprogramming of stem cells. In that case, we don't need to know exactly what the cell does with those reprogramming factors, we just know that they can reprogram the cell. It's sort of like saying, "Well, someone gave me this iPhone, and I have no idea how it works (having ground a couple of them up), but at the very least, it seems like if I push these particular buttons, then I can somehow get it to call my mom". So we've gained some higher level understanding of how the system works, enough to bend it to our will occasionally. That seems more feasible. Maybe. I hope.