Friday, June 5, 2015

Gene expression by the numbers, day 2: take me out to the ballgame

(Day 0Day 1Day 2Day 3 (take Rorschach test at end of Day 3!))

First off, just want to thank a commenter for providing an interesting and thoughtful response to some of the topics we discussed in day 1. Highly recommended reading.

Day 2 started with Rob trying to stir the pot by placing three bets (the stakes are dinner in Paris at a fancy restaurant, yummy!). First bet was actually with me, or really a bet against pessimism. He claimed that he would be able to explain Hana’s complicated data on transcription in different conditions once we measured the relevant parameters, like, say, transcription factor concentration (wrote about this in the day 1 post). My response was, well, even if you could explain that with all the transcription factor concentrations, that’s not really the problem I have. My problem is that it is impossible to build a simple predictive model of transcription here. The input-output relationship depends on so many other factors that we end up with a mess–there are no well-defined modules. To which Rob rightfully responded by saying that that's moving the goalposts: I said he can't do X, he does X, I say now you have to do Y. Fair enough. I accept the original challenge: I claim that he will not be able to explain the differences in Hana's data using just transcription factor concentration.

Next bet was with Barak. In the day 1 post, I mention the statistical approach vs. the mechanistic approach. Rob and Barak still have to formulate the bet precisely (and I think they actually agree mostly), but basically, it is a bet against the statistical approach. Hmm. Personally, I don't know how I come down on this. I am definitely sympathetic to Rob's point of view, and don't like the overemphasis these days on statistics (my thoughts). But my thoughts are evolving. Rob asked "Would it really have been possible to derive gravitation with a bunch of star charts and machine learning?" To which I responded with something along the lines of "well, we are machines, and we learned it." Sort of silly, but sort of not.

Final bet was with Ido (something about universality of noise scaling laws). Ido also had a bet as well on this point, in this case offering up a bottle of Mezcal for a resolution. More on this some other time. I am going to try and get the bottle!

The talks were again great (I mean really great), if perhaps a bit more topically diffuse than yesterday. Started with evolution. Very cool, with beautiful graphs of clonal sweeps. An interesting point was that experimental evolution arrives at different answers than you expect initially. They are rational (or can be), but not what you expect early on–amazingly even in pathways as well worked out as the metabolic pathways. I'm wondering if we could leverage this to understand pathways better in some way?

On to the "tech development" section, which was only somewhat about tech development, somewhat not. Stirling gave a great talk about human NET-seq. What I really liked about it was that in the end, there was a simple answer to a simple question (is transcription different over exons when they're skipped? exons vs. introns?). I think it's awesome to see that genome-wide data can give such clear results.

So far, everything was about control of the mean levels of transcription. Both Ido and I talked about the variance around that mean, with Ido providing beautiful data on input-output functions. On the Mezcal, Ido shows that there is a strong relationship between the Fano factor and the mean. I am wondering whether this is due to volume variation. Olivia's paper has some data on this. Probably the subject of another blog post at some point in the future.

Theory: great discussion about Hill coefficients with Jeremy! How can you actually get thresholds in transcriptional regulation? Couple ideas. There's conventional cooperativity, and there could also be other mechanisms, like titration via dummy binding sites like in Nick Buchler's work. Surprising that we still have a lot of questions about mechanisms of thresholds after all this time.

Conversation with Jeremy and Harinder: how much do we know about whether sequence fully predicts binding? Thought for an experiment–if you sweep through transcription factor concentrations, what happens to binding as measured by e.g. ChIP-seq? Has anyone done this experiment?

Then, off to the Red Sox vs. the Twins. Biked over there on Hubway with Ron, which was perfect on a really lovely day in Cambridge. The game was super fun! Apparently there were some people playing baseball there, but that didn't distract me too much. Had a great time chatting with various folks, including two really awesome students from Angela's lab, Clarissa Scholes and Ben Vincent, who joined in the fun. Talked with them about the leaky pipeline, which is something I will never, ever discuss online for various reasons. Also crying in lab–someone at the conference told me that they've made everyone in their lab cry, which is so surprising if you know this person. Someone also told me that I'm weird. Like, they said "Arjun, you are weird." Which is true.

Oh, and the Twins won, which made me happy–not because I know the first thing about baseball, but I hate the Red Sox, mostly because of their very annoying fans. Oops, did I say that out loud?

Okay, fireworks are happening here on day 3. More soon!

Thursday, June 4, 2015

Gene expression by the numbers, verdict on day 1: awesome!

(Day 0Day 1Day 2Day 3 (take Rorschach test at end of Day 3!))

Yesterday was day 1 of Gene expression by the numbers, and it was everything I had hoped it would be! Lots of discussion about big ideas, little ideas, and everything in between. Jane Kondev said at some point that we should have a “controversy meter” based on the loudness of the discussion. Some of the discussions would definitely have rated highly, which great! Here are some thoughts, very much from my own point of view:

We started the day with a lively discussion about how I am depressed (scientifically) :). I’m depressed because I’ve been thinking lately that maybe biology is just hopelessly complex, and we’ll never figure it out. At the very least, I’ve been thinking we need wholly different approaches. More concretely for this meeting, will we ever truly be able to have a predictive understanding of how transcription is regulated? Fortunately (?), only one other person in the room admitted to such feelings, and most people were very optimistic on this count. I have to say that at the end of the day, I’m not completely convinced, but the waters are muddier.

Who is an optimist? Rob Phillips is an optimist! And he made a very strong point. Basically, he’s been able to take decades of data on transcriptional regulation in E. coli and reduce it to a single, principled equation. Different conditions, different concentrations, whatever, it all falls on a single line. I have to say, this is pretty amazing. It’s one thing to be an optimist, another to be an optimist with data. Well played.

And then… over to eukaryotes. I don’t think anyone can say with a straight face that we can predict eukaryotic transcription. Lots of examples of a lot of effects that don’t resolve with simple models, and Angela DePace gave a great talk highlighting some of the standard assumptions that we make that may not actually hold. So what do we do? Just throw our hands in the air and say “Complexity, yipes!”?

Not so fast. First, what is the simple model? The simplest model is the thermodynamic model. Essentially, each transcription factor binds to the promoter independently of each other, and its effects are independent of each other. Um, duh, that can’t work, right? I was of the opinion that decades of conventional promoter bashing hasn’t really provided much in the way of general rules, and more quantitative work along these lines hasn’t really done so either.

But Barak brought up an extremely good point, which is that a lot of these approaches to seeing how promoter changes affect transcription suffer from being very statistically underpowered. They also made the point (with data) that once you really start sampling, maybe things are not so bad–and amazingly enough, maybe some of the simplest and “obviously wrong” caricatures of transcriptional regulation are not all that far off. Maybe with sufficient sampling, we can start to see rules and exceptions, instead of a big set of exceptions. Somehow, this really resonated with me.

I’m also left a bit confused. So do we have a good understanding of regulation or not? I saw some stuff that left me hopeful that maybe simple models may be pretty darn good, and maybe we’re not all that far off from the point where if I wanted to dial up a promoter that expressed at a certain level, I just type in this piece of DNA and I’ll get close. I also saw a lot of other stuff that left me scratching my head and sent me back to wondering how we’ll ever figure it all out.

There was also here an interesting difference in style. Some approach from a very statistical point of view (do a large amount of different things and look for emergent patterns). Some approach things from a very mechanistic point of view (tweak particular parameters we think are important, like distances and individual bases, and see what happens). I usually think it’s very intellectually lazy to say things like “we need both approaches, they are complementary”, but in this case, I think it’s apt, though if I had to lean one way, personally, I think I favor the statistical approach. Deriving knowledge from the statistical approach is a tricky matter, but that’s a bigger question. How much variance do we need to explain? As yet unanswered, see later for some discussion about the elephant in the room.

Models: some cool talks about models. One great point: “No such thing as validating a model. We can only disprove models.” A point of discussion was how to deal with models that don’t fit all the data. Do we want to capture everything? How many exceptions to the rule can you tolerate before it’s no longer a rule?

Which comes to a talk that was probably highest on the controversy meter. In this one, the conferee who shares my depression showed some results that struck me as very familiar. The idea was build a quantitative model, then go build some experiments to show transcriptional response, and the model fits nicely. Then you change something in the growth medium, and suddenly, the model is out the window. We’ve all seen this: day to day variability, batch variability, “weird stuff happened that day”, whatever. So does the model really reflect our understanding of the underlying system?

This prompted a great discussion about what our goals are as a community. Is the goal really to predict everything in every condition? Is that an unreasonable thing to expect from a model? This got down to understanding vs. predicting. Jane brought up the point that these are different: Google can predict traffic, but it doesn’t understand traffic. A nice analogy, but I’m not sure that it works the other way around. I think understanding means prediction, even if prediction doesn’t necessarily mean understanding. Perhaps this comes down to an aesthetic choice. Practically speaking, for the quantitative study of transcription, I think that the fact that the model failed to predict transcription in a different condition is a problem. One of my big issues with our field is that we have a bunch of little models that are very context specific, and the quantitative (and sometimes qualitative) details vary. How can we put our models together if the sands are shifting under our feet all the time? I think this is a strong argument against modularity. Rob made the solid counter that perhaps we’re just not measuring all the parameters–if we could measure transcription factor concentration directly, maybe that would explain things. Perhaps. I’m not convinced. But that’s just, like, my opinion, man.

So to me the big elephant in the room that was not discussed is what exactly matters about transcription? As quantitative scientists, we may care about whether there are 72 transcripts in this cell vs. 98 in the one next door, but does that have any consequences? I think this is an important question because I think it can shape what we measure. For instance, this might help us answer the question about whether explaining 54% of the variance is enough–maybe the cell only cares about on vs. off, in which case, all the quantitative stuff is irrelevant (I think there is evidence for and against this). Maybe then all we should be studying is how genes go from an inactive to an active state and not worry about how much they turn on. Dunno, all I’m saying is that without any knowledge of the functional consequences, we’re running the risk of heading down the wrong path.

Another benefit to discussing functional consequences is that I think it would allow us to come up with useful definitions that we can then use to shape our discussion. For instance, what is cross-talk? (Was the subject of a great talk.) We always talk about it like it’s a bad thing, but how do we know that? What is modularity? What is noise? I think these are functional concepts that must have functional definitions, and armed with those definitions, then maybe we will have a better sense of what we should be trying to understand and manipulate with regard to transcriptional output.

Anyway, looking forward to day 2!

Tuesday, June 2, 2015

Gene expression by the numbers, day 0: Big picture questions about transcription

(Day 0Day 1Day 2Day 3 (take Rorschach test at end of Day 3!))

So just about to get on a plane to go to Boston/Cambridge for a meeting on transcription–I think it's going to be a lot of fun! Bunch of folks with a quantitative bent getting together, including the organizers Al Sanchez, Hernan Garcia, Jané Kondev, Angela DePace and Rob Phillips (big thanks for all their hard work!). The big reason I'm excited is that this is not going to be a typical meeting: the goal is to discard with the usual formalities of a meeting (like a bunch of boring talks that nobody pays attention to) and instead actually talk with each other about where we want the field to head and how we might get there. We even all made short videos beforehand as a sort of pre-conference introduction!

This is going to require changing our usual scientific behavior, which is to stamp out wild ideas as soon as we hear them. You know that crazy person who asks you some weird question at the end of your seminar about bees and the number 12? Well, that's going to be me, and I won't be satisfied with "talking about it later off-line". :)

Nor is it going to be completely off-line. I'm going to blog about the goings-on in the hope that others can participate as well in what is sadly (but perhaps necessarily) a rather small event. So drop me a line if you have any burning questions about transcription.

What are the sorts of questions we'll be discussing? Here's a few I’ve been thinking about after watching everyone’s videos:
  1. How close are we to a predictive understanding of the regulatory code? I.e., if I give you a cell type and a piece of DNA, can I predict how much transcription there will be?
  2. (Related bonus question) How do we deal with the complexity of metazoan transcriptional regulation? What new conceptual frameworks will we need to make further progress?
  3. What are some new methods that we could develop that would help us understand transcription? What are the quantities that we would like to measure?
  4. Development appears to be incredibly precise–how do developing organisms achieve this despite the sloppiness of chemical reactions? To what extent is this precision an intrinsic property of the cell and to what extent is it an emergent property of the interaction of different cells?
  5. What are the functional consequences of transcription? Which aspects of transcription “matter” and which ones are irrelevant? In chemistry, we talk about rate-limiting reactions. What are the biology-limiting reactions in transcription? What should we be measuring?
More soon!

Friday, May 22, 2015

RNA doesn't correlate with protein? Huh?

tl;dr: I don’t know why people say that RNA doesn’t correlate with protein. There are different contexts to this question, and some recent experiments may make the question a bit confusing, but overall, I’m pretty sure that most of the time, if you increase the amount of RNA for a given gene, you will end up with more of the protein encoded by that gene. I’m sure there are counter-examples, though–if you know of any, please fill me in.

In our group, when we present work on RNA abundances, we are often faced with the question: “Well, what about the protein?” (fair enough). This is usually followed by the statement “Because of course it is well known that RNA doesn’t correlate with protein.” Umm, what?

I have to say that I’m a bit puzzled by this bit of apparently obvious and self-evident truth. I thought that most people accept that the central dogma of DNA to RNA to protein is a pretty solid fact in most cases. So… if you have more RNA, that should lead to more protein, right? Shouldn’t that be the null hypothesis?

Apparently this notion has been around for a long time, though nowadays it is perhaps a bit more conceptually confusing due to a few recent results. Perhaps the biggest one was the Schwanhausser paper in which they compare RNA-seq to mass-spec and show that there is a distinct lack of correlation between mean RNA levels and mean protein levels across all genes (also the Weissman ribosome profiling paper). What this means, on the face of it, is that even if gene A produces more RNA than gene B, then it may be the case that there is more protein B than protein A. Fine. There are differences in protein translation rate and degradation rate, leading to these differences, no surprises there. Plus, Mark Biggins and Allan Drummond make the point that any measurement noise will lead to decorrelation even if things are very correlated, and their reanalyses seem to indicate that the correlation between RNA and protein may actually be considerably higher than initially reported.

The next example that’s a bit closer to home for me is whether RNA levels and protein levels correlate, even for the same gene, across single cells. Here, it gets a bit more complex, and one might expect a variety of behaviors depending on the burstiness of transcription, degradation rate of the RNA and the degradation rate of the protein. Experimentally, there are some cases in which the RNA and protein of a particular gene do not correlate in single cells (Taniguchi et al. Science 2010 is a particularly good example). This may be due to long protein half-life, which effectively smooths over RNA fluctuations. In our PLOS 2006 paper (Fig. 7), we showed that there can be a strong correlation between RNA and protein when the protein degrades fast, and that correlation goes down a lot when the protein degrades more slowly.

And of course there’s the whole world of post-translational modifications, like during the cell cycle, etc., in which protein activity and potentially levels change independent of transcript abundance. Well, dunno what to say about that, I’m biased to just think about RNA. :)

Nevertheless, overall, I think it’s pretty safe to assume most of the time that if you increase RNA abundance for a particular gene, you will end up with more of the encoded protein. I think that should be the null hypothesis. If anyone knows of any counterexamples, please let me know.

Oh, and by the way, in case you’re wondering, transcription also correlates with RNA.

Sunday, May 10, 2015

Retraction in the age of computation

tl;dr: I’ve been wondering recently whether we need to reexamine our internal barometers for retraction now that computational analyses are a bigger part of our work in biomedical sciences, in which the line between data and interpretation are somewhat more blurry. I’m not sure what the answer is, but I would definitely lean on the side of not retracting because of the stigma of “shady data” associated with retractions.

I started thinking about this because I saw Yoav Gilad’s reanalysis of some previous expression profile data and showed that the “interesting” finding went away after correcting for batch effects. Someone on Twitter asked whether the paper should be retracted. Should it?

I grew up with the maxim “Flawed data, retract; flawed interpretation, don’t retract”. I think that made a lot of sense. If the data themselves are not reproducible (fraudulent or otherwise), then that’s of course grounds for retraction. Flawed interpretations come in a couple varieties. Some are only visible in hindsight. For example “I thought that this band on the gel showed proof of XYZ effect, but actually it’s a secondary effect due to ABC that I didn’t realize at the time” is a flaw, yes, but at the time, the author would have been fine in believing that the interpretation was right. Not really retraction worthy, in my opinion. Especially because all theories and interpretations are wrong on some level or another–should we retract Newton’s gravitation because of Einstein?

Now, there’s another sort of interpretational flaw, which comes from a logical error. These can also come in a number of types. Some are just plain old interpretational flaws, like claiming something that your data doesn’t fully support. This can be subtle, like failing to consider a reasonable alternative explanation, which is a common problem. (Flawed experimental design also falls under this heading, I think.) Certainly overclaiming and so forth are rampant and generally considered relatively benign.

Where it gets more interesting is if there is a flaw in the analysis, an issue that is becoming more prevalent as complex computational analyses are more common (and where many authors have to essentially trust that someone did something right). Is data processing part of the analysis or part of the data? I think that puts us squarely in the grey zone. What makes it complex is the interplay between the biological interpretation and the nature of the technical flaw. Here are some examples:
  1. The one that got me thinking about this was when Yoav Gilad reanalyzed some existing expression profiles from human and mouse tissues. The conclusion of the original paper was that human and mouse profiles clustered together, rather than by tissue (surprise!), but upon removing batch effects, one finds that tissues cluster together more tightly than species (whoops!). Retraction? Is this an obvious flaw in methodology? Would it matter whether people figured out the importance of batch effects before or after it was published? If so, how long after? I would say this should not be retracted because these lines seem rather arbitrarily drawn.
  2. Furthermore, if we were to retract papers because the analysis method was not right, then we would go down a slippery slope. What if I analyze my RNA-seq using an older aligner that doesn’t do quite as good a job as the newer one? Is that grounds for retraction? I’m pretty sure most people would say no. But how is that really so different than the above? One could say that in this case, there is little change in the biological conclusion. But there are very few biological conclusions that stand the test of time, so I’m less swayed by that argument.
  3. Things may seem more complicated depending on where the error arises. Let’s take the case of RNA/DNA differences as reported by sequencing, which was a controversial paper that came out a few years back. Many people provided rebuttals with evidence that many of the differences were in fact sequencing artifacts. I’m no expert, but on the face of it, it seems as though the artifact people have a point. Should this paper be retracted? Here, the issue is allegedly a flaw in the early stages of the analysis. Does this count as data or interpretation? To many, it feels like it should be retracted, but where’s the real difference from the two previous examples?
  4. I know a very nice and influential paper in which there is a minor mathematical error in a formula in part of the analysis method (I am not associated with this paper). This changes literally all the results, but only by a small amount, and none of the main conclusions of the paper are affected. Here, the analysis is wrong, but the interpretation is right. I believe they were contacted by a theorist who pointed out the error and asked “when will you retract the paper?”. Should they retract? I would say no, as would most people in this case. Erratum? Maybe that’s the way to go? But I am somewhat sympathetic to the fact that a stated mathematical result is wrong, which is bad. And this is a case in which I’m saying that the biological conclusion should trump the analysis flaw.
Overall, I think the issue of how to deal with problematic papers in which errors involve sometimes murky computational and analytical methods is a difficult one, and I would say that it’s maybe worth figuring out what our standards are. I think the real question is whether computational processing of data is part of the data or part of the interpretation, and I think there are reasonable cases to be made either way. It’s tricky and slightly different than with experiments. If someone does a crappy experiment (like used the wrong buffer), then those data would be marked as irreproducible, and thus could be subject to retraction. If the computational pipeline is documented but has a bug, then technically it’s replicable, if not reproducible. So maybe one way forward is to say that bugs are retractable but methodological flaws are not?

I realize this is a pretty high bar for retraction. For me, that’s fine because, practically speaking, I think it’s far better to just leave flawed papers in the literature. Retractions in biomedical science come with the association of fraud, and I think that associating non-fraudulent but flawed papers with examples of fraud is very harmful. Also, perhaps the data is useful to someone else down the road. We wouldn’t want the data to be designated as “retracted” just because of some mistake in the analysis, right? But this also will depend on what point the data is considered data? For instance, let’s say I used the wrong annotations to quantify transcript abundance per gene and report that data. So then the data is flawed. But probably the raw reads are fine. Hmm. Retract one and not the other?

Anyway, I think it’s something worth thinking about.

Update, 5/12/2015: Lots of interesting commentary around this, especially in the case of the Gilad reanalysis of the PNAS paper. Leonid Kruglyak had a nice point:



Sounds reasonable in this case, right? I still think there are many situations in which this distinction is pretty arbitrary, though. In this case, the issue was that they didn’t watch out for batch effects. Now, once people realized that batch effects were a thing, how long does it take before it’s considered standard procedure to correct for it? 1 year? 2 years? A consensus of 90% of the community? 95%? And what if it turns out 10 years from now that the batch effect thing is not actually a problem after all and the original conclusion was valid? These all sound less relevant in this instance, but I think the principle still applies.

Great point from Joe Pickrell:



I really like the idea of just marking papers as wrong in the comments, perhaps accompanied by making comments more visible. (A more involved version of this could be paper versioning.) In this case, the data were fine, and were the paper retracted, then nobody could do a reanalysis to show that the opposite conclusion actually holds (which is also useful information).

Saturday, May 9, 2015

Thoughts on taking my first class in over a decade

Some folks in our lab (including myself) have embarked on a little experiment, which is semi-informally taking a machine learning class this summer. We’re taking a machine learning class over the summer in a self-directed manner, including doing all the homeworks. The rules are that the people who don’t do the homework have to pay for the lunches of the people who do do the homework. So far, everyone’s paid for themselves. For now… :)

Anyway, this is the first class I’ve taken in well over 10 years (although I’ve taught a bunch since then), and I’m enjoying it immensely! It also feels very different than when I took classes in the past. Firstly, I’m definitely slower. I’m taking way longer to get through the problems. At least partly, I think this is because my brain is not quite as quick as it used to be, for sure. Not sure if that’s just from having a lot of distractions or lack of sleep or just the aging process, but it’s definitely the case. Lame.

Also, I’m slower because my approach to every question is very different than it used to be. When I was an undergraduate taking a bunch of classes, a lot of the time I was just trying to get the answer. Now, with a lot more experience (and a very different objective function), I’m far less concerned with getting the right answer, and so I of course spend a lot more time trying to understand exactly how I arrived at the answer.

More interestingly, though, is the realization that beyond just trying to understand the answer, I’m also spending a lot more time trying to understand why the professor asked the question in the first place. For instance, I just worked through an example of a decision tree and entropy, and while I think my earlier self would have just applied the formulas to get the answer, now I really understand why the problem was set up the way it was and why it’s trying to teach me something. This is something I think I’ve come to appreciate a lot more now that I’ve taught a few courses and have designed homework and exam questions. When I write a question, I’m usually trying to illustrate a particular concept through an example (though I typically fail). As a student, I think I typically missed out on these messages a lot of the time both because I was more concerned with getting the answer and because I didn’t have the context in which to understand what the concept was in the first place. Now, I’m purposefully trying to understand why the question is there in the first place from the very get go.

(Note: it’s really hard to devise questions that reveal a concept to the student. Lots of reasons, but one of them is that I feel like concepts come across best through interaction. Problems for classes, though, typically have to be well defined with clear statements and solutions. In a way, that’s the worst way to get a concept across. Not sure exactly what the right way to do this is.)

Another thing I’ve noticed is that every mathematical operation I perform, from doing an integral to inverting an equation, seems far more meaningful than it used to. I think it’s because I feel like I have a much deeper understanding of why they come up and what they mean. That makes computations a bit slower but far more purposeful (and with less time spent on fruitless directions).

Which leads to another point, which is that I tend to make fewer mistakes than I used to, especially of the silly variety. I think this is because in our research, a mistake is a mistake, silly or not, and having the right answer is the only one that matters. So I’ll take it slow and get it right more often than before, which is a somewhat amusing change from the past.

Anyway, overall, a really fun experience, and one that I highly recommend if you haven’t taken a class in a while.

Friday, May 1, 2015

Can I just normalize expression levels by GAPDH?

tl;dr: Depends on context. Probably yes in many instances, but there are definitely situations where you can’t. And beware of global changes in transcription–may just be volume effects.

Now that Olivia’s paper is out (slidecast, full text), thought I’d write a bit about the time-honored practice of normalizing gene expression by GAPDH. A bit of context: when people did RT-qPCR (remember that?) on bulk RNA isolated from, say, cells with and without drug, the question would arise as to how to normalize the measurement by number of cells, differences in RNA isolation efficiency, etc. The way people normally do this in a practical sense is by dividing by the expression of housekeeping genes like GAPDH, which we assume is roughly the same per cell in both conditions. This is of course an assumption, and one which is most definitely broken in some situations.

The plot thickened around 10 years ago, when people started making measurements showing that absolute transcript abundances can vary dramatically from cell to cell, even for housekeeping genes like GAPDH. So how should you normalize single cell data?

Olivia’s paper provides some answers, but also opens up more questions. One of the principal findings (also see this paper by Hermannus Kempe in Frank Bruggeman's group) is that transcript abundance roughly scales with volume. What this means is that bigger cells have more transcripts, and that while the number of, say, GAPDH mRNA can vary a lot from cell to cell, the concentration varies far less. This holds fairly globally. So what this means is that if you normalize by GAPDH, you are pretty much normalizing by the total (m)RNA content of the cell. In the case of single cell RNA-seq (will write up a comparison of that later), you are essentially also normalizing by total mRNA content. Thus, if you are interested in the concentration of your particular mRNA, this is a reasonable thing to do.

There are a couple of wrinkles here. First, one observation we made was that most of the mRNA we looked at had a higher concentration in smaller cells than in larger cells. It was not as wide as the volume variation, but it could go as high as 2x. We’re not sure of the origin of the effect, and it is possible that there’s some systematic error in our measurement that leads to this (although we really tried a lot of different things to discount such possibilities). In any case, it’s something to consider, especially if you want to be very quantitative.

Another wrinkle is that there are definitely situations we’ve encountered when GAPDH mRNA concentration itself can change. This can happen both homogeneously across the entire population, or even within single cells–in one project we’ve been working on, we see some cells with very high GAPDH transcript abundance right next to cells with very low GAPDH transcript abundance. What to do? If you’re doing sequencing, I think that adding some spike-in controls to help normalize by the total number of molecules could help. Or just do some RNA FISH to get a baseline… :)

Finally, I think it’s really important to carefully consider the directions of causality when making claims about global changes in transcription. Olivia’s heterokaryon experiments clearly show that increasing cell volume/cellular content can directly lead to increased transcription. What that means is that if you make a perturbation and then see a global change in gene expression, it may be (in fact, very well likely is) that the perturbation is somehow causing a cell volume change, which then can result in a proportional global change in transcription. We have seen this very clearly in a number of cases.

Another point is that it really depends on context.  We have a recent example in which absolute expression of a secreted protein remains constant, but the cell volume (and hence GAPDH) expression increases dramatically. So what matters, concentration? Absolute amount? It is secreted, and these cells are living in a primarily acellular environment, so the total secreted proteins presumably depends on the absolute number of molecules rather than the concentration. I think it's all a question of context. Which is of course a complete cop-out, I know... :)

Coming soon: description of a comparison of single cell RNA seq and RNA FISH.