RajLab: actual science

Showing posts with label actual science. Show all posts

Wednesday, August 8, 2018

On mechanism and systems biology

(Latest in a slowly unfolding series of blog posts from the Paros conference.)

Related reading:

Musings on Mechanism, Rob Phillips, https://www.ncbi.nlm.nih.gov/pubmed/28963318
Excellent blog post on "Theoretical Amnesia" http://osc.centerforopenscience.org/2013/11/20/theoretical-amnesia/)

Mechanism. The word fills many of us with dread: “Not enough mechanism.” “Not particularly mechanistic.” "What's the mechanism?" So then what exactly do we mean by mechanism? I don’t think it’s an idle question—rather, I think it gets down to the very essence of what we think science means. And I think there are some practical consequences on everything from how we report results to the questions we may choose to study (and consequently to how we evaluate science). So I’ll try and organize this post around a few concrete proposals.

To start: I think the definition I’ve settled on for mechanism is “a model for how something works”.

I think it’s interesting to think about how the term mechanism has evolved in our field from something that really was mechanism once upon a time into something that is really not mechanism. In the old days, mechanism meant figuring out e.g. what an enzyme did and how it worked, perhaps in conjunction with other enzymes. Things like DNA polymerase and ATP synthase. The power of the hard mechanistic knowledge of this era is hard to overstate.

What can we learn about the power of mechanism/models from this example?

As the author of this post argues, models/theories are “inference tickets” that allow you to make hard predictions in completely new situations without testing them. We are used to thinking of models as being written in math and making quantitative predictions, but this need not be the case. Here, the predictions of how these enzymes function has led to, amongst other things, our entire molecular biology toolkit: add this enzyme, it will phosphorylate your DNA, add this other enzyme, it will ligate that to another piece of DNA. That these enzymes perform certain functions is a “mechanism” that we used to predict what would happen if we put these molecules in a test tube together, and that largely bore out, with huge practical implications.

Mechanisms necessarily come with a layer of abstraction. Perhaps we are more used to talking about these in models, where we have a name for them: “assumptions”. Essentially, there is a point at which we say, who knows, we’re just going to say that this is the way it is, and then build our model from there. In this case, it’s that the enzyme does what we say it will. We still have quite a limited ability to take an unknown sequence of amino acids and predict what it will do, and certainly very limited ability to take a desired function and just write out the sequence to accomplish said function. We just say, okay, assume these molecules do XYZ, and then our model is that they are important for e.g. transcription, or reverse transcription, or DNA replication, or whatever.

Fast forward to today, when a lot of us are studying biological regulation, and we have a very different notion of what constitutes “mechanism”. Now, it’s like oh, I see a correlation between X and Y, the reviewer asks for “mechanism”, so you knock down X and see less Y, and that’s “mechanism”. Not to completely discount this—I mean, we’ve learned a fair amount by doing these sorts of experiments, but I think it’s a pretty clear that this is not sufficient to say that we know how it works. Rather, this is a devolution to empiricism, which is something I think we need to fix in our field.

Perhaps the most salient question is what it does it mean to know “how it works?”. I posit that mechanism is an inference that connects one bit of empiricism to another. Let’s illustrate in the case of something where we do know the mechanism/model: a lever.

“How it works” in this context means that we need a layer of abstraction, and have some degree of inference given that layer of abstraction. Here, the question may be “how hard do I have to push to lift the weight?”. Do we need to know that the matter is composed of quarks to make this prediction, or how hard the lever itself is? No. Do we need to know how the string works? No. We just assume the weight pulls down on the string and whatever it’s made of is irrelevant because we know these to be empirically the case. We are going to assume that the only things that matter are the locations of the weight, the fulcrum, and my finger, as well as the weight of the, uhh, weight and how hard I push. This is the layer of abstraction the model is based on. The model we use is that of force balance, and we can use that to predict exactly how hard to push given these distances and weights.

How would a modern data scientist approach this problem? Probably take like 10,000 levers and discover Archimedes Law of the Lever by making a lot of plots in R. Who knows, maybe this is basically how Archimedes figured it out in the first place. It is perhaps often possible to figure out a relationship empirically, and even make some predictions. But that’s not what we (or at least I) consider a mechanism. I think there has to be something beyond pure empiricism, often linking very disparate scales or processes, sometimes in ways that are simply impossible to investigate empirically. In this case, we can use the concepts of force to figure out how things might work with, say, multiple weights, or systems of weights on levers, or even things that don’t look like levers at all. Wow!

Okay, so back to regulatory biology. I think one issue that we suffer from is that what we call mechanism has moved away from true “how it works” models and settled into what is really empiricism, sort of without us noticing it. Consider, for instance, development. People will say, oh, this transcription factor controls intestinal development. Why do they say that? Well, knock it out and there’s no intestine. Put it somewhere else and now you get extra intestine. Okay, but that’s not how it works. It’s empirical. How can you spot empiricism? A good sign is excessive obsession with statistics: effect sizes and p-values are often a good sign that you didn’t really figure out how it works. Another sign is that we aren’t really able to apply what we learned outside of the original context. If I gave you a DNA typewriter and said, okay, make an intestine, you would have no idea how to do it, right? We can make more intestine in the original context, but the domain of applicability is pretty limited.

Personally, I think that these difficulties arise partially because of our tools, but mostly because I think we are still focused on the wrong layers of abstraction. Probably the most common current layers of abstraction are those of genes/molecules, cells, and organisms. Our most powerful models/mechanisms to date are the ones where we could draw straight lines connecting these up. Like, mutate this gene, make these cells look funny, now this person has this disease. However, I think these straight lines are more the exception than the norm. Mostly, I think these mappings are highly convoluted in interwoven systems, making it very hard to make predictions based on empiricism alone (future blog post coming on Omnigenic Model to discuss this further).

Which leads me to a proposal: let’s start thinking about other layers of abstraction. I think that the successes of the genes/molecules -> cells paradigm has led to a certain ossification of thought centered around thinking of genes and molecules and cells as being the right layers of abstraction. But maybe genes and cells are not such fundamental units as we think they are. In the context of multicellular organisms, perhaps cells themselves are passive players, and rather it is communities of cells that are the fundamental unit. Organoids could be a good example of this, dunno. Also, it is becoming clear that genetics has some pretty serious limits in terms of determining mechanism in the sense I’ve defined. Is there some other layer involving perhaps groups of genes? Sorry, not a particularly inspired idea, but whatever, something like that maybe. Part of thinking this way also means that we have to reconsider how we evaluate science. As Rob pointed out, we have gotten so used to equating “mechanism” to “molecules and their effects on cells” that we have become both closed minded to other potential types of mechanism while also deceiving ourselves into allowing empiricism to pose as mechanism under the guise of statistics. We just have to be open to new abstractions and not hold everyone to the "What's the molecule?" standard.

Of course, underlying this is an open question: do such layers of abstraction that allow mechanism in the true sense exist? Complexity seems to be everywhere in biology, and my reaction so far has been to just throw up my hands up and say “it’s complicated!”. But (and this is another lesson learned from Rob), that’s not an excuse—we have to at least try. And I do think we can find some mechanistic wormholes through the seemingly infinite space of empiricism that we are currently mired in.

Regardless of what layers of abstraction we choose, however, I think that it is clear that a common feature of these future models will be that they are multifactorial, meaning that they will simultaneously incorporate the interactions of multiple molecules or cells or whatever the units we choose are. How do we deal with multiple interactions? I’m not alone in thinking that our models need to be quantitative, which as noted in my first post, is an idea that’s been around for some time now. However, I think that a fair charge is that in the early days of this field, our quantitative models were pretty much window dressing. I think (again a point that I’ve finally absorbed from Rob) that we have to start setting (and reporting) quantitative goals. We can’t pick and choose how our science is quantitative. If we have some pretty model for something, we better do the hard work to get the parameters we need, make hard quantitative predictions, and then stick to them. And if we don’t quantitatively get what we predict, we have to admit we were wrong. Not partly right, which is what we do now. Here’s the current playbook for a SysBio paper: quantitatively measure some phenomenon, make a nice model, predict that removal of factor X should send factor Y up by 4x, measure that it went up 2x, and put a bow on it and call it a day. I think we just have to admit that this is not good enough. This “pick and choose” mix of quantitative and qualitative analyses is hugely damaging because it makes it impossible to build upon these models. The problem is that qualitative reporting in, say, abstracts leads to people seeing “X affects Y” and “Y affects Z” and concluding “thus, X affects Z” even though the effects for X on Y and Y on Z may be small enough to make this conclusion pretty tenuous.

So I have a couple proposals. One is that in abstracts, every statement should include some sort of measure of the percentage of effect explained by the putative mechanism. I.e., you can’t just say “X affects Y”. You have to say something like “X explains 40% of the change in Y”. I know, this is hard to do, and requires thought about exactly what “explains” means. But yeah, science is hard work. Until we are honest about this, we’re always going to be “quantitative” biologists instead of true quantitative biologists.

Also, as a related grand challenge, I think it would be cool to try and be able to explain some regulatory process in biology out to 99.9%. As in, okay, we really now understand in some pretty solid way how something works. Like, we actually have mechanism in the true sense. You can argue that this number is arbitrary, and it is, but I think it could function well as an aspirational goal.

Any discussion of empiricism vs. theory will touch on the question of science vs. engineering. I would argue that—because we’re in an age of empiricism—most of what we’re doing in biology right now is probably best called engineering. Trying to make cells divide faster or turn into this cell or kill that other cell. And it’s true that look, whatever, if I can fix your heart, who cares if I have a theory of heart? One of my favorite stories along these lines is the story of how fracking was discovered, which was purely by accident (see Planet Money podcast): a desperate gas engineer looking to cut costs just kept cutting out an expensive chemical and seeing better yield until he just went with pure water and, voila, more gas than ever. Why? Who cares! Then again, think about how many mechanistic models went into, e.g., the design of the drills, transportation, everything else that goes into delivering energy. I think this highlights the fact that just like science and engineering are intertwined, so are mechanism and empiricism. Perhaps it’s time, though, to reconsider what we mean by mechanism to make it both more expansive and rigorous.

Thursday, June 14, 2018

Notes from Frontiers in Biophysics conference in Paros, episode 1 (pilot): Where's the beef in biophysics?

Long blog post hiatus, which is a story for another time. For now, I’m reporting from what was a very small conference on the Frontiers of Biophysics from Paros, a Greek island in the Aegean, organized by Steve Quake and Rob Phillips. The goals of the conference were two-fold:

Identify big picture goals and issues in biophysics, and
Consider ways to alleviate suffering and further human health.

Regarding the latter, I should say at the outset that this conference was very generously supported by Steve through the foundation he has established in memory of his mother-in-law Eleftheria Peiou, who sounds like she was a wonderful woman, and suffered through various discomforts in the medical system, which was the inspiration behind trying to reduce human suffering. I actually found this directive quite inspiring, and I’ve personally been wondering what I could do in that vein in my lab. I also wonder whether the time is right for a series of small Manhattan Projects on various topics so identified. But perhaps I’ll leave that for a later post.

Anyway, it was a VERY interesting meeting in general, and so I think I’m going to split this discussion up based on themes across a couple different blog posts, probably over the course of the next week or two. Here are some topics I’ll write about:

Exactly what is all this cell type stuff about

Exactly what do we mean by mechanism

I need a coach

What are some Manhattan Projects in biology/medicine

Maybe some others

So the conference started with everyone introducing themselves and their interests (research and otherwise) in a 5 minute lightning talk, time strictly enforced. First off, can I just say, what a thoughtful group of folks! It is clear that everyone came prepared to think outside their own narrow interests, which is very refreshing.

The next thing I noticed a lot of was a lot of hand-wringing about what exactly we mean by biophysics, which is what I’ll talk about for the rest of this blog post. (Please keep in mind that this is very much an opinionated take and does not necessarily reflect that of the conferees.) To me, basically, biophysics, as seemingly defined at this meeting, as a whole needs a pretty fundamental rebranding. Raise your hand if biophysics means one of the following to you:

Lipid rafts
Ion channels
A bunch of old dudes trying to convince each other how smart they are (sorry, cheap shot intended for all physicists) ;)

If you have not raised your hand yet, then perhaps you’re one of the lonely self-proclaimed “systems biologists” out there, a largely self-identified group that has become very scattered since around 2000. What is the history of this group of people? Here’s a brief (and probably offensive, sorry) view of molecular biology. Up until the 80s, maybe 90s, molecular biology had an amazing run, working out the genetic code, signaling, aspects of gene regulation, and countless other things I’m forgetting. This culminated in the “gene-jock” era in which researchers could relate a mutation to a phenotype in mechanistic detail (this is like the Cell golden era I blogged about earlier). Since that era, well… not so much progress, if you ask me—I’m still firmly of the opinion that there haven’t really been any big conceptual breakthroughs in 20-30 years, except Yamanaka, although one could argue whether that’s more engineering. I think this is basically the end of the one-gene-one-phenotype era. As it became clear that progress would require the consideration of multiple variables, it also became clear that a more quantitative approach would be good. For ease of storytelling, let’s put this date around 2000, when a fork in the road emerged. One path was the birth of genomics and a more model-free statistical approach to biology, one which has come to dominate a lot of the headlines now; more on that later. The other was “systems biology”, characterized by an influx of quantitative people (including many physicists) into molecular biology, with the aim of building a quantitative mechanistic model of the cell. I would say this field had its heyday from around 2000-2010 (“Hey look Ma, I put GFP on a reporter construct and put error bars on my graph and published it in Nature!”), after which folks from this group have scattered towards more genomics-type work or have moved towards more biological applications. I think that this version of "systems biology" most accurately describes most of the attendees at the meeting, many of whom came from single molecule biophysics.

I viewed this meeting as a good opportunity to maybe take score and see how well our community has done. I think Steve put it pretty concisely when he said “So, where’s the beef?” I.e., it's been a while, and so what does our little systems biology corner of the world have to show for itself in the world of biology more broadly? Steve posed the question at dinner: “What are the top 10 contributions from biophysics that have made it to textbook-level biology canon?” I think we came up with two: Hodgkin and Huxley’s model of action potentials, gene expression “noise”, and Luria and Delbrück’s work on genetic heritability (and maybe kinetic proofreading; other suggestions more than welcome!). Ouch. So one big goal of the meeting was to identify where biophysics might go to actually deliver on the promise and excitement of the early 2000s. Note: Rob had a long list of examples of cool contributions, but none of them has gotten a lot of traction with biologists.

I’ll report more on some specific ideas for the future later, but for now, here’s my personal take on part of the issue. With the influx of physicists came an influx of physics ideas. And I think this historical baggage mostly distracts from the problems we might try to solve (Stephan Grill made this point as well, that we need something fundamentally new ways of thinking about problems). This baggage from physics is I think a problem both strategically and tactically. At the most navel-gazy level, I feel like discussions of “Are we going to have Newton’s laws for biology” and “What is going to be the hydrogen atom of the cell” and “What level of description should we be looking at” never really went anywhere and feel utterly stale at this point. On a more practical level, one issue I see is trying to map quantitative problems that come up in biology back to solved problems in physics, like the renormalization group or Hamiltonian dynamics or what have you. Now, I’m definitely not qualified to get into the details of these constructs and their potential utility, but I can say that we’ve had physicists who are qualified for some time now, and I think I agree with Steve: where’s the beef?

I think I agree with Stephan that perhaps we as a community perhaps need to take stock of what it is that we value about the physics part of biophysics and then maybe jettison the rest. To me, the things I value about physics are quantitative rigor and the level of predictive power that goes with it (more on that in blog post on mechanism). I love talking to folks who have a sense for the numbers, and can spot when an argument doesn’t make quantitative sense. Steve also mentioned something that I think is a nice way to come up with fruitful problems, which is looking at existing data through a quantitative lens to be able to find paradoxes in current qualitative thinking. To me, these are important ways in which we can contribute, and I believe will have a broader impact in the biological community (and indeed already has through the work of a number of “former” systems biologists).

To me, all this raises a question that I tried to bring up at the meeting but that didn’t really gain much traction in our discussions, which is how do we define and build our community? So far, it’s been mostly defined by what it is not: well, we’re quantitative, but not genomics; we’re like regular biology, but not really; we’re… just not this and that. Personally, I think our community could benefit from a strong positive vision of what sort of science we represent. And I think we need to make this vision connect with biology. Rob made the point, which is certainly valid, that maybe we don’t need to care about what biologists think about our work. I think there’s room for that, but I feel like building a movement would require more than us just engaging in our own curiosities.

Which of course begs the question of why we would need to have a “movement” anyway. I think there’s a few lessons to learn from our genomics colleagues, who I think have done a much better job of creating a movement. I think there are two main benefits. One is attracting talent to the field and building a “school of thought”. The other is attracting funding and so forth. Genomics has done both of these extremely well. There are dangers as well. Sometimes genomics folks sound more like advocates than scientists, and it’s important to keep science grounded in data. Still, overall, I think there are huge benefits. Currently, our field is a bunch of little fiefdoms, and like it or not, building things bigger than any one person involves a political dimension.

So how do we define this field? One theme of the conference that came up repeatedly was the idea of Hilbert Problems, which for those who don’t know, is a list of open math problems set out in 1900 by David Hilbert, and they were very influential. Can we perhaps build a field around a set of grand challenges? I find that idea very appealing. Although I think that given that I’ve increasingly come to think of biology as engineering instead of science, I wonder if maybe phrasing these questions instead in engineering terms would be better, sort of like a bunch of biomedical Manhattan Projects. I’ll talk about some ideas we came up with in a later blog post.

Anyway, more in the coming days/weeks…

Wednesday, June 8, 2016

What’s so bad about teeny tiny p-values?

Every so often, I’ll see someone make fun of a really small p-value, usually along with some line like “If your p-value is smaller than 1/(number of molecules in the universe), you must be doing something wrong”. At first, this sounds like a good burn, but thinking about it a bit more, I just don’t get this criticism.

First, the number itself. Is it somehow because the number of molecules in the universe is so large? Perhaps this conjures up some image of “well, this result is saying this effect could never happy anywhere ever in the whole universe by chance—that seems crazy!”, and makes it seem like there’s some flaw in the computation or deduction. Pretty easy to spot the flaw in that logic, of course: configurational space can of course be much larger than the raw number of constituent parts. For example, let’s say I mix some red dye into a cup of water and then pour half of the dyed water into another cup. Now there is some probability that, randomly, all the red dye stays in one cup and no dye goes in the other. That probability is 1/(2^numberOfDyeMolecules), which is clearly going to be a pretty teeny-tiny number.

Here’s another example that may hit a bit closer to home: during cell division, the nuclear envelope breaks down, and so many nuclear molecules (say, lincRNA) must get scattered throughout the cell (and yes, we have observed this to be the case for e.g. Xist and a few others). Then, once the nucleus reforms, those lincRNA seem to be right back in the nucleus. What is the chance that the lincRNA just happened to be back in the nucleus by chance? Well, again, 1/(2^numberOfRNAMolecules) (assuming a 50/50 nucleus/cytoplasm split), which for many lincRNA is probably like 1/1024 or so, but for something like MALAT1, would be 1/(2^2000) or so. I think we can pretty safely reject the hypothesis that there is no active trafficking of MALAT1 back into the nucleus… :)

I think the more substantial concern people raise with these p-values is that when you get something so small, it probably means that you’re not taking into account some sort of systematic error; in other words, the null model isn’t right. For instance, let’s say I measured a slight difference in the concentration of dye molecules in the second cup above. Even a pretty small change will have an infinitesimal p-value, but the most likely scenario is that some systematic error is responsible (like dye getting absorbed by the material on the second glass or the glasses having slightly different transparencies or whatever). In genomics—or basically any study where you are doing a lot of comparisons—the same sort of thing can happen if the null/background model is slightly off for each of a large number of comparisons.

All that said, I still don’t see why people make fun of small p-values. If you have a really strong effect, then it’s entirely possible that you can get such a tiny p-value. In which case, the response is typically “well, if it’s that obvious, then why do any statistics?” Okay, fine, I’m totally down with that! But then we’re basically saying that there’s no really strong effects out there: if you’re doing enough comparisons that you might get one of these tiny p-values, then any strong, real effect must generate one of these p-values, no? In fact, if you don’t get a tiny p-value for one of these multi-factorial comparisons, then you must be looking at something that is only a minor effect at best, like something that only explains a small amount of the variance. Whether that matter or not is a scientific question, not a statistical one, but one thing I can say is that I don’t know many examples (at least in our neck of the molecular biology woods) in which something which was statistically significant but explained only a small amount of the variance was really scientifically meaningful. Perhaps GWAS is a counterexample to my point? Dunno. Regardless, I just don’t see much justification in mocking the teeny tiny p-value.

Oh, and here’s a teeny tiny p-value from our work. Came from comparing some bars in a graph that were so obviously different that only a reviewer would have the good sense to ask for the p-value… ;)

Update, 6/9/2016:
I think that there are a lot of examples that I think illustrate some of these points better. First, not all tiny p-values are necessarily the result of obvious differences or faulty null models in large numbers of comparisons. Take Uri Alon's network motifs work. Following his book, he showed that in the transcriptional network of E. coli (424 nodes, 519 edges), there were 40 examples of autoregulation. Is this higher, lower or equal to what you would expect? Well, maybe you have a good intuitive handle on random network theory, but for me, the fact that this is very far from the null expectation of around 1.2±1.1 autoregulatory motifs (p-value something like 10^-30) is not immediately obvious. One can (and people) quibble about the particular type of random network model, but in the end, the p-values were always teeny tiny and I don't think that is either obvious or unimportant.

Second, the fact that a large number of small effects can give a tiny p-value doesn't automatically discount their significance. My impression from genetics is that many phenotypes are composed of large numbers of small effects. Moreover, the effects of a perturbation of, say, a gene knockout can lead to a large number of small effects. To say those are not meaningful is an (open, to my mind) scientific question, but whether the p-value is small or not is not really relevant.

This is not to say that all tiny p-values mean there's some science worth looking into. Some of the worst offenders are the examples of "binning", where, e.g. half-life of individual genes correlates with some DNA sequence element, R^2=0.21, p=10^-17 (totally made up example, no offense to anyone in this field!). No strong rule comes from this, so who knows if we actually learned something. I suppose an argument can be made either way, but the bottom line is that those are scientific questions, and the size of the p-value is irrelevant. If the p-value were bigger, would that really change anything?

Sunday, May 22, 2016

Spring cleaning, old notebooks, and a little linear algebra problem

Update 5/25/2016: Solution at the bottom

These days, I spend most of my time thinking about microscopes and gene regulation and so forth, which makes it all the more of a surprising coincidence that on the eve of what looks to be a great math-bio symposium here at Penn tomorrow, I was doing some spring cleaning in the attic and happened across a bunch of old notebooks from my undergraduate and graduate school days in math and physics (and a bunch of random personal stuff that I'll save for another day—which is to say, never). I was fully planning to throw all those notebooks away, since of course the last time I really looked at it was probably well over 10 years ago, and I did indeed throw away a couple from some of my less memorable classes. But I was surprised that I actually wanted to keep a hold of most of them.

Why? I think partly that they serve as an (admittedly faint) reminder that I used to actually know how to do some math. It's actually pretty remarkable to me how much we all learn during our time in formal class training, and it is sort of sad how much we forget. I wonder to what degree it's all in there somewhere, and how long it would take me to get back up to speed if necessary. I may never know, but I can say that all that background has definitely shaped me and the way that I approach problems, and I think that's largely been for the best. I often joke in lab about how classes are a waste of time, but it's clear from looking these over that that's definitely not the case.

I also happened across a couple of notebooks that brought back some fond memories. One was Math 250(A?) at Berkeley, then taught by Robin Hartshorne. Now, Hartshorne was a genius. That much was clear on day one, when he looked around the room and precisely counted the number of students in the room (which was around 40 or so) in approximately 0.58 seconds. All the students looked at each other, wondering whether this was such a good idea after all. For those who stuck with it, they got exceptionally clear lectures on group theory, along with by far the hardest problem sets of any class I've taken (except for a differential geometry class I dropped, but that's another story). Of the ten problems assigned every week, I could do maybe one or two, after which I puzzled away, mostly in complete futility, until I went to his very well attended office hours, at which he would give hints to help solve the problems. I can't remember most of the details, but I remember that one of the hints was so incredibly arcane that I couldn't imagine how anyone, ever, could have come up with the answer. I think that Hartshorne knew just how hard all this was, because one time I came to his office hours after a midterm when a bunch of people were going over a particular problem, and I said "Oh yeah, I think I got that one!" and he looked at me with genuine incredulity, at which point I explained my solution. Hartshorne looked relieved, pointed out the flaw, and all went back to normal in the universe. :) Of course, there were a couple kids in that class for whom Hartshorne wouldn't have been surprised to see a solution from, but that wasn't me, for sure.

While rummaging around in that box of old notebooks, I also found some old lecture notes that I really wanted to keep. Many of these are from one of my PhD advisor's, Charlie Peskin, who had some wonderful notes on mathematical physiology, neuroscience, and probability. His ability to explain ideas to students with widely-varying backgrounds was truly incredible, and his notes are so clear and fresh. I also kept notes from a couple of my other undergrad classes that I really loved, notably Dan Roksar's quantum mechanics series, Hirosi Ooguri's statistical mechanics and thermodynamics, and Leo Harrington's set theory class (which was truly mind-bending).

It was also fun to look through a few of the problem sets and midterms that I had taken—particularly odd now to look at some old dusty blue books and imagine how much stress they had caused at the time. I don't remember many of the details, but I somehow still vaguely remembered two problems, one in undergrad, one in grad school as being particularly interesting. The undergrad one was some sort of superconducting sphere problem in my electricity and magnetism course that I can't fully recall, but it had something to do with spherical harmonics. It was a fun problem.

The other was from a homework in a linear algebra class I took in grad school from Sylvia Serfaty, and I did manage to find it hiding in the back of one of my notebooks. A simple-seeming problem: given an n×n matrix A, formulate necessary and sufficient conditions for the 2n×2n matrix B defined as

B = |A A|
|0 A|

to be diagonalizable. I'll give you a hint that is perhaps what one might guess from the n=1 case: the condition is that A = 0. In that case, sufficiency is trivial (B = 0 is definitely diagonalizable), but showing necessity—i.e., show that if B is diagonalizable, then A=0—is not quite so straightforward. Or, well, there's a tricky way to get it, at least. Free beer to whoever figures it out first with a solution as tricky (or more tricky) than the one I'm thinking of! Will post an answer in a couple days.

Update, 5/25/2016: Here's the solution!

Thursday, October 1, 2015

Fun new perspective paper with Ian Mellis

Wrote a perspective piece with Ian Mellis that just came out today:

http://genome.cshlp.org/content/25/10/1466.full

tl;dr: Where is systems biology headed? The singularity, of course... :)

(Warning: purposefully inflammatory.)

Monday, September 21, 2015

Some observations from Single Cell Genomics 2015

Just got back from a really great conference on Single Cell Genomics at Utrecht in the Netherlands. The lead organizer, Alexander van Oudenaarden (my postdoc mentor) was an absolutely terrific host, with excellent speakers, a superb venue, and a great dance party with live music in an old church (!) to cap it all off.

Here are some observations on the field from a relative outsider:

1. Single cell genomics is becoming much more democratic. As the tools have developed, the costs and complexity have gone way down in terms of preparing the libraries of cells, and it seems like the field has achieved some degree of consensus on barcoding and molecular identifiers. The droplet techniques are particularly remarkable in terms of the numbers of cells, and look relatively inexpensive and easy to set up (we are close to having it working in our lab, and we just started a little while ago).

2. Meanwhile, the quality of the data overall seems to have increased. Earlier on, I think there was a lot of talk about how much better one method for e.g. single cell RNA-seq was than the other, and the question on everyone's mind from the outside was which one to use. Nowadays, it doesn’t seem like any one method leads to radically different biological claims than anyone else’s. That’s not to say that there aren’t any differences, but rather that there are fewer practical differences between methods that I could see, especially compared to a few years ago. Then again, I'm completely naive to this area, so I could be way off base here.

3. tSNE was everywhere. Very cool method! It's important to remember, though, that it's just a new-fangled projection method that tries to preserve distance. Can't necessarily ascribe any biological significance to it–it's just a visualization method to help look at high-dimensional datasets. I think most folks at the conference realize that, but perhaps people from outside might not know about that.

4. This was undoubtedly a technology meeting. That said, while the technology is still rapidly advancing, I feel that we have to start asking what the new conceptual insights one might get from single cell sequencing might be. I think this question is in the air, and I think some clever folks will start coming up with something, especially now that the methods are maturing. But it will require some deep thinking.

5. Along those lines, one thing that sort of bugs me is when people start their talk with a statement like "It is clear that ABC is important for the cell because of XYZ" as motivation for developing the method. Sometimes I would disagree with those statements. I think that it's important to really dig into the evidence for some phenomenon being important and present that fairly at the beginning of a talk.

6. At the same time, one amazing talk highlighted some actual, real, clinical personalized medicine using single cell sequencing. Now THAT is real-world impact. I’m don't think it’s published yet, but when it is, I’m pretty sure you’ll hear about it.

7. Imaging is making a comeback. For a while, I was sort of bummed that sequencing was the new hotness and imaging was old and busted. But Long Cai and Xiaowei Zhuang showed off some very nice recent results on multiplexing RNA FISH to get us closer to image-based transcriptomics. Still a ways to go, but it has a number of advantages, spatial information of course being the most obvious one, sensitivity being another. One big issue is cost reduction for oligonucleotides, though. That may take some creative thinking.

8. This field has a lot of young, energetic people! As Alexander remarked, the poster session was huge, and the quality was very high. Clearly a growth area. It is also clearly friendly but rather competitive. At this stage, though, I think the methods are all sort of blending together, and I get the sense that the big game have already been hunted in terms of flashy papers purely based on methods. So maybe the competitiveness will diminish a bit now, or at least transfer elsewhere.

9. Speaking of growth, Dutch people are indeed really tall. Like, really tall. I had to use the kids urinal in the bathroom at the airport when I got off the plane.

Next year this meeting will be at the Sanger Institute–should be fun!

Thursday, July 23, 2015

RNA integrity in tissues

Been thinking a lot about expression in tissue these days. Funny quote in a post from the always quotable Dan Graur: “They looked at the process of transcription in long-dead tissues? Isn’t that like studying the [...] circulation system in sushi?” He also points to this study from Yoav Gilad about RNA degradation, which is really great–wish there were more such studies.

We have been doing a fair amount of RNA FISH in tissues (big thanks to help from Shalev Itzkovitz and Long Cai), and while we haven’t done a formal study, I can say that RNA degradation is a huge problem in tissue. We’ve seen RNA in some tissues disappear literally within minutes after tissue harvest. This seems somewhat correlated with RNase content of the tissue, but it’s still unclear. We’ve also worked in fresh frozen human samples, all collected ostensibly the same way, and found huge variability in RNA recovery, with some samples showing great signals and other, seemingly identical samples showing no RNA whatsoever. This is true even for GAPDH. No clue whether the variability is biological or not, but I'm inclined to think it's technical. Most likely culprit is ischemic time, of which we had no control in the human samples.

We’ve also found that we’ve been able to get decent signals in formaldehyde fixed paraffin embedded samples, even though those are thought to be generally worse than fresh frozen. If I had to guess, I’d say it’s all about sample handling before freezing/fixing. I would be very hesitant to make any strong claims about gene expression without being absolutely certain about the sample quality. Problem is, I don't know what it means to be absolutely certain... :)

Anyway, so far, all we have is the sum of anecdotes, which I share here in case anyone’s interested. We really should do a more formal study of this at some point.

Wednesday, July 15, 2015

RNA-seq vs. RNA FISH, part 2: differential expression of 19 genes

On the heels of RNA FISH vs. RNA-seq (absolute abundance), here's cells in two different conditions, differential expression of 19 genes, RNA FISH vs. RNA-seq:

A few are way off, but not bad, on the whole.

Sunday, June 14, 2015

RNA-seq vs. RNA FISH for 26 genes

Been meaning to post this for a while. Anyway, in case you're interested, here is a comparison of mean number of RNA per cell measured by RNA FISH to FPKM as measured by RNA-seq for 26 genes (bulk and also combined single cell RNA-seq). Experimental details in Olivia's paper. We used a standard RNA-seq library prep kit from NEB for the bulk, and used the Fluidigm C1 for the single cell RNA-seq. Cells are primary human foreskin fibroblasts.

Bulk RNA-seq vs. RNA FISH (avg. # molecules per cell)

Bulk RNA-seq vs. RNA FISH (avg. # molecules per cell), linear scale

Single cell RNA-seq vs. RNA FISH (avg. # molecules per cell)

Single cell RNA-seq vs. RNA FISH (avg. # molecules per cell), linear scale

Probably could be better with UMIs and so forth, but anyway, for whatever it's worth.

Saturday, June 6, 2015

Gene expression by the numbers, day 3: the breakfast club

(Day 0, Day 1, Day 2, Day 3)

So day 3 was… pretty wild! And inspiring. A bit hard to describe. There was one big session. The session had some dancing. A chair was thrown. Someone got a butt in the face. I’m not kidding.

How did such nuttiness come to pass? Well, today the 15 of us all gave exit talks, where we have the floor to discuss a point of our choosing. On the heels of the baseball game, we decided (okay, someone decided) that everyone should choose a walk-up song, and we’d play the song while the speaker made their way up for the exit talk. Later, I’ll post the playlist and the conference attendees and set up a matching game. The playlist was so good!

(Note: below is a fairly long post about various things we talked about. Even if you don’t want to read it all, check out the scientific Rorschach test towards the end.)

I was somehow up first. (See if you can guess my song. People in my lab can probably guess my song.) The question I posed was “does transcription matter?” More specifically, if I changed the level of transcription of a gene from, say, 196 transcripts per cell to 248 transcripts per cell, does that change anything about the cell? I think the answer depends on the context. Which led me to my main point that I kind of mentioned in an earlier post, which is that (I think) we need strong definitions based on functional outcomes in order to shape how we approach studying transcriptional regulation. I personally think this means that we really need to have much better measurements of phenotype so we can see what the consequences are of, say, a 25% increase in transcription. If there is no consequence, then should we bother studying why transcription is 25% higher in one situation vs. the other? Along these lines, Mo Khalil made the point that maybe we can turn to experimental evolution to help us figure out what matters, and maybe that could help guide our search for what matters in regulation.

Barak led another great point about definitions. He started his talk by posing the question “Can someone please give me a good definition of an enhancer?” In the ensuing discussion, folks seemed to converge on the notion that in molecular biology, definitions of entities is often very vague and typically defined much more by the experiments that we can do. Example: is an enhancer a stretch of DNA that affects a gene independently of its position? At a distance? These notions often from experiments in which they move the enhancer around and find that it still drives expression. Yet from the quantitative point of view, the tricky thing with experimentally based definitions is that these were often qualitative experiments. If moving the enhancer changes expression by 50%, then is that “location independent”?

Justin made an interesting point: can we come up with “fuzzy” definitions? Is there a sense in which we can build models that incorporate this fuzziness that seems to be pervasive in biology? I think this idea got everyone pretty excited: the idea of a new framework is tantalizing, although we still have no idea exactly what this would look like. I have to admit that personally, I’m not so sure that dispensing with the rigidity of definitions is a good thing–without rigid definitions, we run the risk of not saying anything useful and concrete at all. Perhaps having flexible definitions is actually similar to just saying that we can parametrize classes of models, with experiments eliminating some fraction of those model classes.

Jané brought in a great perspective from physics, saying that actually having a lot of arguments about definitions is a great thing. Maybe by having a lot of competing definitions and all of us trying to prove ours and contrast with others will eventually lead us to the right answer, and myopia in science can really lead to stagnation. I really like this thought. I feel like “big science” endeavors often fail to provide real progress because of exactly this problem.

The discussion of definitions also fed into a somewhat more meta discussion about interdisciplinary science and different approaches. Rob is strongly of the opinion that physicists should not need to get the permission of biologists to study biology, nor should they allow them to dictate what’s “biologically relevant”. I think this is right, and I also find myself often annoyed when people tell us what’s important or not.

Al made a great point about the role of theory in quantitative molecular biology. The point of theory is to say, “Hey, look at this, this doesn’t make sense. When you run the numbers, the picture we have doesn’t work–we need a new model.” Jané echoed this point, saying that at least with a model, we have something to argue about.

He also said that it would be great if we could formulate “no-go” models. Can we place constraints on the system in the abstract? Gasper put this really nicely: let’s say I’m a cell in a bicoid gradient trying to make a decision on what to do with my life. Let’s say I had the most powerful regulatory “computer” in the world in that cell. What’s the best that that computer could do with the information it is given? How precisely can it make its decision? How close do real cells get to this? I think this is a very powerful way to look at biology, actually.

Some of the discussions on theory and definitions brought up an important meta point relating to interdisciplinary work. I think it’s important that we learn to speak each other’s languages. I’ve very often heard physicists give a talk where they garble the name of a protein or something like that, and when a biologist complains, the response is sort of “well, whatever, it doesn’t matter”. Perhaps it doesn’t matter, but can be grating to the ear and the attitude can come across as somewhat disrespectful. I think that if a biologist were to give a talk and said “oh, this variable here called p… oh, yes, you call it h-bar, but whatever, doesn’t matter, I call it p”, it would not go over very well. I think we have to be respectful and aware of each other’s terminology and definitions and world view if we want to get each other to care about what we are both doing. And while I agree with Rob that physicists shouldn’t need permission to study biology, I also think it would be nice to have their blessings. Personally, I like to be very connected to biologists, and I feel like it has opened my mind up a lot. But I also think that’s a personal choice, perhaps informed by my training with Sanjay Tyagi, a biologist who I admire tremendously.

Another point about communicating across fields came up in discussing synthetic biology approaches to transcriptional regulation. If you take a synthetic approach to regulatory DNA, you will often encounter fierce resistance that you’re studying a “toy model” and not the real system. The counter, which I think is a reasonable argument, is that if you study just the existing DNA, you end up throwing your hands in the air and saying “complexity, who knew!”. (One conferee even said complexity is a waste of time: it’s not a feature but rather a reflection of our ignorance. I disagree.) So the synthetic approach may allow us to get at the underlying principles in a controlled and rigorous manner. I think that’s the essence of mechanistic molecular biology: make a controlled environment and then see if we can boil something down to its parts. Sort of like working in cell extracts. I think this is a sensible approach and one that deserves support in the biological community–as Angela said, it’s a “hearts and minds” problem.

That said, personally, I’m not so sure that it will be so easy to boil things down to its parts–partly because it's clearly very hard to find non-regulatory DNA to serve as the "blank slate" to work with for synthetic biology. I'm thinking lately that maybe a more data first approach is the way to go, although I weirdly feel quite strongly against this view at the same time (much more on this in a perspective piece we are writing right now in lab). But that’s fundamentally scary, and for many scientists, this may not be a world they want to live in. Let me subject you to a scientific Rorschach test:

Image from here

What do you see here?

A catalog of data points.
A rule with an exception.
A best fit line that explains, dunno, 60% of the variance, p = 0.002 (or whatever).

If you said #1, then you live in the world of truth and fact, which is admirable. You are also probably no fun at dinner parties.

Which leads us to #2 vs. #3. I posit that worldview #2 is science as we traditionally know it. A theory is a matter of belief, and doesn’t have a p-value. It can have exceptions, which point to places where we need some new theory, but in and of itself, it is a belief that is absolute. #3 is a different world, one in which we have abandoned understanding as we traditionally define it (and there is little right now to lead us to believe that #3 will give us understanding like #2, sorry omics people).

I would argue that the complexity of biological regulation may force us out of #2 and into #3. At this meeting, I saw some pretty strong evidence that a simple thermodynamic model can explain a fair amount of transcriptional regulation. So is that a theory, a simple explanation that most of us believe? And we just need some additional theory to explain the exceptions? Or, alternatively, can we just embrace the exceptions, come up with some effective theory based on regression, and then say we’ve solved it totally? The latter sounds “wrong” somehow, but really, what’s the difference between that and the thermodynamic model? I don’t think that any of us can honestly say that the thermodynamic model is anything other than an effective representation of molecular processes that we are not capturing fully. So then how different is that than a SVM telling us there are 90 features that explain most of the variance? How much variance do you explain before it’s a theory and not a statistical model? 90%? How many features before it’s no longer science but data science? 10? I think that where we place these bars is a matter of aesthetics, but also defines in some ways who we are as scientists.

Personally, I feel like complexity is making things hopeless and we have to have a fundamental rethink transitioning from #2 to #3 in some way. And I say this with utmost fear and trepidation, not to mention distaste. And I’m not so sure I’m right. Rob holds very much the opposite view, and we had a conversation in which he said, well, this field is messy right now and it might take decades to figure it out. He could be right. He also said that if I’m right, then it’s essentially saying that his work on finding a single equation for transcription is not progress. Did I agree that that was not progress? I felt boxed in by my own arguments, and so I had to say “Yeah, I guess that’s not progress”. But I very much believe that it is progress, and it’s objectively hard to argue otherwise. I don’t know, I’m deeply ambivalent on this myself.

Whew. So as you can probably tell, this conference got pretty meta by the end. Ido said this meeting was not a success for him, because he hasn’t come away with any tangible, actionable items. I agree and disagree. This meeting was sort of like The Breakfast Club. It was a bunch of us from different points of view, getting together and arguing, and over time getting in touch with our innermost hopes and anxieties. Here’s a quote from Wikipedia on the ending of the movie:

Although they suspect that the relationships would end with the end of their detention, their mutual experiences would change the way they would look at their peers afterward.

I think that’s where I am. I actually learned a lot about regulatory DNA, about real question marks in the field, and got some serious challenges to how I’ve been thinking about science these days. It’s true that I didn’t come away with a burning experiment that I now have to do, but I would be surprised if my science were not affected by these discussions in the coming months and years (in fact, I am now resolved to work out a theory together with Ian in the lab by the end of the summer).

At the end, Angela put up the Ann Friedman’s Disapproval Matrix:

She remarked, rightly, that even when we disagreed, we were all pretty much in the top half of the matrix. I think this speaks to the level of trust and respect everyone had for each other, which was the best part of this meeting. For my part, I just want to say that I feel lucky to have been a part of this conference and a part of this community.

Walk-up song match game coming soon, along with a playlist!

Friday, June 5, 2015

Gene expression by the numbers, day 2: take me out to the ballgame

(Day 0, Day 1, Day 2, Day 3 (take Rorschach test at end of Day 3!))

First off, just want to thank a commenter for providing an interesting and thoughtful response to some of the topics we discussed in day 1. Highly recommended reading.

Day 2 started with Rob trying to stir the pot by placing three bets (the stakes are dinner in Paris at a fancy restaurant, yummy!). First bet was actually with me, or really a bet against pessimism. He claimed that he would be able to explain Hana’s complicated data on transcription in different conditions once we measured the relevant parameters, like, say, transcription factor concentration (wrote about this in the day 1 post). My response was, well, even if you could explain that with all the transcription factor concentrations, that’s not really the problem I have. My problem is that it is impossible to build a simple predictive model of transcription here. The input-output relationship depends on so many other factors that we end up with a mess–there are no well-defined modules. To which Rob rightfully responded by saying that that's moving the goalposts: I said he can't do X, he does X, I say now you have to do Y. Fair enough. I accept the original challenge: I claim that he will not be able to explain the differences in Hana's data using just transcription factor concentration.

Next bet was with Barak. In the day 1 post, I mention the statistical approach vs. the mechanistic approach. Rob and Barak still have to formulate the bet precisely (and I think they actually agree mostly), but basically, it is a bet against the statistical approach. Hmm. Personally, I don't know how I come down on this. I am definitely sympathetic to Rob's point of view, and don't like the overemphasis these days on statistics (my thoughts). But my thoughts are evolving. Rob asked "Would it really have been possible to derive gravitation with a bunch of star charts and machine learning?" To which I responded with something along the lines of "well, we are machines, and we learned it." Sort of silly, but sort of not.

Final bet was with Ido (something about universality of noise scaling laws). Ido also had a bet as well on this point, in this case offering up a bottle of Mezcal for a resolution. More on this some other time. I am going to try and get the bottle!

The talks were again great (I mean really great), if perhaps a bit more topically diffuse than yesterday. Started with evolution. Very cool, with beautiful graphs of clonal sweeps. An interesting point was that experimental evolution arrives at different answers than you expect initially. They are rational (or can be), but not what you expect early on–amazingly even in pathways as well worked out as the metabolic pathways. I'm wondering if we could leverage this to understand pathways better in some way?

On to the "tech development" section, which was only somewhat about tech development, somewhat not. Stirling gave a great talk about human NET-seq. What I really liked about it was that in the end, there was a simple answer to a simple question (is transcription different over exons when they're skipped? exons vs. introns?). I think it's awesome to see that genome-wide data can give such clear results.

So far, everything was about control of the mean levels of transcription. Both Ido and I talked about the variance around that mean, with Ido providing beautiful data on input-output functions. On the Mezcal, Ido shows that there is a strong relationship between the Fano factor and the mean. I am wondering whether this is due to volume variation. Olivia's paper has some data on this. Probably the subject of another blog post at some point in the future.

Theory: great discussion about Hill coefficients with Jeremy! How can you actually get thresholds in transcriptional regulation? Couple ideas. There's conventional cooperativity, and there could also be other mechanisms, like titration via dummy binding sites like in Nick Buchler's work. Surprising that we still have a lot of questions about mechanisms of thresholds after all this time.

Conversation with Jeremy and Harinder: how much do we know about whether sequence fully predicts binding? Thought for an experiment–if you sweep through transcription factor concentrations, what happens to binding as measured by e.g. ChIP-seq? Has anyone done this experiment?

Then, off to the Red Sox vs. the Twins. Biked over there on Hubway with Ron, which was perfect on a really lovely day in Cambridge. The game was super fun! Apparently there were some people playing baseball there, but that didn't distract me too much. Had a great time chatting with various folks, including two really awesome students from Angela's lab, Clarissa Scholes and Ben Vincent, who joined in the fun. Talked with them about the leaky pipeline, which is something I will never, ever discuss online for various reasons. Also crying in lab–someone at the conference told me that they've made everyone in their lab cry, which is so surprising if you know this person. Someone also told me that I'm weird. Like, they said "Arjun, you are weird." Which is true.

Oh, and the Twins won, which made me happy–not because I know the first thing about baseball, but I hate the Red Sox, mostly because of their very annoying fans. Oops, did I say that out loud?

Okay, fireworks are happening here on day 3. More soon!

Thursday, June 4, 2015

Gene expression by the numbers, verdict on day 1: awesome!

(Day 0, Day 1, Day 2, Day 3 (take Rorschach test at end of Day 3!))

Yesterday was day 1 of Gene expression by the numbers, and it was everything I had hoped it would be! Lots of discussion about big ideas, little ideas, and everything in between. Jane Kondev said at some point that we should have a “controversy meter” based on the loudness of the discussion. Some of the discussions would definitely have rated highly, which great! Here are some thoughts, very much from my own point of view:

We started the day with a lively discussion about how I am depressed (scientifically) :). I’m depressed because I’ve been thinking lately that maybe biology is just hopelessly complex, and we’ll never figure it out. At the very least, I’ve been thinking we need wholly different approaches. More concretely for this meeting, will we ever truly be able to have a predictive understanding of how transcription is regulated? Fortunately (?), only one other person in the room admitted to such feelings, and most people were very optimistic on this count. I have to say that at the end of the day, I’m not completely convinced, but the waters are muddier.

Who is an optimist? Rob Phillips is an optimist! And he made a very strong point. Basically, he’s been able to take decades of data on transcriptional regulation in E. coli and reduce it to a single, principled equation. Different conditions, different concentrations, whatever, it all falls on a single line. I have to say, this is pretty amazing. It’s one thing to be an optimist, another to be an optimist with data. Well played.

And then… over to eukaryotes. I don’t think anyone can say with a straight face that we can predict eukaryotic transcription. Lots of examples of a lot of effects that don’t resolve with simple models, and Angela DePace gave a great talk highlighting some of the standard assumptions that we make that may not actually hold. So what do we do? Just throw our hands in the air and say “Complexity, yipes!”?

Not so fast. First, what is the simple model? The simplest model is the thermodynamic model. Essentially, each transcription factor binds to the promoter independently of each other, and its effects are independent of each other. Um, duh, that can’t work, right? I was of the opinion that decades of conventional promoter bashing hasn’t really provided much in the way of general rules, and more quantitative work along these lines hasn’t really done so either.

But Barak brought up an extremely good point, which is that a lot of these approaches to seeing how promoter changes affect transcription suffer from being very statistically underpowered. They also made the point (with data) that once you really start sampling, maybe things are not so bad–and amazingly enough, maybe some of the simplest and “obviously wrong” caricatures of transcriptional regulation are not all that far off. Maybe with sufficient sampling, we can start to see rules and exceptions, instead of a big set of exceptions. Somehow, this really resonated with me.

I’m also left a bit confused. So do we have a good understanding of regulation or not? I saw some stuff that left me hopeful that maybe simple models may be pretty darn good, and maybe we’re not all that far off from the point where if I wanted to dial up a promoter that expressed at a certain level, I just type in this piece of DNA and I’ll get close. I also saw a lot of other stuff that left me scratching my head and sent me back to wondering how we’ll ever figure it all out.

There was also here an interesting difference in style. Some approach from a very statistical point of view (do a large amount of different things and look for emergent patterns). Some approach things from a very mechanistic point of view (tweak particular parameters we think are important, like distances and individual bases, and see what happens). I usually think it’s very intellectually lazy to say things like “we need both approaches, they are complementary”, but in this case, I think it’s apt, though if I had to lean one way, personally, I think I favor the statistical approach. Deriving knowledge from the statistical approach is a tricky matter, but that’s a bigger question. How much variance do we need to explain? As yet unanswered, see later for some discussion about the elephant in the room.

Models: some cool talks about models. One great point: “No such thing as validating a model. We can only disprove models.” A point of discussion was how to deal with models that don’t fit all the data. Do we want to capture everything? How many exceptions to the rule can you tolerate before it’s no longer a rule?

Which comes to a talk that was probably highest on the controversy meter. In this one, the conferee who shares my depression showed some results that struck me as very familiar. The idea was build a quantitative model, then go build some experiments to show transcriptional response, and the model fits nicely. Then you change something in the growth medium, and suddenly, the model is out the window. We’ve all seen this: day to day variability, batch variability, “weird stuff happened that day”, whatever. So does the model really reflect our understanding of the underlying system?

This prompted a great discussion about what our goals are as a community. Is the goal really to predict everything in every condition? Is that an unreasonable thing to expect from a model? This got down to understanding vs. predicting. Jane brought up the point that these are different: Google can predict traffic, but it doesn’t understand traffic. A nice analogy, but I’m not sure that it works the other way around. I think understanding means prediction, even if prediction doesn’t necessarily mean understanding. Perhaps this comes down to an aesthetic choice. Practically speaking, for the quantitative study of transcription, I think that the fact that the model failed to predict transcription in a different condition is a problem. One of my big issues with our field is that we have a bunch of little models that are very context specific, and the quantitative (and sometimes qualitative) details vary. How can we put our models together if the sands are shifting under our feet all the time? I think this is a strong argument against modularity. Rob made the solid counter that perhaps we’re just not measuring all the parameters–if we could measure transcription factor concentration directly, maybe that would explain things. Perhaps. I’m not convinced. But that’s just, like, my opinion, man.

So to me the big elephant in the room that was not discussed is what exactly matters about transcription? As quantitative scientists, we may care about whether there are 72 transcripts in this cell vs. 98 in the one next door, but does that have any consequences? I think this is an important question because I think it can shape what we measure. For instance, this might help us answer the question about whether explaining 54% of the variance is enough–maybe the cell only cares about on vs. off, in which case, all the quantitative stuff is irrelevant (I think there is evidence for and against this). Maybe then all we should be studying is how genes go from an inactive to an active state and not worry about how much they turn on. Dunno, all I’m saying is that without any knowledge of the functional consequences, we’re running the risk of heading down the wrong path.

Another benefit to discussing functional consequences is that I think it would allow us to come up with useful definitions that we can then use to shape our discussion. For instance, what is cross-talk? (Was the subject of a great talk.) We always talk about it like it’s a bad thing, but how do we know that? What is modularity? What is noise? I think these are functional concepts that must have functional definitions, and armed with those definitions, then maybe we will have a better sense of what we should be trying to understand and manipulate with regard to transcriptional output.

Anyway, looking forward to day 2!

Tuesday, June 2, 2015

Gene expression by the numbers, day 0: Big picture questions about transcription

(Day 0, Day 1, Day 2, Day 3 (take Rorschach test at end of Day 3!))

So just about to get on a plane to go to Boston/Cambridge for a meeting on transcription–I think it's going to be a lot of fun! Bunch of folks with a quantitative bent getting together, including the organizers Al Sanchez, Hernan Garcia, Jané Kondev, Angela DePace and Rob Phillips (big thanks for all their hard work!). The big reason I'm excited is that this is not going to be a typical meeting: the goal is to discard with the usual formalities of a meeting (like a bunch of boring talks that nobody pays attention to) and instead actually talk with each other about where we want the field to head and how we might get there. We even all made short videos beforehand as a sort of pre-conference introduction!

This is going to require changing our usual scientific behavior, which is to stamp out wild ideas as soon as we hear them. You know that crazy person who asks you some weird question at the end of your seminar about bees and the number 12? Well, that's going to be me, and I won't be satisfied with "talking about it later off-line". :)

Nor is it going to be completely off-line. I'm going to blog about the goings-on in the hope that others can participate as well in what is sadly (but perhaps necessarily) a rather small event. So drop me a line if you have any burning questions about transcription.

What are the sorts of questions we'll be discussing? Here's a few I’ve been thinking about after watching everyone’s videos:

How close are we to a predictive understanding of the regulatory code? I.e., if I give you a cell type and a piece of DNA, can I predict how much transcription there will be?
(Related bonus question) How do we deal with the complexity of metazoan transcriptional regulation? What new conceptual frameworks will we need to make further progress?
What are some new methods that we could develop that would help us understand transcription? What are the quantities that we would like to measure?
Development appears to be incredibly precise–how do developing organisms achieve this despite the sloppiness of chemical reactions? To what extent is this precision an intrinsic property of the cell and to what extent is it an emergent property of the interaction of different cells?
What are the functional consequences of transcription? Which aspects of transcription “matter” and which ones are irrelevant? In chemistry, we talk about rate-limiting reactions. What are the biology-limiting reactions in transcription? What should we be measuring?

More soon!

Friday, May 22, 2015

RNA doesn't correlate with protein? Huh?

tl;dr: I don’t know why people say that RNA doesn’t correlate with protein. There are different contexts to this question, and some recent experiments may make the question a bit confusing, but overall, I’m pretty sure that most of the time, if you increase the amount of RNA for a given gene, you will end up with more of the protein encoded by that gene. I’m sure there are counter-examples, though–if you know of any, please fill me in.

In our group, when we present work on RNA abundances, we are often faced with the question: “Well, what about the protein?” (fair enough). This is usually followed by the statement “Because of course it is well known that RNA doesn’t correlate with protein.” Umm, what?

I have to say that I’m a bit puzzled by this bit of apparently obvious and self-evident truth. I thought that most people accept that the central dogma of DNA to RNA to protein is a pretty solid fact in most cases. So… if you have more RNA, that should lead to more protein, right? Shouldn’t that be the null hypothesis?

Apparently this notion has been around for a long time, though nowadays it is perhaps a bit more conceptually confusing due to a few recent results. Perhaps the biggest one was the Schwanhausser paper in which they compare RNA-seq to mass-spec and show that there is a distinct lack of correlation between mean RNA levels and mean protein levels across all genes (also the Weissman ribosome profiling paper). What this means, on the face of it, is that even if gene A produces more RNA than gene B, then it may be the case that there is more protein B than protein A. Fine. There are differences in protein translation rate and degradation rate, leading to these differences, no surprises there. Plus, Mark Biggins and Allan Drummond make the point that any measurement noise will lead to decorrelation even if things are very correlated, and their reanalyses seem to indicate that the correlation between RNA and protein may actually be considerably higher than initially reported.

The next example that’s a bit closer to home for me is whether RNA levels and protein levels correlate, even for the same gene, across single cells. Here, it gets a bit more complex, and one might expect a variety of behaviors depending on the burstiness of transcription, degradation rate of the RNA and the degradation rate of the protein. Experimentally, there are some cases in which the RNA and protein of a particular gene do not correlate in single cells (Taniguchi et al. Science 2010 is a particularly good example). This may be due to long protein half-life, which effectively smooths over RNA fluctuations. In our PLOS 2006 paper (Fig. 7), we showed that there can be a strong correlation between RNA and protein when the protein degrades fast, and that correlation goes down a lot when the protein degrades more slowly.

And of course there’s the whole world of post-translational modifications, like during the cell cycle, etc., in which protein activity and potentially levels change independent of transcript abundance. Well, dunno what to say about that, I’m biased to just think about RNA. :)

Nevertheless, overall, I think it’s pretty safe to assume most of the time that if you increase RNA abundance for a particular gene, you will end up with more of the encoded protein. I think that should be the null hypothesis. If anyone knows of any counterexamples, please let me know.

Oh, and by the way, in case you’re wondering, transcription also correlates with RNA.

Friday, May 1, 2015

Can I just normalize expression levels by GAPDH?

tl;dr: Depends on context. Probably yes in many instances, but there are definitely situations where you can’t. And beware of global changes in transcription–may just be volume effects.

Now that Olivia’s paper is out (slidecast, full text), thought I’d write a bit about the time-honored practice of normalizing gene expression by GAPDH. A bit of context: when people did RT-qPCR (remember that?) on bulk RNA isolated from, say, cells with and without drug, the question would arise as to how to normalize the measurement by number of cells, differences in RNA isolation efficiency, etc. The way people normally do this in a practical sense is by dividing by the expression of housekeeping genes like GAPDH, which we assume is roughly the same per cell in both conditions. This is of course an assumption, and one which is most definitely broken in some situations.

The plot thickened around 10 years ago, when people started making measurements showing that absolute transcript abundances can vary dramatically from cell to cell, even for housekeeping genes like GAPDH. So how should you normalize single cell data?

Olivia’s paper provides some answers, but also opens up more questions. One of the principal findings (also see this paper by Hermannus Kempe in Frank Bruggeman's group) is that transcript abundance roughly scales with volume. What this means is that bigger cells have more transcripts, and that while the number of, say, GAPDH mRNA can vary a lot from cell to cell, the concentration varies far less. This holds fairly globally. So what this means is that if you normalize by GAPDH, you are pretty much normalizing by the total (m)RNA content of the cell. In the case of single cell RNA-seq (will write up a comparison of that later), you are essentially also normalizing by total mRNA content. Thus, if you are interested in the concentration of your particular mRNA, this is a reasonable thing to do.

There are a couple of wrinkles here. First, one observation we made was that most of the mRNA we looked at had a higher concentration in smaller cells than in larger cells. It was not as wide as the volume variation, but it could go as high as 2x. We’re not sure of the origin of the effect, and it is possible that there’s some systematic error in our measurement that leads to this (although we really tried a lot of different things to discount such possibilities). In any case, it’s something to consider, especially if you want to be very quantitative.

Another wrinkle is that there are definitely situations we’ve encountered when GAPDH mRNA concentration itself can change. This can happen both homogeneously across the entire population, or even within single cells–in one project we’ve been working on, we see some cells with very high GAPDH transcript abundance right next to cells with very low GAPDH transcript abundance. What to do? If you’re doing sequencing, I think that adding some spike-in controls to help normalize by the total number of molecules could help. Or just do some RNA FISH to get a baseline… :)

Finally, I think it’s really important to carefully consider the directions of causality when making claims about global changes in transcription. Olivia’s heterokaryon experiments clearly show that increasing cell volume/cellular content can directly lead to increased transcription. What that means is that if you make a perturbation and then see a global change in gene expression, it may be (in fact, very well likely is) that the perturbation is somehow causing a cell volume change, which then can result in a proportional global change in transcription. We have seen this very clearly in a number of cases.

Another point is that it really depends on context. We have a recent example in which absolute expression of a secreted protein remains constant, but the cell volume (and hence GAPDH) expression increases dramatically. So what matters, concentration? Absolute amount? It is secreted, and these cells are living in a primarily acellular environment, so the total secreted proteins presumably depends on the absolute number of molecules rather than the concentration. I think it's all a question of context. Which is of course a complete cop-out, I know... :)

Coming soon: description of a comparison of single cell RNA seq and RNA FISH.

Friday, January 23, 2015

Some thoughts on Tomasetti and Vogelstein (and post-publication review)

Interesting paper from Tomasetti and Vogelstein entitled “Variation in cancer risk among tissues can be explained by the number of stem cell divisions” (screw the paywall). This paper has generated a lot of controversy on Twitter and blogs, which is in many ways a preview of what a post-publication review environment might look like. I worry that it’s been largely negative, so here are my (admittedly relatively uninformed) thoughts.

Here is the abstract:

Some tissue types give rise to human cancers millions of times more often than other tissue types. Although this has been recognized for more than a century, it has never been explained. Here, we show that the lifetime risk of cancers of many different types is strongly correlated (0.81) with the total number of divisions of the normal self-renewing cells maintaining that tissue’s homeostasis. These results suggest that only a third of the variation in cancer risk among tissues is attributable to environmental factors or inherited predispositions. The majority is due to “bad luck,” that is, random mutations arising during DNA replication in normal, noncancerous stem cells. This is important not only for understanding the disease but also for designing strategies to limit the mortality it causes.

Basically, the idea is that part of the reason that some tissues are more prone to cancer is because they have a lot of stem cell divisions–an idea supported by the data they present. I think this is a really important point! In particular, because in some ways it establishes what I consider an important null, which is that in considering cancer incidence, it seems reasonable to consider that the more proliferative tissues will be more prone to cancer just because of the increased number of cell divisions. Darryl Shibata (USC) has a series of really nice papers on this point, focusing on colorectal cancer. In particular, in this paper, he points out that such models would predict that taller (i.e., bigger) people would have more stem cells and thus should have a higher incidence of cancer. And that’s actually what they find! I saw Shibata give an (excellent) talk on this at a Physics of Cancer workshop, and afterwards, a cancer biologist criticised this height result, incredulously saying “Well, but there are so many other factors associated with being tall!” Fair enough. But I think that Darryl’s is an economical model that explains the data, and would be what I would consider an important null that deviations should be measured against. I think this is a nice point that Tomasetti and Vogelstein make as well.

What are the consequences of such a null? Tomasetti and Vogelstein frame their discussion around stochastic, environmental and genetic influences on cancer incidence between tissues. Emphasis on between tissues. What exactly does this mean? Well, what they are saying is that if you compare lung cancer rates in smokers vs. non-smokers (environmental effect), then the rate of getting cancer is around 10-20 times higher, but your chances of getting lung cancer even as a non-smoker is still much higher than getting, say, head osteosarcoma, and a plausible possible reason for this is that there are way more stem cell divisions in lung than in the bones in your head. Similarly, colorectal cancer incidence rates are much higher in people with a genetic predisposition (APC mutation), but again, even without the genetic predisposition, that is still many orders of magnitude higher than in other tissues with much lower rates of stem cell divisions. I think this is pretty interesting! Of course, as with Shibata’s height association, the association with stem cell divisions is not proof that the stem cell divisions are per se the cause of this association, but one of the nice things about Shibata’s work is that he shows that a model of stem cell divisions and number of genetic “hits” required for a particular cancer can match the actual cancer incidence data. So I think this is a plausible null model for a baseline of how much certain tissues will get cancer. Incidentally, this made me realize a perhaps obvious point on the genetic determinants of cancer: if you find an association of a gene with cancer incidence, then it may be that the association is because the gene is associated with, e.g., height, in which case, yes, there is technically a genetic underpinning for that variation, but it is hard to imagine designing any sort of drug based on this finding. Tomasetti and Vogelstein make this point in their paper.

The authors then go on to further analyze their data and separate cancers into ones in which the variance in incidence is dominated by “stochastic” effects vs. “deterministic” effects. I can’t say I’ve gone into the details of this analysis, but it seems interesting–and a natural question to ask with these data. Here are a few thoughts on the ideas this analysis explores. One question that has come up a lot is why is this correlation not so strong, especially on a linear scale? I think that one issue is that the division into stochastic, environmental and genetic is missing a big component, which is the tissue, cell and molecular biology of cancer. Some tissues may require more genetic “hits” than others, or a long series of epigenetic effects, or have structures that enable rapid removal of defective stem cells, and so even tissues with the same number of divisions, in the absence of any genetic or environmental factors, will have different rates of cancer. Another issue is that these data are imperfect, and so you will get some spread no matter what. Still, I think the association is real and interesting.

Anyway, I think this “null model” is pretty cool. I wonder if one of the reasons that we focus so much on environmental and genetic effects is that we can do “experiments” on them, whereas the causal links in the stem cell division hypothesis are hard to prove.

There was a very interesting critique from Yaniv Erlich that said that the authors’ analysis implicitly assumes that there is no interaction between the number of stem cell divisions and genetic and environmental factors. A good point, although I do think that Tomasetti and Vogelstein have thought about this–as I mentioned, they say explicitly:

The total number of stem cells in an organ and their proliferation rate may of course be influenced by genetic and environmental factors such as those that affect height or weight.

Their example about the mouse vs. human incidence of colon vs. small intestine cancer in the case of the APC mutation is I think a nice piece of evidence suggesting that number of divisions is very important factor in determining cancer incidence. Although again, many alternative explanations here.

I think some of the confusion out there about this paper can be summed up as follows:

“You are a smoker and I am not, so I have a lower rate of getting lung cancer.”
“Yeah, but you still have a much higher rate of getting lung cancer than bone cancer.”
“Uhh… okay… sure… don’t think I’m gonna take up smoking anytime soon, though.”

It’s just a weird comparison to make. That said, I don’t think the authors really make this comparison anywhere in their manuscript. What I think they are saying at the end is that for cancers that have strong determinants due to environmental factors, lifestyle changes and other such interventions could be useful (like quitting smoking), whereas for other cancers that arise more randomly, we should just focus on detection. Although I have to admit that perhaps I’m missing something, but this seems like a point one could make even without this analysis.

There has been a lot of discussion out there about how weak the correlation is and whether its appropriate to use log-log or linear scales and so forth. I think the basic point they are trying to make is that more highly proliferative tissues are more prone to cancer. I think the data they present are consistent with this conclusion. Whether the specific amount of variance they quote in the abstract is right or not is an important technical matter that I think other people are already talking about a lot, but I think the fundamental conclusion is sound.

A note about the reaction to this paper: in principle, I like the concept of moving from pre-publication anonymous peer review to a post-publication peer review world. I think that pre-publication anonymous peer review is slow, arbitrary, and (most importantly) demoralizing, especially for trainees. That said, now that I’ve seen a bit of post-publication peer review happen online, I think the sad thing I must report is that in many cases, the culture seems to be one of the hardcore takedown, often in a rather accusatorial tone. And I thought it was hard to get a positive review from a journal! Here are some nice thoughts from Kamoun, who recently responded (admirably) to an issue raised on Pubpeer.

My view is that in any paper with real-world data, there will be points that are solid and points that are weak. In post-publication peer review, we run the risk of reducing a paper to a negative soundbite that propagates very fast, and thus throwing out the baby with the bathwater, not to mention putting the author (often a trainee) under very intense public scrutiny that they might not be equipped to handle. I think we should be very careful in how we approach post-publication review because of its viral nature online. Anyway, those are my two cents.

PS: Apropos of discussions of log-log correlations vs. linear correlations, we have a fairly extensive comparison of RNA-seq data to RNA FISH data. More very soon.