Thursday, March 21, 2013
Sunday, March 17, 2013
Some thoughts on ENCODE and functionality
Gautham recently found a paper from some evolutionary biologists in which they discuss (or, more accurately, offer a scathing criticism) of ENCODE, specifically its claims about how much of the genome is functional:
"ENCODE
chose to bias its results by excessively favoring sensitivity over specificity. In fact, they
could have saved millions of dollars and many thousands of research hours by ignoring
selectivity altogether, and proclaiming a priori that 100% of the genome is functional.
Not one functional element would have been missed by using this procedure."
It's an interesting read. A bit of a polemic, but a witty one, with lines like:
and
"For example, according to ENCODE, the putative function of the
H4K20me1 modification is 'preference for 5’ end of genes.' This is akin to asserting that
the function of the White House is to occupy the lot of land at the 1600 block of
Pennsylvania Avenue in Washington, D.C."
I think the latter is really getting at the issue of functionality. The authors claim that ENCODE makes a logical error in assigning function. For instance, they say that ENCODE asserts something like the following: "1. Some DNA property (say, histone modification) serves in some situation to cause a functional change. 2. We observe that another piece of DNA has the same modification. 3. Therefore, that other piece of DNA must be functional as well." Fair point. Although I'm no expert, I believe transcription factor binding has to be one of the prime examples: many people have shown that there are tons of binding sites for transcription factors, identified both by sequence similarity to a motif and by actual ChIP experiments, that are not actually functional in the sense that they don't alter expression of some known gene. Such "interesting looking places" are perhaps a reasonable place to start looking for function, but the fact that it may look interesting does not inherently prove that it is functional.
Of course, that begs the question of what it means to be functional...
The authors of this paper believe that evolutionary conservation is the only way to characterize something as functional. I think I disagree. I think, depending on your definition of functionality, you can have functional elements that are not conserved, and you can have conserved things that aren't functional.
Take a recent example from our lab. We (meaning Hedia) have characterized a long non-coding RNA that, when you knock it down in mouse ES cells, changes the expression of a nearby protein coding gene (Hoxa1). It seems that this non-coding RNA doesn't display any conservation. Is this non-coding RNA functional or not? Well, that depends. We have no mouse knockout data showing any phenotype for this lncRNA. Let's say we did have a mouse knockout and there was no overt mouse phenotype. Does that mean this lncRNA is not functional? Some would say no. But such a strong definition of functionality would of course eliminate most genes, including the majority of those displaying purifying selection! What do we then mean by "overt phenotype"? Perhaps functionality from the phenotype point of view could mean a fitness defect? Well, this would help you find genes that have obvious phenotypes, like embryonic lethality or heart development defects and so forth. But would it help you find, for instance, hair color? Are those genes functional by this definition? Maybe, maybe not. But then you're really evaluating just so stories about whether hair color actually affects fitness, etc. And it may be that hair color in the end doesn't affect fitness at all. I think most people would still want to consider genes affecting hair color to be functional. Then, however, you've started going down the slope. At what point is the phenotype deemed "unimportant" in a functional sense? Isn't changing gene expression of Hoxa1 a phenotype just as much as changing hair color? And on that basis, isn't our lncRNA functional? I bet the authors of the paper would say so. ENCODE (I think) takes this one step further and say that the mere transcription of a lncRNA is "functional" in the sense that it changes the biochemistry of the cell. I'm not sure about that, but I can see the logic. After writing most of this post, I came across this interesting post from one of the ENCODE folks that touches on the functionality issue.
So our lncRNA could be considered functional, but doesn't show any evolutionary conservation (at the level we have looked). One could counter that this might just be a consequence of weakness in our metrics of evolutionary constraint–perhaps a more sophisticated view of conservation would reveal that this lncRNA is actually conserved, along with many other "important" DNA elements. That would be great, and if that were the case, then ENCODE would be very useful as a dataset that one could use to evaluate such methods. There is also an alternative: that our lncRNA is something specific to mouse, and does something specific in the mouse, and so is not detectable via conservation methods (presumably there are many such things that make a mouse different than a rat, etc.). It is, I think, good to at least be open to this possibility.
I guess to sum up my ramblings up to this point, I would say that a strict definition of functionality based on strong phenotype would eliminate a bunch of conserved genes, and a looser definition of functionality (even not ENCODE-level loose) would include a bunch of non-conserved stuff. So I don't know, but I think conservation is useful but imperfect measure of functionality. Of course, this sort of very abstract thinking is probably of little use when evaluating whether to study something in the lab. At that point, I think the definition of functionality largely comes down to a matter of taste.
Another issue is the value of ENCODE as a data set for others. Michael Eisen, who I don't know but I think I would like to meet one day, had this thoughtful post about how ENCODE data doesn't fit the bill for everyone's scientific question. I definitely understand this point. My advisor at Courant (Charlie Peskin) posed to me the question "But at some point, wouldn't we have just sequenced everything there is to sequence?". My answer was that there's always something else to sequence, because every new scientific question may require some new data, like adding some different drug to some different cell line at some different time point, etc. Eisen's main point seems to be that the ENCODE data cannot cover all the different needs of different researchers for their specific questions.
My response to that is, well, whatever. There's nothing I can do about whether or not it was a good idea to generate this or that dataset for ENCODE, and I honestly don't know whether any particular set of data was a good or bad idea to begin with. Nobody's listening to my opinion, and frankly, they probably shouldn't anyway. The fact is that the data is here. The way we're using it in the lab is as reference data, data that we can use as a basis for some of our scientific questions. It is undeniable, for instance, that it is very useful for our science to have RNA-seq and ChIP-seq data for a large number of cell lines, and it's also likely that we'll probably need to generate some additional genomic data sets of our own, which is probably in line with what the ENCODE people would have expected. For example, we're using the GM12878 cell line for some allelic expression stuff these days. It's not the ideal cell line for imaging (we wouldn't have picked it), but we can make it work, and it would be really hard, both cost and time-wise, for us to generate all the background data and analysis we need in another cell line. Another bonus is that the data we generate is not just "in some weird cell line" but in one of the ENCODE cell lines, which always helps when you try to argue "relevance" (whatever that means). Anyway, I think our job in the lab is to spend our time trying to think of creative ways to use the ENCODE data to do cool science. Speaking of which, I've spent a lot of time on this blog post...
My response to that is, well, whatever. There's nothing I can do about whether or not it was a good idea to generate this or that dataset for ENCODE, and I honestly don't know whether any particular set of data was a good or bad idea to begin with. Nobody's listening to my opinion, and frankly, they probably shouldn't anyway. The fact is that the data is here. The way we're using it in the lab is as reference data, data that we can use as a basis for some of our scientific questions. It is undeniable, for instance, that it is very useful for our science to have RNA-seq and ChIP-seq data for a large number of cell lines, and it's also likely that we'll probably need to generate some additional genomic data sets of our own, which is probably in line with what the ENCODE people would have expected. For example, we're using the GM12878 cell line for some allelic expression stuff these days. It's not the ideal cell line for imaging (we wouldn't have picked it), but we can make it work, and it would be really hard, both cost and time-wise, for us to generate all the background data and analysis we need in another cell line. Another bonus is that the data we generate is not just "in some weird cell line" but in one of the ENCODE cell lines, which always helps when you try to argue "relevance" (whatever that means). Anyway, I think our job in the lab is to spend our time trying to think of creative ways to use the ENCODE data to do cool science. Speaking of which, I've spent a lot of time on this blog post...
Subscribe to:
Posts (Atom)