Thursday, July 25, 2013

Less variables makes for happy analysis in R, MATLAB, etc.

- Gautham

The more experience you have in R, MATLAB, in general the less variables you'll have in your workspace. There are good reasons for this that as beginners we try to make up for by sometimes clever means, but they catch up to us eventually.

Suppose you measure temperature and vapor pressure of water one day. And on another day you add some sugar to water and measure temperature and vapor pressure. When I was just getting started with R or MATLAB, my analysis script may have looked like this:

temp_water <- read.table( ... )
vp_water  <-  read.table( ... )
temp_sugar <- read.table( ... )
vp_sugar <- read.table( ... )

As the number of experimental conditions increases, I'll end up making more and more variables (or maybe different variables in different scripts). Worse, if I try to plot something, I'll end up doing a bunch of copy-paste to plot it for every condition, changing the variables each time. By this kind of logic, and by refusing to write common procedures into functions (a topic for later), version 1 of the code I used to plot the figures of our initial worm paper submission turned into such a deep morass that I had to rewrite it nearly from scratch when the revisions came.

The "clever" beginner, myself and members of my old lab included, would try to get around the copy-pasting by exploiting functions in R and MATLAB that let you execute commands in strings. This is where the dreaded eval function and its close relative assign, come in.   Unless you are developing an R package to submit to CRAN and know your way around namespace hierarchies, you are probably headed down the wrong road. There are much less ugly ways to accomplish what you are trying to do.

Instead, you can organize your data the way you would if you were making a database. Principle 3 of Dr. Wickham's paper on 'tidy data' suggests that:
3. Each table (or file) stores information about one observational type.

Go one further, and store *all* information of that one observational type (that one type of experiment) in a single table. In our fake vapor pressure example, we'd just have one variable:

> vp_temp = 
T        vp         sugar_frac
..       ..         0
..       ..         0
..       ..         ..
..       ..         0.1
..       ..         0.1

R is very good at helping you then take out the parts of this data that you are interested in for any particular plot or analysis. Tools like plyr and ggplot2 work like magic with data that looks like this. This form of data is also good for merging with other tables that contain data from other kinds of experiments (maybe heat capacity against temperature?).

 To make this kind of master data table, follow this two step procedure:
1) Maintain a master index of all your vapor pressure - temperature experiments as a .csv file (you could even keep it on google docs). Every row is an experimental run. The table has columns for everything you think is relevant about the experiment, the so-called metadata, like the date, the experiment conditions. Most importantly, there are columns for the location and name of the file that contains that run's raw data. 
2) Have a script read the master index file and use the table to read all your data files and add informative columns (like the sugar fraction). In R you can read each file into a list and then, the list) or melt the list to get the full data table. Then merge with the master index table to attach the metadata columns.

Since 1 is a good idea no matter how you analyze data, may as well tack on 2 and get rid of that mess of variables in your workspace.

Sunday, July 21, 2013

Wikipedia makes you learned

Gautham and I were recently having a discussion about the phrase "The exception that proves the rule". I have steadfastly maintained that this phrase is inherently nonsensical, and I haven't had anybody offer up a good explanation. My thought was perhaps one could use it when you have a situation where you have something that looks like an exception to the rule, but if you look at it a bit more closely, it turns out that the exception arises from some mistaken assumption that proves the rule. Like:
Rule: "All rubber ducks are yellow."
Exception: "Look, this rubber duck is green!"
Exception that proves the rule: "That's actually a plastic duck."

Rule: "All scientists are dorks."
Exception: "What about Russell Crowe in A Beautiful Mind?"
Exception that proves the rule: "Russell Crowe is actually an actor.  He is also reportedly a complete jerk (based on his interactions with real mathematicians during the making of the film).  But he's not a dork."

But it usually isn't used that way. Mostly, it seems that it's used as a (feeble) retort to an exception to somebody's going theory. Usually because the exception is not really an exception or the rule is not really a rule:

Rule: "Scientists are dorks."
Exception: "Richard Feynman wasn't a dork."
Exception is not an exception: "That's an exception that proves the rule.  Richard Feynman is also a dork."
Exception guy: "?"

Rule: "All computer scientists have long hair in ponytails."
Exception: "What about this dude?"
My rule is wrong: "..." [couple seconds] "That's just the exception that proves the rule!"

Well, it turns out that Wikipedia has a pretty solid discussion of the topic. Turns out the idea behind this saying is really more like "The exception proves the existence of an rule." Wikipedia's example is a sign that says "Parking prohibited Sundays" is an exception that proves the existence of the rule that parking is generally allowed. Apparently this comes from ancient Roman law! Who would have known?

One thing I find amazing about this is that in the old days pre-Internet, probably only a few scholars who read books and stuff would know this fact. And those guys would seem so smart just by virtue of knowing some obscure fact. But now everyone can know those obscure facts. All you have to do is have the question, and someone out there probably has the answer for you. Cool. Now everyone can be learned. Which reminds me of this exchange from The Simpsons.

Friday, July 19, 2013

Passive (aggressive) review writing...

Quickly wanted to point out one of my pet peeves in review writing.  I hate it when reviewers criticize ask for you to do things using the passive voice.  Like:

"The ranges of these variables must be discussed."
"The graphs must be properly normalized."
"The quantitative aspects of the work must be approached more rigorously."

If you want to tell us to do something, just say "The authors should...".  It's perhaps a minor point, but I think it humanizes the discussion.  If you actually write having in mind that there's a living, breathing person on the other end rather than just some science robot, then I think the tone will overall be much more constructive:

"Can the authors discuss the ranges of these variables?"
"The authors should normalize their graphs appropriately."
"The authors should improve the rigor of the quantitative aspects of their work."

Also, it tends to sound less absolute, which I think is nice because it gives the authors a bit of the benefit of the doubt.  I've often misunderstood aspects of a paper as a reviewer, and it makes it easier for authors if they can engage a more ambiguous comment rather than have to somehow refute a very absolute sounding but ultimately nonsensical reviewer comment.  You know you've gotten some of those!  In fact, these days, it seems like those are the only ones we get...

I also remember once being on a thesis defense committee where one of the committee members kept badgering the student with annoying questions posed in this absolutist, passive voice way:

"The quantities in your graph must be discussed."
"The credit must be ascribed appropriately."

What a jerk!

Thursday, July 18, 2013

More on writing...

In the course of writing this grant, I was wondering about why it's so much harder to write about an idea than to talk about it.  I think it's because conversation is just so easy to manipulate by glossing over things and eluding direct questions.  I think back on issues that people bring up in conversation, and it's amazing how often one just completely forgets about it and moves on.  Sometimes its as simple as just... not saying anything for a while until everyone just moves one somehow.  But when you write, there's nowhere to hide a lazy argument or a dangling piece of logic.  It's all out there in the open, and it (ideally) has to make sense.  No wonder it's so hard to write this stuff.

Wednesday, July 17, 2013


So I just submitted a grant for the NSF CAREER.  I'm tired of writing.  Seriously tired.

Why am I writing some more?  Not sure, but I can say that writing on a blog is much easier.  Scientific writing can seem so mechanical ("In this aim, I will perform XYZ.  To perform X, I will use A to examine B.  This will tell me about C...").  Ugh.  Makes me want to hire a robot to write it for me.  But on the other hand, the best scientific writing is still an artisanal product.  It's like it's carved from a stone with a chisel over many months, with not a single word misplaced or misused, somehow weaving together often multiple strands of logic into a cohesive narrative.  The very very best scientific writing, in fact, is the stuff that makes you feel smarter while you're reading it, like when you're reading about one experiment and it's written so you think "wouldn't it be cool if they did XXX", and then you turn the page and it says "To test for that possibility, we performed XXX."  Man, that's awesome!  I'll get there one day.

It also occurs to me that my scientific writing suffers from the "first pancake" phenomenon–you know, when the first pancake never comes out quite right.  I think I just have to write the thing twice, almost from start to finish, before it's any good.  The first time, I just have to get the ideas out, and the very act of writing them down makes my thinking evolve.  So the beginning never quite matches the end. Rewriting is not fun (and certainly wasn't when Olivia told me my first draft sucked (very nicely and constructively, though!)), but the writing is much better for it.

Anyway, whatever, I'm just happy to have submitted something mostly coherent.  The nice thing is that I feel like I really clarified my thoughts while writing this thing, which is a good thing, and I'm really excited about the work and about the educational/broader impacts thing we proposed.  But that's a subject for another blog post...

Wednesday, July 3, 2013

An essay about Marshall's iceFISH paper

Lenny Teytelman, being the awesome guy that he is, has set up what he calls an "anti-journal club" on PubChase.  The idea is that these are essays by the authors that describe some more personal aspect of the story behind their paper.  (Most journal clubs are just exercises in destroying a paper, which is much less fun over time.)  I wrote one about Marshall's iceFISH paper in Nature Methods.  I really like the concept!