Wednesday, October 30, 2013

Better programming practices applied to a scientific plotting problem

- Gautham

It is easy to find "best programming practices" out on the web, but usually there aren't accompanied by an example. This is a real-life example of applying some good practices to turn a bewildering task into something actually pleasant.

One of the hardest aspects of making the worm paper was dealing with the complicated plots that had to be generated. The input data was of exactly two forms:

  • RNA-FISH data sets. A sample of worm embryos of some particular strain, grown under particular conditions, were fixed and imaged. For each embryo analyzed you have #nuclei, and the number of RNA-FISH spots in all fluorescent dye channels used in that experiment. Each dye channel corresponds to a different gene, and different data sets can involve different genes.
  • Embryo cell lineage data sets. Our collaborators Travis Walton and John Murray took movies of individual embryos from certain strains and conditions and determined the time of each cell division. We also incorporated published lineage data. The names of the cells in the embryo follow a standardized pattern. 
Since the C. elegans embryonic lineage is invariant, you can use the data set in 2 to effectively determine a time axis for RNA-FISH, and do things like determine whether expression of a gene starts before or after a certain cell's division (which was actually a key question asked by the paper). 

The goal is to make plot layouts like:
Output of the figure-1 producing code in MATLAB
You see here lineages versus time (bottom left), RNA versus time (bottom center) and versus nuclei (top right), and even a nuclei versus time plot (bottom right). RNA traces for different genes are shown, color-coded, and a specific lineage (the E lineage) is highlighted and the times of its divisions are highlighted in all the plots. Plus there's the shaded alternating grey bars.

My first crack at this thing was for our first submission. It involved very messy and long scripts in MATLAB. Producing plots required a lot of looking back and forth and copying and pasting. Changing which lineage data or which RNA-FISH data set was used to make the plot was a chore. When the reviews came back, and we had to consider many ways to present our data to make our point clear, we needed a more flexible way to make these plots. Here are some principles/tips that helped me:

Centralize access to raw data
This is related to a previous post on keeping together all data of the same type.

LINDATADIR = 'lineagedata/';

lindata = loadWormLineageData( LINDATADIR );
rnadata = loadAllWormRNAData( RNADATADIR );

>> rnadata
rnadata = 

         gut: {1x49 cell}
    othergut: {1x8 cell}
      muscle: {1x12 cell}

>> showIndexWormData(rnadata.gut)
1)   RT 101003
2)   15C 101011
3)   RT 101115

49)   N2 20C 110219 and 120309 gut

In Matlab, structs or arrays of structs are the container of choice for arbitrary data. Above I have a cell array of structs for each experiment, and each struct contains a string that has a descriptor of the experiment, which I can show with the showIndexWormData function. Similarly,
>> lindata
lindata = 

           n2: [1x1 struct]
         gg52: [1x3 struct]
         gg19: [1x3 struct]
         div1: [1x5 struct]
In this case lindata is itself a struct, where each field corresponds to a strain or source and contains a struct array of lineage data, which could represent data from different individual embryos.

Good code needs few comments
If you have to write many comments to explain what your code does, rewrite the code. Code that makes its intent clear is good code. The nightmare is not looking back at code and asking "how does this code work?" but rather "what was I trying to do here!?" 

Here is the top level script run to make the figure:
%% FIGURE 1 - Overview

% The end-3 and end-1 data is in 110219
% The elt-2 data is in 120216

rnadatacell = { ...
    getWormData( rnadata.gut, 'N2 20C 110219' ), ...
    getWormData( rnadata.gut, 'N2 20C 110219' ), ...
    getWormData( rnadata.gut, 'N2 120216 20C' )};

genenamecell = {'end3','end1','elt2'};
dyenamecell = {'alexa','tmr','cy'};
RNArangecell = {[-100 700],[-100 900],[-100 600]};

lineagedata = lindata.Bao_to20C(1);

plotfig1_0114( rnadatacell, genenamecell, ...
    dyenamecell, RNArangecell, lineagedata)

print('DRAFT_fig1_0114.eps', '-depsc')

If clarity of intent is key, we are prepared to sacrifice all manner of other things, including speed, storage and efficiency. As you can see, the script really packages the variables I might be interested in changing when producing Fig. 1 and sends them off to a function that does the work.

One other good reason to have such self-commented code is that your code and your comments will never diverge due to laziness. If you rely on comments to understand your code once, you will have to be diligent from then on to make sure they are totally sync'ed up any time you come back and modify it. 

Use accessor and constructor functions liberally
This is an offshoot of ideas related to the way you write object-oriented programs. Marshall Levesque turned me onto writing "get" and "make" functions.

getWormData( rnadata.gut, 'N2 20C 110219' )

Fetches data by descriptor, rather than by number, filename, or some other obtuse thing. 

Here is the portion of the code in plotfig1_0114 that plots the three plots in the bottom center of the figure:

for i=1:length(rnadatacell) 
    addtimeshading_t( getLinDatabycellname('ABa',lindata_adj) , ...
        10, brightcolor, darkcolor, 'y_t');
    plot_onewormseries_vs_t( rnadatacell{i}, lindata_adj,...
        getRNAslopes( genenamecell{i}),...
        getWormColors( genenamecell{i}), dyenamecell{i}, 'x_RNA__y_t') ;
    addLineageMarks_t( lindata_adj, Linsubsetfunc, 'y_t') ;

There are three 'get's here: getLinDatabycellname('ABa',lindata_adj) finds the time of cell division of cell ABa based on the lineage data and getWormColors( genenamecell{i}) returns a structure containing all the graphical properties I like to use when plotting that particular gene. This is how a consistent color is applied every time for a given gene (green for elt-2, for example). In my implementation, the table of graphical properties for all genes is hard coded into that function body and all it does is look up the entry for the desired gene. ( getRNAslopes is also a "get" involving a hard-coded table )

Incidentally, the low level functions accept things like the strings 'x_RNA__y_t' or 'y_t' to determine what is plotted on the x and y axes. This was important when we were considering all kinds of variations with this or that plot flipped. 

The lineage marks are added using the lineage data and a function that filters only the lineage I wanted to highlight (the E lineage including the EMS division):
addLineageMarks_t( lindata_adj, Linsubsetfunc, 'y_t')

Turns out that the filter itself was made with an accessor/constructor function:
Linsubsetfunc = makeLinsubsetter('EwEMS');

Here is a function that returns a function itself. MATLAB is a bit handicapped for this kind of thing since the only kind of function you can make interactively is an anonymous function one-liner, but it sufficed for my purpose:

function fh = makeLinsubsetter( lineagename )

% Return a function handle to an anonymous function that will filter for
% the specified lineage. 

switch lineagename
    case 'AB'
        fh = @(cellnames) find( strncmp('AB',cellnames,2) );

    case 'E'
        fh = @(cellnames) find( strncmp('E',cellnames,1) & ...
                                  ~ strcmp('EMS',cellnames) );


    case 'EwEMS'
        fh = @(cellnames) find( strncmp('E',cellnames,1));
    case 'notABorP0'
        fh = @(cellnames) find( ~ ( strncmp('AB',cellnames,2) | ...
            strncmp('P0',cellnames,2) | strncmp('NA',cellnames,2)));    
    case 'all'
        fh = @(cellnames) 1:length(cellnames);

Having a function like this was incredibly handy.

Avoid copy-paste code. Wrapper functions are nice.  

These two principles are closely related. For example, consider the plotting of RNA vs time (bottom center)  and the RNA vs nuclei (center left) in the figure. The function that plots an individual RNA vs time graph is:
function [ linehandles, plothandles ] = plot_onewormseries_vs_t( cdata, ...
        lindata, rnaslopes, graphicsspecs, colortoplot, plotmode)

embryoages = assignAgesFrom_lindata( numcells, ...
                                    numrna, rnaslopes, lindata) ; 
linehandles = plotGoodBadWormData( embryoages, numrna, ...
    has_clearthreshold, graphicsspecs, flip_rna_t) ;

On the other hand, the code that plots an individual RNA vs nuclei graph is:
function [linehandles, plothandles] = plot_onewormseries_vs_N( cdata, ...
                                graphicsspecs, colortoplot, plotmode)
linehandles = plotGoodBadWormData( numcells, numrna, ...
    has_clearthreshold, graphicsspecs, flip_rna_n) ;

Both functions ultimately call plotGoodBadWormData to actually draw the graphs, but the time version  passes in embryo ages, which it computes, rather than the number of cells (same as the number of nuclei). This way, plotGoodBadWormData does the work of plotting and applying all the graphics parameters and its code does not have to be duplicated. Meanwhile, the two plot_onewormseries... functions are really just wrappers. 

Ultimately, in all my worm code there are only three or four functions that do actual hard "work". 

More function, less scripting.

This is a corollary of avoiding code duplication. Think twice before you copy-paste! If you are using a language like R or python that allows you to write arbitrary functions interactively, there is no down-side to functionalizing. 

Hope some of these tips help you to not lose your mind, like I did on my first go at this stuff. We scientists are some pretty bad programmers because nobody else usually sees our code. But we have to see it and work with it. Good practices do make you more efficient, and also, happier. I trust you have enough good taste to avoid the extremes.

No comments:

Post a Comment