Wednesday, August 8, 2018

On mechanism and systems biology

(Latest in a slowly unfolding series of blog posts from the Paros conference.)

Related reading:

Mechanism. The word fills many of us with dread: “Not enough mechanism.” “Not particularly mechanistic.” "What's the mechanism?" So then what exactly do we mean by mechanism? I don’t think it’s an idle question—rather, I think it gets down to the very essence of what we think science means. And I think there are some practical consequences on everything from how we report results to the questions we may choose to study (and consequently to how we evaluate science). So I’ll try and organize this post around a few concrete proposals.

To start: I think the definition I’ve settled on for mechanism is “a model for how something works”.

I think it’s interesting to think about how the term mechanism has evolved in our field from something that really was mechanism once upon a time into something that is really not mechanism. In the old days, mechanism meant figuring out e.g. what an enzyme did and how it worked, perhaps in conjunction with other enzymes. Things like DNA polymerase and ATP synthase. The power of the hard mechanistic knowledge of this era is hard to overstate.

What can we learn about the power of mechanism/models from this example?

As the author of this post argues, models/theories are “inference tickets” that allow you to make hard predictions in completely new situations without testing them. We are used to thinking of models as being written in math and making quantitative predictions, but this need not be the case. Here, the predictions of how these enzymes function has led to, amongst other things, our entire molecular biology toolkit: add this enzyme, it will phosphorylate your DNA, add this other enzyme, it will ligate that to another piece of DNA. That these enzymes perform certain functions is a “mechanism” that we used to predict what would happen if we put these molecules in a test tube together, and that largely bore out, with huge practical implications.

Mechanisms necessarily come with a layer of abstraction. Perhaps we are more used to talking about these in models, where we have a name for them: “assumptions”. Essentially, there is a point at which we say, who knows, we’re just going to say that this is the way it is, and then build our model from there. In this case, it’s that the enzyme does what we say it will. We still have quite a limited ability to take an unknown sequence of amino acids and predict what it will do, and certainly very limited ability to take a desired function and just write out the sequence to accomplish said function. We just say, okay, assume these molecules do XYZ, and then our model is that they are important for e.g. transcription, or reverse transcription, or DNA replication, or whatever.

Fast forward to today, when a lot of us are studying biological regulation, and we have a very different notion of what constitutes “mechanism”. Now, it’s like oh, I see a correlation between X and Y, the reviewer asks for “mechanism”, so you knock down X and see less Y, and that’s “mechanism”. Not to completely discount this—I mean, we’ve learned a fair amount by doing these sorts of experiments, but I think it’s a pretty clear that this is not sufficient to say that we know how it works. Rather, this is a devolution to empiricism, which is something I think we need to fix in our field.

Perhaps the most salient question is what it does it mean to know “how it works?”. I posit that mechanism is an inference that connects one bit of empiricism to another. Let’s illustrate in the case of something where we do know the mechanism/model: a lever.

“How it works” in this context means that we need a layer of abstraction, and have some degree of inference given that layer of abstraction. Here, the question may be “how hard do I have to push to lift the weight?”. Do we need to know that the matter is composed of quarks to make this prediction, or how hard the lever itself is? No. Do we need to know how the string works? No. We just assume the weight pulls down on the string and whatever it’s made of is irrelevant because we know these to be empirically the case. We are going to assume that the only things that matter are the locations of the weight, the fulcrum, and my finger, as well as the weight of the, uhh, weight and how hard I push. This is the layer of abstraction the model is based on. The model we use is that of force balance, and we can use that to predict exactly how hard to push given these distances and weights.

How would a modern data scientist approach this problem? Probably take like 10,000 levers and discover Archimedes Law of the Lever by making a lot of plots in R. Who knows, maybe this is basically how Archimedes figured it out in the first place. It is perhaps often possible to figure out a relationship empirically, and even make some predictions. But that’s not what we (or at least I) consider a mechanism. I think there has to be something beyond pure empiricism, often linking very disparate scales or processes, sometimes in ways that are simply impossible to investigate empirically. In this case, we can use the concepts of force to figure out how things might work with, say, multiple weights, or systems of weights on levers, or even things that don’t look like levers at all. Wow!

Okay, so back to regulatory biology. I think one issue that we suffer from is that what we call mechanism has moved away from true “how it works” models and settled into what is really empiricism, sort of without us noticing it. Consider, for instance, development. People will say, oh, this transcription factor controls intestinal development. Why do they say that? Well, knock it out and there’s no intestine. Put it somewhere else and now you get extra intestine. Okay, but that’s not how it works. It’s empirical. How can you spot empiricism? A good sign is excessive obsession with statistics: effect sizes and p-values are often a good sign that you didn’t really figure out how it works. Another sign is that we aren’t really able to apply what we learned outside of the original context. If I gave you a DNA typewriter and said, okay, make an intestine, you would have no idea how to do it, right? We can make more intestine in the original context, but the domain of applicability is pretty limited.

Personally, I think that these difficulties arise partially because of our tools, but mostly because I think we are still focused on the wrong layers of abstraction. Probably the most common current layers of abstraction are those of genes/molecules, cells, and organisms. Our most powerful models/mechanisms to date are the ones where we could draw straight lines connecting these up. Like, mutate this gene, make these cells look funny, now this person has this disease. However, I think these straight lines are more the exception than the norm. Mostly, I think these mappings are highly convoluted in interwoven systems, making it very hard to make predictions based on empiricism alone (future blog post coming on Omnigenic Model to discuss this further).

Which leads me to a proposal: let’s start thinking about other layers of abstraction. I think that the successes of the genes/molecules -> cells paradigm has led to a certain ossification of thought centered around thinking of genes and molecules and cells as being the right layers of abstraction. But maybe genes and cells are not such fundamental units as we think they are. In the context of multicellular organisms, perhaps cells themselves are passive players, and rather it is communities of cells that are the fundamental unit. Organoids could be a good example of this, dunno. Also, it is becoming clear that genetics has some pretty serious limits in terms of determining mechanism in the sense I’ve defined. Is there some other layer involving perhaps groups of genes? Sorry, not a particularly inspired idea, but whatever, something like that maybe. Part of thinking this way also means that we have to reconsider how we evaluate science. As Rob pointed out, we have gotten so used to equating “mechanism” to “molecules and their effects on cells” that we have become both closed minded to other potential types of mechanism while also deceiving ourselves into allowing empiricism to pose as mechanism under the guise of statistics. We just have to be open to new abstractions and not hold everyone to the "What's the molecule?" standard.

Of course, underlying this is an open question: do such layers of abstraction that allow mechanism in the true sense exist? Complexity seems to be everywhere in biology, and my reaction so far has been to just throw up my hands up and say “it’s complicated!”. But (and this is another lesson learned from Rob), that’s not an excuse—we have to at least try. And I do think we can find some mechanistic wormholes through the seemingly infinite space of empiricism that we are currently mired in.

Regardless of what layers of abstraction we choose, however, I think that it is clear that a common feature of these future models will be that they are multifactorial, meaning that they will simultaneously incorporate the interactions of multiple molecules or cells or whatever the units we choose are. How do we deal with multiple interactions? I’m not alone in thinking that our models need to be quantitative, which as noted in my first post, is an idea that’s been around for some time now. However, I think that a fair charge is that in the early days of this field, our quantitative models were pretty much window dressing. I think (again a point that I’ve finally absorbed from Rob) that we have to start setting (and reporting) quantitative goals. We can’t pick and choose how our science is quantitative. If we have some pretty model for something, we better do the hard work to get the parameters we need, make hard quantitative predictions, and then stick to them. And if we don’t quantitatively get what we predict, we have to admit we were wrong. Not partly right, which is what we do now. Here’s the current playbook for a SysBio paper: quantitatively measure some phenomenon, make a nice model, predict that removal of factor X should send factor Y up by 4x, measure that it went up 2x, and put a bow on it and call it a day. I think we just have to admit that this is not good enough. This “pick and choose” mix of quantitative and qualitative analyses is hugely damaging because it makes it impossible to build upon these models. The problem is that qualitative reporting in, say, abstracts leads to people seeing “X affects Y” and “Y affects Z” and concluding “thus, X affects Z” even though the effects for X on Y and Y on Z may be small enough to make this conclusion pretty tenuous.

So I have a couple proposals. One is that in abstracts, every statement should include some sort of measure of the percentage of effect explained by the putative mechanism. I.e., you can’t just say “X affects Y”. You have to say something like “X explains 40% of the change in Y”. I know, this is hard to do, and requires thought about exactly what “explains” means. But yeah, science is hard work. Until we are honest about this, we’re always going to be “quantitative” biologists instead of true quantitative biologists.

Also, as a related grand challenge, I think it would be cool to try and be able to explain some regulatory process in biology out to 99.9%. As in, okay, we really now understand in some pretty solid way how something works. Like, we actually have mechanism in the true sense. You can argue that this number is arbitrary, and it is, but I think it could function well as an aspirational goal.

Any discussion of empiricism vs. theory will touch on the question of science vs. engineering. I would argue that—because we’re in an age of empiricism—most of what we’re doing in biology right now is probably best called engineering. Trying to make cells divide faster or turn into this cell or kill that other cell. And it’s true that look, whatever, if I can fix your heart, who cares if I have a theory of heart? One of my favorite stories along these lines is the story of how fracking was discovered, which was purely by accident (see Planet Money podcast): a desperate gas engineer looking to cut costs just kept cutting out an expensive chemical and seeing better yield until he just went with pure water and, voila, more gas than ever. Why? Who cares! Then again, think about how many mechanistic models went into, e.g., the design of the drills, transportation, everything else that goes into delivering energy. I think this highlights the fact that just like science and engineering are intertwined, so are mechanism and empiricism. Perhaps it’s time, though, to reconsider what we mean by mechanism to make it both more expansive and rigorous.

Monday, August 6, 2018

The biologist's arrow

Guest post by Caroline Bartman

How do we understand biology? “Mutant IDH2 <arrow> 2-hydroxyglutarate <arrow> hypermethylation <arrow> cell proliferation (?),” I scribbled at the top of a paper I read this week. My mind requires linear relationships, direct chains of cause and effect, to retain the findings of a paper I read.

Evidence suggests that this is not how biology in general operates. For example, Pritchard’s ‘omnigenic theory’ synthesizes many years of work to show that most polymorphisms contribute to the total phenotype in a significant but barely detectable way. Identifying each genetic variant that contributes to a phenotype requires many years of costly effort and will culminate with a long list of polymorphisms that incrementally contribute to a phenotype. (Exceptions to this rule- PCSK9- are valuable but rare.) Not only are most contributions miniscule (median contribution of significant height SNPs is 0.00143 meters according to Pritchard), but many polymorphisms play a role in a wide range of traits, by influencing broadly expressed genes. Our search for cause <arrow> effect reveals a tangled thicket of partial causes and modest effects.

Human genetic studies are not the only realm where such complexity dominates. We perform RNA sequencing of wild-type and knockout cells, find a thousand differentially expressed genes, and then focus on a single target gene. We do a screen and follow up on a single hit. It boggles the mind to understand that all of the hits, probably even some below the significance threshold, contribute to that biological process every time it occurs. So we ignore this tangle in order to tell a story, to write a paper, to give a talk that other scientists will appreciate.

This struggle to understand continues as we try to finish a study. Many scientific projects reach an uncomfortable stage where we have a phenotype in hand, a dramatic finding with some relevance to an open biological question, but we require a bit of mechanism for the last figure. (We use the phrase ‘bit of mechanism’ with a half-ashamed laugh.) A bit of mechanism? A handle to give readers, to reassure them that biology is not random, there is a reason for our finding, there is ultimately something to understand? How many of these last figure gambits are quickly abandoned by the relevant subfield as future studies fail to support these ‘mechanisms,’ or change their interpretation beyond recognition?

How do we as humans with limited intelligence, limited bandwidth, limited attention span understand complex biological processes?

Does understanding biology even matter? Don’t we do biology to help patients, to solve problems, to cure disease? But one of the most attractive things about biology for me was that there is a truth outside oneself. Unlike consulting, or writing, or reporting, which are all ways humans can talk about humans, or operate in artificial systems constructed by humans, I believed that science was the way to escape from navel-gazing, the way out of the closed loop. It is not all about humans and feelings and opinions! There are truths outside our selves that we can understand! Just look at ribosomes, or whales, or frogs, or the lac operon and you see a truth that does not require humans as an origin but that humans could find a logic behind. But can we actually understand that logic?

This concern does not lend itself well to selecting and starting a new biological project. The papers that are most beautiful and elegant to me are the simplest. But they leave me with a disquieting feeling that they have achieved beauty by denying complexity.

Thursday, June 14, 2018

Notes from Frontiers in Biophysics conference in Paros, episode 1 (pilot): Where's the beef in biophysics?

Long blog post hiatus, which is a story for another time. For now, I’m reporting from what was a very small conference on the Frontiers of Biophysics from Paros, a Greek island in the Aegean, organized by Steve Quake and Rob Phillips. The goals of the conference were two-fold:
  1. Identify big picture goals and issues in biophysics, and
  2. Consider ways to alleviate suffering and further human health.
Regarding the latter, I should say at the outset that this conference was very generously supported by Steve through the foundation he has established in memory of his mother-in-law Eleftheria Peiou, who sounds like she was a wonderful woman, and suffered through various discomforts in the medical system, which was the inspiration behind trying to reduce human suffering. I actually found this directive quite inspiring, and I’ve personally been wondering what I could do in that vein in my lab. I also wonder whether the time is right for a series of small Manhattan Projects on various topics so identified. But perhaps I’ll leave that for a later post.

Anyway, it was a VERY interesting meeting in general, and so I think I’m going to split this discussion up based on themes across a couple different blog posts, probably over the course of the next week or two. Here are some topics I’ll write about:

Exactly what is all this cell type stuff about

Exactly what do we mean by mechanism

I need a coach

What are some Manhattan Projects in biology/medicine

Maybe some others

So the conference started with everyone introducing themselves and their interests (research and otherwise) in a 5 minute lightning talk, time strictly enforced. First off, can I just say, what a thoughtful group of folks! It is clear that everyone came prepared to think outside their own narrow interests, which is very refreshing.

The next thing I noticed a lot of was a lot of hand-wringing about what exactly we mean by biophysics, which is what I’ll talk about for the rest of this blog post. (Please keep in mind that this is very much an opinionated take and does not necessarily reflect that of the conferees.) To me, basically, biophysics, as seemingly defined at this meeting, as a whole needs a pretty fundamental rebranding. Raise your hand if biophysics means one of the following to you:
  1. Lipid rafts
  2. Ion channels
  3. A bunch of old dudes trying to convince each other how smart they are (sorry, cheap shot intended for all physicists) ;)
If you have not raised your hand yet, then perhaps you’re one of the lonely self-proclaimed “systems biologists” out there, a largely self-identified group that has become very scattered since around 2000. What is the history of this group of people? Here’s a brief (and probably offensive, sorry) view of molecular biology. Up until the 80s, maybe 90s, molecular biology had an amazing run, working out the genetic code, signaling, aspects of gene regulation, and countless other things I’m forgetting. This culminated in the “gene-jock” era in which researchers could relate a mutation to a phenotype in mechanistic detail (this is like the Cell golden era I blogged about earlier). Since that era, well… not so much progress, if you ask me—I’m still firmly of the opinion that there haven’t really been any big conceptual breakthroughs in 20-30 years, except Yamanaka, although one could argue whether that’s more engineering. I think this is basically the end of the one-gene-one-phenotype era. As it became clear that progress would require the consideration of multiple variables, it also became clear that a more quantitative approach would be good. For ease of storytelling, let’s put this date around 2000, when a fork in the road emerged. One path was the birth of genomics and a more model-free statistical approach to biology, one which has come to dominate a lot of the headlines now; more on that later. The other was “systems biology”, characterized by an influx of quantitative people (including many physicists) into molecular biology, with the aim of building a quantitative mechanistic model of the cell. I would say this field had its heyday from around 2000-2010 (“Hey look Ma, I put GFP on a reporter construct and put error bars on my graph and published it in Nature!”), after which folks from this group have scattered towards more genomics-type work or have moved towards more biological applications. I think that this version of "systems biology" most accurately describes most of the attendees at the meeting, many of whom came from single molecule biophysics.

I viewed this meeting as a good opportunity to maybe take score and see how well our community has done. I think Steve put it pretty concisely when he said “So, where’s the beef?” I.e., it's been a while, and so what does our little systems biology corner of the world have to show for itself in the world of biology more broadly? Steve posed the question at dinner: “What are the top 10 contributions from biophysics that have made it to textbook-level biology canon?” I think we came up with two: Hodgkin and Huxley’s model of action potentials, gene expression “noise”, and Luria and Delbrück’s work on genetic heritability (and maybe kinetic proofreading; other suggestions more than welcome!). Ouch. So one big goal of the meeting was to identify where biophysics might go to actually deliver on the promise and excitement of the early 2000s. Note: Rob had a long list of examples of cool contributions, but none of them has gotten a lot of traction with biologists.

I’ll report more on some specific ideas for the future later, but for now, here’s my personal take on part of the issue. With the influx of physicists came an influx of physics ideas. And I think this historical baggage mostly distracts from the problems we might try to solve (Stephan Grill made this point as well, that we need something fundamentally new ways of thinking about problems). This baggage from physics is I think a problem both strategically and tactically. At the most navel-gazy level, I feel like discussions of “Are we going to have Newton’s laws for biology” and “What is going to be the hydrogen atom of the cell” and “What level of description should we be looking at” never really went anywhere and feel utterly stale at this point. On a more practical level, one issue I see is trying to map quantitative problems that come up in biology back to solved problems in physics, like the renormalization group or Hamiltonian dynamics or what have you. Now, I’m definitely not qualified to get into the details of these constructs and their potential utility, but I can say that we’ve had physicists who are qualified for some time now, and I think I agree with Steve: where’s the beef?

I think I agree with Stephan that perhaps we as a community perhaps need to take stock of what it is that we value about the physics part of biophysics and then maybe jettison the rest. To me, the things I value about physics are quantitative rigor and the level of predictive power that goes with it (more on that in blog post on mechanism). I love talking to folks who have a sense for the numbers, and can spot when an argument doesn’t make quantitative sense. Steve also mentioned something that I think is a nice way to come up with fruitful problems, which is looking at existing data through a quantitative lens to be able to find paradoxes in current qualitative thinking. To me, these are important ways in which we can contribute, and I believe will have a broader impact in the biological community (and indeed already has through the work of a number of “former” systems biologists).

To me, all this raises a question that I tried to bring up at the meeting but that didn’t really gain much traction in our discussions, which is how do we define and build our community? So far, it’s been mostly defined by what it is not: well, we’re quantitative, but not genomics; we’re like regular biology, but not really; we’re… just not this and that. Personally, I think our community could benefit from a strong positive vision of what sort of science we represent. And I think we need to make this vision connect with biology. Rob made the point, which is certainly valid, that maybe we don’t need to care about what biologists think about our work. I think there’s room for that, but I feel like building a movement would require more than us just engaging in our own curiosities.

Which of course begs the question of why we would need to have a “movement” anyway. I think there’s a few lessons to learn from our genomics colleagues, who I think have done a much better job of creating a movement. I think there are two main benefits. One is attracting talent to the field and building a “school of thought”. The other is attracting funding and so forth. Genomics has done both of these extremely well. There are dangers as well. Sometimes genomics folks sound more like advocates than scientists, and it’s important to keep science grounded in data. Still, overall, I think there are huge benefits. Currently, our field is a bunch of little fiefdoms, and like it or not, building things bigger than any one person involves a political dimension.

So how do we define this field? One theme of the conference that came up repeatedly was the idea of Hilbert Problems, which for those who don’t know, is a list of open math problems set out in 1900 by David Hilbert, and they were very influential. Can we perhaps build a field around a set of grand challenges? I find that idea very appealing. Although I think that given that I’ve increasingly come to think of biology as engineering instead of science, I wonder if maybe phrasing these questions instead in engineering terms would be better, sort of like a bunch of biomedical Manhattan Projects. I’ll talk about some ideas we came up with in a later blog post.

Anyway, more in the coming days/weeks…

Wednesday, October 4, 2017

How to train a postdoc? - by Uschi Symmons

- by Uschi Symmons

A couple of weeks ago I was roped into a twitter discussion about postdoc training, which seemed to rapidly develop into a stalemate between the parties: postdocs, who felt they weren't getting the support and training they wanted and needed, and PIs, who felt their often substantial efforts were being ignored. Many of the arguments sounded familiar: over the past two years I’ve been actively involved in our postdoc community, and have found that when it comes to postdocs, often every side feels misunderstood. This can lead to a real impasse for improvements, so in this blog post I’ve put together a couple of points summarizing problems and some efforts we've made to work around these to improve training and support.

First off, here some of the problems we encountered:
1. postdocs are a difficult group to cater for, because they are a very diverse group in almost every aspect:
- work/lab experience and goals: ranging from college-into-grad-school-straight-into-postdoc to people who have multi-year work experience outside academia to scientists who might be on their second or third postdoc. This diversity typically also translates into future ambitions: many wish to continue in academic research, but industry/teaching/consulting/science communication are also part of the repertoire.
- training: Some postdocs come from colleges and grad schools with ample opportunity for soft-skill training. Others might never have had a formal course in even such trivial things, like paper writing or how to give a talk.
- postdoc duration: there is a fair amount of variation in how long postdocs stay, depending on both personality and field of research. In our department postdocs, for example, postdoc positions vary widely, ranging from 1-2 years (eg computational sciences, chemistry) to 5-7 years (biomedical sciences).
- nationality: I don’t know if postdocs are actually more internationally diverse than grad students, but the implications of that diversity are often greater. Some postdocs might be preparing for a career in the current country, others might want to return to their home country, which makes it difficult to offer them the same kind of support. Some postdocs may have stayed in the same country for a long time and know the funding system inside-out, others may have moved country repeatedly and have only a vague idea about grant opportunities.
- family status: when I was in grad school three people in my year (<5%) had kids. In our postdoc group that percentage is way higher (I don’t have numbers, but would put it around 30-40%), and many more are in serious long-term relationships, some of which require long commutes (think two-body problem). Thus, organising postdoc events means dealing with people on very diverse schedules.

2. In addition postdocs are also often a smaller group than grad students. For example, at UPenn, we have as many postdocs in the School of Engineering as we have grad students in a single department of the school (Bioengineering). If fact, I have often heard disappointed faculty argue that postdocs “don’t make use of available resources”, because of low turnout at events. In my experience this is not the case: organising as a grad student and a postdoc I have found that turnout is typically around 30-40% - postdoc events simply seem less attended, because the base is so much smaller.

3. Finally, Postdocs frequently have lower visibility: whereas grad students are typically seen by many faculty during the recruitment process or during classes, it is not unusual for postdocs to encounter only their immediate working group. And unlike grad students, postdocs do not come in as part of a cohort, but at different times during the year, making it also difficult to plan things like orientation meetings, where postdocs are introduced to the department in a timely manner.

Seeing all of the above, it is a no-brainer why training postdocs can be difficult. On one hand problems are conceptual: Do you try to cater to everyone’s needs or just the majority? Do you try to help the “weakest link” (the people with least prior training) or advance people who are already at the front of the field? On the other hand, there are also plenty of practical issues: Do you adjust events to the term calendar, even if postdocs arrive and leave at different times? Do you organise the same events annually or every couple of years? Is it OK to have evening/weekend events? But these are not unsolvable dilemmas. Based on our experiences during the past two years, here are some practical suggestions*:

  1. Pool resources/training opportunities with the grad school and/or other postdoc programmes close-by: for a single small postdoc program, it is impossible to cater to all needs. But more cross-talk between programs means more ground can be covered. Such cross-talk is most likely going to be a win-win situation, both because it bolsters participant numbers and because postdocs can contribute with their diverse experiences (eg in a “how to write a paper” seminar; even postdocs who want more formal training will have written at least one paper). Our postdoc programme certainly benefits from access to the events from UPenn’s Biomedical Programme, as well as a growing collaboration with GABE, our department’s graduate association.

  2. Have a well(!)-written, up-to-date wiki/resource page AND make sure you tell incoming postdocs about this. As a postdoc looking for information about pretty much anything (taxes, health insurance, funding opportunities) I often feel like Arthur in the Hitchhiker’s Guide to the Galaxy:

    Once you know where to look and what you’re looking for, it can be easy to find, but occasionally I am completely blindsided by things I should have known. This can be especially problematic for foreign postdocs (I’ve written more about that here), and so telling postdocs ahead of time about resources can avoid a lot of frustration. A good time for this could be when the offer letter is sent or when postdocs deal with their initial admin. Our department still doesn’t have a streamlined process for this, but I often get personal enquiries, and I typically refer postdocs to either the National Postdoc Association's Survival Guide for more general advice or the aforementioned Biomedical Postdoc Program for more UPenn-related information.

  3. Have an open dialogue with postdocs and listen to their needs: More often than not, I encounter PIs and admin who want to help postdocs. They provide training in areas they have identified as problematic, and given the diversity of the postdoc group most likely that training is genuinely needed by some. But often postdocs would like more: more diversity, other types of training or maybe they even completely different pressing issues. Yet, without open dialogue between departmental organisers and the postdoc community it’s hard to find out about these needs and wishes. Frustratingly, one tactic I encounter frequently is departmental organisers justifying the continuation or repetition of an event based on it's success, without ever asking the people who did not attend, or wondering if a different event would be equally well received. To build a good postdoc program, universities and departments need to get better at gauging needs and interests, even if this might mean re-thinking some events, or how current events are integrated into a bigger framework.
    This can be difficult. As a case in point, Arjun, my PI, likes to point out that, when asked, the vast majority of postdocs request training in how to get a faculty position. So departments organise events about getting faculty positions. In fact, I am swamped with opportunities to attend panel discussions on “How to get a job in academia”: we have an annual one in our School, multiple other departments at the university host such discussions and it’s a much-favored trainee event at conferences. But after seeing two or three such panels, there’s little additional information to be gained. This does not mean that departments should do away with such panels, but coordinating with other departments (see point 1) or mixing it up with other events (eg by rotating events in two to three year cycles) would provide the opportunity to cater to the additional interests of postdocs.
    Frequent topics I’ve heard postdocs ask for are management skills, teaching skills, grant writing and external feedback/mentoring by faculty. For us, successful new programs included participation in a Junior Investigators Symposium on campus, which included two most positively received sessions about writing K/R awards and a “speed mentoring” session, where faculty provided career feedback in a 10-minute, one-on-one setting. Similarly, postdocs at our school who are interested in teaching can partake in training opportunities by UPenn’s Center for Teaching and Learning, and those interested in industry and the business side of science can make use of a paid internship program by Penn’s Center for Innovation to learn about IP and commercialization. While only a small number of postdocs make use of these opportunities per year, the provide a very valuable complement to the programs offered by the school/department. 

  4. Make a little bit of money go a long way: Many fledgling postdoc programs, such as ours, operate on a shoestring. Obviously, in an ideal world neither PIs nor administrative bodies should shy away from spending money on postdoc training - after all, postdocs are hired as trainees. But in reality it is often difficult to get substantial monetary support: individual PIs might not want to pay for events that are not of interest for their own postdocs (and not every event will cater for every postdoc) and admin may not see the return on investment for activities not directly related to research. However, you may have noticed that many of the above suggestions involved little or no additional financial resources: faculty are often more than willing to donate their time to postdoc events, postdocs themselves can contribute to resources such as wikis, and collaborations with other programs on campus can help cover smaller costs. In addition, individual postdocs may have grants or fellowships with money earmarked for training. Encouraging them to use those resources can be of great value, especially if they are willing to share some of the knowledge they gained. My EMBO postdoctoral fellowship paid for an amazing 3-day lab management course, and I am currently discussing with our graduate association to implement some of the training exercises that we were taught.

As my final point I’d like to say that I personally very rarely encounter faculty who consider postdocs  cheap labor. If anything, most PIs I talk to have their postdocs best interest at heart. Similarly, postdocs are often more than willing to organize events and mediate the needs of their fellows. However, in the long run the efforts of individual PIs and postdocs cannot replace a well-organized institutional program, which I think likely will require taking on board some of my above suggestions and building them into a more systematic training program.

*The National Postdoc Association has a much more elaborate toolkit for setting up and maintaining a postdoc association and there's also a great article about initiating and maintaining a postdoc organisation by Bruckman and Sebestyen. However, not all postdoc groups have the manpower or momentum to directly dive into such an program, so the tips listed here are more to get postdocs involved initially and create that sense of community and momentum to build an association.

Wednesday, August 2, 2017

Figure scripting and how we organize computational work in the lab

Saw a recent Twitter poll from Casey Brown on the topic of figure scripting vs. "Illustrator magic", the former of which is the practice of writing a program to completely generate the figure vs. putting figures into Illustrator to make things look the way you like. Some folks really like programming it all, while I've argued that I don't think this is very efficient, and so arguments go back on forth on Twitter about it. Thing is, I think ALL of us having this discussion here are already way in the right hand tail in terms of trying to be tidy about our computational work, while many (most?) folks out there haven't ever really thought about this at all and could potentially benefit from a discussion of what an organized computational analysis would look like in practice. So anyway, here's what we do, along with some discussion of why and what the tradeoffs are (including talking about figure scripting.

First off, what is the goal? Here, I'm talking about how one might organize a computational analysis in finalized form for a paper (will touch on exploratory analysis later). In my mind, the goal is to have a well-organized, well-documented, readable and, most importantly, complete and consistent record of the computational analysis, from raw data to plots. This has a number of benefits: 1. it is more likely to be free of mistakes; 2. it is easier for others (including within the lab) to understand and reproduce the details of your analysis; 3. it is more likely to be free of mistakes. Did I mention more likely to be free of mistakes? Will talk about that more in a coming post, but that's been the driving force for me as the analyses that we do in the lab become more and more complex.

[If you want to skip the details and get more to the principles behind them, please skip down a bit.]

Okay, so what we've settled on in lab is to have a folder structured like this (version controlled or Dropboxed, whatever):

I'll focus on the "paper" folder, which is ultimately what most people care about. The first thing is "extractionScripts". This contains scripts that pull out numbers from data and store them for further plot-making. Let me take this through the example of image data in the lab. We have a large software toolset called rajlabimagetools that we use for analyzing raw data (and that has it's own whole set of design choices for reproducibility, but that's a story for another day). That stores, alongside the raw data, analysis files that contain things like spot counts and cell outlines and thresholds and so forth. The extraction scripts pull data from those analysis files and puts it into .csv files, which are stored in extractedData. For an analogy with sequencing, this is like maybe taking some form of RNA-seq data and setting up a table of TPM values in a .csv file. Or whatever, you get the point. plotScripts then contains all the actual plotting scripts. These load the .csv files and run whatever to make graphical elements (like a series of histograms or whatever) and stores them in the graphs folder. finalFigures then contains the Illustrator files in which we compile the individual graphs into figures. Along with each figure (like, we have a Fig1readme.txt that describes exactly what .eps or .pdf files from the graphs folders ended up in, say, Figure 1f (and, ideally, what script). Thus, everything is traceable back from the figure all the way to raw data. Note: within the extractionScripts is a file called "extractAll.m" and in plotScripts "plotAll.R" or something like that. These master scripts basically pull all the data and make all the graphs, and we rerun these completely from scratch right before submission to make sure nothing changed. Incidentally, of course, each of the folders often has a massive number of subfolders and so forth, but you get the idea.

What are the tradeoffs that led us to this workflow? First off, why did we separate things out this way? Back when I was a postdoc (yes, I've been doing various forms of this since 2007 or so), I tried to just arrange things by having a folder per figure. This seemed logical at the time, and has the benefit that the output of the scripts are in close proximity to the script itself (and the figure), but the problem was that figures kept getting endlessly rearranged and remixed, leading to endless tedious (and error-prone) rescripting to regain consistency. So now we just pull in graphical elements as needed. This makes things a bit tricky, since for any particular graph it's not immediately obvious what made that graph, but it's usually not too hard to figure out with some simple searching for filenames (and some verbose naming conventions).

The other thing is why have the extraction scripts separated from the plots? Well, in practice, the raw data is just too huge to distribute easily this way, and if it was all mushed together with the code and intermediates, it would be hard to distribute. But, at least in our case, the more important fact is that most people don't really care about the raw data. They trust that we've probably done that part right, and what they're most interested are the tables of extracted data. So this way, in the paper folder, we've documented how we pulled out the data along while keeping the focus on what most people will be most interested in.

[End of nitty gritty here.]

And then, of course, figure scripting, the topic that brought this whole thing up in the first place. A few thoughts. I get that in principle, scripting is great, because it provides complete documentation, and also because it potentially cuts down on errors. In practice, I think it's hard to efficiently make great figures this way, so we've chosen perhaps a slightly more tedious and error prone but flexible way to make our figures. We use scripts to generate PDFs or EPSs of all relevant graphical elements, typically not spending time to optimize even things like font size and so forth (mostly because all of those have to change so many times in the end anyway). Yes, there is a cost here in terms of redoing things if you end up changing the analysis or plot. Claus Wilke argued that this discourages people from redoing plots, which I think has some truth to it. At the same time, I think that the big problem with figure scripting is that it discourages graphical innovation and encourages people to use lazy defaults that usually suffer from bad design principles—indeed, I would argue it's way too much work currently to make truly good graphics programmatically. Take this example:

Or imagine writing a script for this one:

Maybe you like or don't like these type of figures, but either way, not only would it take FOREVER to write up a script for these (at least for me), but by the time you've done it, you would probably never build up the courage to remix these figures the dozen or so times we've reworked this one over the course of publication. It's just faster, easier, and more intuitive to do with a tool for, you know, playing with graphical elements, which I think encourages innovation. Also, many forms of labeling of graphs that reduce cognitive burden (like putting text descriptors directly next to the line or histogram that they label) are much easier in Illustrator and much harder to do programmatically, so again, this works best for us. It does also, however, introduce a human element for error, and that has happened to us, although I should say that programmatic figures are a typo away from errors as well, and that's happened, too. There is also the option to link figures, and we have done that with images in the past, but in the end, relying on Illustrator to find and maintain links as files get copied around just ended up being too much of a headache.

Note that this is how we organize final figures, but what about exploratory data analysis? In our lab, that ends up being a bit more ad-hoc, although some of the same principles apply. Following the full strictures for everything can get tedious and inhibitory, but one of the main things we try and encourage in the lab is keeping a computational lab notebook. This is like an experimental lab notebook, but, uhh, for computation. Like "I did this, hoped to see this, here's the graph, didn't work." This has been, in practice, a huge win for us, because it's a lot easier to understand human descriptions of a workflow than try and read code, especially after a long time and double especially for newcomers to the lab. Note: I do not think version control and commit messages serve this purpose, because version control is trying to solve a fundamentally different problem than exploratory analysis. Anyway, talked about this computational lab notebook thing before, should write something more about it sometime.

One final point: like I said, one of the main benefits to these sorts of workflows is that they help minimize mistakes. That said, mistakes are going to happen. There is no system that is foolproof, and ultimately, the results will only be as trustworthy as the practitioner is careful. More on that in another post as well.

Anyway, very interested in what other people's workflows look like. Almost certainly many ways to skin the cat, and curious what the tradeoffs are.