For instance, take this more conventionally over-plotted graph of city vs. highway miles per gallon, with different classes of cars labeled by color:
q2 <- qplot(cty,hwy,data=mpg,color = class) + theme_bw()
ggsave("color.pdf",q2,width = 8, height = 6)
Now there are a number of problems with this graph, but the most pertinent is the fact that there are a lot of colors corresponding to the different categories of car and so it takes a lot of effort to parse. The small multiple solution is to make a bunch of small graphs, one for each category, that allows you to see the differences between each. By the power of ggplot, behold!
q <- qplot(cty,hwy,data=mpg,facets = .~class) + theme_bw()
ggsave("horizontal_multiples.pdf",q,width = 8, height = 2)
Or vertically:
q <- qplot(cty,hwy,data=mpg,facets = class~.) + theme_bw()
ggsave("vertical_multiples.pdf",q,width = 2, height = 8)
Notice how much easier it is to see the differences between categories of car in these small multiples than the more conventional over-plotted version, especially the horizontal one.
Most small multiple plots look like these, and they're typically a huge improvement from heavily over-plotted graphs, but I think there’s room for improvement, especially in the labeling. The biggest problem with small multiple labeling is that most of the axis labels are very far away from the graphs themselves. This is of course a seemingly logical way to set things up because the labels apply to all the multiples, but it leads to a problem because it leads to a lot of mental gymnastics to figure out what the axes are for any one particular multiple.
Thus, my suggestion is actually based on the philosophy of the small multiple itself: explain a graph once, then rely on that knowledge to help the reader parse the rest of the graphs. Check out these before and after comparisons:
The horizontal small multiples also improve, in my opinion:
To me, labeling one the small multiples directly makes it a lot easier to figure out what is in each graph, and thus makes the entire graphic easier to understand quickly. It also adheres to the principle that important information for interpretation should be close to the data. The more people’s eyes wander, the more opportunities they have to get confused. There is of course the issue that by labeling one multiple, you are calling attention to that one in particular, but I think the tradeoff is acceptable. Another issue is a loss of precision in the other multiples. Could include tickmarks as more visible markers, but again, I think the tradeoff is acceptable.
Oh, and how did I perform this magical feat of alternative labeling of small multiples (as well as general cleanup of ggplot's nice-but-not-great output)? Well, I used this amazing software package called “Illustrator” that works with R or basically any software that spits out a PDF ;). I’m of the strong opinion that being able to drag around lines and manipulate graphical elements directly is far more efficient than trying to figure out how to do this stuff programmatically most of the time. But that’s a whole other blog post…
Hey Arjun,
ReplyDeleteThat's a nice illustration of small multiples -- I will definitely keep this in mind for future plots.
I'm also looking forward to reading a bit more about your figure pipeline. I also start all my figures in some scripting language that talks to the data, and have gone back and forth between minimizing the amount of script writing necessary vs. minimizing the amount of illustrator click and drag in necessary. Despite being repeatedly recommended by colleagues that the pure script is the best way to go, I find, like you I believe, that it just doesn't work so well for my way of visual reasoning.
Hi Alistair, I've been meaning to write a blog post on reproducibility and efficiency on this topic specifically. For now, suffice it to say that in my opinion, purity is the enemy of efficiency. :)
ReplyDelete