14 March 2017 |
A Pie Chart for Pi Day: The Data Scientist Pie Eating Challenge
Today is national pi day. The number, not the food. As mentioned in a previous post, I love pie charts. And, as luck would have it, I recently chanced upon some data in need of a pie chart. Surprisingly, the data set that I found in need of a pie chart was in a blog post by David Robinson attacking pies!
In his informative post, David points out that the visualization below, from the Wall Street Journal, is a poor one. Sure it looks pretty, but it takes quite a bit of work to figure out the answer to the question it poses at the top. Where does the time go?
David’s better plot
David creates small multiples of column charts. You can check out the code here.
This new plot is, to my mind, a clear improvement on the original, although it has a bit too much of the Tufte funeral style for my own liking.
But, it is a complex chart, not a visualization
However, I think this is still a complex chart rather than a visualization. To interpret it correctly requires a strong mind. Compare Presenting analysis with Machine learning/statistics. The highest column is for Presenting analysis (47%), but taking frequency properly into account, more time is actually spent on Machine learning/statistics. Sure, the reader can easily work out this correct interpretation, but they need to work it out, as their instinctual interpretation – a higher column is meaningful – is not correct.
A solution is to:
- Make educated guesses about the average of each of the column categories. My guess is that < 1 a week is 0.25 times a week, 1-4 a week is 2.5 times a week, 1 – 3 a day is 14 a week, and >4 a day is 40 times a week. This is often referred to as midpoint recoding (even when not using midpoints)
- Multiply the percentages in each column by the guessed averages and sum them up.
- Convert them to percentages.
- Create a high-quality pie chart.
Is the pie chart truly better?
The pie chart clearly shows less information than the early charts. And, it makes assumptions that are, at best, educated guesses. But, to my mind, it does a better job, in that it quickly communicates that main pattern in the data, which is that all the categories are reasonably large, with no one activity dominating the time of data scientists.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.