This webinar is focused on how to find stories in your survey data.
The first step in finding the story in your data is to create and implement an analysis plan. Typically, this involves the creation of hundreds or thousands of tables of data. That's the easy bit.
Then, the next stage is to reduce all the data, so that all that is left is the interesting stuff. This second stage, known as data reduction, is the main focus of this webinar.
Cast a wide-but-focused net
It is a natural thing when you have collected data to just jump in and play with the data. However, as a pretty general rule, when people do this they end up just getting lost and finding no story.
The smart way to proceed tends to be to have a written analysis plan. There's something about writing an analysis plan that focuses your mind.
We will start by looking at an analysis plan for a study investigating the appeal of this product, the iLock. I'll give you a moment to read the description.
An example analysis plan
This is an example of an analysis plan. Don't worry if some of the jargon is new to you, we will cover some of it in this webinar, and at the end I'll also share some links to other resources.
I will give you a chance to digest it.
The key bit is the planned tables and, if any, planned visualizations.
I covered the analysis plan in a previous webinar which I will link to at the end of the webinar. But, I will just emphasize some key aspects
The most important thing is to spend the time to carefully thinking through the goal of the study working out the analyses that best achieve the goal. This will always be specific to your study. You need to work them out from scratch.
Second, there are some frameworks which I will touch on shortly.
Third, it's always a good idea to identify the key questions in your survey and then crosstab them with all the other data. Such exploratory analyses can uncover insights that you didn't anticipate.
Do you have a best friend at work?
"Do you have a best friend at work?" It's an unusual question, isn't it?
It has been created by legendary polling company Gallup, to measure the extent to which people feel connected to their colleagues.
At one company, 16% of employees strongly agreed. What does this mean? It's not hard to come to the conclusion that this is a very bad question.
What does having a best friend have to do with work?
Is 16% a good number or a bad number?
As I'll shortly explain, even though the question may be pretty ambiguous, with appropriate analysis it can still be useful.
The trick is to apply the delta principle. This principle is very important in applied research, because it turns out that most survey data is pretty ambiguous, so we need to apply the delta principle very often.
The delta principle
The delta principle is that when analyzing data that is inaccurate to some unknowable degree, you should create analyses that compare it so similarly inaccurate data. Or, less verbosely, focus on relativities, not magnitudes.
Let's do it.
One way is to compare sub-groups.
Let's say that 12% of engineers have a best friend at work, and 32% of marketers do.
The difference between these numbers, often referred to using the greek symbol delta, is meaningful.
Whatever errors are in our 12% result are also in our 32% result, so when we subtract one from the other, a good deal of this error will just cancel out.
This allows us to conclude that even though we know the data is problematic, it suggests that the marketers at this company are more connected than the engineers. Go marketers!
We can also use other studies to provide benchmarks.
If our best friend score is worse than other companies, that's not good news.
And, we can track the results over time.
Lastly, we can be creative and create similar questions that we think will have the same error structure.
For example, we can ask employees if they had a best friend at their previous company, and then compare the difference.
Once you have your analysis plan, you go ahead and perform the analyses. I illustrated how to do this in our recent webinar on crosstabs.
You will typically end up with lots of crosstabs.
Lots of crosstabs means lots of data. And this is where the job of finding the story really starts. We need to perform data reduction. As the name suggest, the goal is to reduce the amount of data, chucking away all the noise, so all that's left is the story.
The pope of the day asked Michelangelo how he'd carved this most famous of all statues.
He said. "It's simple. I just remove everything that's not David."
And this is precisely how we find the story in our data.
Remove if not interesting
We remove everything that's not interesting to our stakeholders.
Or, returning to our fish motif, by reducing the quantity of data, we distill and concentrate it, so we end up with a much stronger flavor, much like fish sauce, for those of you that like Vietnamese and Thai food.
So how do we do it?
Data reduction techniques
There are eight key techniques to master. They can be done in any order. The first three of these we touched on in our earlier webinar on crosstabs, but I will cover them again.
We will start with the first of these, which is to delete uninteresting analyses.
Key planned analyses
If you have an awesome visual memory, you will recall that my analysis plan consisted of key planned analyses, and then some exploratory analyses.
Usually you want to share the key planned analyses with stakeholders, so our focus when we are doing data reduction is on the exploratory analyses.
My exploratory analyses consists of a whole heap of tables, with everything in the data, crossed by how likely people said they would be to buy the iLock, or, to use the jargon, purchase intent.
Let's look at the first table.
You'll recall our first data reduction goal is just to delete things that aren't interesting.
This table compares how long it took people to complete the questionnaire by their purchase intent.
It's hard to imagine a less interesting table.
Now, we could just delete it.
But, with exploratory analysis, we can entirely automate the process.
Statistics has invented a nice little tool that summarizes whether exploratory crosstabs are interesting. It's called stat testing. We can just delete all exploratory tables that are not statistically significant.
As you can see there are quite a lot of tables here!
These levels refer to how much we want to delete. I'm lazy, so I don't to delete a lot, as the more that is automatically deleted, the less I need to read, so I'll choose the first option.
It's telling me what has been deleted.
Quite a lot!
Now, I've only got about half as many tables to read!
So, how do we reduce the data on this one further?
The second of the principles is we want to remove visual clutter.
Let's see what we can do here.
Before I start, I just want you to note something. This table has 18 rows of numbers, and 8 columns, so it contains 144 numbers. Success is reducing that to a much smaller quantity of numbers.
This first column of Less than 18 years is the definition of clutter. The sample size is 0, so it contains no data. Let's get rid of it.
As discussed in our crosstabs webinar, we usually just want the column %. Let's take the row% and count off.
All that blue on the table is visual clutter as well. Let's choose a much simpler coloring.
Let's get rid of the percentage signs
OK, it's getting better.
To use some jargon from the olden days, we improve things by reducing the amount of ink on the page.
What's the next data reduction technique?
It's to merge similar things
There are two basic ways that we do this. One is to use automated techniques like chaid and cluster analysis and factor analysis. They are topics of other webinars. But, the most useful thing is to just merge columns and rows in tables.
We can see that our first two age categories are pretty similar, so we will merge them.
OK, we started with 144 numbers, we've now reduced it to 24 numbers, which is an 83% reduction and pretty impressive. Let's see what else we can do.
Replace data with summary statistics
This next technique is to replace data with summary statistics.
You're already familiar with the most common examples of this, which are that rather than share the raw data with somebody, you compute averages and percentages, which are summary statistics.
But, there are many other statistics.
The first four of these are the ones that we use most often in survey analysis.
Let's work through some examples.
Using averages instead of percentages
Here's some data looking at how many episodes people have watched of particular TV shows.
I'll give you a moment to work out which TV shows are more popular.
Stranger Things and The Mandalorian are the winners. Nice work. But, it was hard work. You had to digest 70 numbers.
What we will do now is we will change the data from ordinal to numeric, and then Displayr will automatically compute the averages.
Cool. Now we only have 11 numbers. That's much quicker and easier to read.. It's also easier to see that Stranger things and The Madolerian win.
There's a little trap for beginners in this. It's mentioned in the heading.
Our table had words describing how often people watched different programs. How can we take an average of words?
Let's look at what has happened in the background
By default, data collection programs assign a 1 to the first category, a 2 to the second category, and so on.
These are the numbers that we have averaged. So, we need some better numbers.
Some people get a bit nervous when this is done. In particular, they look at the Started one score of 0.1, and point out that it's just a guess.
The good news is that people have been doing this for more than 70 years, and it turns out that unless you do something really dumb, it makes no meaningful difference.
You don't need to take my word for this. Obviously the earlier values of 1 to 6 were not so sensible. So, will the results change when we use these more sensible numbers?
Yes, the magnitude of the numbers changed. But, the relativities have not.
Remember from our earlier discussion of the delta principle. We aren't so interested in magnitudes. It's only relativities that we trust.
The Mandalorian and Stranger Things remain the winners.
Let's do a more challenging example.
As we have seen, we have data on the viewing of 10 different programs.
With data like this, we are keen to understand if people that watch one program are more likely to watch certain other programs. This is useful both in scheduling content, and also in placing advertising.
But, we have a problem. With 10 shows, there are 45 crosstabs showing the viewing patterns between them.
And, each of these crosstabs has 36 numbers. So, we have 1,620 numbers to look at. That's hard work.
Let's start by just looking at a subset of them.
What's the overlap?
The table on the top left is showing us the relationship between viewing Fargo and Dr Who. What can you see?
I can't see much of a pattern.
Now, look at the table to the right. In its top left, we can see that 91% of people that didn't watch Picard have also not watched Dr Who. But, reading along the row, note that the more they watch Picard, the less they have said they have not watched Dr Who.
That is, it looks like there is a correlation between watching Dr Who and watching Picard.
When you have a crosstab where the categories have a natural ordering, as we have here, we can use correlations as a summary statistic. A correlation of 1 means that we can perfectly predict viewing of one show by viewing of the other. A correlation of 0 means there's no relationship.
A correlation of -1 means that the more people watch one show, the less they watch another.
As we can see here, the correlation statistic is doing a good job at summarizing the strength of the patterns in these tables. It clearly shows that Picard and Star Trek Discovery are much more strongly correlated with Dr Who than Fargo and Little Fires Everywhere.
Let's look at all 10 shows together
OK, let's create a big crosstab.
Here's our summary grid. Let's crosstab it by itself.
This is a big table! Look at the scroll bars!
That's too hard to read.
We want to look at correlations instead.
Just as with my previous example of looking at means, the trick is to change the data structure from ordinal to numeric for both the row and column data. In this example, the same data is in the rows and columns, so both will change at the same time, giving us correlations.
Now we have a much smaller table. We've reduced the data to a much smaller size, from 1,620 numbers to 121. That's a huge saving.
What can you see?
That's right. The blue arrows are telling us that the big pattern is that in general, if I watch one show then I am more likely to watch all of the others. I'm sad to say this is true for me. The only one I haven't watched is You!
We will return to exploring this shortly.
Now we are onto my personal favorite technique. Changing the order of data to make patterns easier to see.
Main ways of ordering data
Let's start with the table to the right.
We will start by de-cluttering it, removing the SUM
It's currently sorted by alphabetical order.
This is one of the worst ways of sorting most tables.
How can we improve it?
The easy win is to sort from highest to lowest.
That's much better.
Let's also turns add some share data.
Now for the cool bit.
When we have two dimensional tables, we can reorder the rows and columns so that all the small numbers are in one corner. This makes it easier to see patterns.
And, we will find two types of patterns, hierarchy, or segmentation.
Let me show you an example.
Diagonalizing viewing correlations
Here's the table from earlier. It's changed a bit, as it was automatically sorted from lowest to highest before.
I'm going to reorder the rows so that my bottom left corner has the small numbers.
Can you see how Dr Who is in the bottom left, but its numbers aren't so small?
I'll drag it up to Start Trek discovery
The undoing is also in the wrong spot.
You can see that Fargo is in the wrong spot. It's not correlated with much. Let's drag it to the bottom.
I have now diagonalized the table. You can see that there is a big diagonal pattern running from the top left to the bottom right.
I'm now going to view it as a heat map to make it a bit more obvious
Cool. We started with 1,620 numbers, and now we have reduced them to an image with no numbers. That's true data reduction.
Note also that we've got a pattern.
The lighter blue means weaker correlations. As we planned, we've moved all the weaker correlations to the corners. In this case the bottom left and top right.
This is a segmentation pattern. We've found two groups of programs.
The science fiction shows are all group together in the top left and other programs in the bottom right.
Interesting, The Handmaid's tale, which is really a science fiction show, doesn't appear with the other science fiction shows.
And, the strongest correlations are between the most nerdy of the shows, with the Star Trek and the Star Trek universe show Picard being most strongly correlated.
Change the scale
Now we move onto changing the scale of data, also known as recoding to many market researchers.
We already touched on this before when looking at the TV viewing data.
Changing the scale of the purchase intent data
Reminder. We've seen this before. This table shows us that younger people are more likely to say they will definitely buy the iLock
Earlier I changed the data from categorical to numeric. But, this time I want to make it clearer what's going on, so I will instead show the average on the same table
At first glance this is a bit odd. It's showing lower scores of purchase intent for the younger ages. What's going on?
Let's look at the values.
Ah. They have a score of 1 for definitely buy, and a score of 5 for definitely not buy, so the higher the score, the lower the purchase intent.
That's counterintuitive. So, we need to reverse the scale
OK, now the average makes more sense.
But, 1 to 5 is pretty arbitrary. Can we improve on it?
I'm going to convert this to probabilities. If somebody said they would definitely buy it, then I'll give them a score of 100.
Probably is a core of 75 for me.
Not sure is maybe 20%.
And the other two are 0
OK, so now we have a prediction, which is that 24% of people will buy, and among the under 35s, this is 41%.
It works like this. Of course, keep the delta principle in mind. These absolute magnitudes can't be relied on.
There's a different scaling that a lot of market researcher likes, called top 2 box.
We assign a score of 100 to the two most positive categories, and 0 to the others.
If you are good at math, you will realize that this is the same as merging the first two rows
Now, let's remove some clutter.
So, now we have really reduced the amount of data.
Common ways of changing the scale
These are the most common ways of re-scaling when we do survey research.
The ones in blue are the ones that are very widespread.
Midpoint recoding of Income
Here's a more challenging bit of data.
Income has been measured in lots of categories.
What do I do?
Fortunately Displayr will automate this for me.
I click on income. This is creating a new variable for me.
And, as you can see its computed average income. How did it do it?
It has automatically worked out the midpoints of each of the categories.
It's even come up with a sensible answer for the unbounded category.
And, it has set the Don't know and refused categories to missing.
I'm also a big fan of decomposing data
A decomposition is when you replace one number with multiple numbers.
At first glance, this may seem the opposite of the idea of data reduction, but stay with me
Sometimes you can use logic to create a decomposition.
For example, in business we split profit into revenue and cost.
And, we can split revenue up in lots of different ways.
Returning to the data we have been looking at, we can decompose the data on average proportion of episodes watched. It splits into two things.
The percentage of people to watch one or more episodes.
And, the average proportion of episodes watched among people that watched more than 1.
That is, the first table is obtained by multiplying together the next two tables.
Why would we do this?
If you look carefully, we have found something interesting.
Patterns are often clearer in a component or by comparing components.
Looking at the first table, Picard is watched less than half as much as Stranger Thing.
But, if we look at middle table, we see that its penetration is not massively lower, at 77% relative to Stranger things at 98%.
But, the real weakness for Picard is that among the people that do watch it, they only watch half as much as the people that do watch Stranger things. So, the program's struggling in terms of getting its repeat viewing, rather than penetration.
Algorithmic seasonal decomposition of c02 data
An alternative to using logic, is to use algorithms to decompose data.
The most well known example of this is the seasonal decomposition, which decomposes the observed data into
Trend + Seasonal component + random noise
The viewing correlations again
Here's our correlation matrix from earlier. It's got 100 numbers on it. It's a bit hard to digest.
We can use another algorithmic decomposition, known as correspondence analysis, to pull this apart.
This decomposition actually takes the 100 numbers and turns them into the 20 numbers that best explain the data, + another 80 numbers of noise. Then, we just plot the first 20 as coordinates on a scatter plot.
The way we read it is that the closer two programs are together, the more their viewing overlaps.
As you can see, this much more quickly reveals the patterns we identified before.
Common sense (consistent, no smell, no APEs)
The last way of reducing data is to use common sense. In practice this comes down to three things.
The first is consistency. If a result is not consistent with existing data and theories, it is probably wrong.
Second, smell the result. If a result smells fishy, it probably is.
There's a rule called Twyman's rule, which says that any result that looks interesting is probably wrong. This rule, sadly, is usually right.
The third secret to applying common sense is to look for APES
An APE is an alternative plausible explanation for the finding.
If you can find an APE, then it's appropriate to either get rid of the analysis, or, prove that the APE is wrong.
Let's do some examples.
Imagine a data scientist has built a predictive model that shows that the more firefighters that go to a fire, the worse the property damage, leading to the recommendation that fewer firefighters should be sent out.
The data is correct, but the correlation is spurious. As the ape says, the real story is that we see the pattern in the data because more firefighters are sent to worse fires.
The ones in blue are the ones that are very widespread.
Here's an example from our survey. The data is showing the relationship between purchase intent, and, attitude to technology. What can you see?
The obvious conclusion is that people that the more people like technology, the more they liked the iLock. But, can you see an APE?
That's right. It could be a response bias. This is perhaps the most common APE in surveys.
The idea of this alternative plausible explanation is that patterns in how people semi-randomly answer questions cause spurious patterns to appear in surveys.
Maybe there are some people that just like to choose the first option, and some the middle, and some the rightmost options. If true, this would explain this pattern, indicating that maybe true interest in the product isn't driven by attitude to technology..
How would we test this APE?
We need to go back to the delta principle. We need to create compare this result, to some other result that should be wrong in the same way if our APE is true.
If response biases were strong, with people just having strong preferences for particular answers throughout the questionnaire, this should also appear in our TV viewing data.
Testing the APE
I've got a crosstab here of purchase intent by how frequently people said they watched Fargo. Take a look.
There's no clear pattern here. So, we can rule out the response bias APE and conclude that purchase intent is correlated with attitude to technology.