Learn how to do a state-of-the-art driver analysis in 30mins.
Discover how to uncover which attributes in your data have the biggest impact on behavior.
I'm going to explain how to do driver analysis, which is also known as key driver analysis, importance analysis, and relative importance analysis.
This is one of the more technical topics in research. But, once you get past the jargon - and there is a lot - it's actually quite straightforward.
During the webinar, I will be illustrating the key principles, and showing you how to do it in Displayr.
10 Steps of Driver Analysis
There are 10 steps to perform, starting with checking that driver analysis is appropriate.
Perform driver analysis when
We perform driver analysis when we have a survey where we have measured overall performance, and we have measured -dimensions or components of that overall performance.
This example is from a Hilton customer satisfaction survey.
The overall performance measure, known as the outcome variable ,is Overall service delivery.
And the predictors are these five drivers of aspects of overall performance.
And we perform driver analysis to find out the relative importance of the drivers. That is, the predictors. We are wanting to know what is most important, so the end clients can work out where to focus their attentions.
The two main applications of driver analysis are when we are predicting service performance, typically via satisfaction or NPS, and when we are understanding how brand imagery relates to brand preference.
A lot of people confuse driver analysis with predictive modelling.
They share a lot of math, but they are distinct.
When not to use driver analysis
If we want to make predictions, such as sales, or wish to use data relating to behavior and demographics, then we are instead needing a predictive technique, like linear regression, deep learning, or a random forest.
Step 2: Make numeric
We will do a case study of the US cellular phone market
Our outcome measure is the net promoter score.
In this study, our predictors were collected in two questions.
The first asked satisfaction with network coverage, internet speed, and value for money.
The second set of predictors measures the perceived effort it takes customers to achieve outcomes.
We want to see how these things predict NPS. Is value for money the strongest predictor? Internet speed? Or the effort to check your internet usage?
Standard driver analysis assumes that the predictor variables are numeric.
Now this step is very software dependent. In Displayr, variables have a setting called structure. Which determines how they are to be analyzed. In Q, this is known as Question Type..
If I click on the variable set for Satisfaction, the structure is showing that these variables contain multiple nominal variables. So, I need to change this to numeric.
Change structure for satisfaction
Note that Displayr changed the table to show an average.
Change for customer effort.
Step 3 is to ensure that higher values are assigned with higher performance levels.
The original data for NPS asks people to rate how likely they are to recommend their cell phone company. When we compute NPS, this is equivalent to assigning a score of -100 to people that gave a rating of 6 or less,
0 to people that gave a rating of 7 or 8, and 100 to people that gave a rating of 9 or 10.
So, we have higher values for the better performance level, as we need.
As you can see, if people said they were very dissatisfied they have a 1, and higher values are associated with higher satisfaction, so again everything is correct. Nothing to do.
Ah. We have a problem. Can you see it?
So, the worst level of performance, Very difficult, has a value of 5.
And the best has a value of 1, so this is the opposite of what we want.
And, we've got some don't knows as well. That is not what we want, so we will have to fix it.
So, to use the jargon, we need to reverse code these values, so that we have higher values for higher levels of performance
Replace 5 to 1
Don't know is a bit tricker.
What does a don't know mean? Think about Cancel your subscription. If you have said don't know, it's probably because you never tried. This is important. We will return to it.
What value should we assign to don't know?
Is it a 2 maybe, a 3 or a 4?
It's a trick question. It does not belong, so we will set it as missing data. We will return to this issue
Don't know: Exclude from analyses
Step 4: Stack or use
Step 4 introduces some new jargon. Stacking.
What the data needs to look like
For driver analysis, we need a signle outcome variable, and then one variable for each of our predictors.
In our cell phone study, we have only asked people to rate the performance of their current phone company. So, our data is in order.
If you have repeated...
Sometimes studies have repeated measures. In the example shown on the screen, we asked people to rate performance of different brands, so we have three outcome variables, 3 variables for the first predictor and so on.
This repeated measures data needs to be stacked.
We will return to this step in a second case study later in the webinar.
Step 5: Regression type
OK, this sounds scary doesn't it? Once you get passed the fearsome jargon, it's very simple.
Choose the right regression type
We need to look at our outcome variable and work out which of these descriptions is correct.
Reminder: our outcome variable measures NPS.
Which of these is it?
we need to start by doing linear regression, which I'm sure you've heard of before.
As an aside, if we wanted to predict the 11 categories used to create the NPS, we would be instead using an Ordered Logit model.
And also use Shapley...
After we choose the regression type, we need to make a second choice. This one is even easier.
If we have a regression type of Linear and we don't have too many predictors, we want to choose Shapley regression.
Otherwise, we use Johnson's relative weights.
They give nearly identical results for linear data, so the choice doesn't really matter. But, Shapley is the one that is most well known.
Why do we do this? I'll give you a chance to read. But, it's very technical!
Let me emphasize the last line. There are a whole lot of techniques with different names that do the same thing.
OK, let's do it. We need to do a driver analysis. Where's that in Display? Let's search for it.
By default it has linear as the Regression type, but this is where we change it if we need to.
Click on regression type
And we will set it to Shapley.
Output: Shapley Regression
We drag across out outcome variable.
OK, so we have actually just done a driver analysis.
It tells us that the most import driver is network coverage.
Followed by Help from customer or technical support.
Value for money is driver number 3.
Step 7: Choose your missing data strategy
Pay close attention here. This is the area where I commonly see the most experienced analysts get it completely wrong.
Remember, with our customer effort data we had don't knows, and we set them to missing, so we do have missing data.
We have a missing data problem we need to address.
Missing data treatments
There are a number of different ways that we can treat missing data.
I will give you a chance to read this, but it is a bit hard going.
The top 3 are the good options. The bottom 3 are rarely smart options.
Missing values by case
Let's try and understand our missing data.
Search: Missing data
Anything > Data > Missing data > Plot by case
Let's drag across our outcome and predictors
This is a heatmap, where we show blue for missing values.
Note though that there are clearly differences by variable. We've only got missing data for our customer effort scores. This second last variable has much more blue. Why is that?
If you squint, it's the predictor relating to the ease of cancelling your subscription.
Ah, that's interesting.
Think about it for a moment. That makes sense. Of the predictors, it's the one the fewest people will have experienced.
In our case, option 1 is clearly the correct option. People have missing data because they have no experience with cancellation and the other effort predictors. We need to use Dummy variable adjustment.
Now, remember, value for money is our third most important driver.
Let's set the missing data
Missing data: Dummy variable adjustment
Now value for money is our second most important driver. So, the missing data setting does make a difference.
Displayr's got an expert system that reviews models and checks they are good.
Warnings are shown in the top right and also on the page to the left.
I will give you a chance to read this first warning
An assumption of driver analysis is that things have a positive effect. Displayr's giving us a warning that one of our predictors has a negative effect! That's potentially a problem, as it may suggest a data integrity problem.
Look at importance scores.
Ah, and look, it's cancel your subscription plan yet again. Think about this for a second, what this is telling us is that if we make it easier to cancel your subscription, then people are going to be less happy. That's what the negative sign means.
If you think for a moment, our issue here is really logic. The only way that most people know cancellation is if they don't like the company, so you can think about this as being an outcome of how people feel about their phone company. Not an appropriate predictor.
So, we will remove it.
The next warning is telling us that we may have some outliers that are stuffing up things.
We have two ways we can address this.
One is we can examine diagnostic plots,
Extension button: Cooks Distance versus leverage.
and inspect all the outlying observations
If you want to, all the key diagnostic plots are in Displayr and Q. But, it means reading through all the raw data and trying to figure out what makes this data weird. It's what the textbooks say you should do, but I rarely helps with survey data.
I'm going to do something a lot easier.
First, I will create a second version of the model, and automatically remove some of the outliers.
We will automatically delete the 10% of most outlying observations.
What we want to see is that the broad conclusions remain the same.
And they are!
The bar lengths are similar when we compare. They are not identical though.
Internet speed is the number four driver in the model on the left, and number 3 on the model on the right.
And, we still have problems with outliers.
What do we do?
We just keep this model on the right and keep in mind that there is noise in our data. We have learned that we need to focus on the broad themes, rather than getting stuck on small differences.
This is good advice with all research anyway.
Now let us read the last warning
It sounds complicated, but we will follow the advice and the warning will disappear.
The next thing is to review statistical significance. Everything is highly significant, so all good here!
Step 10: Visualize
So, Now let's make it look pretty.
Visualize > Bar Chart
I will hook this up to our driver analysis.
Personally, though, I actually prefer a donut chart. Can't really tell you why. I just do.
Chart type: Donut
Output: Driver analysis
The other classic visualization for a driver analysis is a quad map, showing performance by importance.
To make this easier, I am going to combine the performance and importance data.
And, for a bit of fun, I'm going to filter this by main brand of phone company.
So, now this is filtered for AT&T.
But, I can easily change it.
Now I just need a scatter plot
Visualization > Scatterplot
Let's hook it up to the data
X coordinates: model
Chart > APPEARANCE > Show labels >P chart
DATA LABELS: Font size: 10
Resize to full width
Chart > x axis > Axis title: Importance
Y Axis < Axis Title: Performance Chart
OK, so we can see that AT&T does great on the all important network coverage, but is terrible on value for money and help services, so that's where it needs to concentrate. And, because it's interactive, it's easy to see what the other brands need to do. Change to Boost For example, Boost needs to solve network coverage Case study 2 Ok, let's do a second case study. This one is about the cola market. My outcome variable is brand preference.
My predictor is Q5, which measures brand personality. This is a standard application of driver analysis. Step 2 is to make predictors numeric or binary. Let's look at them. Click on Q5 As you can see, the structure is binary, so nothing to do. Step 3 is to assign higher values to better performance levels.
A 1 for yes and a 0 for no, so we have better levels for our predictors. What about our outcome?
Ah, we have a problem. Don't know again. Press Values The values are OK for everything except Don't know. Set Don't know to Exclude Now, let's have a look at the raw data.
Select Q6 and then Q5 > View in data editor
We've got one variable for Coke as an outcome.
Another for Diet Coke.
So, we have repeated measures. We need to stack the data.
Case study 2
Anything > Advanced > Regression > Driver
I will choose the Stack data option.
Great. So, we've built our model.
Let's read the warnings
This first one is telling us something about the data. It's not a problem.
The second one is telling is that we may have the wrong type. It's recommending Ordered Logit. The table I showed you before said the same thing.
We've got negative signs again. Note that it's for Feminine, bit the predictor is not significant, so we will just force it to be positive.
Great. We are done!