Webinar

DIY Driver Analysis

Whether this is your first driver analysis or you’re already a guru, this 15 minute webinar is for you.

In this webinar you will learn

Here’s a little summary of some of the subjects we cover in this webinar

This webinar will explain all the key steps needed for you to perform your own driver analyses. By the end of the webinar, you will know how to prepare your data, deal with outliers and multicollinearity, and how to create quad maps to show the results. Techniques covered include: generalized linear models, Shapley Regression, and Johnson’s Relative Weights.

Driver analysis is used to quantify the importance of the drivers of satisfaction, NPS, or brand preference. It’s an application of regression, where the outcome is a measure of the performance of one or more brands, and the predictors are measurements of the performance of the brand.

The webinar is presented by Dr Justin Wishart, who is in the data science team at Displayr and has built a number of the features that will be illustrated in the webinar (and, he’s funny!)

Transcript

This webinar will explain how to take your data, analyze it and present the results from a driver analysis quickly and effectively. I will be illustrating the key principles and showing you how to do it in Displayr. We will be going through the stages of data preparation, choosing the right technique, dealing with missing data, model checking, and how to create a data visualization to communicate results to clients and colleagues.

Overview

I am going to walk you through two case studies. Apart from the presentation aspect, it can all be done in Q as well and I'll explain how when we do the demonstration.

 

Cell phone case study - data preparation

This first case study is from the US cell phone market. The outcome variable here is one that many are no doubt familiar with, the Net Promoter Score. Coded numerically as you see here.

The predictors are a series of ratings of people's phone companies based on satisfaction and customer effort measures. These are recorded in the variable set – Performance.

There are three variables measuring satisfaction: network coverage, internet performance and value for money; and six effort variables measuring the effort in understanding their bill, changing their plans and so on.

So far, so good, we have our variables we wish to use in a driver analysis. Let's move on to decide the appropriate technique.

 

Choosing the right technique

This table lists the key choices we need to make when performing driver analysis. Firstly, depending on the outcome variable we should model the data differently. If we have Net Promoter Score data, as we do here, we need to start with linear regression.

The next step is to consider how to deal with correlation between predictors and their scale.

Traditional linear regression on its own is often not enough for driver analysis. It, along with all the other generalized linear models, can be difficult to interpret when there are high levels of correlations between the predictors.

And, an added complication is the results are sensitive to the scale of the predictors. That is, if one of the predictors has a higher range than another, it will be deemed less important, all else being equal. Modern driver analysis techniques, like Shapley Regression and Johnson's Relative Weights address both of these issues.

In our case, if we use a linear regression, either Shapley or Johnson's Relative Weights can be used. Otherwise in other GLMs, Johnson's relative weights is appropriate. Let's return to our case study and apply a Linear regression technique.

 

Cell phone case study - driver analysis

We run a driver analysis by using Insert > Regression > Driver Analysis. It's exactly the same in Q, except we use the Create menu

 

In Displayr:

Insert > Regression > Driver Analysis

Outcome> Net Promoter Score

 

So, this tells us that the satisfaction drivers are important and the most important driver is value for money. However, the orange boxes here are alerting us to some potential problems, which we need to explore and reconcile.

The first issue is that we have is that we have some negative importance scores. This can occur when data is coded incorrectly. We inspected our data earlier but only superficially with their summary tables. We need to start by checking that our data has been coded properly for use in a Driver Analysis. We need both increasing values in the outcome variable denote higher levels of recommendation and increasing values in the predictor to denote a higher performance. We start by looking at the outcome variable.

What we want to see is that higher values indicate higher levels of recommendation. We do have this, with -100 for detractors, 0 for neutrals, and +100 for promoters, so all is good here.

We have two types of predictors. Three of the variables measure satisfaction.

These are also coded correctly. Higher levels of satisfaction, or higher performance, have higher numbers.

Now, we have some problems. First, we have a 6 assigned to missing data. We need to set this as missing.

We have high values for low levels of performance. That is, “Very difficult” is coded with the highest value, when it should be the opposite. Very difficult corresponds to low performance and have the lowest value. We need to reverse the coded values.

Now, we are getting a more sensible looking model. We do still have a negative sign for one of the drivers, but it's an unimportant driver, so it likely just reflects some random noise. We can force it to be positive.

 

Outliers

We have also got a warning about unusual observations or outliers.

Outliers can be a rabbit hole that sucks you in and can be dealt with in different ways. An easy straightforward thing to do is to see if our results change when we remove the outliers.

What I'm going to do is duplicate the analysis and remove some outliers in one of the outputs and see how much it changes. I'm going to remove 20% of the most outlying observations and see if the results change a lot.

Yes, there are differences. But, the main conclusions are basically the same, so we can ignore the issue of outliers.

 

Non-constant variance

The last warning is a check for Linear regression that assumes constant variance in the residuals. This warning tells us that this assumption is not met here. Fortunately, as the warning says, we can fix this by just checking this box. This updates the estimates of the standard errors appropriately allowing non-constant variance.

If you look at the footer, which is a bit small, it tells us that some missing data has been ignored due to missing values. We can get Displayr and Q to automatically solve this problem.

The first is to error if any data is missing, not ideal. The second is, exclude cases that are missing. That is the default that we saw in the footer. Better, but also not ideal. We should be using all the data available if possible. That leaves the other three options that we will explore.

 

 

Missing data treatments

There are three options in this case. Which approach should we use?

Option 1 is a dummy variable adjustment and is appropriate when people can't respond since the situation is not applicable to them and hence have missing data. Option 2, with partial data, that is appropriate when people are randomly given a subset of questions to answer. Option 3, imputation should be used when it’s appropriate to estimate the missing data based on other responses. With that in mind, lets return to the data and look at what was missing.

Let's look at the missing counts to see what was missing.

 

In Displayr:
Performance Table > Add missing counts (Cells, Missing counts).

 

We can see that the satisfaction variables had no missing data. However, the effort variables all had missing data. Consider, for example, people that have never changed plan have no experience to indicate how hard it was. They are unable to respond and have missing data. Consequently, we need to go with option 1.

Notice the change in the importance scores, Cancel has become irrelevant (score of zero) and check usage has shrunk from 6 to 2.

 

Shapley

By default, Displayr uses a technique called Johnson's Relative Weights to perform the driver analysis. The point of this is that it automatically rescales the data so that you can directly compare the importance of the different predictors, and, it also does an adjustment for correlation between the predictor variables.

Many market researchers are more familiar with Shapley Regression for solving these problems.

If you want to do Shapley, all you need to do is change Output to Shapley Regression. When I do this, you'll notice that virtually nothing changes, as the techniques are almost identical.

There's essentially no difference.

 

Cell Phone Case Study - quad map/performance-importance analysis

Let's create a plot showing the importance of the different predictors versus performance. This is sometimes called a Quad Map.

 

In Displayr:

Insert > Visualization > Scatterplot

Chart: Appearance > Show labels > One Chart

 

Cola case study

Now we will do a second case study, this time from the cola market. We will start by looking at the data.

Here we have ratings of six different cola brands, on a 5-point ordered scale from hate to love. We have a don't know option, that we need to remove. Other than that, they are in the correct order, increasing values giving higher levels of preference, which is what we need.

Let's look at the predictors now. This is a big grid question, with attributes for each brand.

We could do individual driver analysis for each brand separately, or we could stack the data to combine into a single model that represents the entire cola market against the attribute predictors. I'll demonstrate how to easily stack and do a single driver analysis across the whole market.

So, we do a driver analysis like before.

 

In Displayr:

Insert > Regression > Driver Analysis

 

As I've got data from multiple brands, I will choose the Stack data option to combine the data.

As in the earlier case study, I get lots of prompts about what to do in the orange box.

The first is telling us that there was a None of These in the predictor data, but not in the outcome variables, and this has been excluded. That's fine. The next two warnings are telling us that our data is not numeric. Remember, we have five ordered categories.

So, we should be using an Ordered logit regression. Displayr is also telling us to use ordered logit.

We have got the same issues with the coefficients near 0 having negatives. Again, we can choose the Absolute importance score.

This time we have no warnings about outliers or anything else, so we are done.

 

Overview

Using the two case studies, we have seen how quickly we can do a driver analysis, while still being rigorous. We have looked at data preparation, including stacking.

We looked at how to determine the right model, choosing linear regression for the first case study and ordered logit for the second.

We have dealt with scale and correlation between predictor variables using Shapley Regression and Johnson's Relative Weights.

We looked at how to deal with missing data, using the dummy variable adjustment method in the first case study.

We checked the models, dealing with outliers and heteroscedasticity. And, we visualized the results.

Want to learn more and how it can help you with your projects? Book a personalized demo with us.

Read more