| 18 June 2017 |
The Problem with Using Multiple Linear Regression for Key Driver Analysis: a Case Study of the Cola Market
A key driver analysis investigates the relative importance of predictors against an outcome variable, such as brand preference. Many techniques have been developed for key driver analysis, to name but a few: Preference Regression, Shapley Regression, Relative Weights, and Jaccard Correlations.
The best of the methods for regular day-to-day use of key driver analysis seems to be Johnson’s Relative Weights technique, yet the standard technique taught in introductory statistics classes is Multiple Linear Regression. In this post, I compare Johnson’s Relative Weights to Multiple Linear Regression and I use a case study to illustrate why this introductory technique is best left in introductory classes.
Key driver analysis of the cola market
The data set I am using for this case study comes from a survey of the cola market. The brands considered are Coca-Cola, Diet Coke, Coke Zero, Pepsi, Pepsi Lite, and Pepsi Max. There were 327 respondents in the study. The 34 predictor variables contain information about the brand perceptions held by the consumers in the sample. I consider the relationship between these perceptions and how much the respondents like the brands (Hate … Love). The data has been stacked, and there are 1,893 cases with complete data for the analysis.
The labeled scatterplot below shows the coefficients from the Multiple Linear Regression on the x-axis versus the relative importance scores computed using Johnson’s Relative Weights on the y-axis. While the results are correlated, they are by no means strongly correlated. Remember that we are plotting the same data with the same basic type of analysis (i.e., predicting an outcome as a weighted sum of predictors).
The most interesting contrast is for perception of Unconventional. The traditional regression shows it to be the third most important variable. However, the Relative Weights method suggests it is the 14th most important of the variables. That is a staggeringly big difference in interpretation.
Which estimate is better?
The Relative Weights estimates are the better of the two. This can be seen by inspecting a few additional analyses.
The first analysis to check predicts brand preference using only Unconventional as the predictor. This model has an R2 of .009. By contrast, the model using only Reliable as a predictor has an R2 of .1883. This simple-but-easy-to-understand analysis suggests suggests that Reliable is 20 times as important as Unconventional, which is a lot more consistent with the conclusion from the Relative Weights than the Multiple Linear Regression.
When a model is estimated using both Unconventional and Reliable as predictors, its R2 is .1903. Thus, adding Unconventional to the model that previously only predicted using Reliable increases the explanatory power by a paltry .0020. When done the other way around, adding Reliable to the model that only contains Unconventional adds .1813. Again, this suggests that Reliable is much more important than Unconventional.
The regression model with all 34 predictors has an R2 of .4008. If we remove Unconventional from this model, the R2 drops by .0071, compared to a drop of .0118 for Reliable. This suggests that Reliable is around 1.7 times as important as Unconventional.
In theory, we could repeat this analysis for all possible models involving the 34 predictors. That is, see what impact Unconventional has with each possible combination of predictors, and repeat the analysis for Reliable. This is how Shapley Regression computes importance. But, as we have 34 predictors, this would involve computing 17,179,869,184 regressions, and I have better things to do. Fortunately, Johnson’s Relative Weights approximates the Shapley Regression scores. The estimates are that Unconventional will, on average, improve R2 by .01, whereas Reliable improves R2 by .044, suggesting that Reliable is around four times as important as Unconventional. This relativity is what is shown in the importance scores (i.e., vertical distances on the scatterplot above).
Why does the multiple linear regression get it so wrong?
If you have ever studied introductory statistics there is a good chance you were shown a proof that multiple linear regression estimates are the best possible unbiased estimates. So, why is it getting it wrong here? The multiple linear regression result implies that Reliable is around 1.3 times as important as Unconventional. This result is smaller than suggested by any of the other analyses that I have conducted, and is most similar to the analysis with all of the variables except for each of Reliable and Unconventional. Why does the multiple linear regression get it so “wrong”?
The answer is that multiple regression makes a quite different assumption from an assumption implicit in my comparison. Multiple regression assumes that all the variables in the model are causally related to the outcome variable. So, its coefficient for Unconventional is the the estimated effect of this attribute under the assumption that all the other 33 predictors in the model do in fact cause brand preference. The relative importance analysis instead implicitly makes the assumption that we are not really sure which variables are true predictors or not, and the importance score is an estimate of the incremental effect of Unconventional across all possible models.
In the case of key driver analysis, I think it is pretty fair to say that we never really know which of the predictors are appropriate. The assumption of the Relative Weights method is much safer.
What about the whole issue of correlated predictors?
Usually, when people discuss Relative Weights and the closely related Shapley Regression, the discussion is about how these methods perform better when the predictor variables are correlated. This is because if predictor variables are correlated, the effect of a variable will inevitably change a lot depending on which other variables are included in the analysis. That is, if two variables are highly correlated, if they are both included in the analysis their effects typically cancel out to an extent. Relative Weights and Shapley Regression essentially take the average effect across all the possible combinations of predictors. This means that they tend to be less sensitive to correlations between the predictors. With multiple regression, correlations between predictors can cause results to be unstable (i.e., to differ a lot from analysis to analysis). As the other methods essentially average across models, the instability cancels out.
The conclusion is straightforward: if performing Key Driver Analysis, you are better off using Relative Weights or a similar method, rather than Multiple Linear Regression.
TRY IT OUT
If you want to see all the detailed results referred to in this post, or run similar analyses yourself, click here to login to Displayr and see the document. You can see the R code by clicking on any of the results and selecting Properties > R CODE, on the right of the screen.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.