Driver Analysis in Displayr
Displayr's driver analysis makes it both easy and fast to perform driver analysis. This post gives an overview of the key features in Displayr designed for performing driver analysis (i.e., working out the relative importance of predictors of brand performance, customer satisfaction, and NPS). This post describes the various driver analysis methods available, stacking, options for missing data, in-built diagnostics for model checking and improvement, and how to create outputs from the driver analysis.
Choice of driver analysis method
All the widely used methods for driver analysis are available in Displayr. They are accessed via the same menu option, so you can toggle between them.
- Correlations: Insert > Regression > Driver analysis and set Output to Correlation. This method is appropriate when you are unconcerned about correlations between predictor variables.
- Jaccard coefficient/index: Insert > Regression > Driver analysis and set Output to Jaccard Coefficient (note that Jaccard Coefficient is only available when Type is set to Linear). This is similar to correlation, except it is only appropriate when both the predictor and outcome variables are binary.
- Generalized Linear Models (GLMs), such as linear regression and binary logit, and the related quasi-GLM methods (e.g., ordered logit): Insert > Regression > Linear, Binary Logit, Ordered Logit, etc. These address correlations between the predictor variables, and each of the different methods is designed for different distributions of the outcome variable (e.g., linear for numeric outcome, binary logit for two-category outcome, ordered logit for ordinal output).
- Shapley Regression: Insert > Regression > Driver analysis and set Output to Shapley Regression (note that Shapley Regression is only available when Type is set to Linear). This a regularized regression, designed for situations where linear regression results are unreliable due to high correlations between predictors.
- Johnson's relative weight: Insert > Regression > Driver analysis. Note that this appears as Output being set to Relative Importance Analysis. As with Shapley Regression, this is a regularized regression, but unlike Shapley it is applicable to all Type settings (e.g., ordered logit, binary logit).
Often driver analysis is performed using data for multiple brands at the same time. Traditionally this is addressed by creating a new data file that stacks the data from each brand on top of each other (see What is Data Stacking?). However, when performing driver analysis in Displayr, the data can be automatically stacked by:
- Checking the Stack data option.
- Selecting variable sets for Outcome and Predictors that contains multiple variables (for Predictors these need to be set as Binary - Grid or Number - Grid).
By default, all the driver analysis methods exclude all cases with missing data from their analysis (this occurs after any stacking has been performed). However, there are two additional Missing data options that can be relevant:
- If using Correlation, Jaccard Coefficient, or Linear Regression, you can select Use partial data (pairwise correlations), in which case the data is analyzed using all the available data. Even when not all the predictors have data, the partial information is used for each case.
- If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis) or any of the GLMs and quasi-GLMs, Multiple imputation can be used. This is generally the best method for dealing with missing data, except for situations the Dummy variable adjustment is appropriate.
- If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis) or any of the GLMs and quasi-GLMs, Dummy variable adjustment can be used. This method is appropriate when the data is missing because it cannot exist. For example, if the predictors are ratings of satisfaction with a bank's call centers, branches, and web site, if data is missing for people that have not attended any of these, then this setting is appropriate. By contrast, if the data is missing because the person didn't feel like providing an answer, multiple imputation is preferable.
Diagnostics for model checking and improvement
A key feature of Displayr's driver analysis is that it contains many tools for automatically checking the data to see if there are problems, including VIFs and G-VIFs if there are highly correlated predictors, a test of heteroscedasticity, tests for outliers, and checks that the Type setting has been chose correctly. Where Displayr identifies an issue that is serious it will show an error and provide no warnings. In other situations it will show a warning (in orange) and provide suggestions for resolving the issue.
One particular diagnostic that sometimes stumps new users is that by default Displayr sometimes shows negative importance scores for Shapley Regression and Johnson's Relative Weights. As both methods are defined under the assumption that importance scores must be positive, the appearance of negative scores can cause some confusion. What's going on is that Displayr also performs a traditional multiple regression and shows the signs from this on the relative importance outputs as a warning for the user that the assumption of positive importance may not be correct. This can be turned off by checking Absolute importance scores.
Standard output output from all but the GLMs is a table like the one below. The second column of numbers shows the selected importance metric, and the first column shows this scaled to be out of 100.
A key aspect of how driver analysis works in Displayr is that it can be hooked up directly to a scatterplot, thereby creating a quad map. See Creating Quad Maps in Displayr.
Crosstabs of importance scores
All the driver analysis methods have an option called Crosstab interaction, where a categorical variable can be selected, and the result is a crosstab that shows the importance scores by each unique value of the categorical variable, with bold showing significant differences and color-coding showing relativities.
Accessing the importance scores by code
The importance scores can also be accessed by code. For example, the raw importance scores are accessed using model.1$importance$raw.importance, contains the raw importance scores, where model.1 is the name of the main driver analysis output.
This can then be used in other reporting. For example, when inserted via Insert > R Output, table.Q14.7[order(model.1$importance$raw.importance, decreasing = TRUE), ] sorts a table called table.Q14.7 by the importance scores, and paste(names(sort(model.1$importance$raw.importance, decreasing = TRUE)), collapse = "\n") creates a textbox containing the attributes sorted from most to least important.