16 June 2017 |
Using Partial Least Squares to Conduct Relative Importance Analysis in Displayr
Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable.
There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.
Partial Least Squares
PLS is a dimension reduction technique with some similarity to principal component analysis. The predictor variables are mapped to a smaller set of variables and within that smaller space we perform a regression against the outcome variable. In contrast to principal component analysis where the dimension reduction ignores the outcome variable, the PLS procedure aims to choose new mapped variables that maximally explain the outcome variable.
Loading the example data
First I’ll add some data with Insert > Data Set > URL and paste in this link:
Dragging Brand preference onto the page from the Data tree on the left table produces a table showing the breakdown of the respondents by category. This includes a Don’t Know category that doesn’t fit in the ordered scale from Love to Hate. To remove Don’t Know I click on top of Brand preference in the Data tree on the left and then click on Value Attributes. Changing Missing Values for the Don’t Know category to Exclude from analyses produces the table below.
Creating the PLS model
Partial least squares is easy to run with a few lines of code. Select Insert > R Output and enter the following snippet of code into the R CODE box:
dat = data.frame(pref, Q5r0, Q5r1, Q5r2, Q5r3, Q5r4, Q5r5, Q5r6, Q5r7, Q5r8, Q5r9, Q5r10, Q5r11, Q5r12, Q5r13, Q5r14, Q5r15, Q5r16, Q5r17, Q5r18, Q5r19, Q5r20, Q5r21, Q5r22, Q5r23, Q5r24, Q5r25, Q5r26, Q5r27, Q5r29, Q5r28, Q5r30, Q5r31, Q5r32, Q5r33) library(pls) library(flipFormat) library(flipTransformations) dat = AsNumeric(ProcessQVariables(dat), binary = FALSE, remove.first = FALSE) pls.model = plsr(pref ~ ., data = dat, validation = "CV")
The first line selects pref as the outcome variable (strength of preference for a brand) and then adds 34 predictor variables, each indicating whether the respondent perceives the brand to have a particular characteristic. These variables can be dragged across from the Data tree on the left.
Next, the 3 libraries containing useful functions are loaded. The package pls contains the function to estimate the PLS model, and our own publicly-available packages, flipFormat and flipTransformations are included for function to help us transform and tidy the data. Since the R pls package requires inputs to be numerical I convert the variables from categorical.
In the final line above the plsr function does the work and creates pls.model.
Automatically Selecting the Dimensions
The following few lines recreate the model having found the optimal number of dimensions,
# Find the number of dimensions with lowest cross validation error cv = RMSEP(pls.model) best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1 # Rerun the model pls.model = plsr(pref ~ ., data = dat, ncomp = best.dims)
Producing the Output
Finally, we extract the useful information and format the output,
coefficients = coef(pls.model) sum.coef = sum(sapply(coefficients, abs)) coefficients = coefficients * 100 / sum.coef names(coefficients) = TidyLabels(Labels(dat)[-1]) coefficients = sort(coefficients, decreasing = TRUE)
The regression coefficients are normalized so their absolute sum is 100. The labels are added and the result is sorted.
The results below show that Reliable and Fun are positive predictors of preference, Unconventional and Sleepy are negative predictors and Tough has little relevance.
TRY IT OUT
You can perform this analysis for yourself in Displayr.
Author: Jake Hoare
After escaping from physics to a career in banking, then escaping from banking, I decided to go back to BASIC and study computing. This led me to rediscover artificial intelligence and data science. I now get to indulge myself at Displayr working in the Data Science team, often on machine learning.