R | Statistics | Using Displayr
| 16 June 2017 | by Jake Hoare
Using Partial Least Squares to Conduct Relative Importance Analysis in Displayr

Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable.

There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.

Partial Least Squares

PLS is a dimension reduction technique with some similarity to principal component analysis. The predictor variables are mapped to a smaller set of variables and within that smaller space we perform a regression against the outcome variable.  In contrast to principal component analysis where the dimension reduction ignores the outcome variable, the PLS procedure aims to choose new mapped variables that maximally explain the outcome variable.

Loading the example data

First I’ll add some data with Insert > Data Set > URL and paste in this link:

http://wiki.q-researchsoftware.com/images/6/69/Stacked_Cola_Brand_Associations.sav

Dragging Brand preference onto the page from the Data tree on the left table produces a table showing the breakdown of the respondents by category. This includes a Don’t Know category that doesn’t fit in the ordered scale from Love to Hate.  To remove Don’t Know I click on top of Brand preference in the Data tree on the left and then click on Value Attributes. Changing Missing Values for the Don’t Know category to Exclude from analyses produces the table below.


Creating the PLS model

Partial least squares is easy to run with a few lines of code. Select Insert > R Output and enter the following snippet of code into the R CODE box:

dat = data.frame(pref, Q5r0, Q5r1, Q5r2, Q5r3, Q5r4, Q5r5, Q5r6, Q5r7, Q5r8, 
                  Q5r9, Q5r10, Q5r11, Q5r12, Q5r13, Q5r14, Q5r15, Q5r16, Q5r17,
                  Q5r18, Q5r19, Q5r20, Q5r21, Q5r22, Q5r23, Q5r24, Q5r25, Q5r26,
                  Q5r27, Q5r29, Q5r28, Q5r30, Q5r31, Q5r32, Q5r33)

library(pls)
library(flipFormat)
library(flipTransformations)

dat = AsNumeric(ProcessQVariables(dat), binary = FALSE, remove.first = FALSE)
pls.model = plsr(pref ~ ., data = dat, validation = "CV")

The first line selects pref as the outcome variable (strength of preference for a brand) and then adds 34 predictor variables, each indicating whether the respondent perceives the brand to have a particular characteristic. These variables can be dragged across from the Data tree on the left.

Next, the 3 libraries containing useful functions are loaded. The package pls contains the function to estimate the PLS model, and our own publicly-available packages, flipFormat and flipTransformations are included for function to help us transform and tidy the data. Since the R pls package requires inputs to be numerical I convert the variables from categorical.

In the final line above the plsr function does the work and creates pls.model.

Automatically Selecting the Dimensions

The following few lines recreate the model having found the optimal number of dimensions,

# Find the number of dimensions with lowest cross validation error
cv = RMSEP(pls.model)
best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1
# Rerun the model
pls.model = plsr(pref ~ ., data = dat, ncomp = best.dims)

Producing the Output

Finally, we extract the useful information and format the output,

coefficients = coef(pls.model)
sum.coef = sum(sapply(coefficients, abs))
coefficients = coefficients * 100 / sum.coef
names(coefficients) = TidyLabels(Labels(dat)[-1])
coefficients = sort(coefficients, decreasing = TRUE)

The regression coefficients are normalized so their absolute sum is 100. The labels are added and the result is sorted.

The results below show that Reliable and Fun are positive predictors of preference, Unconventional and Sleepy are negative predictors and Tough has little relevance.



TRY IT OUT
You can perform this analysis for yourself in Displayr.


Author: Jake Hoare

After escaping from physics to a career in banking, then escaping from banking, I decided to go back to BASIC and study computing. This led me to rediscover artificial intelligence and data science. I now get to indulge myself at Displayr working in the Data Science team, often on machine learning.


Share
Twitter
Facebook
LinkedIn
GOOGLE
https://www.displayr.com/using-partial-least-squares-conduct-relative-importance-analysis-displayr/">
RSS
Follow by Email
follow us in feedly
Recent Posts



No comment. Share your thoughts.

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? *

Keep updated with the latest in data science.