16 June 2017 | by Jake Hoare

Using Partial Least Squares to Conduct Relative Importance Analysis in Displayr

Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable.

There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.

Partial Least Squares

PLS is a dimension reduction technique with some similarity to principal component analysis. The predictor variables are mapped to a smaller set of variables and within that smaller space we perform a regression against the outcome variable.  In contrast to principal component analysis where the dimension reduction ignores the outcome variable, the PLS procedure aims to choose new mapped variables that maximally explain the outcome variable.

Loading the example data

First I’ll add some data with Insert > Data Set > URL and paste in this link:

http://wiki.q-researchsoftware.com/images/6/69/Stacked_Cola_Brand_Associations.sav

Dragging Brand preference onto the page from the Data tree on the left table produces a table showing the breakdown of the respondents by category. This includes a Don’t Know category that doesn’t fit in the ordered scale from Love to Hate.  To remove Don’t Know I click on top of Brand preference in the Data tree on the left and then click on Value Attributes. Changing Missing Values for the Don’t Know category to Exclude from analyses produces the table below.


Creating the PLS model

Partial least squares is easy to run with a few lines of code. Select Insert > R Output and enter the following snippet of code into the R CODE box:

dat = data.frame(pref, Q5r0, Q5r1, Q5r2, Q5r3, Q5r4, Q5r5, Q5r6, Q5r7, Q5r8, 
                  Q5r9, Q5r10, Q5r11, Q5r12, Q5r13, Q5r14, Q5r15, Q5r16, Q5r17,
                  Q5r18, Q5r19, Q5r20, Q5r21, Q5r22, Q5r23, Q5r24, Q5r25, Q5r26,
                  Q5r27, Q5r29, Q5r28, Q5r30, Q5r31, Q5r32, Q5r33)

library(pls)
library(flipFormat)
library(flipTransformations)

dat = AsNumeric(ProcessQVariables(dat), binary = FALSE, remove.first = FALSE)
pls.model = plsr(pref ~ ., data = dat, validation = "CV")

The first line selects pref as the outcome variable (strength of preference for a brand) and then adds 34 predictor variables, each indicating whether the respondent perceives the brand to have a particular characteristic. These variables can be dragged across from the Data tree on the left.

Next, the 3 libraries containing useful functions are loaded. The package pls contains the function to estimate the PLS model, and our own publicly-available packages, flipFormat and flipTransformations are included for function to help us transform and tidy the data. Since the R pls package requires inputs to be numerical I convert the variables from categorical.

In the final line above the plsr function does the work and creates pls.model.

Automatically Selecting the Dimensions

The following few lines recreate the model having found the optimal number of dimensions,

# Find the number of dimensions with lowest cross validation error
cv = RMSEP(pls.model)
best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1
# Rerun the model
pls.model = plsr(pref ~ ., data = dat, ncomp = best.dims)

Producing the Output

Finally, we extract the useful information and format the output,

coefficients = coef(pls.model)
sum.coef = sum(sapply(coefficients, abs))
coefficients = coefficients * 100 / sum.coef
names(coefficients) = TidyLabels(Labels(dat)[-1])
coefficients = sort(coefficients, decreasing = TRUE)

The regression coefficients are normalized so their absolute sum is 100. The labels are added and the result is sorted.

The results below show that Reliable and Fun are positive predictors of preference, Unconventional and Sleepy are negative predictors and Tough has little relevance.



TRY IT OUT
You can perform this analysis for yourself in Displayr.


Author: Jake Hoare

After escaping from physics to a career in banking, then escaping from banking, I decided to go back to BASIC and study computing. This led me to rediscover artificial intelligence and data science. I now get to indulge myself at Displayr working in the Data Science team, often on machine learning.

2 Comments. Share your thoughts.

  1. John Hamilton Bradford

    Hi I have a quick question re: line 2 of your last bit of code:

    > sum.coef = sum(sapply(coefficients, abs))

    When I used this in my own analysis it appears to be summing the absolute values of the coefficients for the last component only. A different method would be to sum the coefficients across all components:

    > sum(sapply(pls.model$coefficients, abs))

    These two values are typically not the same, so long as the number of components exceeds 1. Since the components account for different amounts of total variance in Y, I suppose the coefficients from the different components would need to be weighted, yes?

    Would there be a reason to sum the coefficients across all components as I show above? Or do the coefficients from the last component (with the # of components being determined by CV as you did) summarize or take into account information from all of the components already? I would think that if you have to pick one, the first component coefficients would be the most important since it usually accounts for the largest amount of the variance in Y, yet your code grabs the coefficients from the last component. Can you explain this? Am I mistaken about how the coefficients are interpreted? Thanks, John


    • Gaurav Jain

      Hi John

      The line “coefficients = coef(pls.model)” produces a 3 dimensional array with dimensions of nvar x 1 x 1.
      Where nvar is the nunber of variables, which is 34 in this case.
      The last 2 dimensions are redundant, so this is effectively a vector of coefficients for the model with the first 3 components.
      So in terms of your question, this is already summed across those 3 components.

      Changing it to “coefficients = coef(pls.model, comp = 1:3)” produces an array with dimensions of nvar x 1 x 3.
      Each index of the final dimension refers to the coefficients of one component. With this it would make sense to
      sum over the components as you suggest. Looking at each component individually gives a breakdown of which coefficients are large for each
      component. As you say, the first component explains most variance.

      Changing it to “coefficients = coef(pls.model, ncomp = 1:3)” (note ncomp vs comp) also produces an array with dimensions of nvar x 1 x 3.
      Each index i of the final dimension refers to the sum of coefficients from 1 to i. So the last index of the final dimension is the same as
      “coefficients = coef(pls.model)”. This is also the same as “pls.model$coefficients”. It does not make sense to sum again over the components
      here because it is already a cumulative sum.

      This documntation goes into a bit more detail on page 4, but it could be written more clearly in my opinion!
      https://cran.r-project.org/web/packages/pls/pls.pdf

      cheers


Leave a Reply

Your email address will not be published. Required fields are marked *

Human? *

Keep updated with the latest in data science.