Using Partial Least Squares to Conduct Relative Importance Analysis in R
Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations.ย Relative importance analysis is a general term applied to any technique used for estimating the importance ofย predictor variables in a regression model. The output is a set of scores which enable theย predictor variables to be ranked based upon how strongly eachย influences the outcome variable.
There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here.ย In this blog post I briefly describe how to use an alternative method,ย Partial Least Squares,ย in R.ย Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.
What are Partialย Least Squares?
Partial Least Squares sometimes known as Partial Least Square regression or PLS is a dimension reduction technique with some similarity to principal component analysis.ย The predictor variables are mapped to a smaller set of variables, and within that smaller space we perform a regression against theย outcome variable. In Principal Component Analysis the dimension reduction procedure ignores the outcome variable. However PLS aims to choose newย mapped variables that maximally explain the outcome variable.
To get started I'll import some data into R and examine it with the following few lines of code:
cola.url = "http://wiki.q-researchsoftware.com/images/d/db/Stacked_colas.csv" colas = read.csv(cola.url) str(colas)
The output below show 37 variables. I am going to predictย pref,ย the strength of a respondent's preference for a brand on a scale from 1 to 5. ย To do this I'll use the 34 binary predictor variables that indicate whether the person perceivesย the brand to have a particular characteristic.
Using Partialย Least Squares inย R
The next step is to remove unwanted variables and then build a model. ย Cross validation is used to find the optimal number ofย retained dimensions. Then the model is rebuilt with this optimal number of dimensions. This is all contained in the R code below.
colas = subset(colas, select = -c(URLID, brand)) library(pls) pls.model = plsr(pref ~ ., data = colas, validation = "CV") # Find the number of dimensions with lowest cross validation error cv = RMSEP(pls.model) best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1 # Rerun the model pls.model = plsr(pref ~ ., data = colas, ncomp = best.dims)
Producing the Output
Finally, we extract the useful information and format the output.
coefficients = coef(pls.model) sum.coef = sum(sapply(coefficients, abs)) coefficients = coefficients * 100 / sum.coef coefficients = sort(coefficients[, 1 , 1]) barplot(tail(coefficients, 5))
The regression coefficientsย are normalized so their absolute sum is 100 and the result is sorted.
Theย results below show thatย Reliable and Fun are positive predictorsย of preference. ย You could run the codeย barplot(head(coefficients, 5))ย to see that atย the other end of the scale Unconventional and Sleepy are negative predictors.
TRYย IT OUT
Displayr is a data science platform providing analytics, visualization and the full power of R. You canย perform this analysisย for yourself in Displayr.