When using linear discriminant analysis (LDA) in market research, a common need is to obtain what are known as discriminant functions. These allow the analysis to be applied to predict values for new respondents in a spreadsheet or other external tool. In this post I describe how to obtain these functions from the LDA output in Q, and how to use them in Q and Excel to make predictions for new respondents.
If you are new to using linear discriminant analysis, you can learn more about the theory here.
Performing Linear Discriminant Analysis (LDA)
You need to begin by setting up a linear discriminant analysis. In this example, I make a fairly boring prediction: preferred colas based on the ratings for different cola brands. To keep things simple, I removed the Preferred Cola categories of Don't Know and Dislike All Cola from the analysis. I set the Brand Attitude ratings as a Number - Multi question (as LDA requires numeric predictors).
The steps are as follows:
- Import some data. In this example I use the Colas.sav file, located in Q's Examples folder.
- Add the LDA model by selecting Create > Classifier > Linear Discriminant Analysis.
- Select your Outcome variable - the variable you want to predict. For this example I will choose Preferred cola.
- Select the variables that you want to use to build the prediction in the Numeric predictors box. For this example I select each of the variables from the Brand attitude question
- Click Calculate.
The resulting table of category means is below. In addition to showing the mean score for each predictor within each of the outcome categories, coloring indicates when a result is higher or lower than average for the row, and bold font indicates when this is statistically significant. The R-Squared numbers indicate the proportion of variance in each row which is explained by the outcome groups.
To obtain discriminant functions, select your LDA output and then click Create > Classifier > Diagnostic > Discriminant Functions.
This shows that the function for predicting that a respondent prefers Coca-Cola based on the brand attitude scores is:
-2.10 + (1.32 * Coca-Cola) + (-0.17 * Diet Coke) + (0.03 * Coke Zero) + (0.09 * Pepsi) + (-0.79 * Diet Pepsi) + (0.02 * Pepsi Max).
There are a few ways that you can make use of these functions, which we will cover next. It would be rather tedious to manually evaluate all 6 functions per data point. Next I will show you how to use this table by exporting it to Excel. In Excel, you can create a set of formulas to calculate the scores (see the example Excel sheet linked below). I will also show you how to repeat these calculations in R, and how to get the predicted values without any calculations at all.
Manual Calculations of Predictions in Excel
- Use File > Export > To Office > Excel (or click the Excel icon) to export the discriminant function coefficients table to Excel.
- Create a table showing the RAW DATA for the predictor variables. One easy way to do this is to use Create > Tables > Raw Data, select each of the predictor variables in the Variables box, and click Calculate. The RAW DATA table for this example is shown below.
- Export the raw data table into the same Excel document.
- Create a formula like the one above to calculate the scores for each respondent for each category of the outcome variable, to replicate the one above.
- Add a formula to compute the highest score for each respondent.
- Create a formula to assign each respondent to the category of their highest score.
For the last three steps, please download and review this example Excel sheet.
Calculation of Predictions using R
In Q, you can get predicted values for your existing data set without any manual calculations. To do so:
- Copy and paste your LDA output so that you have a second version to play around with.
- Make sure that the the Output option is set to Means.
- Select Create > Classifier > Save Variable(s) > Predicted Values.
This will save a new variable to your data set that contains the predicted value for each respondent.
To see how this works under the hood, you can set up the calculations manually with R. You will need the same RAW DATA table used in the previous section. To make the predictions, select Create > R Output and paste or type in the following few lines of code:
raw.data = cbind(rep(1, nrow(raw.data)), raw.data) raw.data = as.matrix(raw.data) # convert from data.frame to matrix scores = raw.data %*% lda.discriminant.functions predictions = colnames(scores)[apply(scores, 1, which.max)]
The first line adds a column of ones to the data to facilitate the "intercept" part of each calculation. The third line uses matrix multiplication to produce the scores for the respondents all at once. The final line chooses the category with the highest score from each row. This algorithm is exactly the same as that in the Excel example above, except that the power of R allows you to use a single formula.
To see these worked examples, you can download a QPack here.