Principal Component Analysis of Text Data
This post introduces our new Principal Component Analysis (PCA) tool for analyzing text data. It takes a single text variable as an input, and returns numeric variables that summarize the text data, as well as tables of loadings to facilitate interpretation. This can be used either to provide a summary of text data, or, as an input to further analyses (e.g., as variables to be used in segmentation).
Worked example: Understanding attitude towards Tom Cruise
This post analyzes text data where people have listed their reasons for not liking Tom Cruise. The raw data is shown in the table below.
By default, Displayr creates a PCA with two components, but to explain the technique I'm going to start by looking at the result with a single component. With one component, the PCA of text data seeks to find a single numeric variable that best explains differences in text.
The table of loadings below shows the correlation of different words and phrases with the numeric variables that describe the text. The way to read it is as follows:
- The strongest correlation is for people that have used the word nothing (or a synonym) anywhere in their text.
- The slightly weaker correlations for Exclusive: nothing is for people who mentioned nothing, but didn't mention it as a part of a bigram (a pair of words that appear commonly together).
- Stem: not is the correlation of the word not and any words that commence with not (e.g., not, nothing) with the numeric variable.
- nchars is the number of characters in the text. As it has a negative correlation it means that the more somebody typed, the lower their score on the variable that has been identified.
- The first component is negatively correlated with Negative sentiment (i.e., the higher the score, the higher the negative sentiment, and thus high scores on the variable correspond to positive sentiment).
Putting all the results together tells us that if we have to summarize the text data as a single numeric variable, that variable measures whether they said Nothing on one end of the continuum, or didn't say nothing on the other.
The table below shows the numeric variable that has been computed. We can see, for example, that respondent 10 has said nothing and has a relatively high score (2.3). Respondent 1's answer isn't purely Nothing, which is why his score is closer to 0 (the average). By contrast, respondents who didn't write nothing have negative scores.
The table below shows the loadings from the two component solution. The first component has essentially the same meaning as in the first analysis. But, if you scroll down, you will see that the second component is measuring whether or not somebody didn't say (note the negative correlation) tom cruise. This second component measures at one end mentioning Tom Cruise and like, and at the other end not mentioning Tom Cruise and not mentioning like.
When we look at the four component solution, we end up with four variables that have the following interpretation:
- First component variable - whether the text said nothing or similar variants described in the other first variable situations above.
- Second component variable - whether the text mentions like or actor.
- Third component variable - whether the text has Stem: scientolog (i.e., scientology or scientologist and any misspellings beginning with scientolog). Also words that have synonyms with faith are positively correlated with this variable.
- Fourth component variable - Not mentioning crazy.
The table below shows the raw values of the four variables, sorted by the fourth variable (lowest to highest). We can easily see here that the further the value below zero on the fourth variable, the more likely they were to reveal they regarded Tom Cruise as being crazy.
This analysis is useful in its own right, as a summary of the key trends in the data. And, the variables can be used as inputs into other analyses, such as cluster analysis or latent class analysis (segmentation).
Selecting the number of components
How many components should you have? This is likely best determined by by judgment. Choose the number which leads to a result that makes sense.
An alternative is a scree plot. The basic idea is that you imagine that the plot is showing an arm, and you want to have the number of components that occurs at around the "elbow". In this example we have a double jointed elbow, so the plot at best tells us that 10 or fewer components is appropriate. As mentioned in the previous paragraph, my recommendation is to just use judgment.
One common heuristic for selecting the number of components is to use the Kaiser rule (eigenvalues > 1). Such rules aren't practical when using PCA for text data. This is because the PCA has 512 dimensions, and pretty much any traditional heuristic for determining the number of dimensions will recommend too many dimensions (e.g., with this example, the Kaiser rule suggests 81 components).
Instructions for conducting principal component analysis of text data
- To conduct the analysis in:
- Displayr: Insert > Text Analysis > Advanced > Principal Components Analysis (Text)
- Q: Create > Text Analysis > Advanced > Principal Components Analysis (Text)
- Set the text variable in the Variable field.
- Specify the desired Number of components.
- Press ACTIONS > Save variables to save the variables to the data file.
How it works
- The text data is cleaned
- If necessary it is translated into English
- It is converted into 512 numeric variables using Google's Universal Sentence Encoder for English.
- A PCA is performed on the 512 numeric variables and the scores are extracted
- A term-document matrix is created from the cleaned text data, along with sentiment analysis, and some related variables.
- The loadings are computed as the cross-correlation matrix of the term-document matrix (rows) and the PCA scores (columns).
- A varimax type rotation is applied to the loadings.