Text Analysis: Hooking up Your Term Document Matrix to Custom R Code
I have previously written about some of the text analysis options that are available in Displayr: sentiment analysis, text cleaning, and the predictive tree. As text analysis is a growing field, you likely want to use your own tools on top of those already built into Displayr. To feed information about your text into a statistical algorithm, it must first be converted into a form which is amenable to doing calculations. One approach to this is to use a term document matrix - the topic of this blog post. I'll explain what a term document matrix is, a version of a term document matrix called a sparse matrix and how to use it.
What is a term document matrix?
A term document matrix is a way of representing the words in the text as a table (or matrix) of numbers. The rows of the matrix represent the text responses to be analysed, and the columns of the matrix represent the words from the text that are to be used in the analysis. The most basic version is binary. A 1 represents the presence of a word and 0 its absence. Consider, as an example, the following, very basic, set of text responses:
The term document matrix for this would look something like the following:
The steps to creating your own term matrix in Displayr are:
- Clean your text responses using Insert > More > Text Analysis > Setup Text Analysis. Options for cleaning the text with this item are discussed in How to Set Up Your Text Analysis in Displayr.
- Add your term-document matrix using Insert > More > Text Analysis > Techniques > Create Term Document Matrix.
Like my other posts on text analysis, I will use the example of Donald Trump's tweets. In this example data set, which you can play with here in Displayr, there are 1,512 tweets from @realDonaldTrump. Using Displayr's tool to create the term document matrix, we instead start with an output that looks somewhat different from the one in our easy example:
This version of the matrix is called a sparse matrix (believe it or not!) and it is a more efficient representation of the information contained in the term document matrix. It is necessary for us to use this representation whenever there are a large number of cases or words. The matrix tends to be mostly 0's and in this case the output tells us that the proportion of entries that are zero (called the Sparsity) is 97%. Because this representation does not store this information, we save a lot of computer memory. The downside is that it doesn't display as nicely on the screen, and you'll need to convert it into a normal matrix when you want to use it in a calculation.
If your data set contains only a few hundred text entries then you can use some R code to display the matrix:
- Click on Insert > R Output. This creates a new output which will display the output of any R code that you type.
- Click on Properties > R CODE on the right of the screen.
- Enter the following:
library(tm) non.sparse.matrix <- as.matrix(term.document.matrix)
There are a couple of important things to note about this very simple snippet of code. Firstly, we have loaded the R package called tm (which stands for text mining). We did this because this package knows how to handle the sparse matrix format that we have used. It contains a version of the generic function as.matrix(), which converts the sparse matrix into a normal R matrix. In addition, term.document.matrix is the name of our original sparse term document matrix. In Displayr you can, consequently, use outputs in your document as inputs to other calculations by referring to their name. To find the name of an output, first click on it, and then look in Properties > GENERAL > Name.
The result looks like this:
We are now ready to analyze the tweets with a statistical algorithm. To begin with, we will use a random forest model to see how the presence of particular words can be used to predict which device the tweet was sent from - iPhone or Android. Why do we care? The working hypothesis is that Trump himself tweets from an Android, whereas his media team tweet on his behalf from an iPhone (see a previous post on sentiment analysis). This results in differences in the language coming from those devices.
Displayr has a built in option for running a random forest model. This type of model predicts the relationship between variables in the data set. Use it by selecting Insert > More > Machine Learning > Random Forest. However, the term document matrix lives in an R output and is not saved as a set of variables in our data set. In fact, due to it's size, it is undesirable to save a term document matrix into your data set. Instead, we can modify the code for the existing random forest option to work as follows:
- Click on Insert > R Output.
- Use the following code:
library(flipMultivariates) # Our package containing the Random Forest routine library(tm) # The package needed to convert the sparse matrix tdm <- as.matrix(term.document.matrix) # Convert the sparse matrix before use colnames(tdm) <- make.names(colnames(tdm)) # Ensure the column names are appropriate for use in an R model df <- data.frame(TweetSource = tweetSource, tdm) # Combine the outcome variable with the term document matrix f <- formula(paste0("TweetSource ~ ", paste0(colnames(tdm), collapse = "+"))) # Create the R Formula which describes the relationship we are interrogating rf <- RandomForest(f, df) # Run the random forest model
The code above first converts the term document matrix, before combining it with the dependent variable (tweetSource), working out an appropriate R formula which relates the dependent variable to the columns of the term document matrix, and finally runs the random forest routine. Similarly, the same process could be used for a regression model, or other R routines which gets their data in, using this basic structure. This leads to a table showing how important each word is in improving the accuracy of predicting the source of each tweet:
The MeanDecreaseAccuracy figures provide a measure of how much each word improves the accuracy of the random forest model in predicting the source of the tweet. The first three columns show the importance for each possible source. In the second row, we see that the presence of the @realDonaldTrump (which is where the account re-tweets mentions of Trump), is by far the most important term. Looking at the relative frequencies of words used between the two devices, we therefore conlcude that the presence of such a mention is almost always from the Android (theorized to be Trump's own device). The first row, on the other hand, shows that the presence of the tag #trump2016 was very good at predicting a tweet did not come from the Android device, but as it was fairly infrequent overall, was not a great predictor of a tweet being from an iPhone.
TRY IT OUT
Feel free to try out these examples in Displayr.
Many packages for doing text analysis have been written in the R language. We've made some of them available in Displayr already, including tm, tidytext, text2vec, stringr, hunspell, and SnowballC. If you come across one that you want to use, but which is unavailable in Displayr, you should contact us at firstname.lastname@example.org to let us know. We can, when needed, typically make new packages available within a few days.