Automatically Classify New Text Data Using an Existing Categorization
Fully automated text analysis can, sometimes, do a great job. However, the gold standard for automatic categorization is to first get a human being to manually "tag" the data, then use machine learning to automatically categorize new data. This saves tons of time, not only when you are working with preliminary data sets and trackers, but also when you just don't have time to manually categorize thousands of responses. Our Semi-Automatic Categorization feature makes manually categorizing responses faster. Conversely, this post describes how to make our Automatic Categorization of Unstructured Text analysis smarter by feeding in some manually categorized data. For a more in-depth discussion of this and other approaches to automatically categorizing text, please see our Using Machine Learning to Automate Text Coding white paper here.
Step 1. Create the categorization on a subset of the data
First, create a categorization using a subset of the data. This could be a random selection of the data or the first wave of a tracking study. For more information about how to do this, please read Manually Coding Multiple Response Text Data in Displayr and Semi-Automatic Coding of Text Data: A Cutting-Edge Technique. For this example, I used some responses received regarding what people like about their cell phone provider. By simply searching and categorizing some keywords, I've managed to code a subset of 594 responses out of the 895.
Step 2. Hookup your existing categorization to our automated text analysis
Next, we can automatically classify new responses as follows:
- Create the text analysis output using Insert > Text Analysis > Automatic Categorization > Unstructured Text
- In the Object Inspector on the Inputs tab, select your original text variable for Text variable and your manually categorized variable set for Existing categorization. I've done this for my Likes per the screenshot below:
Now, the analysis will automatically calculate. The output will be similar to below. You can see how many responses were automatically categorized by the model (the Predicted column) and how accurate those are based on your originally categorized responses (the Accuracy column).
Note, not all of the responses may fit into one of the existing categories based on the machine learning. In this instance, the response will not be categorized.
Step 3. Save your categories
Finally, you can save your categories into a variable set to use in tables and other outputs by selecting the automatic categorization output and clicking Insert > Text Analysis > Advanced > Save Variable(s) > Categories.