Automatic Categorization of Unstructured Text Data
Categorizing text data can be a time-consuming and expensive activity. In cases where time is short and budgets low, using automatic categorization of text data can save the day and give you a good idea of what's contained in your data.
In the following example, I have some text data collected in a survey about Tom Cruise. The question was "What don't you like about Tom Cruise?" and the responses are pretty varied. Categorizing these responses would normally take a good couple of hours, if not longer. Here, instead, I'll run an automatic text categorization to see what main themes are of concern to the survey respondents.
How to run Automatic Text Categorization in Displayr
I've imported my data as usual (see here for more on that) and I'm ready to begin my analysis.
- Go to Insert > Text Analysis (Analysis) > Automatic Categorization > Unstructured Text.
- In the object inspector (the section that opens on the right of the screen), under Inputs > Text variable select the variable that holds the text you want to analyze.
- Change the Inputs > Number of categories to the number of categories you would like to classify the data into. I've chosen 15 for this example.
- The output will calculate automatically, and looks like this:
On the left of this output you can see the automatically generated categories, the center column the proportions and counts of the number of cases in the file that has been allocated to that category, and on the right, examples of the types of responses that have been allocated. Clicking the ▶ button will show you all the text that's been assigned to that category.
How to Save the Categories to your Data Set
Saving the categories assigned to your data - so that you can use them in other analysis - is easily done. Make sure that the output above is selected on the Page and then go to Insert > Text Analysis > Advanced > Save Variable(s) > Categories. A new variable will be added to your Data Sets called "Categories from..." This new variable will store which category each case in your file has been categorized into. It allows you to then combine this categorized data with other variables in your data set.
To create a simple example of a table that uses categorized data and another variable, I start by dragging a variable from the Data Sets pane onto a document Page. Next, I select a second variable, in this case the one containing the categorized data, and drag it onto the table I already created, taking care to drop it in the Columns field that appears when you hover over the table. The result is the table shown below, where I reduced the categories to five, and crossed the automatically generated categories with the education level of the respondents in my data set.