I'm going to start by describing the goal of text analysis. I will then look at the concept of machine understandable text.
Then, we will be into the main sections, which deal with how to quickly summarize text, accurately summarize text, and both quickly and accurately summarize text.
Everything I am going to show you today is available in both Q and Displayr.
The wrong way to think about text analysis
One of the biggest challenges with text analysis is that often people have the wrong goal in mind.
Maybe they have asked people what they like about their phone company.
"What do you like about your phone company?"
The mistake to make is to think that you want to summarize the data as a cool visualization. Something like this.
The wrong objective: creating a cool visualization
But, don't take my word for this being a bad idea.
Have a look. What does this visualization tell you about why people like their phone company?
I think it says that service is the biggest thing people like, and that it goes with great, good, and customer.
Is that what you are taking out of it?
Back to the future
It turns out that the correct approach was invented before the term text analysis even become popular.
Text analysis is the middle step
Text analysis is the middle step. The objective should be to summarize the text as one or more variables. Either categorical or numeric.
That is, in the market research jargon, the goal is to code the data.
Code, then use standard analysis tools
Then we analyze it just like any other data. What can you see here?
This is the same data as before. Do you remember how the earlier visualization implied service was key? This chart says it's price and reliability, and service is trivial.
The visualization was cool. But, entirely misleading. The reason that it emphasized Service is that service is the word that most commonly appears with other words. It is a well networked word. But, just not that important.
The real problem to solve is how to efficiently code the data.
If you know you are going to use text analysis, you should take the time to collect machine-understandable text. If you can.
A non-savvy person using a chat bot
Consider how you talk to a chat bot. Or Siri. Do you talk like this? No, you don't. You understand that it will cause problems.
A savvy person use a chat bot
You talk like this.
To use the jargon, we talk in Pidgin. You will get better results from text analysis when you try and collect data in pidgin.
Collect machine understandable text
In the days before we used computers to do text analysis, it was clever to ask genuinely open-ended questions, like question 2 here.
Collect machine understandable text (part 2)
But a computer will do a better job at summarizing answers to this section version of Question 2.
Collect machine understandable text (part 3)
And even better with this one. Am I saying we need to always ask questions like QUESTION 2 on the right? No. It's a trade-off. But the closer we get to questions like this, the easier it is to automate their analysis.
Quickly summarizing text data
Often we want to find a fast way of summarizing text data.
Case study 1
I will illustrate this with one of my favorite case studies. We asked people how they feel about Tom Cruise
Warning: it turns out some people don't love him.
I've deleted the most offensive comments.
The laziest reporting option is to give all the text responses to the end-user of the research. So they can figure out what they mean.A word cloud's often a step forward
Chart > Word Cloud
Our word clouds are pretty cool. You can drag things off and we can merge things. And also create phrases.
But, as we all know. Word clouds are pretty superficial.
An alternative is to create a network diagram. These look cool. Lines show the strongest relationships. But, in the 20 years since I first saw one of these, I have never, ever, seen one that provides any insight. Look at this one. It tells us that Tom and Cruise are linked. Wow!
A word map is a bit like a word cloud, but words that appear together in the raw text are placed closer together. I used to like this a bit. But, the newer techniques, which I will get to are so much better that I never use them anymore.
Tables of common words
Another quick way of summarizing data is to create tables that count up common words. A few things are done to make these better:
- Capitalization is ignored
- Spelling mistakes are fixed
- Uninteresting words, like “he”, “at” and “the” are ignored
- Synonyms found
- There's lots of options for customizing these.
An improvement on these tables is tables that contain n-grams. An n-gram means a sequence of words that commonly appear together. I'm going to look for sequences up to 5 words.
Maximum n for n-gram identification: 5
Now it's found tom cruise in the fourth row. Again, it's a common technique, but rarely interesting with market research data.
Everything so far is text analysis 1.0. Let's look at the modern stuff.
Now for something cool. I'm going to automatically cluster the text data.
Insert > Text Analysis > Automatic Categorization > List of Items
It will automatically group this into 10 segments. I tend to find this too many, so will set it to 5.
Number of categories: 5
It's got to do some pretty serious math which takes a few minutes, so let's look at one I've created before.
Automatic categorization: pre-baked
As you can see, its automatically named the five segments. The first segment is people that have said Arrogant or the word not. It's 37% of people.
An illustrative quote is: “He's gone off the deep end”. Let's look at what they've said.
I think it's done a remarkable job. This category groups people who think he has a personality disorder, being either arrogant or crazy.
This next segment's Faith. Let's expand it out. This example's pretty cool. But, as you can see it's automatically worked out that religion, faith, scientology, and church, are all related ideas and grouped them together.
Automatic translation: to English
Now for something quite magical. On the left I've got hotel reviews in lots of languages. On the right, it's automatically translated them and categorized them, and described the segments in English.
Automatic translation: to Chinese
Here, the summaries are in Chinese. So, the days of not being able to work out anything from your multi-language surveys are over.
Case study 2: tweets
For this example, I'm going to use tweets. The same analyses can be used with open-ended questions, but the twitter data will make it easier to see what's going on. The type of automatic categorization that we just used on the Tom Cruise data just looked for the strongest patterns.
An alternative technique called entity extraction, searches for recognizable things, or entities, in data. Here's an example. The first column shows some tweets The next column automatically extracts names. This is great for CSAT data, as you can automatically extract the names of team members from the quotes. The next column extracts URLs. Then twitter handles. Then states.
You get the idea.
A more well-known technique is sentiment analysis. It reads through open-ended text and sums up the number of positive and negative words. For example, “Enjoy” is a positive word, so the first row of text has a sentiment of 1.
Looking at the fifth respondent, this is a much more negative tweet, and it has a score of -3.
I've shown sentiment analysis because it's a standard technique. But, it's rarely very useful with open ended questions. The technique's always less accurate than, say, asking for people to rate their satisfaction. I find it's more useful with social data, where you can't ask open-ended questions.
Principal Components Analysis of Text Data
This is the most technical part of today. If you are familiar with factor analysis or PCA, you will love it. If not, we'll be back to normal broadcasting in a few minutes.
Sentiment analysis is what's known in the world of measurement as a confirmatory technique. We assume that people differ in their sentiment and we seek to confirm and measure this from the data.
Exploratory techniques instead don't assume what's interesting, and instead just try and find the strongest patterns in the data. The exploratory equivalent of sentiment analysis is principal component analysis with a single component. This technique, like the automatic categorization I showed you before, is only available in our software.
Insert >Text analysis > Advanced > Principal Component Analysis
This will also take a minute to compute, and I won't bore you by making you wait!
Principal component analysis of text data: 1 component
This technique creates a single numeric variable that summarizes the data.
People with a high value are similar to others with a high value and very different to people with a low value.
The table of loadings, which we are looking at now, shows the correlation of different words and phrases with the numeric variable. The way to read it is as follows:
- The strongest correlation is for people that have used the word nothing (or a synonym) anywhere in their text.
- We will send you a blog post with more detail about how this works.
So, the numeric variable we have estimated describes people in terms of whether they said the word Nothing and related words, or not..
Principal component analysis of text data: 5 components
Here's one where I have set it to five components. Why 5? Like with coding and categorization in general, it's a judgment call. You play around and find the number of components that you find easiest to interpret. Let's add these five variables to the data set.
ACTIONS > Save variables
When you save components, you then need to name them. Looking at component 1, this is like our earlier solution. This is measuring people's tendency to say nothing. We’ll call it “Nothing”.
Component 2 is whether or not people said Tom Cruise. It's a negative correlation, so we’ll name it “Not Tom Cruise”.
Component 3 is people that have said religion. So we’ll call it “Religion”.
Component 4 is about Craziness. But, there's a negative sign, so it’s about not being crazy. So “Not crazy”.
And component 5 is “not a good actor”
Principal Component Analysis of Text Data: scores
Let's look at the variables we've created to get a better idea what's happened.
Let's look at row 1. A score of 0 means average. The person said Nothing. They've got a high score for Nothing. And below average for everything else. If we sort the table by columns, we can quickly get a feeling for what it means. Let's sort the Not Tom Cruise dimension.
Big negative scores are, as we would expect, people that said Tom Cruise.
Looking at the religion, people with low scores didn't say anything. And the people with the high scores are saying religion a lot. So, we have converted these text quotes into five numeric variables which summarize the strong patterns in the data.
Back to less complicated techniques now!
Accurately summarizing text data
Everything we've looked at so far was designed to give us a quick read of data. But what if we want to get an accurate read? The traditional approach in market research has been to perform what is called coding.
The first thing I will show is back coding. It's pretty simple, but I'm always surprised by how many people aren't familiar with it. In this example, people had an “Other” option and 20 people selected it. Back coding is the process of combing the categories with the raw responses.
We click on the variable containing the other specify responses.
Click on Cell Phones > Race Other > Insert > Text Analysis > Manual > Overlapping
We then hook it up to the variables in the closed ended data.
Inputs and Back Coding
Corresponding…: Race, OK
Rename Category 1 as White
So, Eastern Europe belongs in the white category.
Click on White
You get the idea.
We will return to the Tom Cruise data.
Data Sets > Tom Cruise > What don't you like… > Text Analysis > Manual Categorization >
When coding text data we have to decide if each respondent is permitted in one and only one category. That is, single response coding, where the categories need to be mutually exclusive. Or, whether we permit multiple response coding with overlapping categories.
For this example, I will code people into only one category:
You will recall that we've already learned a bit about the data, so we know what the key categories are. If we didn't, we could just read through the text and get an idea.
Right-click on New Category > Import Category Names and type in:
- Nothing - I like him
This first response is clearly nothing. You can see that the count has gone up.
And we have a new response to categorize.
This one goes in religion.
As you can see 28 people have said exactly Nothing, so we can categorize them all with a mouse click.
What I am doing here is typically known in market research as coding. But we refer to it as manual coding. The problem with it is that it's very slow. There's a better way. We call this semi-automatic coding.
Click Sort by: Fuzzy match
We saw before that religion was a key issue in the Tom Cruise data. So, we are going to sort the data by similarity to the word religion.
Fuzzy sort on: Religion
This is going to take a moment. It's warming up. While it does this, let me show you the latest version of Q.
Here's the same data again.
Right-click: Insert variable > Code Text > New Code Frame > Manual Categorization > Mutually Exclusive
So, this is basically exactly what we saw in Displayr. You can now do what I'm showing you in Q as well.
Just to remind, we asked Displayr to do a fuzzy sort based on religion, and it's now sorted all the text this way. If we look at the first few responses, we see that they contain the word religion, as we would expect.
This orange bar shows how confident we are in our sort. We are a bit less confident for this item. Why's that. It is because the person has said faiths, rather than religion. So, in the background Displayr has works out that these probably mean the same thing.
And, now we are starting to see a bit of magic. The program has been smart enough to work out that Christian science is a religion.
Select all observations with a similar length of orange bar
Code as Religion
So, quick as a flash we've coded 22 of responses.
Let's sort based on Crazy.
Fuzzy sort on Crazy
Select all observations with a similar length of confidence.
OK, so we've quickly got 23 people coded in the Crazy bucket.
Let's sort on arrogant.
Fuzzy sort on arrogant
So, lots of people have said things that pertain to arrogant.
Click on first option
Scroll down and shift click on conceited
So far, we have just been categorizing based on similarity to something we've typed. We can ask Displayr to fit a model in the background trying to predict based on what we have already categorized. Let's start by creating a model predicting religion.
Sort by: Similarity to Religion
It's not just looking at similarity to the word now. Instead, it's building a predictive model identifying how people that we have classified as religion are different to those that we put in the other categories. And, making predictions for the rest of the data set. Now we've found a whole lot more.
So, in a few minutes we have already coded a lot of the data. Lots of time has been saved.
I'll put the rest into Other.
Code as Other.
Now, we can use it just like any other variable.
And, we can quickly see that in the Midwest Religion was much more likely to be raised as an issue.
Quickly and accurately summarizing text data
Of course, often you want to be both quick and accurate.
Automatic categorization based on a manual or semi-automatic categorization
Of course, often you want to be both quick and accurate.
Insert > Text Analysis > Automatic > Unstructured Text
We used this tool before to automatically code the Tom Cruise data. Then, we used the manual and semi-automatic categorization to do something better.
What we can now do is pass that more accurate categorization into the automatic as an input. This will take a while to compute, so here's a different example, where I've already hooked it up.
Here's a dashboard that shows me complaints data in July and August. This chart is showing coded responses in four categories. And, we can see the most recent verbatims to the right.
This is based on coded data.
I coded 895 cases using the semi-automatic technique I showed before, with data about reasons for liking cell phone providers. You can see the groups here. I then, set it up to the automatic categorization, just as I showed you before. At the bottom of the table, it's showing predictive accuracy.
The writing's a bit small, so you have to squint. But the table below is showing accuracy, and it says that with a sample as small as 200, we can accurately predict the coding 97.4% of the time.
What this means is I can now import a revised data file, and it will automatically code this file with a high level of accuracy.
Click on data set
Choose update file
This will take a few minutes to update the text analysis. So, let's look at one I did before. As you can see, the chart is now updated all the way through to September, and we've accurately coded more than a thousand responses automatically. In this case study, the whole report updates automatically, from data checking and coding through to all the charts.
Case study 3
In market research we often ask people to type lists into text boxes. For example, which cell phone providers can you think of?
Cell phone spontaneous
This type of data is much more structured than the examples we've looked at before. This structure makes accurate automation much more straightforward.
Automatically code spontaneous awareness data
Insert > Text analysis > Automatic categorization > List of items
Drag across Mobiles > Spontaneous awareness
As you can see, it's automatically identified a list of phone carrier brands. Look at all the different variants of Verizon that it's found in the first line. While Displayr's been pretty clever, we've still got a bit of work to do. Note that:
- AT&T also appears as Att
- We've got Tmobile and T-Mobile
We can merge these categories
Scroll down object inspector
Press REQUIRED CATEGORIES > Add data
We can give Displayr more detailed instructions. There are many more options. It's always a tradeoff about how much time you want to spend optimizing. Now let's just save the first category selected, which is what's known as spontaneous awareness in the trade
ACTIONS > Save first category
Tidied word cloud
Drag across First categories
Chart > Word cloud
So, this is a better word cloud as we've used the accurate automated text analysis to tidy everything up.
Spontaneous awareness over time
Drag across variable
Crosstab by Interview date
So, we can see that Verizon's awareness was low in July.
If you want to learn more about what Displayr’s amazing text coding capabilities can do and how they can help you, book a personalized demo with us today.