I'm going to show you, in about 20 minutes, how to do your own market segmentations. After a recent webinar I was asked by a Q user if we could instead show things in Q, rather than Displayr. The reason that we don't do that is that Displayr is designed for both analysis and presenting, so it just works a lot better in webinars. But, as you will see I've add detailed instructions for how to do things in Q.
1. Convert ratings and text to numeric
The first step is to convert ratings and text data to numeric variables. Half of this you will probably know. Half is, I guarantee, completely new and very exciting!
How to convert ratings to numeric values
As promised, there are detailed instructions.
Convert ratings to numeric
I will start with the bit you probably know.
I've got three satisfaction ratings that I want to use in my segmentation. Currently, these are ordered categories. I need to convert them to be numeric. How you do this depends a bit on what program you use. In our products, Displayr and Q, we need to explicitly change the structure of the data.
- Data Sets > Satisfaction
- Change Structure to Numeric - Multi
- Press Values
Higher values are associated with higher ratings, so that makes sense.
But, I need to set the Don't knows as missing.
- Don't knows: Exclude from analysis
- By: RAW DATA
So now we have numbers.
Convert text to numeric
Now for something new! My colleague Justin #1 has been hard at work putting the finishing touches on this for this webinar.
The text is showing why people like their phone company. How do we convert this into numbers?
We do it using a special form of principal components analysis that we've invented for text data!
Insert > Text Analysis > Advanced> Principal Components Analysis (Text)
Drag Likes as Variable
This one takes a moment to compute, so I've pre-done it.
Convert text to numeric - pre-baked
I'm going to do this section quite quickly. If you have never done factor analysis or PCA before, the next few seconds will be a bit hard to follow. Don't worry, there is an easy-to-read blog post on it.
The analysis has represented the text data as two numeric variables.
These numeric variables are explaining about 21% of the information that we can quantify in the text data.
We could, if we wanted, add more components. This would give us a more granular understanding of the text data. And, if my goal today was to do text analysis I would do that. But, for the moment we will be happy with these two variables.
We can save these variable to the data set:
Click on PCA output
ACTIONS > Save variables
Note that a new set of variables has appeared. The first of these variables, is strongly negatively correlated with service. So, the variable measures a dimension of Service versus not service.
This second component is negatively loading on people that didn't say Nothing. That is, people that liked something.
Let's have a look at the raw data. Looking at the text in the background, respondent 7 and 8 both said Nothing. Now look at the scores, respondent 7 and 8 have scores 0 on the first dimension, that is, we can't say they are high in Service nor low in service. And, a big negative value on the second, telling us that the Liked Nothing. So, the numbers are accurately representing their text. Cool!
OK, so we know have two variables that summarize the text data. As mentioned before, if we want a more granular representation of the text, we could just create more components.
How to conduct principal components analysis
Here's the instructions. As mentioned, we'll shortly also send you a blog post which gives a lot more detail.
Create 4 segments
Having prepared the data, we now need to move on and create some segments. The place to start is to create four segments using latent class analysis
Why four? It is, by far, the most common number that people end up using, so it's a good place to start.
Latent class analysis? It's just a lot easier to use. Older techniques like K-Means involve a lot more steps and it's too easy to make mistakes.
The first real benefit of latent class analysis is that it can deal with any type of data. In our example, we illustrate three sets of numeric variables. But, you can also use categorical data, conjoint, and MaxDiff. If using categorical data, conjoint, and MaxDiff, you must first find a way to make them numeric, which is both hard and leads to information loss.
The second reason to use latent class analysis is that it automatically rescales the data in the background. With traditional clustering methods, you need to manually do that.
The third reason for using latent class analysis is that it automatically takes care of the missing values. Some clustering software does that will, such as ours, but most don't.
Latent Class Analysis
Insert > Groups/Segments > Latent Class Analysis
I'm going to choose three sets of variables; the variables form the text analysis that we just created. The satisfaction variables and some measures the data has on needs.
By default, the software will automatically choose the number of segments using the Bayesian Information Criterion. This may sound fancy, but algorithms for automatically selecting the number of segments are sadly fools gold. We will change this to four.
Specify the number of groups: 4
Create latent class analysis
One of my regrets in life is the way that I designed this output. We'll use something better than my colleague Carmen's created.
Insert > Groups/Segments > Segment Comparison Table
I'm going to compare the segments by the variables we used to create them, and some demographics
A variable was automatically added to the data file that contains segments, so we will use that as well. The nice thing about this output is that it only shows significant differences in black and uses colored boxes to emphasize them.
The first thing to note here is that there's very little color on this table. Segment 4 have higher mean values from the variables we created from text. That is, we have found a segment of people that like the service they receive from their cell phone company. But, everything else is grey. This is telling us that all the data we have thrown in is all measuring really different things, so we are just getting a grey segmentation driven by the text variables.
To add color, we need to be a bit more selective about which variables to include.
Let's see what happens when we remove Text data.
Inputs > Modify > Remove text variables.
If you haven't seen out software before, not that it's automatically updated the table. We didn't need to redo anything manually.
OK, that's much better.
Let's try and figure out what it means.
Segment 1 is 18% of the market. Their highest scores is for streaming speed, and they've got relatively high scores for the various data related attributes and entertainment. Ah, that makes sense. Look at age, the data shows that this segment skew away from older people.
What about the second segment? They’re a little over a third of the market. Their highest score is for price. They’re very, very price-conscious; and they have relative high scores for the quality of voice calls, coverage and unlimited calls within the US. Segment two are Cheap Talkers, so we would expect them to be a little older and if we look at the demographics, they do skew to being a bit older and they skew away from being Hispanic.
Let's look at segment 3. They also like voice quality and coverage. In lots of ways, they’re similar to segment two but, price is not an issue for them and they’re a lot more global in terms of who they wish to call. I'd call them Happy Talkers.
Let's look at segment 4. They seem to have relative average scores on average. They care about everything. Often that's a bad sign in segmentation. However, in this case, if we look at the demographics, a different story emerges. They’re strong skewing into this 35 to 44 years age group, they’re less educated than the other segments and they’re much less likely to identify as White. We’re looking at what seems to be a lower socio-economic group so the reason that we’re looking at these average scores is probably because everything is kind of important as they probably has the same streaming and data tendencies as Segment 1 but they have more economic constraints and they probably have people in their social networks with less electronics so they have to rely more on calls.
Here are the instructions on how to conduct a latent class analysis in Q or Displayr.
3. Create lots of alternative segmentations
We have already created and compared two segmentations, by changing the input variables.
A lot of people when they are new to segmentation are a bit surprised by the subjectivity of which variables to choose. But, that's the whole point. Segmentation isn't really about statistics. It's about strategy. The goal is to use trial and error and try and find the best solution. The more time you spend on it the better.
There are lots of other ways we can experiment. The next easiest thing is to see what happens with 3 segments, 5 segments and so on.
Then, there's some more exotic things you can do. You can find out about them in more detail in our ebook.
Compare and choose the best
If you do what I've described, you will end up coming up with a number of good segmentations. It's then a good idea to be quite systematic in comparing them, getting as many stakeholders involved as you can.
There are three or four key things we want to take into account when choosing:
- Is the segmentation related to the data not used in the segmentation, such as age and gender? The stronger the better.
- Was it easy to name the segments?
- When we look at it, does it inspire us to come up with strategies?
And, sometimes, we also need to ask whether we have a segmentation that is easy to predict with a small number of variables. If we do, it makes it easy to ask people a small number of golden questions to work out what segment they are in.
Finding the golden questions is very simple problem for machine learning. Start with a random forest.
Insert > Machine Learning > Random forest
Drag latent class to Outcome
Drag satisfaction and needs to predictors
With all the variables used, we get a little over 90% predictive accuracy. We can see here that the key predictors are unlimited calls to Canada and Mexico, Price, Premium Entertainment, Quality of the Voice Calls and Coverage.
Click on into Data Sets > Needs
Select all the variables mentioned above.
Let’s delete the others and drag across those we selected.
So, with only these 5 variables, we can still predict around 86% to 87% which means we’ve only lost 3% of predictive accuracy and we’ve got rid of the majority, so that suggests that these 5 variables are good to use. But, almost always with segmentation, we can improve this by using a Support Vector Machine. And look, its accuracy is 92% to 93%.
As mentioned, there's a lot more technical information in this ebook or book a demo.