Webinar

Correspondence analysis: The magical technique for quickly finding the story in your data

Correspondence analysis is the one advanced technique all quant researchers should know. It quickly finds the pattern in any table, large or small.

You can explore the document featured in the webinar here.

This webinar will cover

  • How to use normalization, scaling, and rotation methods to get the best model.
  • How to visualize your results, from Moon plots to Bubble charts to Heatmaps.
  • Gain an understanding of the mechanics behind the analysis.

Transcript

Correspondence is my go-to technique when I've got a large table of data and I want to quickly understand, and communicate, what it means. If you've never done correspondence analysis, you should get enough out of this webinar to start doing it on your own. If you are an expert, I promise you will also learn some cool new things.

As always, I am presenting from within Display, as it's designed for interactive presentations. but everything I show, can be done in Q as well.

Today I will show you when and how to use correspondence analysis, how to interpret the outputs, and we will then cover the more technical concepts of normalization, scaling, rotation, variants of correspondence analysis and different ways of visualizing the results.

Car example - correspondence analysis

In Displayr, Correspondence analysis is one of those refreshing techniques that's entirely automated. You can just click the button and it works out of the box. You can find it in the dimension reduction menu in both Displayr, which we are now looking at, and Our other product, Q.

We are wanting to do a correspondence analysis of a table. So, we will click this option.Now I just need to hook it up to the car table.

What can you see? From this, we can see that Volkswagen and Opel are popular, in the top left. Prius is, at the bottom, is Green. Audi and Mercedes are luxury.

I bet you saw that a lot faster than on the earlier table.

Brand personality
Here's a much bigger table. It's so big that we can only see about 10% of it. The whole table shows 42 brands by 15 attributes. How long would it take you to figure out what it means?

Brand personality - correspondence
Now, here's the correspondence analysis map. What can you see? Yes, it is very detailed. But, it's a lot simpler than the table. We can see that that the jeans and shoe brands, in the top left corner are Tough and Outdoorsy. In the bottom left we can see that Calvin Klein, Amex, and Lexus are upper class.

Interpretation
There is one thing you need to know. Don't worry. It's simple. But, it's not obvious. But you will know it soon!

Interpreting a simple example
Here we have a very simple crosstab. What can we see? Diet Pepsi is strongly associated with 50 or more. Diet Coke is very weak with the 18 to 24s. Pepsi is strong with the 25 to 49s. Coke is the most popular brand with all the age groups.Coke's stronger among the 18 to 24s.

Now, let's let correspondence analysis do it. Let's work through the map and see how it shows the conclusions.

The mistake
The rooky mistake is to look at how close things are together. Most people when they are new to correspondence analysis note that Coke and 18 to 24 are close together and interpret this as being the key pattern. But, it's actually one of the weakest patterns revealed. The strongest pattern is that Diet Pepsi is strongly correlated with being 50 or more, as we saw before when looking at the table. So, how do we read this plot correctly to see this?

Rules for interpretation
The closer row labels are together, the more similar they are. In this case the row labels of the table we are analyzing are the brands. So, Diet Coke and Coke Zero are shown to be similar in terms of their age profiles.

The closer the column labels, the age groups in this example, the more similar they are. So, the 50 or more are marginally more like the 25 to 49s than or the 18 to 24s.

Now for the tricky bit. Before I said that the relationship between 50 or more and Diet Pepsi is shown to be strong. Why is that? We start by drawing lines from the two labels to the middle.

We then look at the angle. The smaller the angle, the stronger the correlation.

The angle is very small, so there's a strongish correlation. Then we look at the length of the lines. The longer the lines the stronger the correlation. As the 50 or more is about average for Age, but Diet Pepsi is furthest from the middle, this tell us that this the strongest pattern revealed by the data.

And, if you love math, there's a formula. But, using your eyes is always good enough and once you have done it a few times it becomes second nature and intuitive.

Now, if things are on opposite sides of the origin that's also meaningful. We can see here that Diet Coke and 18 to 24s have a strong negative relationship, as the angles very big and the lines are long.

There's a moderate relationship between Pepsi and 25 to 49s. And between 18 to 24 and Coke. Yes, the angle is very small. But the lines are short.

Interpretation is all about relativities
Here' are the results we just obtained from the map and below are the earlier results from the table. Fortunately, we've got the same conclusions. But, there's one exception. The map is only showing relativities. There's no way we can see from this that Coke is strong in all the age groups.

And, if you think about it you realize that this is inevitable. You can't both show that Coke is close to all the age groups and also show which one it's most closely correlated with, as that would put Coke at multiple points on the map. I'll return to how we can remedy this in the section on visualization.

The distance from the middle
As I explained, we understand relationships between the row and column labels by drawing lines to the center. If you think about this for a while you will see it has an important implication: the closer something is to the middle of the map, the less information that is revealed. We will return to this.

Patterns are determined by point of view
This next point's profound. And, a bit hard on the brain. I'll start with an analogy. Correspondence analysis maps present a view of the data. This is just like a drawing of a face. Depending on the point of view, conclusions change.

Note how the distance between the left eye and the tip of the nose seems to change from view to view. And, some features are only revealed with some views. The mug shot, shown at the top left, doesn't reveal the fat rat’s tail at all.

Information is lost
One of the ways that we can address the issue of point of view is to rearrange things. For example, a three-dimensional globe is a poor map when viewed in two dimensions. We can't see Australia at all on the left. The map on the right is better for most purposes. But it is very misleading as well.

For example, what's the distance between Sydney Australia and London? This map implies you need to go via the US.

Different world map
If your goal is to understand the distance between London and Sydney, this is a better map. But, it's worth if wanting to know the distance between Sydney and Los Angeles

Dimension compression leads to loss of information
When we show the 3-dimensional world in 2 dimensions, information is lost. Returning to our earlier table of 42 brands by 15 attributes, that's data in 15 dimensions! All squashed into 2 dimensions. So, a lot of information will be lost.

57% explained
The map shows us how much of the information has been explained. The horizontal and vertical dimensions are explaining 34.4%% and 22.2% respectively. In total 57% of the variance in relativities that could be shown, is shown.

We've compressed the data from 15 to 2 dimensions, and only lost 43% of the information. That's pretty good. And that's why correspondence analysis is so useful. But, it's not perfect. Some information will be lost. That's the nature of summarizing.

Always check
And that means it's always appropriate to go back and check that any key conclusions can also be seen in the data table.

Normalization and scaling
If you use the default correspondence analysis you will get the best overall summary of the data. But there are some tweaks you can do that can make it better, in a given situation. Two of them relate to normalization and scaling…

Normalization and scaling options
All correspondence analyses make technical assumptions about how data is normalized and scaled. In Q and Displayr, this is, by default, set to something called Principal normalization.

What this means is that the map is designed to:
Be as accurate as possible in representing distances between rows. E.g., brands
Be as accurate as possible in representing distances between columns. E.g., attributes.
Be a bit less accurate in terms of how it shows the relationships between rows and columns
If we have a table where the rows show brands and the columns attributes, we are typically most interested in the brands.

Consequently, it's often better to use a different normalization called Row Principal. This still tries to represent the relationship between the rows as accurately as possible. It is more accurate in terms of showing the relationship between the rows and columns. But, it is less accurate in terms of showing the distance between the column labels. This this has been traded off. And, it can be hard to read. We can improve the map further by scaling. Let me show you.

Row principal (scaled)
This is the default map. You can see here that the normalization is set to Principal. Here's the underlying data. Note that brands are shown in the rows. Thus, we are most interested in accurately showing the relationship between the brands and the brands and attributes. The relationship between the attributes is less important. This means that the row principal normalization will be more accurate. But, it's a mess. A mess we can fix by scaling.

So, this is the most accurate map for our purposes. I would stress that the default map was also really good as well. We are just polishing here. Check out our ebook if you want more information.

While I'm on this slide, note what a nice job the software's done in not letting any of the labels overlap. It's a really smart algorithm than Po and Kyle created. But, not as great as the human brain, so you can move things around if you want to improve.

Rotation
As I mentioned before in the example of the face, the distance between the eye and the nose is a function of the view that we take. You can rotate a head to get different views. In much the same way, you can rotate a correspondence analysis to make it most accurately represent a particular aspect of the data.

Rotated to focus on mini cooper
What does this map reveal about Mini Cooper? Before today, some of you may have said that Mini Cooper has a similar positioning to BMW and Nissan Qashgai. But, now, hopefully, you will say Mini Cooper is near the middle of the map, so the map tells us little about mini cooper.

But what if you work for Mini. This means the map is useless. As useless as a mug shot for working out if somebody has a rat's tail. Fortunately, there's an easy fix. We need to rotate the correspondence analysis to focus on Mini Cooper.

We can now see that Mini Cooper is uniquely associated with City, and has few competitors in this space, with Fiat 500 being the closest competitor.

What's going on in the background? The default map shows the best two dimensions for representing all of the brands. Now, we are extracting the best two dimensions for explaining how Mini Cooper competes, which may have been hidden in dimension 3 and 4, and invisible on the default the map.

This is a technique that we've invented. But we've published it in the international journal of market research, and there's more detail in the ebook.

Variants of correspondence analysis
So far, I've been showing you a technique called correspondence analysis. There are some other more exotic variants. They are sometimes useful. But, 99% of the time the standard technique is the best. So, if you are feeling overwhelmed. Just ignore this section, and tune back in when I get to data visualization.

Square table
Sometimes you have a table that is square. That is, it has the same labels in the rows as in the columns. This one shows brand switching.

401 people purchased Cornflakes two times in a row.
194 first purchase cornflakes, then Weetabix.

Correspondence analysis can be used to summarize this table by showing which brands operate as substitutes.

Cereal correspondence analysis
But, the standard correspondence analysis plots both the row and column labels. As these are the same, we end up with all the labels appearing twice. .

Correspondence analysis of a square table
Fortunately, there's a special variant of correspondence analysis for this. So, that's much neater.

Multiple correspondence analysis
A lot of people hear of multiple correspondence and think it must be multiple times more useful than correspondence analysis. This is true in other areas of statistics. For example, multiple regression is much more useful than simple regression.

But the reality is the reverse. The standard correspondence analysis is actually much more useful than multiple correspondence analysis. But I will show you what multiple correspondence analysis does anyway.

In Displayr go: Insert > Dimension Reduction > Multiple Correspondence Analysis > Table

Previously all our analyses have been based on tables. But multiple correspondence analysis allows us to select multiple variables.

In Displayr go: Select Data Sets > Age, Gender, Preferred Cola and Drag into box

So, now we are looking at the relationship between age, gender, and brand. We can see that Diet Pepsi is still for older people.

And weakly correlated with females. Note that it is explaining 43.1 + 21.8 = 65% of the variance.

Brand preference by age and gender
But, there's a smarter way of creating a plot showing age, gender, and brand. It's to create a table like this. And use standard correspondence analysis.

Correspondence analysis
Note that it is still showing a strong correlation between Diet Pepsi, and 50 or more. And, the correlation with female appears stronger than on the previous map.

Now, let's look at the quality. It's explaining 71.2% + 20.7%, which is 92% of the variance. Remember, the multiple correspondence analysis was only explaining 65%. So, this is much better. Why?

The big table…
In the background, multiple correspondence analysis is trying to summarize all the relationships between all the variables, from a table a bit like this.

That's a lot of info, and a lot gets lost.

But, if you look at the table, a lot of the information isn't that interesting. We are interested in the relationship between brand with age and gender. But this big table is also showing us the relationship between age and gender, which isn't our interest.

So, as I mentioned multiple correspondence analysis is usually not as good as traditional correspondence analysis.

But, having said that, in two weeks we will do a webinar on factor analysis, and I'll talk about a useful application of multiple correspondence analysis then.

Multiple tables
Another variant of correspondence analysis is for when you want to look at multiple tables. Here I have brand imagery for tech companies from 2012 and 2017. How have these changed?

Multiple tables correspondence
In Displayr: Insert > Dimension Reduction > Correspondence Analysis. I like to use arrows to show movement over time. Click trend lines.

Visualization
If your goal is just to interpret the data, then the visualizations I've shown you before do the job. But you can do things a bit cooler as well.

3D Correspondence analysis
Back in my days as a researcher I had a couple of clients who seemed to enjoy complexity for its own sake. Hi Ian and Toby! Here's something for them: 3 D correspondence analysis.

Images
I love to replace brand names with icons. And, you can go a bit further.

In Displayr: Click on tab 3, interactive correspondence analysis
Here my clever colleague Claudia has put a vending machine in the background and hooked up some filters, so it's now a cool interactive market map.

Notice how for the younger crowed, Enjoy Life is up here next to the energy brands.

In Displayr: Select the older groups, and deselect the younger groups

But now we can see for the older people this attribute belongs to the established

And, here we've built on this more, by using bubbles to represent the attributes.

Each bubble in this case represents the importance of the attribute, based on a driver analysis.

Moonplot
But, another way to use bubbles is to represent the size of the brands, as in this example. Remember when we were talking about measuring the length of the lines and the angles? Too hard for your clients? There's an alternative visualization which is easier to interpret. People instinctively read this correctly.

In Displayr: Output: Moonplot
But some clients don't think it looks pretty. I love it myself. But I did invent it so maybe I'm not so objective.

eBook
Everything in this webinar is discussed in detail in the book. Just go to the resources section on our website or download it for free here.

Book a demo - See Displayr in action

So, there you have it - now you can do a state-of-the-art correspondence analysis on your own.

If you've done correspondence analysis in any other tool, you'll have seen how all the hard bits were automated in Displayr. Displayr's built to save researchers lots of time. If you’d like to cut your analysis times in half, book a demo with one of our experienced researchers today.

Read more

Cookies help us provide, protect and improve our products and services. By using our website, you agree to our use of cookies (privacy policy).
close-image

Register now
close-link