| 08 August 2017 |
Normalization and the Scaling Problem in Correspondence Analysis
Correspondence analysis is a useful technique for compressing the information from a large table into a relatively-easy-to-read scatterplot. The resulting plot, as is the case with most simplifications, is often misleading. When the plot is made, the analyst chooses, or leaves to a default setting, the normalization. This setting governs how the resulting map should be interpreted.
Most correspondence analyses plots are misleading in at least three different ways, but the choice of normalization can increase this to five, so you want to get the choice of normalization right. This post provides an overview of the main normalization options, explains how to interpret the resulting maps, provides a technical explanation of the normalizations, and gives recommendations for the best approach to normalization for different situations.
Overview of normalization options in correspondence analysis
The table below lists the main normalizations, the key concepts and terminology used. Please take note of one really important issue: there is no commonly-agreed upon meaning of the word “symmetric(al)”. Different apps and authors use it to mean completely different things. For example, the most widely used program, SPSS, uses a meaning that is completely different from that of the most widely read author on the topic, Michael Greenacre. For this reason, I do not use this term.
|Normalization||Other names||Definition of row coordinates||Definition of column coordinates||How to interpret relationships between row coordinates||How to interpret relationships between column coordinates||How to interpret relationships between row and column categories|
|Standard||Symmetrical||Standard||Standard||The vertical distances are exaggerated||The vertical distances are exaggerated||No straightforward interpretation|
|Row principal||Row, Row asymmetric, Asymmetric map of the rows, Row-metric-preserving||Principal||Standard||Proximity||The vertical distances are exaggerated||Dot product|
|Row principal (scaled)||Principal||Standard * first eigenvalue||Proximity||The vertical distances are exaggerated||Proportional dot product|
|Column principal (scaled)||Column, Column asymmetric, Asymmetric map of the columns, Column-metric-preserving||Standard * first eigenvalue||Principal||The vertical distances are exaggerated||Proximity||Proportional dot product|
|Column principal||Standard||Principal||The vertical distances are exaggerated||Proximity||Dot product|
|Principal||Symmetric map, French scaling, Benzécri scaling, Canonical, Configuration Plot||Principal||Principal||Proximity||Proximity||No straightforward interpretation|
|Symmetrical (1/2)||Symmetrical, Symmetric, Canonical scaling||Standard * sqrt(singular values)||Standard * sqrt(singular values)||The vertical distances are somewhat exaggerated||The vertical distances are somewhat exaggerated||Dot product|
Interpreting plots created with the different normalizations
The first requirement for correct interpretation of correspondence analysis is a scatterplot with an aspect ratio of 1, which is the technical way of saying that the physical distance on a plot between values on the x-axis and y-axis need to be the same. If you look at the plot below, you will see that the distance between 0 and 1 on the x-axis is the same as the on the y-axis, so this basic hurdle has been passed. But, if you are viewing correspondence analysis in general-purpose charting tool, such as Excel or ggplot, be careful, as they will not, by default, respect the aspect ratio, which will make the plots misleading.
You can sign in to Displayr and explore this normalization example here.
As I mentioned in my introductory paragraph, most standard correspondence analysis plots are misleading in at least three ways.
The first way is that they only show relativities. For example, the plot above suggests that Pepsi and Coke (which were rows in the table) are both associated with Traditional or Older (columns). However, there is no way to conclude from this map which brand has the highest score on any attribute. In the case of maps using brand association data, it is quite common to have a leading brand with the highest score on all the attributes; the key when interpreting is to remember that the map only shows relativities.
The second general way that correspondence analysis maps mislead relates to the variance explained. If you add up the percentages in the x and y axis labels, you will see that they add up to 97.5%. So, 2.5% of the variance in the data is not explained. This is not much. But, the percentage can be much higher. The higher the percentage, the more misleading the plot. And, of course, it is possible that the two dimensions explain 100% of the variance, as is illustrated in Understanding the Math of Correspondence Analysis: A Tutorial Using Examples in R.
The map above is misleading in a third way. To the naked eye, it misrepresents the relationship between the columns. The plot shows that Weight-conscious is roughly the same distance apart from Older as it is from Rebellious. This is a misrepresentation of the data. To correctly interpret the relationship between the row coordinates, we need to remember that the vertical dimension explains only about a third of the variance, so vertical distances for the column coordinates are on this plot are exaggerated. If you look at the plot below, it shows the relationship between the columns properly.
What is the difference between the two plots? The top one uses row principal normalization. This means it gets the rows right, but not the columns. The plot below uses principal normalization, which means it gets the rows and columns correct.
At this stage, it no doubt seems the principal normalization is better. Who would want a map which misrepresented the relationship between the column categories? Unfortunately, the principal normalization comes with its own great limitation.
The principal normalization is great at showing the relationships within the row coordinates, and also within the column coordinates. However, it misrepresents the relationships between the row and the column categories. In the row principal normalization shown above, we can infer the relationship between row and column categories by looking at how far they are from the origin, and also the angle formed by the lines that connect them to the origin (if you are not familiar with how to interpret the relationship between the row and column categories, please see Understanding the Math of Correspondence Analysis: A Tutorial Using Examples in R for a technical discussion and How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for examples and a more intuitive explanation).
The misrepresentation of the relationships between the row and column categories can best be described as being moderate. Yes, it is not possible to correctly work out all the relationships from the map, even if the map explains 100% of the variance. However, any strong relationships that appear on the map are likely to be correct. This makes the principal normalization a good default normalization. However, in situations where there is a clear focus on the rows, such as when using it to show brand positioning, as in these examples, the row principal normalization is generally superior.
It is also possible to use column principal normalization. If I have done a good job in explaining things, you can hopefully work out that this normalization correctly shows the relationships between the rows and the columns, but misrepresents the relationships among the row categories.
The next useful normalization is one that is referred to in Displayr and Q as symmetric (1/2) normalization. This normalization, shown below and defined in a bit more detail in the next section, correctly shows the relationship between the row and column coordinates. But, it represents the relationships among the row points, and also among the column points. So, of all the normalization we have seen so far, it is the one that misrepresents the data in the most ways. However, it does have an advantage. Its degree of misrepresentation is the smallest. That is, while the row normalization misrepresents the column coordinates by quite a large amount, the symmetric 1/2 misrepresents them by a smaller amount. Similarly, while the column normalization misrepresents the row coordinates by a large amount, the plot below does so by a smaller amount.
The consequence of this is that if in a situation where the main interest is in the relationships between the row and column coordinates, and there is no clear way of knowing whether to choose between row or column principal normalization, this approach is the best one.
In my own work, I favor a variant of row principal normalization. In most of my work, I set up the tables so that the rows represent brands, as in this post. It is obvious to my clients that the brands are the focus, so they never get confused about the column coordinates issue, as they are not so interested in the relationships among the column categories. However, I have recently started using an improved variant of row principal normalization. Below I have repeated the row principal plot from the beginning of the post. A practical problem with this normalization is that the row categories tend to cluster in the middle of the map and the column categories at the periphery. Sometimes this can be make it impossible to read the row categories, as they are all overlapping.
A straightforward improvement on the row principal normalization is to scale the column coordinates on the same scale as the x-axis of the row coordinates. This results in what Q and Displayr refer to as row principal (scaled) normalization. As I discuss in the next section, this is an improvement without cost.
A technical explanation of the different normalizations
Below are the core numerical outputs of a correspondence analysis of the data used in this post. The first row shows the singular values. The remaining rows show the standard coordinates for the rows (brands) and columns (attributes). Refer to Understanding the Math of Correspondence Analysis, for a detailed explanation about what these are and how they are computed.
In the row principal normalization, you multiply the position of each of the row categories from the original table (i.e., Coke through Pepsi Max) by the corresponding singular values. The first two dimension are then plotted. For example, for Coke Zero, its coordinate on the x-axis is .669*-0.63 = -.42, and its position on the y-axis is .391*.99 = .39. As mentioned, if the two dimensions explain all the variance in the data, then the positions of Coke Zero relative to all the other brands on the map is correct.
Expressing these calculation as formulas, we have:
x for a row = Singular value 1 * Standard Coordinate 1
y for a row = Singular value 2 * Standard Coordinate 2
For the column categories, we just plot the standard coordinates:
x for a column = Standard Coordinate 1
y for a column = Standard Coordinate 2
This simpler formula is not correct. By ignoring the singular values, these coordinates misrepresent the scale. However, the reason for this “mistake” is that the the dot product of these coordinates is meaningful. As described in Understanding the Math of Correspondence Analysis, correspondence analysis allows us to understand the relationships between rows and column categories, where this relationship is formally quantified as the indexed residuals, where:
Indexed residual for x and y = x for row * x for column + y for row * y for column
If you substitute in the earlier formulas this gives us:
Indexed residual for x and y = Singular value 1 * Standard Coordinate 1 * Standard Coordinate 1 + Singular value 2 * Standard Coordinate 2 * Standard Coordinate 2
When we use the principal normalization, this means we use the principal coordinates for both the row and column categories, which changes the formula to Singular value 1 ^ 2 * Standard Coordinate 1 * Standard Coordinate 1 + Singular value 2 ^ 2 * Standard Coordinate 2 * Standard Coordinate 2. As you can see, this puts the singular values in twice, and so no longer correctly computes the indexed values.
The symmetric (1/2) normalization computes the coordinates for x and y for both row and column coordinates using Sqrt(Singular value) * Standard Coordinate. As the principal coordinates, which multiply by the singular values rather than their square roots is correct, it follows that this normalization is neither correct for within row comparisons nor for within column comparisons. Nevertheless, its degree of error is lower than standard coordinates. The indexed residuals are correctly computed because Sqrt(Singular value) * Sqrt(Singular value) = Singular value.
The row principal (scaled) normalization uses the principal coordinates for the row categories and for the column categories uses:
x for a column = Singular value 1 * Standard Coordinate 1
y for a column = Singular value 1 * Standard Coordinate 2
That is, it uses the first singular value for each of the two coordinates. This has the effect of contracting the scatter of the column coordinates on the map, but makes no change to their relativities (i.e., they remain wrong, as they ignore the reality that the y dimension explains less variation). This normalization also changes the indexed residual, so that rather than the dot product being exactly equal to the indexed residual when the plot explains 100% of the variance, instead the dot product becomes proportional to the indexed residual. Changing from an equality to a proportionality has no practical implication of any kind, as relationships between the row and column categories are only ever interpreted from correspondence analysis as relativities. This is why the scaling of row principal is generally appropriate.
Column principal (scaled) is the same as row principal (scaled), except that the focus is switch from the columns to the rows.
For the reasons outlined in this post, my view is that either the row principal (scaled) normalization or the column principal (scaled) normalization is typically best. Although principal is an appropriate default in situations where the viewer is not actively involved in working out and communicating the most appropriate normalization.
EXPLORE THE DATA
All of the examples in this post have I created with R. You can view and play with the examples, including using your own data, by clicking on this link: examples of normalization and signing into Displayr to see the document that I wrote when writing this post.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.