What is Correlation?
Correlation is usually defined as a measure of the linear relationship between two quantitative variables (e.g., height and weight). Often a slightly looser definition is used, whereby correlation simply means that there is some type of relationship between two variables. This post will define positive and negative correlation, provide some examples of correlation, explain how to measure correlation and discuss some pitfalls regarding correlation.
When the values of one variable increase as the values of the other increase, this is known as positive correlation (see the image below). When the values of one variable decrease as the values of another increase to form an inverse relationship, this is known as negative correlation.
An example of positive correlation may be that the more you exercise, the more calories you will burn.
Where it is possible to predict, with a reasonably high level of accuracy, the values of one variable based on the values of the other, the relationship between the two variables is described as a strong correlation. A weak correlation is one where on average the values of one variable are related to the other, but there are many exceptions.
Pearson’s Product-Moment Correlation
The most common measure of correlation is Pearson’s product-moment correlation, which is commonly referred to simply as the correlation, the correlation coefficient, or just the letter r (always written in italics):
- A correlation of 1 indicates a perfect positive correlation.
- A correlation of -1 indicates a perfect negative correlation.
- A correlation of 0 indicates that there is no relationship between the different variables.
- Values between -1 and 1 denote the strength of the correlation, as shown in the example below.
Just about all the common problems that can render statistical analysis meaningless can occur with correlations.
One example of a common problem is that with small samples, correlations can be unreliable. The smaller the sample size, the more likely a we are to observe a correlation that is further from 0, even if the true correlation (obtained if we had data for the entire population) was 0. The standard way of quantifying this is to use p-values. In academic research, a common rule of thumb is that when p is greater than 0.05, the correlation should not be trusted.
Another problem, illustrated in the top-left chart below, is that a single unusual observation (outlier) can make the computed correlation coefficient highly misleading. Yet another problem is that correlations show only the extent to which one variable can be predicted by another, and they do not pick up situations where the difference in the predictive values is too small to be considered useful (to use the jargon, situations where the effect size is small), as shown in the top-right chart below.
Yet another problem with correlation is that it summarizes the linear relationship, and if the true relationship is nonlinear, then this may be missed. One more problem is that very high correlations often reflect tautologies rather than findings of interest.
Want to know more? Learn how to visualize correlation with a correlation matrix!