# What is Correlation?

**Correlation** is a term that is a measure of the strength of a linear relationship between two quantitative variables (e.g., height, weight). This post will define *positive* and *negative* correlations, illustrated with examples and explanations of how to measure correlation. Finally, some pitfalls regarding the use of correlation will be discussed.

*P**ositive correlation* is a relationship between two variables in which both variables move in the same direction. This is when one variable increases while the other increases and visa versa. For example, *positive *correlation may be that the more you exercise, the more calories you will burn. Whilst *negative correlation* is a relationship where one variable increases as the other decreases, and vice versa.

Where it is possible to predict, with a reasonably high level of accuracy, the values of one variable based on the values of the other, the relationship between the two variables is described as a *strong correlation.* A *weak correlation *is one where on average the values of one variable are related to the other, but there are many exceptions.

## Pearson’s Product-Moment Correlation

The most common measure of correlation is *Pearson’s product-moment correlation*, which is commonly referred to simply as the *correlation*, the* correlation coefficient, *or just the letter *r *(always written in italics). The *correlation* coefficient r measures the strength and direction of a linear relationship, for instance:

- 1 indicates a perfect positive correlation.
- -1 indicates a perfect negative correlation.
- 0 indicates that there is no relationship between the different variables.

Values between -1 and 1 denote the strength of the correlation, as shown in the example below.

## Create your own correlation matrix

## Misinterpreting correlations

Just about all the common problems that can render statistical analysis meaningless can occur with correlations.

One example of a common problem is that with small samples, correlations can be unreliable. The smaller the sample size, the more likely we are to observe a correlation that is further from 0, even if the true correlation (obtained if we had data for the entire population) was 0. The standard way of quantifying this is to use *p-values. *In academic research, a common rule of thumb is that when *p *is greater than 0.05, the correlation should not be trusted.

Another problem, illustrated in the top-left chart below, is that a single unusual observation (outlier) can make the computed correlation coefficient highly misleading. Correlations only show the extent to which one variable can be predicted by another. They do not pick up situations where the difference in the predictive values is too small to be considered useful. For instance, situations where the *effect size* may be too small, as shown in the top-right chart below.

Another problem with *correlation *is that it summarizes a linear relationship. If the true relationship is nonlinear, then this may be missed. One more problem is that very high correlations often reflect tautologies rather than findings of interest.

**Want to know more? Learn how to visualize correlation with a correlation matrix!**