There are four qualitatively distinct types of missing data. Missing data is either: structurally missing, missing completely at random (MCAR), missing at random, or nonignorable (also known as missing not at random). Different types of missing data need to be treated differently in order for any analysis to be meaningful.
Structurally missing data
Structurally missing data is data that is missing for a logical reason. In other words, it is data that is missing because it should not exist. In the table below, the first and third observations have missing values for Age of youngest child. This is because these people have no children. This situation is typically best addressed by excluding people with such missing data from any analysis of the variables with the structurally missing values.
In the How many colas did you drink in the past 24 hours column, there are also structurally missing values. In this case, we can logically deduce that the correct value is 0, so this value should be used in place of the missing values in our analysis.
ID | Children | Age of youngest child | Did you drink Coca-Cola in the last 24 hours? | How many colas did you drink in the past 24 hours? |
1 | No | No | ||
2 | Yes | 18 | Yes | 2 |
3 | No | No | ||
4 | Yes | 13 | No | |
5 | Yes | 8 | Yes | 1 |
Missing completely at random (MCAR)
Looking at the table below, we need to ask ourselves: what is the likely income of the fourth observation? The simplest approach is to note that 50% of the other people have high incomes and 50% have low incomes. We could assume, therefore, that there is a 50% chance she has a high income and a 50% chance she has a low income. This is known as assuming that the missing value is missing completely at random (MCAR). When we make this assumption, we are assuming that whether or not the person has missing data is completely unrelated to the other information in the data.
ID | Gender | Age | Income |
1 | Male | Under 30 | Low |
2 | Female | Under 30 | Low |
3 | Female | 30 or more | High |
4 | Female | 30 or more | |
5 | Female | 30 or more | High |
It is relatively easy to check the assumption that data is missing completely at random. If you can predict which units have missing data (e.g., using common sense, regression, or some other method), then the data is not MCAR. A more formal way of testing is to use Little’s MCAR test. [i]
When data is missing completely at random, it means that we can undertake analyses using only observations that have complete data (provided we have enough of such observations).
The MCAR assumption is rarely a good assumption. It is only likely to be true in situations where the data is missing due to some truly random phenomena (e.g., if people were randomly asked 10 of 15 questions in a questionnaire).
Missing at random (MAR)
In the case of missing completely at random, the assumption was that there was no pattern. An alternative assumption, known somewhat confusingly as missing at random (MAR),[ii] instead assumes that we can predict the value that is missing based on the other data.
We use this assumption to return to the problem of trying to work out the value of the fourth observation on income. A simple predictive model is that income can be predicted based on gender and age. Looking at the table below, which is the same as the one above, we note that our missing value is for a Female aged 30 or more, and the other females aged 30 or more have a High income. As a result, we can predict that the missing value should be High. Note that the idea of prediction does not mean we can perfectly predict a relationship. All that is required is a probabilistic relationship (i.e., that we have a better than random probability of predicting the true value of the missing data).
ID | Gender | Age | Income |
1 | Male | Under 30 | Low |
2 | Female | Under 30 | Low |
3 | Female | 30 or more | High |
4 | Female | 30 or more | |
5 | Female | 30 or more | High |
When data is missing at random, it means that we need to either use an advanced imputation method, such as multiple imputation, or an analysis method specifically designed for missing at random data.
Missing at random is always a safer assumption than missing completely at random. This is because any analysis that is valid with the assumption that the data is missing completely at random will also be valid under the assumption that the data is missing at random, but the opposite is not the case.
Missing not at random (nonignorable)
It may be the case that we cannot confidently make any conclusions about the likely value of missing data. For example, it is possible that people with very low incomes and very high incomes tend to refuse to answer. Or there could be some other reason we just do not know. This is known as missing not at random data and also as nonignorable missing data.
It is common to include structural missing data as a special case of data that is missing not at random. However, this misses an important distinction. Structurally missing data is easy to analyze, whereas other forms of missing not at random data are highly problematic.
When data is missing not at random, it means that we cannot use any of the standard methods for dealing with missing data (e.g., imputation, or algorithms specifically designed for missing values). If the missing data is missing not at random, any standard calculations give the wrong answer.
Consider the following study of homelessness.[iii] Data was obtained from 31 women, of whom 14 were located six months later. Of these, three had exited from homelessness, so the estimated proportion to have exited homelessness is 3/14 = 21%. As there is no data for the 17 women who could not be contacted, it is possible that none, some, or all of these 17 may have exited from homelessness. This means that potentially the proportion to have exited from homelessness in the sample is between 3/31 = 10% and 20/31 = 65%. As a result, reporting 21% as being the correct result is misleading.
In this example the missing data is nonignorable. Treating it as missing at random would also be inappropriate. This is especially true given that the inability to contact the women is likely to be causally related with whether or not they have exited from homelessness. Thus, strategies designed for data which is missing at random, such as imputation, will not work.
References
[i] Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198-1202.
[ii] Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. Brisbane, John Wiley & Sons.
[iii] Manski, Charles F. (1995). Identification Problems in the Social Sciences. Harvard University Press.