What's the deal with missing data?
Missing data, also known as missing values, is where some of the observations in a data set are blank. In the example below, the second and fifth observations contain missing data. The second observation has a missing value for Employees, and the fifth for Understand.
|3||Cult. and Rec. Services||A||D||A||A||D||2660.15||20|
Why is missing data a problem?
Missing data is a problem because it adds ambiguity to your analysis. Consider the data in the table above. What if you want to find out the average number of employees? With the second observation missing a data value, it would be impossible to accurately work it out. You could work around this by computing the average based on the available data, but your results will always be flawed. Furthermore, an observation that has missing data for a variable indicates that it is atypical. Therefore, any analysis that assumes the missing value fits neatly into the rest of the data, is unsound.
Missing data and multivariate analyses
Multivariate analysis involves conducting analyses with more than one variables. The more variables in an analysis, the bigger the problem caused by missing values. For example, when we look at the Employees data, we have four observations with no missing values. But if we wanted to perform an analysis of how Employees is related to the Understand variable, we have only three complete cases on which to perform the analysis. Thus, it becomes much harder to perform any rigorous analysis.
How to deal with missing data
You have three options when dealing with missing data. The most obvious and by far the easiest option, is to simply ignore any observations that have missing values. This is often called complete case analysis or listwise deletion of missing values.
Another approach is to impute the missing values. This involves using statistical or machine learning models to make educated guesses based about the values of the missing data. For example, you could create a model that predicts the number of employees based on the other variables and then use this model to predict the number of employees. A variant of this approach, known as multiple imputation, is usually considered best practice when building regression-type models (e.g., linear regression, logistic regression).
A third approach is to use analysis methods that are specifically designed to deal with missing values, such as latent class analysis.
Want to find out more?
Check out 5 ways to deal with missing data in cluster analysis here!