There is a great deal of overlap between the fields of statistics and data science, to the point where many definitions of one discipline could just as easily describe the other discipline. However, in practice, the fields differ in a number of key ways. Statistics is a mathematically-based field which seeks to collect and interpret quantitative data. In contrast, data science is a multidisciplinary field which uses scientific methods, processes, and systems to extract knowledge from data in a range of forms. Data scientists use methods from many disciplines, including statistics. However, the fields differ in their processes, the types of problems studied, and several other factors.
The process of creating and comparing models
Many data science problems are addressed with a modeling process which focuses on the predictive accuracy of the model. Data scientists do this by comparing the predictive accuracy of different machine learning methods, choosing the model which is most accurate.
Statisticians take a different approach to building and testing their models. The starting point in statistics is usually a simple model (e.g., linear regression), and the data is checked to see if it consistent with the assumptions of that model. The model is improved by addressing any assumptions in the model that are violated. The modeling process is complete when all assumptions are checked and no assumptions are violated.
While data science focuses on comparing many methods to create the best machine learning model, statistics instead improves a single, simple model to best suit the data.
Statisticians focus much more on quantifying uncertainty than data scientists. Part of the statistical model-building process is to quantify the precise relationship between each predictor and the outcome being predicted. Any uncertainty about this relationship is also quantified. This process rarely occurs in machine learning.
Data scientists often deal with huge databases - so big that they cannot be stored on a single computer. While such data sometimes occurs in statistics, it is the exception rather than the norm. Historically, the focus on statistics has been much more about what can be learned from very small quantities of data.
This focus on small data explains why it is important to quantify uncertainty in statistics. When you only have small amount of data, it is easy to confuse signal for noise. The sheer scale of the data which is often studied by data science is also why it is impractical for data scientists to check assumptions.
The types of problems that are studied
Data science problems often relate to making predictions and optimizing search of large databases. In contrast, the problems studied by statistics are more often focused on drawing conclusions about the world at large. This involves working out how best to collect data and measure things, and how to quantify uncertainty about these measurements.
The end-goal of statistical analysis is often to draw a conclusion about what causes what, based on the quantification of uncertainty. By contrast, the end-goal of data science analysis is more often to do with a specific database or predictive model.
Backgrounds of the people working in the fields
Data scientists tend to come from engineering backgrounds. Statisticians are usually trained by math departments.
The following table describes some of the key differences in how each field uses language. This table draws heavily from this post.
|Dummy variable/indicator coding||One-hot coding|
Conclusion: what is the difference?
So what exactly is the difference between data science and statistics? The fields differ in their modeling processes, the size of their data, the types of problems studied, the background of the people in the field, and the language used. However, the fields are closely related. Ultimately, both statistics and data science aim to extract knowledge from data.
Want to keep reading? Head to our blog for more!