What is Data Stacking?
Data stacking involves splitting a data set up into smaller data files, and stacking the values for each of the variables into a single column. It is a type of data wrangling, which is used when preparing data for further analysis. Common applications of stacking are: to unloop data, to allow multiple outcome variables to be used in regression, and to simplify reporting.
Example of stacking
In the image below, each row shows the data for one of four respondents in a survey. The data file contains a looped structure, where three sets of information appears for three different brands. In total, there are four observations and 10 variables.
I've shown the same data below, in stacked form. I've reshaped the data to now contain 12 observations and five variables. The last three variables (columns) show data that has been stacked. The first column contains the ID variable, which has been stretched. The second column contains the unique variable names from the original data and is also stretched to line up with the other data.
Stacking can occur multiple times
You can perform data stacking multiple times. For example, you could stack the data set on the right, to contain two variables, where the first variable contained all the values in the table, stacked on top of each other, and the second variable contained the variable names, stretched to line up with the appropriate values.
Stacking to unloop data
Often, people create data files where each row reflects how the data has been collected, rather than how it should be analyzed. For example, surveys often have data on a whole household of people in a single row, but analysis may require each person in the household to be treated as a separate analysis unit (and thus to have their own row in the data file).
Stacking for regression analysis
Most software for regression assumes that there is a single outcome variable. However, this is commonly not the case. For example, in the data set above, there are three potential outcome variables (the three variables measuring likelihood to recommend). Stacking the data means you can analyze it using standard regression analysis.
When you stack data it becomes possible to update calculations by applying a filter. If the data is not stacked, you'll need to update your analysis by changing variables or recreating the analysis from scratch. This takes more time and increases the risk of error.
Data stacking by software
Want to find out more about different data science terms? Check out the Displayr blog.