What is Data Merging?

What is Data Merging?

Data merging is the process of combining two or more data sets into a single data set. Most often, this process is necessary when you have raw data stored in multiple files, worksheets, or data tables, that you want to analyze all in one go.

There are two common examples in which a data analyst will need to merge new cases into a main, or principal, data file:

  1. They have collected data in a longitudinal study (tracker) – a projects in which an analyst collects data over a period of time and analyzes it as intervals.
  2. They have collected data in a before-and-after project – where the analyst collects data before an event, and then again after.

Similarly, some analysts collect data for a specific set of variables, but may at a later stage augment it with either data in different variables, or with data that comes from a different source altogether. Thus, there are three situations that may necessitate merging data into an existing file: you can add new cases, new variables (or both new cases and variables), or data based on one or more look-up values.

Merging in New Cases

Merging in new cases, sometimes known as appending data (or in SQL, “unions”) or adding data by rows (i.e. you’re adding new rows of data to each column), assumes that the variables in the two files you’re merging are the same in nature. For instance, var1 in the example below should be numeric in both questions, and not a string (text) variable in one file and numeric in the other. Most software matches up the data based on the variable name, and so the same names should be used across the two files. “var1” in one file should be “var1” in the other.

This also assumes that the IDs for each case are different.  If it should happen that you have a variable in one file that doesn’t have a match in the other, then missing data (blank values) may be inserted for those rows that do not have data.

Merging in New Variables

Contrary to when you merge new cases, merging in new variables requires the IDs for each case in the two files to be the same, but the variable names should be different.  In this scenario, which is sometimes referred to as augmenting your data (or in SQL, “joins”) or merging data by columns (i.e. you’re adding new columns of data to each row), you’re adding in new variables with information for each existing case in your data file. As with merging new cases where not all variables are present, the same thing applies if you merge in new variables where some cases are missing – these should simply be given blank values.

It could also happen that you have a new file with both new cases and new variables.  The approach here will depend on the software you’re using for your merge. If the software cannot handle merging both variables and cases at the same time, then consider first merging in only the new variables for the existing sample (i.e. augment first), and then append the new cases across all variables as a second step to your merge.

Merging in Data using Look-ups

The above is all good and well if you have complete data sets to combine.  You can, however, augment your data with information from other sources. Consider, for instance, a data file where you have collected the zip-codes (or postcodes) of your respondents, and you want to attach some demographic data to your survey data – maybe the average income in each zip-code.

You will have your survey data on the one hand (left in the diagram below), and a list of zip-codes with corresponding income values on the other (right in the diagram).  Here, the zip-code would be referred to as a look-up code and function as the ID value did in our previous examples.

In other words, we use the look-up code as the identifier and add in the income values into a new variable in our data file.  In the diagram, observe how the data is matched up for each case by looking up the zip-code and then augmenting the original data with the income data for each matching zip-code.  For those familiar with Excel, for instance, the formula to perform this type of augmentation is =VLOOKUP().

The look-up code should be unique in the file that contains the additional data (in our example, each zip-code should only appear once, with a single associated income), but the same value can appear multiple times in the file you’re wanting to augment.  Think of it like this:  lots of people can share a zip-code, but there’s only one average income for each of those locations.

Don’t Forget…

As the process of merging files and appending data can be complex, it’s always handy to have software that does most of the hard work for you.  There are, however, three things to remember that will help ensure that it’s a smooth process:

  1. If you’re appending data, then the IDs should be unique in both files and the variables should be exactly the same in set-up and structure.
  2. If you’re augmenting data, then the variables should be unique in both files, apart from the ID variable which should be exactly the same in both files.
  3. If you’re looking up values across files then the look-up values can be non-unique in the target file, but should be unique in the source file.

About Mattias Engdahl

Matt has spent the entirety of his career in the market research space, working principally with data collection and processing. He loves nothing more than working out just why it is that you're one case out in Q63 when it *should* have been seen by everyone, or coming up with new solutions and scripts to make sure you have the right chart or table for that final report.