We clean data to make the data more accurate, so that we can get better quality insights.
And to make it faster for us to find the story in the data.
The three tasks
Data checking involves looking for problems in the data.
We then clean the dat to remove these problems.
And, we can tidy the data, so that it's easier to work with.
As with domestic duties, it's something of a grey line to distinguish between cleaning versus tidying.
OK, so this is a bit busy. But, it summarizes everything we will cover. I've done it as a one page cheat sheet, so that you can just refer to this one page when you want to do some data cleaning.
One thing to draw your attention to is that there isn't a neat match between the checking and cleaning. For example, when we check a data the first thing we look at is that it contains the correct number of cases.
If we find that the file doesn't contain the correct number, there are multiple possible remedies.
Maybe we need a better data file.
Maybe we need to reshape the data file.
Maybe we need to delete some cases.
You will see asterisks or stars next to a few of the topics. These are more advanced techniques. Today will cover the basics. If you want the more advanced topics, come back in two weeks!
We start by checking the data file as a whole, and trying to get the best data file that we can find.
Then, we move onto checking, cleaning, and tidying all the variables in a document.
Checking, cleaning, and..
We do this by creating summary tables for each set of variables.
We examine the table.
Clean and tidy its data.
Then, we move onto the next table, until we are finished.
OK, let us dig into how you check a data file.
The difference between...
One of the key differences between good and bad data relates to the format of the data file.
For example, popular file formats like Excel and CSV files aren't designed for surveys.
Good file formats are those where there is a specific places for all key information, such as question wording and the order of responses. This forces people to create data that is in a good structure , making it easier to do your work.
Good vs bad data file formats
To use an analogy, the bad file formats are a bit like trying to make your car run on coal. You need to first convert it to car friendly fuel. So, when you have better files there is much less work to be done.
What impossible-to-analyze data looks like
As an example, sometimes Excel files look like this. You can do virtually nothing of use with this. If this is not a data file.
What marginally OK data looks like
Here's the top left corner of a marginally better Excel data file.
Each row represents a respondent in our survey.
Each column represents some property of the respondents, such as their answers to questions. The columns are called variables.
We've got a unique ID variable. That is, a special number or code assigned to each person.
We are storing answers as numbers, not text.
And, with multiple response questions, we have a separate variable for each possible response.
What terrible data looks like
Now, here's a terrible Excel file.
We've got variable names and labels stored in two rows. This is bad.
We don't have one row per person> Instead, we've got some bonus rows containing sub headings.
Much of the data shows text rather than numbers.
We have half completed responses.
And, we have multiple distinct answers stored in a single column.
Cleaning this type of data takes a long time. The trick is to avoid doing it.
Novices vs professionals
Now to all you novices out there, I want to draw your attention to the biggest difference I've noticed between novices and professionals when it comes to data cleaning.
If a professional gets a bad data file, they try and find a better one. They complain to whoever provided it. The explore all the different exporting options. They will even change which data collection platform they use if it will get them better data files.
Perhaps because of a lack of confidence, novices tend not to do this. The consequence is that novices end up having to figure out how to fix lots of problems themselves, and the data cleaning process tends to take a lot longer as a result of this.
So, let's put this into practice.
Case study 1: Inc 500
We will start with a very simple case study.
We have a data file showing some key facts about the 500 fastest growing businesses, as identified by Inc magazine.
I will load the data file.
The best data usually comes from integrations, where you suck the data straight from the data collection platform. For example, I've got options in Displayr for doing this with Qualtrics, Decipher, and Survey monkey.
Sadly, for this data set, the best I can get is an Excel file, so we will have to make do.
Load daa file> Displayr\Resources - Documents\Data\Data Cleaning\Inco 500 File 1.xlsx
As mentioned, this summarizes all the steps.
Step 1, is to check we have the correct number of cases.
Now, we are meant to have about 500 companies. So, we should have 500 cases, or rows, in the data file.
How many do we have?
Wave over number of cases?
We have 2,500, so there's a problem.
Make 1. Correct number of cases red.
Let's look at the raw data.
Most data analysis software has a view of the data called the data editor, where you can look at the underlying data.
Right-click > View in data editor.
As I mentioned, we are expecting 500 rows or cases of data, but we have 2,500. When we look at the data, we see the problem. Each business is represented by 5 row of data.
For example, we can see the business name in only row 2, and row7.
It is extremely hard to do valid data analysis with survey data in this format. We need to widen the data, so that we have only one row for each business.
We need to get the data reshaped. We could do this ourselves, but this is an advanced topic, and not for today. So, instead, we go back to whoever gave us the data and ask for a format where the data for a business is in just one row.
Let's see what we get
I will delete the old file.
And import a new file.
Now, this new file has got 166 cases. That's not what we want, we want 500. Let's look at the raw data
Select variables > View in data editor
Can you see what's gone wrong?
Yep, We now have multiple businesses in each row. We have Club Pilatoes, and in the same row Halo Top Creamery.
To fix this, we need to reshape the data file using a technique known as stacking.
But, rather than do that, I'll just tell them to give me a new file.
Ok, so now this has 501 cases, so we are at least close.
Let's look at the raw data.OK, so this dta file at least looks right. One row per company.
Mmm.. For some reason, rank 500 is in row 501. Why is that
So the shape looks right, but we do have a problem.
Let's move on to step 3. We want an ID variable without duplicates.
Display's got a nice tool we can use to find duplicates.
We just select the data we want to look at, and ask it to create a new variable that lists any duplicates
+ Insert ready made formulas > Duplicates
Let's add it to the data editor
And, we'll sort in descending order.
So, we can see that rank 215 is a duplicate.
Let's remove the sort.
So, for whatever reason, there are two companies ranked 215., so we seem to have the Inc 501
Time to talk to whoever created the file!
Case Study 2: Survey about mobile (cell) phones
So, this is a more traditional market research study.
I will import it
+ data set
Note that the file name is phone.sav. This ells us that it's one of the nice types of data files. An SPSS file. Which is good.
The bad news is that the data that was collected by students, and they created the data file, so it's a real mess. This makes it perfect for learning about cleaning and tidying.
Let's go through the process of checking and cleaning it.
As you can see, the data is meant to be about 700. The file is 725, so all is good.
Let's look at the raw data in the data editor.
Select IID .. Age
It looks like we have a row for each respondent, and a column for each variable.
So, it's got the correct shape.
Next we need to check the ID variable.
As you can see, it's got an ID variable. But, does it have duplicates?
As before, we will add a new varaible, and take a look.
Ah. We have quite a lot of duplicates. That's a problem.
And, we have some people with missing data for the ID variable.
Let's take a closer look at these duplicate IDs.
They've got the same id number.
But, there is different data for Q2.
This is good news. It tells us that we don't have the same respondents in the data multiple times, which would be a big problem.
But, instead, the ID variable is a problem.
This is actually a real problem. Let's say you wanted to go back and review the actual questionnaire for the person. If you have no unique ID, you have no way of knowing who did what questionnaire, so you can't check.
In a real world study, I would at this stage go and ask for a new data file with the problem fixed. But, sometimes that's not possible, so we will keep going with this bad data.
Next criteria is does it contain the right variables? For example, sometimes a study may be missing the answers to a question. Or, be filled with lots of irrelevant data. As we don't know enough to know the answer in this study, we will move on.
Now, the next step is to review all the variables. As mentioned, the way we do this is create lots of tables and review them all, one by one.
We have already examined the ID variale and the duplicates.
Does respondent have a mobile phone
Look at this.
Housten we have a problem.
When being asked the question do you have a phone, according to the data, one person said 3, and another 3 people said 5.
That makes no sense.
To clean this data properly we need to know how this could have happened.
There are a few different explanations.
One problem could be that the data file is missing some labels. Maybe 3 means "I have multiple phones".
If we do some digging around and work that out, then we just need to change the label.
But, it could be that we have an out of range error, whereby somebody has inadvertently entered the wrong number. This can happen if, for example, your data is from a paper questionnaire and has been manually entered, as happened here
If that's the case, we need to change the value
Sort Until 5s are at top
If we know that say, the 5 should be a yes, we can manually change it.
But, when we do this we get this weird error. I will give you a chance to read it.
Both Displayr and Q won't let you change the raw data without specifying an ID variable.
This is for a really good reason. Often people need to swap data files. For example, maybe they get given a new data file with some more responses, or some fixed problems.
If you manually change values, the software needs to remember which peoples data was changed. It does this by using an ID variable.
So, this is how we set the id variable.
We select the data file
Then we set the unique identifier
Unique identified: IID
Oh, that's right, remember we don't have a unique identifier.
There is a hack we can do, which is to just use the row number.
But, it's a dangerous hack, as if we get a new data file with data in a different order, we will have a problem.
But, at lease we can change the value from a 5 to a Yes.
Now, it looks like this isn't working, but Displayr's being a bit smart here and is re sorted the data.
But, what do we do if we have no way of knowing what the 5s and 3s should be?
What we should instead do is set them to missing values.
We can do this by deleting the raw data.
Before I do that, note that the sample size for this question is 725.
Note how the sample size has changed to 724. But, it's just for this table. Note how the number of cases is still 725.
We still have another problem with this data. The survey was meant to be with people that have cell phones.
But, we've got 6 people that don't have cell phones.
Their data is useless to us. So, we don't just want to delete their answer to this question. We need to delete them from the entire data file.
Now, note that our entire data file has only got 715 cases. Don't worry, you can undo this later if you need to in Q and Displayr.
OK. So now we've find and cleaned all the issues on this table, so we will look at the next one.
OK, what can you see wrong here?
The first problem is that the order doesn't make a lo of sense.
I think it's better of they are ordered to reflect life stage more.
Drag Returned to below Fulltime worker.
Drag Home Maker to above Retired
Look at the footer of the table. For some reasons, we've got 17 people with no data.
Now, in professional surveys, it's extremely rare to make questions optional. So, we need to investigate this.
We will again look at the raw data.
But, with data like this it's good to look at some other data that should be related.
let's also look at Q3, which is occupation
Where are the rest of the demographics?
Ah, here they are.
I will reorder them and drag them to the top so all my demographics are together.
And, I'll look at them in the data editor too.
Now, I'll again sort the data to look at Q2, to find the missing respondents.
What can you work out from this?
This respondent has lots of missing data. They are garbage. I will get rid of them from the entire data file.
Now, let's look at the remaining people with missing data.
What can you see?
That's right. They are all pretty young. My guess is that they are all students. Remember, this study was done by students.
So, what are our options.
What I would do here, is I will change them all to students.
I can manually change them one by one. But, there's a faster way.
But, before I show you, I want you to note that currently the table says 42% Fulltime workers.
Click on Work Status
Missing data = Include in analyses
Note now that the missing data is shown on the table itself, and the sample size has gone up to 716.
And, note that now we only have 41% fulltime workers.
I'll merge them with the students and rename them.
Merge and rename
Some of you may be horrified at this stage.
Did I just make up the data?
Kind of. If it helps, there is a formal name for this. I have just Imputed the missing values.
That sounds better doesn't it?
Some people will say it's wrong, and we should just exclude them from the analysis. But, that would not be correct.
It's likely they are students, if we just treat them as missing data, and ignore them, we will be over-stating the other numbers. For example, we will he showing 42% rather than 41% for Fulltime workers.
We will return to how to impute missing values in two weeks in our session on advanced data cleaning.
OK. Now let's look at the occupation data.
Note that we currently have a sample of 336.
Rather than manually fixing the values of -9 and 0 to missing values, I will use a shortcut and just delete the rows on the table, which automatically sets them to missing in the background.
Note that our sample size went down to 329.
We've got 387 missing values for this question.
Is that correct?
To answer this, we need to compare it with our earlier work status question.
The best way do do this is with a sanky digram.
Title: Sankey diagram of Work Status and Occupation
Let's start reading this from the bottom.
Everybody with a work status of Not working has no occupation. That makes sense.
Same with retired.
Same with Home Maker.
But, with part time worker and student, we've got a bit of a problem. Some people were asked their occupation. Some weren't'. So, we will clean that up.
Select q2 and q3, ands open the data editor
Let's sort by Q2.
Note that the Sankey Diagram is much neater.
So, now the sankey diagram shows us that work status and occupation are now consistent.
We have 5 full time workers who we don't know their occupation.
There's no easy solution to these guys.
I will return to them in two weeks in the advance data cleaning.
OK, we have lots and lots of problems with this data.
Problem 1, we've got one person under 15 years.
That's too small to be very interesting.
In some studies, it would make sense just to merge them into the 16 to 19 years group.
But, in some places, such as Australia, it was perhaps illegal to do the interview, so we will just delete the respondent from the data set.
Oh. Note here that we've got no age data. There not much good to me. II will delete them as well.
OK, so our table is looking better, but we've still got a big problem.
Look at the percentages.
We have lots of people aged 20 to 24.
But, only a few aged 30 to 34.
Lots fo people aged 45 to 54.
Few people older.
What's going one here?
That's right. Our survey is massively over-representing students, and the parents of students.
What can we do?
A serious option is to declare the survey garbage and move on. But, in the real world it's often impractical to do this. So what can we do?
We can acknowledge we have a limitation, and try and rectify it by weighting the data
Insert weight variable
Adjustment variable: Age
To weight data, we need to enter in what the numbers should be. We get these from other studies and published records, such as sales data or census data. Here's an example.
So, I make sure that my categories match, I copy
Now, there's a separate webinar on weighting, so I'm not going to focus on this, but the diagnostics below tell us that this weight isn't a great one
Having created a weight, we apply it to any analyses we wish to use.
Let me apply it to the age data.
Before I do, note that we've only got 2% of our data showing 65 or over.
But, in the weighed analysis, this is now the correct number of 20%.
So, we have fixed a problem, but it doesn't completely solve it, as while he percentages show we have 20% aged 65 and over, we still only have 13 people in our data set of this actual age, so we have limited good data about older people in the study.
Top of mind awareness
OK, hopefully you are getting good at spotting problems. What can you see here?
Yep, we have two others. That's weird. We will combine them.
We've got a weird category that seems a type called No previous contract. As there is nobody in it, we can just remove it.
We've got lots of categories, and some are pretty small, so let's combine them.
Select categories with less than 5%, Combine
Now, this next problem is a bit trickier. To illustrate it I will add some decimals.
Add 1 decimals.
Top of mind awareness is the first brand people can think of when asked. It's often called Share of Voice. Like all share figures, you want it to add up to 100%. But, the brand numbers don't add up to 100%.
Let me show you.
Create NET of all brands.
As you can see, they add up to 99.9%
Now, that's close enough to 100% that we probably don't care.
But, it's caused by the Don't know response. And, the more don't know options, the bigger the problem would be.
There is a standard fix for this, which is to re-base the table.
To understand this, let's look at how percentages are computed.
The Optus score is calculated as 260 divided by 706.
We want to exclude the Don't know from this, which means divide the 260 by a smaller base of 705, rather than 706.
This is done by setting the Don't know as Missing data, which is done in Displayr by right-clicking and pressing Delete.
As you can see the problem is solved.
I'l hide the NET above, as we don't need it anymore
Now we have an ugly problem, which is a bit hard to understand at first.
This table is a bit different to the ones we've looked at before.
Note that its both got row names, and, also, distinct columns names. It's what's known as a grid.
Let's look at the variables.
The grid is actually summarizing the data in 22 different variables
So, where it says Unaided for AAPT at 8%, this is actually a summary of just this first variable here.
Let's look at this in the data editor.
Right click > Data editor
So, this is the first variable, with is Unaided for AAPT.
This is the second variable which is Unaided for New Tel, and so on.
While it's possible to view this as a grid, typically we wouldn't want to do this.
Instead, we'd want to split it into two sets of variables.
I'll do this by selecting the Unaided data, and creating a new variable set.
Note that we've got an orange box warning us there's something to fix. Let's have a look.
Displayr and Q warn you when they have made a guess. In this case, it's guessed that we want to show the percentage of people who said Yes, which is correct, so we don't need to change anything.
Now let's look at the remaining data.
Lets fix up the name
Q -> Aided Awareness
We have an error.
The problem is that it's a grid
Point at structure.
So, while the data is binary. That is, each variable contains two categories, a Yes and a No, it's no longer a grid. So, we need to change this to a more appropriate structure. In this case, to a Binary Multi.
Structure: Binay Multi
Now, let's put the counts and sample sizes back onto these to get a better idea of the data.
Statistics - cells > Count, Sample size
Let's focus on the table on the right.
What's wrong with it?
Note that This data is measuring Aided Awareness as we saw before.
Now, if you think for a while, you will see we have a problem. Optus has an unaided awareness is 89%. That is, when we ask Which phone companies have you heard of, 89% of the Australian sample said Optus.
Now, Aided Awareness is the proportion of people that can recognize the name of a brand. This number for Optus is only 17%, but logically it should be bigger than the result in the earlier table.
There are a few other hints to problems.
The NET % is 100% on the table on the left, which makes sense, but ??.
The Sample size on the table on the right is only ?? people for Optus. What's going wrong?
Let's take a look at the raw data
Select two Optus variables. > View in data editor
The first variable shows whether they said mentioned Optus in unaided awareness. That is the table on the left.
The second variable shows the aided awareness. You can see that everybody who said Yes in the first question, has missing data in the second question.
Can you see what's happened? We have missing data for everybody that said they were aware of Optus.
This makes sense. If you know people are aware of Optus, there's no need to ask them again. But, the table is still wrong.
In the table on the right, we now know how to read it. Of people that didn't initial mention Optus, 17% weren't aware of the brand when asked.
So we once more need to rebase the data.
And, Displayr's given us a hint about how to do it.
Displayr has given us a hint there is in orange.
OK, so where it says Missing Data, it really means Sponteanously Aware
Missing Data -> Spontaneously Aware
Now, to rebase the data, we need to change the missing values setting first.
Rather than exclude the Spontaneously aware, we want to include them in the analysis.
Unchecking Missing values in row 1
And, we want the spontaneous awareness data to count up both the people who said yes in this question, and, all the people we already know have select the brands in the earlier question.
Check first Count this value.
Type: Aided Awareness
Now, look at the table. The percentages are now all bigger than in the first table, and we have consistent sample sizes for each brand, so the data makes sense.
Now, I will skip a few tables just to focus on he most interesting examples.
Reasons for change
Here's an interesting one. The variable asks about reasons for change.
There are two different problems we need to fix.
All our previous tables were showing percentages. But, this one's just showing the raw data.
But, the table seems just to be raw data. What's going on?
Currently, this data is stored as text, we need to instead tell Displayr that is categorical data.
As the numbers in his case represent unordered categories, we need to set this to Nominal.
OK, so now it's showing percenages.
But, there's a second problem. Commas.
What's this about?
A column in a data set should just contain one piece of information. But, whoever set this up has squeezed in multiple answers with a comma inbetween.
So, we need to split them up. In this situation, I'd always go back to whoever created the data file and tell them to do a better job.
But, it's also possible to do a bit of magic to split them up. But, it's an advanced topic, so not for today.
Allows to keep in touch
22 as a response clearly doesnt fit in, so I will set it to missing, rebasing the table.
We'll also get rid of the don't know.
Note that the next table uses the same rating scale, of strongly agree to strongly disagree.
The next page does as well.
Before, we saw that Displayr had over-grouped some data, and we split it up. Here, we are going to do the opposite. We are going to find all the variables that have the same structure, and combine them.
Displayr's pretty smart now and it gives us warnigns telling us that the data is not all the same, and this is why Displayr didn't automaically group he data.
But, we can see that the differences are just typographical errors, of people typing a 6 and a 22, so we will click Proceed regardless
Change all data
I will take the cont and sample size off this table to make it easier to see what's going on
OK, so we have 6 column, and I'll remove it as well.
How many SMS In a typical week
So far, we've been looking at categorical data. Here's some numeric data, showing the number of text messages sent per week. This is an old data set!
How do we check that?
Let's put the minimum and maximum on it
Statistics > Cells > Minimum, Maximum
OK, both the minimum and maximum look sensible.
But, what would we do if they didn't?
We would recode the data to fix it.
We do this by selecting the variable
So, if we didn't believe the 200 number, we could either
Change its value to a smaller number, which is known as capping.
200 -> 100
Or, Change missing data to Exclude form analysis
Analyzable > Piping
OK, there's one last problem that I want to bring to your attention.
It's got a wei name. Piping.
Imagine that we asked Question 5 and Question 6 here. I'll give you a chance to read them.
There are two ways that this can be set up in a file.
In the middle column it shows the right way. There is one variable for each brand. One showing which were consumed int he past week. One showing how many were bought.
The bad way of having this set up is shown on the right.
The problem occurs with the second set of variables.
Rather than there being one for brand, the data is mixed up for the brands, and the first of teh Q6 variables shoes the first brand the person mentioned, regardless of which brand.
If you have this it's a huge problem. Unless you are great at writing code, you can spend days fixing this type of problem. The solution is to get whoever provided you the data file to provide a better one.
So, this is the overview I presnted at the beginning of the webinar.
I've marked the stuff we did in green.