Faster (and Better) Survey Analysis

Confidently Analyze Your Own Survey Data
Get the foundations right—faster. If you work with survey data and want to be more self-sufficient, this free 30-minute session is for you. Led by Displayr Founder Tim Bock, it walks through a real-world case study to show how to confidently clean, structure, and analyze your data.

This isn’t a stats or AI deep-dive—it’s a practical guide to the core steps in most quant studies, and how to do them well.

Watch now and level up your survey analysis.

In this webinar you will learn

Choose the right data file types (and avoid the painful ones)
Clean and tidy data using a simple, repeatable process
Set up basic filters, weights, and significance testing
Make sense of the data and structure it into a compelling report

Transcript

I’m going to walk you through the basics of how to analyze a survey.

We’ve put this together for survey analysis beginners. If you’ve never gone through all the steps in analyzing a survey on your own, this webinar is for you. And, I promise some good time saving tips for the more experienced.

Case study

We will do a case study, exploring interesting in this concept.

Please take a moment to read it.

Overview

I will take you through eight stages of analyzing a survey.

Getting the right type of data file

The first stage is getting data in the right format. This is the first big mistake that people make when they analyze surveys.

Bad vs good…

They click the export button in their data collection software and get an Excel or CSV file. And they try and analyze this data.

But, Excel and CSV files weren’t invented for survey analysis. Yes, you can use them, but you will double the time it takes to do analysis and you will likely make lots of mistakes.

The good file formats have been specifically designed for surveys.

The most well known was invented many years ago by a division of IBM called SPSS. This file format, the SPSS.SAV file format, is the industry standard. It is used by just about all survey analysis software. Even better is to use integrations that suck the data directly into Displayr and Q from the survey platforms.

The difference between bad versus good file formats

There’s much more about this in our webinar on data cleaning and preparation.

But, in a nut shell, the bad file formats allow users to dump whatever data they want where they want, meaning it’s hard to find things.

The good file formats enforce a much better structure on the data.

Of course, there’s more to it than just the file format. People can give you bad data no matter what the format.

I’m going to bring in the data for the iLock study. It’s an SPSS data file

The data is on the bottom left. As it’s a good quality data file, it’s easy to see what data it has in it.

Cleaning and tidying by variable set

Now that we have data, our next step is to clean and tidy it.

Question-by-question workflow for cleaning and tidying

There’s a basic process for cleaning and tidying:

We create a summary table of each question, we check each summary table looking for dirty data. If we find it’s dirty, we clean and tidy the variables and delete any low-quality observations. Let’s kick this off by creating summary tables of all the data.

Response ID

In a survey, each person is typically given a unique number or code. This is what’s shown here. The check for this is that there are no duplicates.

In Displayr, much of the data checking and cleaning is done by creating new variables. Let’s see if we can figure this how to find duplicates.

I select Response ID, as that’s the variable we’re examining. Happily, it’s recommending that we should find duplicates, which is what I wil do.

Now, we’ll create a sumamry table of this new variable.

If we had any duplicates, it should show us a row called **Yes**. Happily we do not.

Having checked this, I don’t really need to see the ID table or duplicate table again, so I will delete them. Just like when cooking, cleaning up as we go is the right way to go.

Status
The next table shows us status. We’ve got 275 people that are incomplete.

Typically we want to delete them from the survey. This should leave us with 300 people.

Displayr and Q make us jump through a hoop before deleting data. We need to tell it to remember the response id of each of these people so that later, if we update the data, it can remember to re-delete the same people.

Let's look at the raw data

As you can see, all the analyses are updated to remove the deleted respondents.
I don’t need to look at this again, so I will delete it.

Duration (in seconds)

This is how long the questionnaire took to complete, on average, in seconds.

Previously Displayr has shown us percentages.Now it’s showing an average. Why?
Because the data here has been stored as being numeric. If it was instead stored as Nominal data. Displayr shows as %. Let’s look at the raw data, and sort from smallest to highest.

So, 227 seconds is 3 minutes and 47 seconds. That’s pretty fast. But, it’s plausible. If the numbers were implausible we would need to delete the data with the implausible values.

Having looked at this data, I don’t need to see it again, so I’ll delete the table.

User language
We’ve got data on user language. It’s showing us the raw data rather than a summary table.

This is because whoever created the data file, set it up to show this data as if it was text rather than categories.

We can change this to instead show a table of percentages

So, 100% of people doing the survey speak English. Not so interesting.

Panel

No idea what this is

Gender

We’ve got a 3rd gender. As there’s no data, I’m just going to delete it. Note that Displayr’s showing percentages and the counts, which are the number of people to choose each option.

Both are useful, but most of the time it is the percentages that are most useful. A mistake that novices make is to report the counts instead. This isn’t so interesting, as who cares if 134 people are Male?

When we look at percentages, it’s more interesting. This says that 55% of adults in America are Female. If correct, that’s a useful thing to know. This is the goal of surveys. To estimate things about the world outside of the survey itself.

The correct value in the population for this data is 51% for females, but it’s not too badly skewed, so it’s not a problem.

Age

The survey was only asked to adults. So, this first category isn’t interesting. Tidying in this case means removing it

We’ve no huge difference between the ages, so that’s fine.

State

It’s usually better to look at this data as a map. We can see that the biggest states are California, Texas, Florida, and New York, so this looks basically right.

Population density

The bottom category’s pretty small. Only 9 people. That’s too small for useful analysis. We need to merge the bottom two categories.

For people new to Displayr and Q, note that while I’m merging rows in the table, it’s actually saving the changes in the underlying data, so these changes will be made whenver we have tables showing population density.

Education

I’ll remove this first category as it’s empty

In which of these groups…

OK, so this is data about income. But, the label’s very verbose. Rather than fix the label of the table, we’ll fix it in the underlying data, and the table will update automatically.

We’ve got a lot of income categories. One option is to merge them. But, a often a better option is to treat the data as numeric and calculate the average.

An average income of 17.7. That seems a bit low. We need to look at the data values to better understand.

So the way the data has been set up, an income of less than 1000 is a 1, 1000 to 2999 is 2, and so on.

What we can do is replace these values with midpoints. For example
1. *1 -> 500*
2. *2 -> 2000*

This is called midpoint recoding. We can do this automatically.

That makes more sense.

What, if anything, do you think you would particularly like about this product? Type “Nothing” if there is nothing that you like.

This next table is showing open ended data about what people like about the iLock.

And the next one asks about dislikes

If you look at the first person, they’ve given us garbage.

Who else has given us garbage? I’m going to use AI to answer this. This feature isn’t in all accounts yet, but should be released for everybody over the next few weeks. Fingers crossed.

I will check both the likes and dislikes at the same time.

I’ll sort it.

OK, so the AI has found the poor quality data. It’s a bit subjective as to what’s dirty, but these are the ones I regard as poor.

I will delete the cases.

OK, let’s automatically categorize it. In the interests of time, Il’ll just do this for dislikes.

I’m going to go through this bit super fast, but we’ve got another wbeinar on text categorization with more details.

As you can see, 2/3 of people didn’t dislike anything.

Which phrase from those below best describes how likely you would be to buy the product for yourself?

Clean enough. We will return to this later.

Compared with similar products, how different do you think this product is?

We will return to this table later.

How well do you think the idea for this particular product fits what Apple means to you?

There are too many categories. We should merge some of these categories

How likely would you be to buy this product for $199?

This is priced purchase intent. Nothing to do here.

Browser meta info - Browser

This tells us what type of browser they were using.

Weighting

Once we’ve cleaned and tidied our data, we move onto weighting, also known as sample balancing, calibration, raking, and post stratification.

The basic idea here is that, in a survey you will often end up under representing some groups in the population.

We already looked at the age and gender data, and it wasn’t hugely problematic. So we don’t need to weight this data.

There’s a whole webinar on weighting if you want to learn more or, ask me at the end.

Filtering

Filtering is the process of running analyses on only a subset of the data.

Here’s the earlier categorization of why people dislike the iLock. I’m going to create a second copy to filter it, but just looking at males.

As you can see, the data is a little different among the males to the total sample.

Often we may create much more complex filters involving age, gender, geography or any other data to drill into some sub-group.

While filtering is very useful, we can often get a more useful result by creating a crosstab, which is where we have one question in the rows, and another in the columns.

Here I will add Gender to the columns of the first table.

Note now that we’ve got the same filtered data for men in the first column, the women in the next, and the total or NET in the third. So, a crosstab is a quick way to apply and contrast multiple filters.

Planned analyses

This next topic is the thing that really separates out expert data people from the rest. Well before you look at your data, you need to very carefully identify the key things you need to work out.

What novices do instead is they write a questionnaire but don’t ever take the time to work through how they are going to analyze it, and this causes trouble when it comes time to do the analyses.

The specific plan that you will have depends entirely on what you are interested in. There’s no standard plan. There’s more about this topic in our webinar on finding the story in the data.

Analysis plan for the iLock

Here’s a simple analysis plan for this survey. I will work through it. The first thing is, is the concept viable?

People tend to exaggerate how likely they are to buy things, so you need to compare this data to benchmarks typically. The benchmark I’m using for this survey is 25%. So, we are a long way behind benchmark.

OK. What’s next? We need to compare our purchase intention priced versus unpriced.So, the score for definitely would buy when we don't show the price is also 12%. So, our problem isn’t price.

Let’s look at the other bits of data that we planned to look at.

Most people are viewing it as somewhat different. So, the problem isn’t that it’s perceived as a “me too” product.

Only a few people are thinking it fits poorly with Apple. So that’s not the problem.

The most used tool in survey analysis is the crosstab, which I’ve explained before.

I’m now going to automaticaly generate lots of crosstabs.

Is there a difference in the purchase intent of men and women?

You bet. Lots of differences. For example, with the key definitely buy it group, it’s higher among men at 14% than women at 11%. But, is this difference reliable? Is it just a random bit of noise that doesn’t reflect the world beyond our survey?

Fortunately this is a topic that the whole discipline of statistics has focused on solving.The arrows are telling us whether the differences between the filter groups are reliable enough to tell other people about. There’s no arrows in the first row, so we can’t conclude a difference between the I would definitely buy it scores of the men versus the women.

Yes, we do get what’s called a significant difference in the third row, but as that row’s not very interesting, this significant difference is immaterial.

When you do surveys, you tend to have to do lots of analyses like these. So, we can automate the process further. I’m going to automatically create crosstabs comparing purchase intent by all the demographics. If your client is not trusting, you can put this in an appendix.

That’s more interesting. Purchase intention is strongly related to age. Note here that I’m looking at the first number in each cell, which is the Column %. 25% for the 18 to 24s. All the way down to 0% for the 5 or older. I’m going to convert this from a table to a page so I can add some commentary.

So, in the very rural places, intent’s much lower. That makes sense.

In the all important I would definitely buy it, there’s no significant difference by education. But, notice there does seem to be a trend.

If we had access to a statiscian, they could perhaps find a way of demonstrating that this is, with a bit of magic statistically significant.

My solution would be to say something like “There’s weak evidence that purchase intent increases with education”

OK there’s lots of interesting stuff on this. Let’s turn it into a page. Starting at the top left corner, even the people that said there was nothing that they disliked, only 18% said they would definitely buy it.

Stat testing / statistical significance

We’ve just done stat testing.

Finding the story

Now, we move onto finding the story.

We’ve already done a lot of this. But, I’ll explain the basic principle. As with everything else, we’ve got another webinar that goes into much more detail.

“It’s simple…”

The pope of the day asked Michelangelo how he’d carved this most famous of all statues. He said. “It’s simple. I just remove everything that’s not David.”

This is also the key principle of doing useful analysis and reporting. We just go and delete everything that’s not interesting.

We’ll put these summary tables as an appendix.

The four pages we authored before, let’s drag them to the top of the report.

After deleting everything that’s not relevant, the next key but of geting the story right is a pyramid structure. And, we need to add some gloss.

Draw red box in top slot

So,we’re done. We can either create a dashboard. Or, export it to PowerPoint.

Summary

This is the summary of what we did.

TECHNIQUES

TECHNIQUES

OBJECTIVES

CAPABILITIES

DATA SOURCES

LEARN

SUPPORT

LATEST WEBINAR

Faster (and Better) Survey Analysis

In this webinar you will learn

Transcript

Prepare to watch, play, learn, make, and discover!

Get access to all the premium content on Displayr

Last question, we promise!

What type of survey data are you working with? (select all that apply)