Webinar

Clever ways to use statistical significance to save time

Learn when and how to test for statistical significance! Maybe it’s a long time since you studied stats at college. Or perhaps you never did. Either way, this webinar is for you. The webinar will focus on how to save time when analyzing survey data.

In this webinar you will learn

Here’s a little summary of some of the subjects we cover in this webinar

This video covers all the basics for testing for statistical significance for market research data, and how to save time when analyzing surveys.

• Why you should always stat test (even if you’ve got a small sample or read something about how it was invalid)
• Clever ways to use stat testing to save time (get your computer to do the hard work of reading through lots of tables!)
• When and how to use cell comparisons versus column comparisons (using arrows and font colors versus letters)
• When to use which significance/confidence level: 80%, 90%, 95%, 99%, and multiple comparison corrections
• Testing for changes in data over time
• How to present stat tests to non-technical audiences
• How weights impact stat testing

Transcript

Maybe it’s a long time since you studied stats at college. Or maybe you never did. Either way, this webinar is for you.

As always, the webinar will focus on how to save time when analyzing survey data.

Agenda

As I just mentioned, our agenda for today is to focus on explaining stat testing, and then moving on to look at how we can use it to save time when analyzing surveys.

A note on jargon. Stat testing is the same things as significance testing, statistical testing, and statistical inference. Lots of names for the same thing.

Why we stat test

We use stat tests for two reasons. Sometimes as a safety belt. Other times as a way of searching for insight.

Dumb reasons for not stat testing

There's a lot of fake news out there about stat testing. Here are six things that I regularly hear, which are just wrong.

A lot of people say that you shouldn't do tests with samples of less than 100 or 30 in a group. Codswallop. Stat tests were invented precisely for small samples. It's the other way around. If you have a large sample it's often not necessary to do stat testing.

You can do valid stat testing with a sample size of 1.

I will take you through a case study with 8 observations

There are a few more exotic myths here that will just give you a chance to skim read.

Safety belt

One of the reasons we stat test is for safety. Like in a car it doesn't guarantee safety. But, it improves our odds.

Imagine a study finds that a new anti-aging drug increases life expectancy by 3 years.

The price of the drug is \$100,000

You wouldn't want to pay that if if the drug didn't work.

So, before shelling out the cash, you would want to be confident the drug works.

Three ways of verifying are:

1. Checking that no fraud has occurred. That is, your not being conned.
2. The research was done in a professional way. We've just had a few major COVID studies retracted due to shoddy research, so this is a serious matter.
3. The third thing should check is that the result is not just a fluke. That is, we'd want to be confident that the study didn't, just by chance, administer the drug to people that had 3 year longer life expectancies. This this third check, that the result is not a fluke, is the goal of stat testing.

Where's Wally

The second reason we stat test is to save time. To use an analogy, this is waly.

Where's Wally. Can you find him?

The main reason we use stat testing in market research is to help us know where to look.

Where's Wally with stat testing

The stat testing tells us where we don't need to look. It's still quite hard to make out Wally in this image. But, he's a lot easier to find here.

In the early 1930s, perhaps the greatest statistician of all, Sir Ronald Fisher, conducted the most famous experiment in the history of statistics.

As he was English, it will of course come as no surprise that the experiment related to the drinking of tea.

A lady with whom Fisher was acquainted said she could tell whether a tea was made by first pouring the milk, or, first pouring the tea.

So, a blindfold was put on. Four cups of tea were made one way. Four the other.

She guessed accurately within a 100% of the time.

The question that interested Fisher, was whether this was a fluke. Could she have just guessed and been lucky? As the sample size is 8, that is 8 cups, it seems possible.

How stat tests work

To work out if it is a fluke, we need two things.

First, we need to compute the difference between two numbers, where one or both of the numbers is from a sample.

Then, we perform a calculation to check if the difference is likely a fluke or not.

Example of differences

Looking at the top-left, we have the key inputs to the tea drinking experiment.

The lady guessed right 100% of the time. This is the first number. This is also from a sample, with a sample size of 8.

If she had guessed we would expect she would get things right. This is the second number.

The difference is then 100% - 50% = 50% 50%

Returning to our first example, our difference is 50%. How do we check this isn't a fluke?

Checking if the difference is a fluke…

In the first column of this table, you will see the words Tea four times and Milk four times. These represent the eight cups, four with tea poured first and four with milk poured first.

The column called Sim 1. Shows what happens if I we get a computer to randomly choose the which cup is milk first, and which has tea.

Each time I press calculate, it re-runs the simulation. We can then look to see if we ever get the difference of 50% by chance.

The second bottom row of the table show the percent correct.

We can see that it says insert number are correct. Let's check.

So, it is insert

I've subtracted 50% from this result , which you will recall is the random result.

So, the difference is insert .

Let's run the simulation again.

Now, let's change this to do 5 simulations

The table below shows how many of each size of difference we observe in our 5 simulations.

Let's change this to 10,000 now.

Go back to Examples of differences

You will recall that our Lady guessed correctly 100% of the time.

Random guessing would lead to 50% on average.

And, so, the difference in this experment is 50%, which is 100% she guessed less the 50% baseline.

Let's look at the table now. We did 10,000 simulations of the experiment.

We can see that in insert number of these observe the difference of 50% that was observed in the experiment.

As a percentage, this is insert percentage.

That is, we observed in the experiment a difference of 50%. If we randomly generate the data, we only observe a difference is big as this in insert percentage of times.

That's a pretty small number. So, it suggests it is unlikely to be a fluke. She really could tell.

This number is a probability. It's the probability that we would have observed this result if it was a fluke.

It's technical name is the p-value.

By convention, if this number is less than 0.05, we say that it's not a fluke. I'll return to this.

But, in this case, we would say that its pretty unlikely she fluked it. The lady knew her tea.

Formula

I got a computer to run some simulations. This is the brute force way. Fisher didn't have a computer, so he did the math.

You get basically the same answer, but with a bit of noise.

Fisher's Exact Test

The easiest way to use the formula is using software that implements fisher's exact test. I've done this here. You can see that it computes the p-value of 0.01429. That is, a 1.4% probability.

If we ran our simulation for a million times, we would have got this same result.

Fisher's test was developed just for experiments like this one.

Chi-square test

A more generally useful test is the chi-square test. It does get a different number. But, not very different.

There are lots of different formulas, and provided you are not doing something dumb, they give similar answers.

Process for stat testing

We're now going to work our way through the 8 basic steps of performing a test. Don't worry. It's not eight you need to learn. They become second nature pretty quick.

The first, bit, and the one that novices get wrong relates to choosing the relevant difference.

You will remember that with tea experiment, the difference was between between her score of 100% accurate, and random guessing of 50%.

Let's look at a result from a survey.

This data shows preferred cola.

The arrows and fonts are telling us that results are statistically significant. But, what's the difference?

In this case, the difference is between these numbers and the average.

Such a test is useful in polling. But, a much more interesting difference is created in a crosstab.

Cell comparisons

Let's look at the result in the top left of the corner of the table

It tells us that 65% of people aged 18 to 24 in the survey preferred Coca-Cola.

In the total sample, this figure is 43%. So, the difference is 22%, as shown I the box in the middle.

The blue arrow and the blue font is the result of the stat test. It's telling us that the p-value is small.

How small? The length of the arrow is poportional to the smallness.

If we want to look at the p-value we can

As it happens, the formula that Displayr's using to calculate the p-value does not actually test this difference of 22%.

Instead, in the background it is comparing the 65% score for the 18 to 24s with

The 40% score for everybody else.

At a logical level these two tests are the same thing. We use the one on the right because the math is a bit smarter, and it deals better with a lot of weird edge cases.

Column comparisons

A different way of looking at the crosstab is to compare differences between columns.

Appearance > Highlight Results > Compare columns

Here, each column is presented as a letter. For example, you can see that 18 to 24 is column A, 25 to 34 is B, and so on.

The letters tell us which column the result is higher than.

So, we can see that the Coca-Cola 65% score for 18 to 24, has a b. This means that the difference between this 65% and the 48% in column B, is significant.

Similarly, we can see that 18 to 24 is also bigger thatn column C, as a C is shown, and D, and E.

Note how some of the letters are in uppercase and some in captalals. Capitals means a smaller p-value.

Lowercase means that the p-value is less than 0.05. And, uppercase means less than 0.001. This is shown in the footer, but expressed as confidence levels.

Differences over time

Personally, I hate the column comparison letters and haven't used them for about 20 years.

They just take too long to digest.

Consider this table here. It's showing differences in average consumption over time.

It's really hard work.

I will switch it back to cell comparisons.

Click on table

Highlight results > Arrows and font color

Now, this table is still not great.

I explained how the differences were computed before. February, for example, is being compared to all the data that's not February.

That's not usually the difference we want. We instead would want to compare February with January.

Highlight results > Options > Advanced > Compare to previous period

Now it is comparing adjacent periods. We can see that Diet Pepsi went up in February

Coke Zero in July

And Coke went down in September.

Choose the formula

Sometimes our users ask us if they can change the formulas used to calculate stat tests.

There are two ways to do this. One is to write code.

But, most of the popular formulas are also available in the menus

Appearance > Highlight results > Options > Advanced

A note of caution here. The defaults in both Displayr and Q are sensible. Some of these other tests are worse in lots of situations. We offer these tests only because some users want to check answers against other packages. It's generally not a smart thing to play with these options.

Stat tests with sample weights

Sometimes surveys over or under-represent different groups and we need to weight the sample to fix this.

There is a text book correct way of doing stat tests with weights. It means using formulas that do something called taylor series linearization.

If you are using Q, Displayr, or Stata, they all do that for most of their standard anlayses.

If you are using R, you need to use the survey package. You get the wrong results if you use any of the standard tests.

Similarly, if you are using SPSS, make sure you use the Complex Samples module.

If you are using a spreadsheet, you are doing it wrong and there's no way to do it right.

If you are usng software where you have to modiy your weight, such as set it to an average of 1, this is a hack, and you are better off using one of the software listed below.

How to do a stat test with sampling weights

In Q and Displayr, you just select the weight, and in the background stat tests are performed that use the Taylor Series Linearization.

Note that at the moment the difference between Coke zero at 9% and the overall sore of 17% is not flagged as being significant.

By default, Displayr performs testing at the 0.05 level of significance, which is the standard used by, to make up a number, 99.999% of stat tests. But, you can play with this option.

For example, to test at the 0.1 level of significance, which is also known at the 90% level of confidence, we do this

Appearance > Highlight Results > Advanced > Highlight many results

Note that now the Coke Zero 18 to 24 result is now significant.

Is this dodgy? No, as long as we are cognisant that we have reduced the burden of proof in our check for flukiness in the results we are fine.

And, we can go the other way as well.

The more tests we do, the bigger the chance of a fluky result.

Multiple comparison corrections are modifications of stat tests designed to deal with this. My favorite, the false discovery rate correction, is available in Displayr.

Appearance > Highlight Results > Advanced > Highlight few results

I usually do this, as the fewer things that are colord, the less I have to do, so I save more time.

In Q, this false discovery correction is on by default

Process for stat testing

The next steps involve looking at the result and thinking. There are four questions you want to ask yourself:

Does the result make sense? This is the sniff test. If the result looks a bit odd, don't go "wow, look what the stat test found" as it usually means you have misunderstood the difference being tested.

The next step is to ask "So what"? Many results that are significant are not interesting.

In this sample, relative few people have an income of \$200,000 or more. Hence the red arrow pointing down. Is this surprising. No. It's not interesting at all.

Step 6 is to look for corroborating evidence. Do you have other information that lines up with the stat test?

Step 7 is to look for alternative plausible explanations for the result. This is the flipside of looking for corroborating evidence. You are looking for reasons to not believe.

Make it digestible

A big mistake that novice researchers make is to talk about the stat testing to clients. It's very rare that this is a good idea. It goes down much better why you talk about the conclusion that you reached, rather than the stat test, which is just how you got there.

For example, it's better to say that Young people like Coke more than to say that there is a statisical difference.

A second strategy is to focus on attracting the eye.

Attract the eye

Here's the example from before comparing over time. How can we visualize that better?

We can do a column chart, but that's too hard goine.

Here, each column is showing the last month's result. We can see that Coca-cola has dropped in the last month, as this is flagged by the color and arrow.
12. And, we can quickly see that Coke Xero has been growing over time.

Attract the eye - part 2

Here's a massive table that shows column comparisons. That's hard work

We make it easier by using the arrows and colors

Highlight results > Arrows and font colors

And, even easier by using this chart type, designed to show stat tests on really large tables

Chart > Trend Plot.

Now, too much is shown here, so let's change our level of detail.

… > Highlight few results

Now we've got much less to look at.

This chart's also designed for looking at results over time

Now we are using a small multiple, and we can easily see that the key result is that Coke Zero grew in terms of being rebellious.

Lots of crosstabs

But, the best for last.

Let's say we want to understand the role of age in the cola market.

The old school way of doing it is to create lots of crosstabs, like this.

Insert > More > Tables > Lots of crosstabs

And I will cross them all by these other 18 questions

Interview date… weight conscious

I've now got 18 tables to read. In a real world study it's not unknown for people to have to read 100s or even thousands of tables.

But, the smart play is to automate it.

So, I'll delete all of these

I use the same menu again

Insert > More > Tables > Lots of crosstabs

And I will cross them all by these other 18 questons

Interview date… weight conscious

Scroll down and select Sort and delete tables not significant at 0.0001

By selecting this option, I will do two things. First, I will sort the tables based on significance. And then, I OK

Now I've only got 7 tables to look at, and they are sorted with the strongest results at the top.

This table shows us that age is related to living arrangements.

This fails the So What test. I'll delete it.

Now I've only got 6 tables to look at. So, we've saved a heap of time.