| 25 July 2017 |
Every now and then somebody sends me an experimental design and says, “can you please check it for me, I need to know if it is OK, it is really urgent!”. Usually they also send an accompanying spreadsheet containing big tables of numbers. I understand why people want somebody to check their designs. It is a scary thing to conduct an experiment. Small mistakes can ruin the whole thing! In this post I explain the basic process that I tend to follow when doing a rough-and-ready check of an experimental design. The last step, Checking with a small sample, is the gold-standard. I’ve never heard a good excuse for not doing this.
Most of my experiments have involved marketing, economics, social research, and surveys. In particular, I spent a couple of decades doing choice modeling, conjoint, and MaxDiff experiments. So if you are looking for information to help you plan a clinical trial, crop experiment, or running a factory, this post is going to be a waste of your time.
Checking where the experimental design came from
Just like with food and drugs, checking the provenance of an experimental design saves a lot of time. If somebody sends me a design created by software that I or one of my colleagues has written, then it is pretty quick for me to check it. If the design used SAS, JMP or Sawtooth, then again it is pretty easy. In these situations, all I really need to check is that the user has clicked the right buttons. But, where the design is of unknown provenance, life gets a lot more difficult. There is the possibility that it is completely wrong.
Checking experimental designs with heuristics
The most straightforward way to check a design is to use rules of thumb (i.e., heuristics). Different heuristics have been developed for pretty much every type of design, and you should definitely try and find ones applicable to whatever problem you are trying to solve (e.g., experimental designs for MaxDiff, drug trials, or for process modeling). However, the following heuristics pop up in many different fields. Please tell me if you know of some good ones that I have missed. But, before reading these please keep in mind that they are all heuristics. There are times when you can achieve better designs by ignoring some of them.
Does each experimental manipulation occur enough times to conduct a sensible statistical analysis. For example:
- In a completely randomized design with a single factor, where each subject is randomly allocated to a treatment (e.g., a new drug, existing drug, or placebo), it is possible to compute power calculations, working out the appropriate minimum sample size if something is known about the likely efficacy of different treatments. In fields where there is insufficient knowledge to work out the likely efficacy, different rules of thumb exist. For example, in a marketing experiment, non-academics are rarely impressed by sample sizes of less than 100. In sensory research, where the experimental manipulations are stronger, sample sizes of 40 can be considered OK.
- With a MaxDiff experiment, a common rule of thumb is that each alternative needs to be shown to each person at least three times if there is a need to estimate each person’s preferences (e.g., if conducting segmentation).
Does each manipulation occur (roughly) the same numbers of times. For example, in a completely randomized experimental design, is the sample size of each group the same? In MaxDiff experiment, does each alternative get shown to each respondent the same number of times? In choice modeling, does each level of an attribute get shown the same number of times?
In experiments where multiple manipulations occur (i.e., multi-factor experiments), it is usually a good idea to check that the manipulations are uncorrelated. For example, in an experiment where manipulating, say, colors and prices, it is usually desirable to check that there certain prices are not more or less likely to appear with specific colors. That is, usually it is desirable to have no correlation between the factors. The term orthogonal means that the variables have a correlation of 0. (Note that outside of experimental design, correlation and orthogonality have different meanings).
If you have studied some experimental design at college, you may have come across the idea that there should be no correlation of any kind between experimental factors. This is far from always true. For example:
- In choice modeling studies in many markets it is desirable to have correlations between variables. For example, in the car market, you would generally want the experimental design to have correlations between brand and price. Otherwise you will end up collecting silly data. There is no point showing people a Lamborghini for $10,000 or a Corolla for $1,000,000.
- In studies with prohibitions (e.g., certain combinations of drugs that should never be administered together), negative correlations are guaranteed.
- In studies where there are constraints regarding the number of manipulations shown at any time, there will be negative correlations (e.g., MaxDiff designs)
Checking the randomization mechanism
Experimental designs typically need to involve some form of randomization. For example, if you are allocating people to receive a drug or to receive a placebo, it is important to allocate them randomly. If you were to give the sicker-looking patients the placebo, this would likely exaggerate the efficacy of a drug.
In my experience, the single major cause of errors in experiments relates to people failing to randomize properly. “Properly” means using statistical software with a random number generator (e.g., R). To illustrate the types of things people get wrong, here a few examples of mistakes that caused me a lot of pain:
- In a choice modeling study, I had about 1,000 respondents and 100 different choice sets (treatments). I wanted to have each person see 10 choice sets, with random allocation of people to treatments. Each choice set was supposed to be seen exactly 100 times. Unfortunately, the company that collected the data discovered that their software was only able to randomize 10 times. So, they came up with the ingenious solution of randomly allocating people to different treatments based on the time (e.g., in the first 36 seconds showing the first choice set, then the second in the next 36 seconds). Needless to say, this did not work out. Unfortunately, it took eight weeks of back-and-forth with the company before the owned up to the error. The study had to be redone. Everybody involved lost a lot of money. One guy’s health broke down and he left the industry due to the stress.
- In another study where I had 200 different treatments and wanted 25 people per treatment, the company collecting the data randomly assigned each person to one of the 200 treatments. Unfortunately, the way that randomization works means that one of the treatments was seen only 8 times, and another 38, with the rest in-between. More data had to be collected until each was seen 25 times, which cost much more money.
- In an AB test looking at the effectiveness of alternative wordings of different emails, everybody who was considered unimportant was assigned the “standard'” wording, and some people considered to be important were given the “personalized” wording. This made it impossible to disentangle whether the wording or importance of the customer was driving response rates.
How can you check if the randomization is working? If I don’t know the person or company well, I will usually ask to see the code they are using or ask a lot of questions. Whether or not I do know them, I will generally always get them to do an initial pilot or soft send of the experiment, and check the frequency, balance, and correlations (see the earlier section on heuristics).
Whenever we conduct an experimental design we estimate a parameter. For example, the coefficient of a choice model, or the proportion of people to prefer an alternative in a taste test. If we square the standard error of this parameter estimate we have its variance. To calculate the efficiency of a parameter estimate, you divide 1 by this variance. The D-efficiency is a measure of the efficiency of all the parameters in a specific model.
One way to check any experimental design is to compute its D-efficiency. In practice, this is useful if using optimization software to generate an experimental design or if wanting to compare the statistical properties of two or more designs. This method is still inferior to Checking on small sample, which I discuss next.
Checking on a small sample
On any study that I work on, I always do the following:
- Get a data file after about 10% of the data has been collected. Either this will be 10% of the final sample, or, just a pilot study.
- When doing something that I have not done many times before, get the fieldwork to stop at this point.
- Review the basic heuristics that are applicable to check that the randomization is working (see Checking the randomization mechanism).
- Estimate the models that I need to estimate, taking care to look at the standard errors of all the parameters.
- Form preliminary conclusions. That is, check that the model is telling me what I need to know for the project to be a success. Sure the standard errors will be relatively high, but key conclusions should still be making sense at this time.
- If everything makes since, continue with the data collection.
You can conduct this process along with all of the other approaches. Or, if you are brave , you can just do this step and skip the earlier approaches. But, skipping testing on a small sample is foolhardy, as it checks things much more thoroughly than the other approaches.
This approach is the only one that checks for clerical errors. That is, it’s possible you have a great design, but due to clerical errors it is not administered correctly. It also allows you to recover if you have made a mistake in the entire conception of the experiment. For instance, sometimes choice modeling studies inadvertently include a couple of factors (attributes) that are so important that everything else becomes insignificant. Where the hypotheses of interest relate to the insignificant factors, this is a big problem. It is best to identify this kind of problem before you have finished the fieldwork. Otherwise it cannot be fixed.
One last little comment. As a pretty general rule, people get more diligent at checking over time, as they learn from pain. If this is your first experiment, make sure you do everything listed in this post.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.