Every now and then somebody writes to us and asks "Why are these MaxDiff results different from those that I get in Sawtooth?" The assumption people usually have is that the results should be the same regardless of which package they use. However, this is not the case. You should get similar results, but never identical results. Below, I list the main reasons you may come across differences.
There are lots of different statistical models that you can use to compute MaxDiff. Some of these get different results from Sawtooth simply because they are wrong. If you are doing counting analysis, aggregate multinomial logit, or aggregate rank-ordered logit models, then you will definitely get a different answer from Sawtooth. In the case of counting analysis, you will get a different answer simply because the technique is wrong. In the case of the other two models, you will get a different answer because these two models assume that everybody is the same, whereas the Sawtooth HB/HBA model assumes that all people are different. The Sawtooth assumption is the better one.
If you are using a latent class analysis model, such as the ones in Q, Displayr, and LatentGold, you will get different answers because these models assume that there are a small number of different segments, whereas Sawtooth HB assumes people lie in a continuum, and this difference can be important. As I discuss in "The accuracy of "Hierarchical Bayes when the Data Contains Segments", the HB model tends to be the safest one, but the smart thing to do when time permits is to run multiple models and compare them.
In the case of MaxDiff models, there is also a difference in terms of how the worst (least preferred) choices are modeled. There are two different approaches: Tricked Logit and Rank-Ordered Logit, and each give slightly different results.
Different respondent-level scores
Now, assuming you are comparing equivalent models (e.g., Displayr's HB with Sawtooth's HB), the next thing to check is that the scores have been computed in the same way for each of the respondents. These numbers can be scaled in many different ways, and any comparison is only meaningful if they have been scaled in the same way.
The main scalings are:
- Mean-centered utilities/coefficients/parameters. These will tend to be numbers between -10 and 10, with lots of decimal places. They will average 0. Coefficient and parameter are, in this context, synonyms. Utility is a more vaguely defined term, and can be the same thing as a coefficient/parameter, but may mean one of the other things in this list.
- 0-Based utilities/coefficients/parameters. These will have one alternative set to 0 for all respondents, with the other utilities relative to this one.
- Respondent-level Z-scores. These are mean centered utilities/coefficients/parameters that have been further standardized to have a standard deviation of 1 for each respondent.
- 0 to 1 scaled utilities/coefficients/parameters. These are utilities/coefficients/parameters scaled to have a minimum value of 0 and a maximum value of 1.
- 0 to 100 Scaled utilities/coefficients/parameters. These are utilities/coefficients/parameters scaled to have a minimum value of 0 and a maximum value of 100.
- Preference shares/Probability %. These are scores that have a minimum of 0 and sum up to either 1 or 100, and are computed by a logit transformation of the coefficients/parameters.
- K-alternative preference shares/Probability %. These are scores that have a minimum of 0 and maximum of either 1 or 100, and are computed using a variant of the logit transformation developed by Sawtooth (see this wiki page for more information).
Each of the different scalings means different things and all are sensible in some contexts. If your differences are due to different scalings, the trick is to work out which scaling is appropriate for your end users.
Number of classes
Latent class analysis in general, and HB in Q, Displayr, and bayesm, all permit the specification of multiple classes. If you compare results from models with different number of classes, you should expect differences. The trick is to choose the model with the best predictive validity.
Non-convergence (too few iterations)
All of the modern analysis techniques start by randomly guessing some initial value and then trying to refine them. Each attempt at refining is called an iteration. If the default number of iterations is too small, you should not rely on the results. The way that you work this out is by checking that the model has converged (for theory, see Checking Convergence When Using Hierarchical Bayes for MaxDiff).
Most modern software will give warnings if you have a problem due to convergence. However, Sawtooth does not provide any warning about this (although our experience is that their default settings are OK, so there is probably not a problem here). It is possible to compute the standard convergence diagnostics for Sawtooth by using the monitor function in the rstan R package (which is available in Q and Displayr).
You only want to be using models that have converged. If one of the models has a worse predictive accuracy this could be a sign that it has not converged.
As mentioned, the techniques start with an initial guess. Sometimes this initial guess is so poor that it is impossible to get a good result. You can diagnose this by inspecting predictive accuracy. However, a better approach would be to run the models many times and compare them.
Even if you have two models that sound the same, they often will still lead to different results due to decisions that people have made when creating the algorithms. Each of these can lead to differences. Examples include:
- The default number of iterations.
- How they test for convergence. For example, do they stop when a model can only be improved by 0.0000001% or 0.0000002%?
- The estimation method. For example, when fitting latent class analysis, Latent Gold uses bayesian posterior mode estimation whereas Q uses maximum likelihood estimation. When fitting hierarchical Bayes (HB), Q and Displayr use Hamiltonian Monte Carlo, whereas Sawtooth and bayesm use Gibbs Sampling.
- Randomization. All the modern algorithms include some form of randomization. As they use different random numbers, they will usually get different answers.
Differences in how you implement your algorithms will guarantee small differences between results.
MaxDiff experiments usually do not collect a lot of data from each respondent. There is usually no way of determining, with certainty, what a respondents true preferences are for alternatives that were neither loved nor loathed (click here for a demonstration of this problem). This ambiguity of preferences means that you can have two different sets of results and both can be adequate descriptions of the underlying data, much as in the same way that people come up with different explanations for election results. Which is correct? The trick is to choose the one with better predictive accuracy.
There are lots of different reasons why different software packages should give different results. However, ultimately comparison should focus on the empirical side of things rather than the theory:
- If you have two sets of different results, you should choose between them based on predictive validity. See Using Cross-Validation to measure MaxDiff Performance for more information about this.
- If the results are very similar, but not identical, this should not be a surprise, due to the reasons listed above. If you get two broadly similar set of results you can be pretty confident that your results are probably not due to local optima or convergence issues, so that is good news!