Comparing Experimental Designs Based on Model Fit
There are many ways to generate an experimental design. Perhaps the simplest is to randomly assign levels to alternatives in a design. However, other methods of generating designs exist which result in more balanced designs and designs with a lower D-error. Rather than assess the quality of a design based on design metrics, in this post I take another approach. With the use of simulated response data, I compare various algorithms for generating choice model designs using metrics for model fit. In other words, I make an assumption about the response data and measure the performance of each design given the assumed response distribution.
I will be comparing the following algorithms used for generating experimental designs:
- Complete enumeration
- Balanced overlap
Out of these, the Shortcut, Complete Enumeration, and Balanced Cverlap algorithms optimize the balance of a design, whereas the Efficient algorithm is designed to minimize D-error. It will be interesting to see how these algorithms perform with metrics for model fit, given that D-error is more closely linked to model fit than design balance.
In order to compare algorithms, I generate 100 designs per algorithm (using different random seeds), and average the metrics from the 100 designs. This reduces noise in the results. Each design has 2 versions with 10 questions each, and 3 alternatives per question. The designs are based on the following 4 attributes for cars:
In order to fit a model with a design, I require response data. To obtain this I simulate responses from 100 respondents by drawing 100 sets of parameters from normal distributions. I assume the the following prior means and standard deviations of the respondent parameters:
With the response data, I fit a two-class latent class analysis which produces parameter estimates and standard errors for these estimates. From these estimates, 3 metrics of model fit are computed: mean standard error, parameter deviation and in-sample prediction accuracy.
- Mean standard error is simply the mean of the standard errors of the parameter estimates. Smaller standard errors imply less uncertainty in parameter estimates.
- Parameter deviation is the mean absolute difference between the simulated parameters and the parameter estimates. A smaller deviation means that the parameter estimates are closer to the original parameters, i.e., the model is better able to capture reality.
- The in-sample prediction accuracy is the percentage of questions where responses were correctly predicted by the model. As model over-fitting is not a major concern with this latent class analysis (model complexity is low), in-sample predictions were chosen over out-of-sample predictions, which can be more noisy (as there is less data available for fitting).
Before using the simulated data, I first compare the average D-error of designs generated with the different algorithms (lower is better):
As expected, the Efficient algorithm produces a design with the lowest D-error (which the algorithm was designed to minimize). The D-error from Complete Enumeration is nearly as low. The worst performer is the Random design, which has a much higher D-error than the others. Using parameter testing from linear regression, I have verified that the Efficient D-error is significantly lower than the other algorithms.
Mean standard error
The next chart compares the mean standard error of the parameter estimates from the multinomial logit model (lower is better):
Again, the Efficient design is in first place and Random is last. This is not surprising since the formulation of D-error is based on standard errors, and hence related to the mean standard error. Only Shortcut and Random are significantly worse than Efficient, based on parameter testing in regression.
This chart compares the deviations between actual and fitted parameters (lower is better):
This result suggests that the design algorithm does not appear to make much of a difference to how accurately parameters estimate the original parameters, which is surprising. Parameter testing confirms that the algorithms are not significantly different from each other.
The final chart compares in-sample prediction accuracy (higher is better):
Random is the worst, but the others have similar prediction accuracies, especially Shortcut, Complete Enumeration and Efficient. Random and Balanced overlap are significantly worse than Efficient according to parameter testing.
Based on this analysis, the Efficient algorithm appears to be the best choice. However, Shortcut and Complete enumeration are equivalent to Efficient when it comes to prediction accuracy. Note that the results are stated for the car experiment with the specified simulated priors. Random designs are clearly the worst on 3 out of 4 metrics. As a result, I wouldn't recommend using it over the others. If you would like to see how I produced this analysis, as well as the parameter test results mentioned in this blog post, please view this Displayr document.