After training a random forest, it is natural to ask which variables have the most predictive power. Variables with high importance are drivers of the outcome and their values have a significant impact on the outcome values. By contrast, variables with low importance might be omitted from a model, making it simpler and faster to fit and predict.
This post builds on my earlier description of random forests. We recommend reading that post first for context.
The example below shows the importance of eight variables when predicting an outcome with two options. In this instance, the outcome is whether a person has an income above or below $50,000.
There are two measures of importance given for each variable in the random forest. The first measure is based on how much the accuracy decreases when the variable is excluded. This is further broken down by outcome class. The second measure is based on the decrease of Gini impurity when a variable is chosen to split a node. See this article for more information on Gini.
Each tree has its own out-of-bag sample of data that was not used during construction. This sample is used to calculate importance of a specific variable. First, the prediction accuracy on the out-of-bag sample is measured. Then, the values of the variable in the out-of-bag-sample are randomly shuffled, keeping all other variables the same. Finally, the decrease in prediction accuracy on the shuffled data is measured.
The mean decrease in accuracy across all trees is reported. This importance measure is also broken down by outcome class. For example, age is important for predicting that a person earns over $50,000, but not important for predicting a person earns less.
Intuitively, the random shuffling means that, on average, the shuffled variable has no predictive power. This importance is a measure of by how much removing a variable decreases accuracy, and vice versa — by how much including a variable increases accuracy.
Note that if a variable has very little predictive power, shuffling may lead to a slight increase in accuracy due to random noise. This in turn can give rise to small negative importance scores, which can be essentially regarded as equivalent to zero importance.
When a tree is built, the decision about which variable to split at each node uses a calculation of the Gini impurity.
For each variable, the sum of the Gini decrease across every tree of the forest is accumulated every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. In the example above, occupation is over five times more important than country.
The importances are roughly aligned between the two measures, with numeric variables age and hrs_per_week being lower on the Gini scale. This may indicate a bias towards using numeric variables to split nodes because there are potentially many split points.
Importance for numeric outcomes
The previous example used a categorical outcome. For a numeric outcome (as show below) there are two similar measures:
- Percentage increase in mean square error is analogous to accuracy-based importance, and is calculated by shuffling the values of the out-of-bag samples.
- Increase in node purity is analogous to Gini-based importance, and is calculated based on the reduction in sum of squared errors whenever a variable is chosen to split.
One advantage of the Gini-based importance is that the Gini calculations are already performed during training, so minimal extra computation is required. A disadvantage is that splits are biased towards variables with many classes, which also biases the importance measure. Both methods may overstate the importance of correlated predictors.
This analysis was done in Displayr. To see Displayr in action, get started below.