After training a random forest, it is natural to ask which variables have the most predictive power. Variables with high importance are drivers of the outcome and their values have a significant impact on the outcome values. By contrast, variables with low importance might be omitted from a model, making it simpler and faster to fit and predict.
This post builds on my earlier description of random forests. We recommend reading that post first for context.
The example below shows the importance of eight variables when predicting an outcome with two options. In this instance, the outcome is whether a person has an income above or below $50,000.
There are two measures of importance given for each variable in the random forest. The first measure is based on how much the accuracy decreases when the variable is excluded. This is further broken down by outcome class. The second measure is based on the decrease of Gini impurity when a variable is chosen to split a node. See this article for more information on Gini.
Accuracy-Based Variable Importance in Random Forest (% Increase in MSE)
Each tree has its own out-of-bag sample of data that was not used during construction. This sample is used to calculate importance of a specific variable. First, the prediction accuracy on the out-of-bag sample is measured. Then, the values of the variable in the out-of-bag-sample are randomly shuffled, keeping all other variables the same. Finally, the decrease in prediction accuracy on the shuffled data is measured.
The mean decrease in accuracy across all trees is reported. This importance measure is also broken down by outcome class. For example, age is important for predicting that a person earns over $50,000, but not important for predicting a person earns less.
Intuitively, the random shuffling means that, on average, the shuffled variable has no predictive power. This importance is a measure of by how much removing a variable decreases accuracy, and vice versa — by how much including a variable increases accuracy.
Note that if a variable has very little predictive power, shuffling may lead to a slight increase in accuracy due to random noise. This in turn can give rise to small negative importance scores, which can be essentially regarded as equivalent to zero importance.
Gini-Based Variable Importance (Mean Decrease Gini)
When a tree is built, the decision about which variable to split at each node uses a calculation of the Gini impurity.
For each variable, the sum of the Gini decrease across every tree of the forest is accumulated every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. In the example above, occupation is over five times more important than country.
The importances are roughly aligned between the two measures, with numeric variables age and hrs_per_week being lower on the Gini scale. This may indicate a bias towards using numeric variables to split nodes because there are potentially many split points.
Comparing Accuracy-based and Gini-based Importance Measures
When assessing variable importance in a random forest, both accuracy-based and Gini-based measures offer valuable insights, though they differ in their calculation and implications.
- Accuracy-based importance (often expressed as % Increase in Mean Squared Error for regression, or decrease in accuracy for classification) quantifies how much model performance degrades when a variable's values are randomly permuted. It directly reflects a variable's impact on prediction accuracy, making it highly intuitive. However, its computation can be more intensive.
- Gini-based importance (or Mean Decrease Gini/IncNodePurity) measures the average decrease in Gini impurity (for classification) or variance (for regression) across all trees when a variable is used for a split. It's computationally faster as it's a byproduct of tree construction. A drawback is its potential bias towards variables with more unique values or categories.
While neither measure is perfect, viewing both together offers a more comprehensive understanding of a variable's true influence, helping to identify predictors.
Variable Importance for Numeric Outcomes (Regression)
The previous example used a categorical outcome. For a numeric outcome (as shown below,) there are two similar measures:
- Percentage increase in mean square error is analogous to accuracy-based importance, and is calculated by shuffling the values of the out-of-bag samples.
- Increase in node purity is analogous to Gini-based importance, and is calculated based on the reduction in sum of squared errors whenever a variable is chosen to split.
Understanding Feature Importance in Random Forest Models
In the realm of machine learning, the terms "variable importance" and "feature importance" are frequently used interchangeably, particularly within the context of random forest models. Both refer to the critical process of identifying which input variables, or "features," contribute most significantly to the model's predictive power.
Understanding feature importance is paramount for several reasons:
- Model Interpretability: It sheds light on the underlying relationships within your data, helping you comprehend why your model makes certain predictions.
- Feature Selection: High-importance features are strong predictors, while low-importance ones might be redundant. This insight can guide you in simplifying your model, potentially improving efficiency and reducing overfitting by removing less impactful variables.
- Domain Insight: Identifying key features can reveal valuable insights about the problem you're trying to solve, pinpointing the most influential factors driving an outcome.
Summary: Choosing the Right Random Forest Importance Measure
One advantage of the Gini-based importance is that the Gini calculations are already performed during training, so minimal extra computation is required. A disadvantage is that splits are biased towards variables with many classes, which also biases the importance measure. Both methods may overstate the importance of correlated predictors.
Neither measure is perfect, but viewing both together allows a comparison of the importance ranking of all variables across both measures. For further reading, see this paper and these slides.
FAQs About Random Forest Variable Importance
Why is variable importance crucial in a random forest?
Variable importance is crucial for several reasons: it enhances model interpretability by showing which features drive predictions, aids in feature selection by identifying redundant or less impactful variables, and provides valuable domain insights by highlighting the most influential factors in your data.
Can variable importance be negative?
Yes, for accuracy-based importance measures (like % Increase in Mean Squared Error), it's possible for a variable to have a small negative importance score. This typically occurs when randomly shuffling a variable’s values in the out-of-bag sample slightly improves the model’s accuracy due to random noise or if the variable has very little predictive power. These negative scores are generally considered equivalent to zero importance.
How does correlation between variables affect importance?
Correlation among predictor variables can indeed affect importance scores. Both Gini-based and accuracy-based measures may overstate the importance of correlated predictors. This is because if two variables are highly correlated, their individual contributions to the model might be shared, making it difficult to isolate the unique impact of each. It’s a known limitation to consider when interpreting results.
Which variable importance measure (Gini vs. Accuracy) should I use?
There isn’t a single “best” measure; each has its strengths. Accuracy-based importance is often more intuitive as it directly relates to prediction error, but it can be more computationally intensive. Gini-based importance is faster to compute as it’s a byproduct of tree construction. For a robust understanding, it’s often recommended to view both measures together, as they can provide complementary insights into a variable’s influence.
How is variable importance calculated for Random Forest in R?
In R, packages like randomForest and ranger are commonly used to build random forest models and calculate variable importance. The randomForest package provides functions that can compute both types of importance: importance(model, type=1) for accuracy-based (%IncMSE for regression, or mean decrease in accuracy for classification) and importance(model, type=2) for Gini-based (IncNodePurity). These functions make it straightforward to implement and interpret variable importance directly within your R workflow.
Run your own Random Forest with Displayr
This analysis was done in Displayr. To see Displayr in action, get started below.
Sign up for free