European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 116
Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events
Figure 11: Variable Importance Plot
Figure 12: Partial Dependence Plots
Both measures see Year and
Congress within the most important
variables. For node-purity (Gini),
TopicCode is not so important, although
it is the most important variable for
accuracy. In our case, where we are
interested in the classification of rare
events, the Gini coefficient should be
preferred.
But there is more to learn from
the random forest model. As described
earlier, decision trees are capable of fitting
truly nonlinear effects. In Figure 6, we
saw already that Year and sou sometimes
had a positive effect and sometimes a
negative one, depending on the critical
value of the split. We can now extract all
these split points from the ensemble of
bootstrapped trees to see how the effect
of the variables changes (Figure 12). This
visualization is called partial dependence
plot.
116