European Policy Analysis Volume 2, Number 1, Spring 2016

Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events Figure 11: Variable Importance Plot Figure 12: Partial Dependence Plots Both measures see Year and Congress within the most important variables. For node-purity (Gini), TopicCode is not so important, although it is the most important variable for accuracy. In our case, where we are interested in the classification of rare events, the Gini coefficient should be preferred. But there is more to learn from the random forest model. As described earlier, decision trees are capable of fitting truly nonlinear effects. In Figure 6, we saw already that Year and sou sometimes had a positive effect and sometimes a negative one, depending on the critical value of the split. We can now extract all these split points from the ensemble of bootstrapped trees to see how the effect of the variables changes (Figure 12). This visualization is called partial dependence plot. 116

European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 116