European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 108

Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events 3. Do the same with all other predictors. 4. Choose the predictor and the corresponding cutpoint that reduces the criterion most. 5. Split the data in two parts according to the selected cutpoint. 6. Repeat the procedure for both parts of the data until a stopping criterion is reached; for instance, until no region contains more than five observations. Decision trees are very prone to overfitting. In an extreme case, we could divide the predictor space in as many regions as data points. The result would be a perfect prediction of the data (unless two cases with different classes share exactly the same position). But such an overcomplex tree would perform poorly on new data (i.e., it would not be robust). Contrary, to keep the regions as big as possible (view splits) increases the robustness of the decision tree because these big regions will probably be suitable for new data points. But the tradeoff then is a higher classification error. Random forest is an upgrade of the decision tree method that overcomes this problem. Random Forest The problem with decision trees is that they suffer from high variance. This means that slightly different data might lead to very different decision trees. Calculating the mean is a common way to reduce the variance. In a set “of n independent observations Z1,…,Zn, each with the variance σ2, the variance of the mean Ž of the observations is given by σ2/n. In other words, averaging a set of observations reduces variance” (James et al. 2013, 306). So, if we ran the decision tree algorithm on multiple training sets, we could average the models and come up with one low-variance machine learning algorithm. The problem is, of course, that we (normally) do not have multiple training sets. Splitting our data in different sets does not help because every model builtd on a subset would be strongly biased. The solution is bootstrapping.9: The procedure is quite simple. We can create multiple datasets from the original data by a sample with replacement (Mooney 1996;, Shikano 2006). The dataset is treated like a bag from which every observation can be drawn and added to the bootstrapped dataset. Then this observation is re