European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 111

European Policy Analysis The two-thirds approach is often seen as best practice because it takes many observations for training— which leads to high accuracy—but leaves sufficient observations for testing. But, in practice, any other proportion of test and training data is possible (e.g., two sets of equal size). Other cross-validation approaches are more computational intensive. Leaveone-out cross-validation builds as many models as there are observations, each with one data point missing and then predicts the value for each missing observation. A different common approach is k-fold cross-validation. Here, data is divided in k randomly selected subsets and then each subset is used once as a validation set. Both approaches have in common that averaging the results will lead to more robust estimations of the model performance (e.g., classification errors). To demonstrate random forests in action, the next section analyzes the whole budget data. A focus will be on visualizing the results. Empirical Application: Predicting Punctuations in Budget Shifts I n the following example, the validation-set approach is used by dividing the PAP data in a training set containing two-thirds of the dataset (randomly selected) and a test set with the remaining third of the observations. All seven predictor variables have been used. Comparison of Decision Trees, Bagging, and Random Forest In this example, I will fit a decision tree on the training set, discuss the results, and then improve the performance with bagging and random forest. First, a single decision tree is grown (Figure 7). The bold printed statements in the tree describe the points where the predictor space is split. In the first node, for example, data is divided in two parts: those observations where Year is equal or bigger than 1952 and those observations that were before this time. The left branch of each node represents data that fulfills this condition (the “yesbranch”), while the right branch does not fulfill the condition. The leaves of the tree show the dominant class (“FALSE” (no punctuation) or “TRUE” (punctuation)) for the region defined by the nodes. In this visualization, the classification rate for each leaf has been added. This information is very helpful to understand the results. We see, for example, in the leaf of the first node that there have been 358 out of 384 incremental budget shifts in this branch. The next split is based on the variable TopicCode. In the years before 1952, budget functions with the codes 150, 300, 500, or 800 were very likely to show extreme shifts (8 of 13 cases). The importance of Year is a good example for nonintuitive finding. Starting with a theory in mind, the year of the budget plan might seem less relevant than the attention Congress or President is paying to a topic. But once the pattern is exposed, it is easy to think about possible interpretations. For example, it could be possible that shortly after World War II, the shift to a civil economy had led to many shifts in Federal budget, as well. 111