European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 111
European Policy Analysis
The two-thirds approach is often
seen as best practice because it takes
many observations for training—
which leads to high accuracy—but
leaves sufficient observations for
testing. But, in practice, any other
proportion of test and training data
is possible (e.g., two sets of equal
size).
Other cross-validation approaches
are more computational intensive. Leaveone-out cross-validation builds as many
models as there are observations, each
with one data point missing and then
predicts the value for each missing
observation. A different common
approach is k-fold cross-validation. Here,
data is divided in k randomly selected
subsets and then each subset is used once
as a validation set. Both approaches have
in common that averaging the results will
lead to more robust estimations of the
model performance (e.g., classification
errors).
To demonstrate random forests
in action, the next section analyzes the
whole budget data. A focus will be on
visualizing the results.
Empirical Application: Predicting
Punctuations in Budget Shifts
I
n the following example, the
validation-set approach is used by
dividing the PAP data in a training
set containing two-thirds of the dataset
(randomly selected) and a test set with
the remaining third of the observations.
All seven predictor variables have been
used.
Comparison of Decision Trees, Bagging,
and Random Forest
In this example, I will fit a decision
tree on the training set, discuss the results,
and then improve the performance with
bagging and random forest.
First, a single decision tree is
grown (Figure 7).
The bold printed statements
in the tree describe the points where
the predictor space is split. In the first
node, for example, data is divided in two
parts: those observations where Year
is equal or bigger than 1952 and those
observations that were before this time.
The left branch of each node represents
data that fulfills this condition (the “yesbranch”), while the right branch does not
fulfill the condition. The leaves of the tree
show the dominant class (“FALSE” (no
punctuation) or “TRUE” (punctuation))
for the region defined by the nodes. In this
visualization, the classification rate for
each leaf has been added. This information
is very helpful to understand the results.
We see, for example, in the leaf of the first
node that there have been 358 out of 384
incremental budget shifts in this branch.
The next split is based on the variable
TopicCode. In the years before 1952, budget
functions with the codes 150, 300, 500, or
800 were very likely to show extreme shifts
(8 of 13 cases). The importance of Year is
a good example for nonintuitive finding.
Starting with a theory in mind, the year of
the budget plan might seem less relevant
than the attention Congress or President
is paying to a topic. But once the pattern is
exposed, it is easy to think about possible
interpretations. For example, it could
be possible that shortly after World War
II, the shift to a civil economy had led to
many shifts in Federal budget, as well.
111