European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 98
European Policy Analysis - Volume 2, Number 1 - Spring 2016
Decision Trees and Random Forests: Machine Learning
Techniques to Classify Rare Events
Simon HegelichA
The article introduces machine learning algorithms for political scientists. These
approaches should not be seen as a new method for old problems. Rather, it is
important to understand the different logic of the machine learning approach.
Here, data is analyzed without theoretical assumptions about possible
causalities. Models are optimized according to their accuracy and robustness.
While the computer can do this work more or less alone, it is the researcher’s
duty to make sense of these models afterward. Visualization of machine learning
results, therefore, becomes very important and is in the focus of this paper.
The methods that are presented and compared are decision trees, bagging, and
random forests. The latter are more advanced versions of the former, relying
on bootstrapping procedures. To demonstrate these methods, extreme shifts
in the US budget and their connection to the attention of political actors are
analyzed. The paper presents a comparison of the accuracy of different models
based on ROC curves and shows how to interpret random forest models with
the help of visualizations. The aim of the paper is to provide an example, how
these methods can be used in political science and to highlight possible pitfalls
as well as advantages of machine learning.
Keywords: Machine learning, methods, punctuated equilibrium, statistics for
the 21st century
Introduction
classical statistics is the way problems are
formulated. Traditional approaches in
political science start with the formulation
of hypothesis, creation of formal models
that represent the underlying causalities,
and then by the test of these models on
the available data. Machine learning starts
with data, tries to find hidden patterns,
and then comes up with formal models
that can “explain” additional cases. So,
both approaches follow a quite different
M
achine learning—the usage of
computer algorithms that are
changing their performance
with new data—is a new tool for political
scientists that can be very useful,
especially in analyzing “unusual” settings
such as extreme events, big data problems,
or classification of rare events. The main
difference between machine learning and
A
Technical University of Munich / Bavarian School of Public Policy, Munich, Germany
doi: 10.18278/epa.2.1.7
98