European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 118

Decision Trees and Random Forests : Machine Learning Techniques to Classify Rare Events
3
Especially , in big data analytics , computation time may increase in a way that it really becomes a problem . State-ofthe-art machine learning tries to overcome this situation by making the algorithms scalable , that is , several machines ( computers or cores in one computer ) are working parallel on the same task .
4
Sometimes , the division in one test set and one training set is still strongly biased . More complex approaches to cross-validation are leave-one-out crossvalidation and k-fold cross-validation ( James et al . 2013 , 178 – 184 ).
5
The data used here were originally collected by Frank R . Baumgartner and Bryan D . Jones , with the support of National Science Foundation grant numbers SBR 9320922 and 0111611 , and were distributed through the Department of Government at the University of Texas at Austin . Neither NSF nor the original collectors of the data bear any responsibility for the analysis reported here .
6
As can be seen , the predictor variables are strongly correlated ( e . g ., congress and eo ). Unlike with conventional statistical approaches , this is not a problem for the machine learning methods that are used in this paper , because they fall in the class of nonparametric models , that is , there are no assumptions about underlying distributions .
7
It is obvious that the same procedure can be used for regression problems , as well . Instead of counting the elements of different classes , one would take the mean or the mode of the response for each region .
8
For regression trees , the residual sum of squares is used as a criterion .
9
In political science , bootstrapping is often used to calculate confidence intervals for unknown distributions ( Jacoby and
Armstrong 2014 )
10
For a deeper discussion of the biasvariance problem , see Hastie et al . ( 2009 , 219 – 225 ).
11
Because of the bootstrap , not every observation is in every dataset . So the number of trees should be sufficient so that every observation is represented in the ensemble .
12
In addition , there are optimization algorithms for this parameter like the tuneRF () function in R .
13
In the computer language R , this is done with the command “ set . seed ().”
14
Too many trees are not really a problem for the model in this case . A very high number of trees might lead to overfitting , but , in general , random forests are quite robust . In addition , the cross-validation would reveal such shortcomings . In realworld examples , computational time might be the biggest problem when growing deep random forests with a lot of trees .
15
None of the tested models is reaching an AUC of 75 percent . If compared with clinical studies , the accuracy even of the random forest model was too low to be accepted . This remark is meant to remind the reader that even the best available model might not be good enough for reliable predictions .
16
The next step in data mining would be to enhance the accuracy of the model further by tuning the variables of the algorithm . The AUC benchmark for clinical studies could easily be reached by changing parameters like the number of predictors at each split or the majority rule .
17
“ Some people believe that decision trees more closely mirror human decisionmaking than do [ other ] regression and classification approaches ” ( James et al . 2013 , 315 ).
118