European Policy Analysis Volume 2, Number 1, Spring 2016 | Page 103

European Policy Analysis The budget data is then linked to the attention data. The construction of this combined dataset can be studied with the attached R-code (see supporting information) and is, therefore, only explained in general terms here. For each topic, the following variables are calculated from the PAP data: • the annual number of Congress’ hearings on each topic (congress), • the annual number of public laws passed by the Congress on each topic (laws), • the annual number of executive orders issued by the President on each topic (eo), • the annual number of State of the Union speeches by the President on each topic (sou), • and the annual percentage, how often the topic was mentioned in Gallup’s most important problems (gallup). These variables measure the attention within the policy process on different topics at a given time. For example, if the President is relating to environmental issues six times in his annual State of the Union speech, the variable sou has the value 6 for the topic “Natural Resources and Environment” in this year. Time span of all variables goes from 1948 to 2014. In addition, there are four variables derived from the budget data: • punctuation TRUE or FALSE (Punc), • the year, for which the budget was proposed (Year), • and the budget function (TopicCode). The President reports on the beginning of each year: how the budget in the last year really was (this is the data in the PAP dataset), how the budget is distributed in the actual year, and what his budget plans are for the year to come (True 2009). To catch the effect of attention on budget decisions, it is necessary to calculate a time lag of two years. Therefore, for all 610 data points, budget is compared with the attention variables from two years earlier. As can be seen in Figure 2, there are not many punctuations compared with incremental budget shifts. Figure 3 gives an overview of the variables at hand. In the diagonal panels, we see histograms of the variables with density plots. The other panels show the cross-wise comparisons of the variables. In the lower left panels, the variables are plotted against each other with a linear regression fit. In the upper right panels, the correlation coefficient of the cross-wise comparison is reported. For example, there is a correlation of 0.66 between congress and eo.6 Figure 3 gives a good overview of the complexity of the dataset. We find a combination of categorical and numerical variables, of which the latter do not seem to follow a normal distribution (see histograms). There are no strong correlations with the response variable Punc, but some of the predictor variables are highly correlated. These features would make the analysis with conventional methods quite tricky. This data is now the starting point for the task to predict punctuations in the annual budget with machine learning algorithms. 103