Journal on Policy & Complex Systems Volume 3, Issue 2 | Page 135

Policy and Complex Systems
tween the clause and the target class is due to chance ; thus , lower values of this fitness function are indicative of potential association between a clause and a target class . A conjunctive clause is considered to be probabilistically significant and thus worthy of being archived if its hypergeometric PMF is less than or equal to a threshold . In addition , the definitions of N tot and X tot help ensure that features with more missing data are not penalized for the missing data . Even though features with large percentages of missing data are less likely to form probabilistically significant multivariate conjunctive clauses , by defining N tot and X tot in this manner , as the total number of non-missing values , the hypergeometric PMF value , will be lower and thus more likely archived .
The CCEA can have a static threshold ( i . e ., the threshold will not heuristically decrease ), or the threshold can deterministically evolve based on the number of archived conjunctive clauses for a given conjunctive clause order . In this work , we use a static threshold . Specifically , we archive conjunctive clauses that cover at least 10 % or more of the houses infested with T . dimidiata by setting the hypergeometric fitness threshold to the fitness of a conjunctive clause that has 100 % accuracy and 10 % coverage of infested houses . Accuracy is defined as x match
/ n match and is analogous to the true positive rate of the conjunctive clause . Infested house coverage is the number of times a sampled conjunctive clause is associated with a target outcome over the total number of target outcomes in the dataset , x match
/ X tot
. Note : While the accuracy and infested house coverage are related to the hypergeometric PMF , both are used as descriptive terms to show the expected true positive rate and generality of a conjunctive clause , respectively ( this is akin to a more descriptive odds ratio that is often produced when performing logistic regression ). If only a few conjunctive clauses are archived , we risk that the archived signals contain large amounts of noise and are subject to overfitting . As mentioned above , the CCEA used a static threshold to maintain a large population of archived conjunctive clauses . This is consistent with the concept in “ Big Data ” that more data can be used to find patterns of correlations with a desired output ( true signal ) ( Mayer-Schönberger & Cukier , 2014 ).
The CCEA was run for five repetitions with 200 generations for each repetition for the El Chaperno , El Carrizal , and the combined datasets . The repetitions are seeded with a random number generator and provide another safeguard against the algorithm becoming trapped in one population of optima . For each dataset , we calculated the accuracy and infested house coverage of every archived conjunctive clause .
In addition to calculating the accuracy and infested house coverage , the population of archived conjunctive clauses was mined for patterns . For each repetition , the archived conjunctive clauses were analyzed on a house-by-house basis . The number of times a feature was present in a conjunctive clause that matched an infested house was calculated for all the features . These sums were then normalized between zero 0 and one1 for all the features . Thus , for every repetition and every dataset , there is a heat map matrix of values [ 0,1 ] for every infested house and every feature . If an infested house is missing data for a given feature , then the corresponding cell in the matrix is not assigned a value . For each dataset , the maximum value across all five repetitions for each infested house and feature wereas then used to create new matrices ( one per dataset ) with values represented as heat maps ( e . g ., panels A- – C of Figure 4 ). Finally , for each dataset , “ important ” features were defined by the majority of infested houses had a value greater than 0.5 . For heterogeneous
131