WHO RIDES THE BUS: Examining Transit Ridership in Marion County WHO RIDES THE BUS | Page 13

In the final phase of the analysis – validation – we analyzed the model’s outputs to determine if the results reflected what we expect to see in the real world. Based on that validation analysis, we refined the model’s eight groups into five. This was done to meaningfully categorize riders on other characteristics they shared. For example, the model separated riders with commute-to-work habits based on race. We recombined them into one commuter supercluster to better describe that group of riders based on their transit habits. Input 4: Language spoken at home Census trends [14] and public polling [15] suggest that the foreign-born population uses transit two to three times as often as the native-born population. Indianapolis is home to a diverse array of communities, so we used language spoken at home as a potential proxy for nativity. We again created a boolean field for this variable, with 1 indicating a home language of English and a 0 indicating a home language other than English. The raw data supplied by IndyGo also included a linked weight value, which we applied to our analysis to create generalizations about the overall ridership for the area. This value controls for oversampling certain routes, and is necessary to extrapolate meaningful results about ridership in its totality from the sampled responses to the survey. Cluster Methodology With the above inputs for our 3,965 records, we performed k-means clustering in Stata15, using the Gower coefficient for similarity measure. K-means is a useful clustering algorithm to begin with because it is simple to run and can accommodate many iterations and experimentations with minimal impact on computing and processing speed. As with most clustering algorithms, many iterations are needed before meaningful clusters begin to appear. The Gower coefficient was necessary to accommodate boolean variables which would otherwise have been treated as continuous. About the Authors The analysis and report were completed by Kelly Davila, Senior Research Analyst; Matt Nowlin, Research Analyst; Unai Miguel Andres, GIS Technician; and Deb Hollon, GIS Analyst. The authors would like to extend their thanks to John Marron, AICP for methodological and report review. Once we settled on the input and the range of reasonable clusters (between three and ten groupings), we applied a stopping rule based on the Calinski-Harabasz pseudo-F statistic. This statistic provides a measure by which to judge the optimal number of groups. Based on the above, we identified eight preliminary groups of riders. 13