WHO RIDES THE BUS: Examining Transit Ridership in Marion County WHO RIDES THE BUS | Page 13
In the final phase of the analysis – validation –
we analyzed the model’s outputs to determine
if the results reflected what we expect to see
in the real world. Based on that validation
analysis, we refined the model’s eight groups
into five. This was done to meaningfully
categorize riders on other characteristics they
shared. For example, the model separated
riders with commute-to-work habits based on
race. We recombined them into one commuter
supercluster to better describe that group of
riders based on their transit habits.
Input 4: Language spoken at home
Census trends [14] and public polling [15]
suggest that the foreign-born population
uses transit two to three times as often as the
native-born population. Indianapolis is home
to a diverse array of communities, so we used
language spoken at home as a potential proxy
for nativity.
We again created a boolean field for this
variable, with 1 indicating a home language of
English and a 0 indicating a home language
other than English.
The raw data supplied by IndyGo also included
a linked weight value, which we applied to
our analysis to create generalizations about
the overall ridership for the area. This value
controls for oversampling certain routes, and
is necessary to extrapolate meaningful results
about ridership in its totality from the sampled
responses to the survey.
Cluster Methodology
With the above inputs for our 3,965 records,
we performed k-means clustering in Stata15,
using the Gower coefficient for similarity
measure. K-means is a useful clustering
algorithm to begin with because it is simple
to run and can accommodate many iterations
and experimentations with minimal impact
on computing and processing speed. As with
most clustering algorithms, many iterations
are needed before meaningful clusters
begin to appear. The Gower coefficient was
necessary to accommodate boolean variables
which would otherwise have been treated as
continuous.
About the Authors
The analysis and report were completed by
Kelly Davila, Senior Research Analyst; Matt
Nowlin, Research Analyst; Unai Miguel Andres,
GIS Technician; and Deb Hollon, GIS Analyst.
The authors would like to extend their thanks
to John Marron, AICP for methodological and
report review.
Once we settled on the input and the range
of reasonable clusters (between three and ten
groupings), we applied a stopping rule based
on the Calinski-Harabasz pseudo-F statistic.
This statistic provides a measure by which to
judge the optimal number of groups. Based
on the above, we identified eight preliminary
groups of riders.
13