Analytics Magazine Analytics Magazine, November/December 2014 - Page 30

g oal - d rive n a n a ly t i c s organizing, extracting and even visualizing rapid streams of data is essentially a cost center activity. Only when content of value is operationalized into active decisioning and measured for impact will big data’s liability be converted into an intelligence asset. Big data’s recovery up the Hype Cycle [1] “Slope of Enlightenment” will come in the form of actionable analytics for automated decision-making at the operational level and proactive recommendations at the strategic level. Size and Success Don’t Correlate Big data enthusiasts are finding that the more data they collect, the harder it becomes to understand just what the data is telling them. And most practitioners are surprised to learn how little data is required to build a highly effective goal-driven model. It’s not a matter of having a lot of data, but a valid sampling of data to support the target objective. For advanced analytics, it is far more important for a database to be wide with attributes or variables than long in transactions. Thanks to big data innovations, more variables are being collected than ever before. In fact, data dictionaries are starting to be turned on their side to allow vertical scrolling through a growing number of attributes. 30 | a n a ly t i c s - m a g a z i n e . o r g Only variables that have no relationship to the target objective should be excluded. A development model will automatically rank the limited set of variables that have predictive value toward the objective. The remainder can be eliminated from the final model and potentially from the analytic sandbox. Only enough transactional data to adequately represent the solution space for the application at hand is required to develop the model. There are standard rules of thumb based on the final number of attributes or dimensionality of the final model that suggest the number of records or transactions needed to derive the train, test and validation data sets for model development. Most times, this range is from 5,000 to 250,000 records – a mere quark in the vast universe of big data. But without a use plan for data, companies feel at risk to not harvest all possible data. This digital hoarding overwhelms analysis and motivates strategies for deriving streamlined analytic sandboxes. The sandboxes draw targeted data for goal-driven model development from the vast stores of useless “dark data.” One other consideration toward limiting data for more streamlined analytics is to start with available structured data. In most organizations, structured data holds far more predictive value and requires far less preparation labor than open text. w w w. i n f o r m s . o r g