The Doppler Quarterly Spring 2017 | Page 34

for use when it comes out of the cloud. The basic checks that we take for granted in an RDBMS envi- ronment are lacking. As data moves in and out of the cloud, data quality frequently deteriorates and loses its trustworthiness due to: • Inability to validate (and alert) on any bad/ missing data sources day over day variables. The traditional approach by developers to ensure data quality is to follow these steps linearly: • Prognosticate points of failure (expected threats) • Code to mitigate expected threats • Inability to enforce data constraints, such as allowing duplicates: Primary key violation! • Test the code • Multiple data sources sending data to the cloud get out of sync over time • Detect new points of failure that were unexpected • Structural change to data in upstream processes not expected by the cloud • Presence of multiple IT platforms (Hadoop, DW, cloud) Faulty processes, ad hoc data policies, poor discipline in capturing and storing data and a lack of control over some data sources all contribute to data incon- sistencies between the cloud and on-premises systems. Data quality is based on a range of metrics. These metrics vary, depending on the industry and the use for the data. The common best practice for ensuring data integrity is to check for the six core dimensions of data quality: • Completeness - Are all data sets and data items recorded in entirety? • Uniqueness - Are there any duplicates? • Timeliness - The degree to which data repre- sent reality from the required point in time. • Validity - Does the data match the rules? A data set is valid if it conforms to the syntax (format, type, range) of its definition. • Accuracy/Reasonableness - Does the data reflect the data set? The degree to which data correctly describes the “real world” object or event being described. • Consistency - The absence of difference when comparing two or more representations of the same data between different data sets. This can also include the relationships between different 32 | THE DOPPLER | SPRING 2017 • Put code in production • New coding for unexpected failures… testing… production • Maintain and update the rules for relevancy and accuracy The data threats that are the most damaging are usu- ally unexpected ones that are not mitigated through proactive programming. The biggest challenge of data validation is the ability to create and maintain thousands of data quality rules that can constantly evolve over time. Organizations need to establish a Data Quality Validation framework that lends itself to the authorization of large and complex data flows across multiple platforms. Right now, that onerous undertaking is a tedious labor-intensive process, prone to human errors. As data flows from many dif- ferent sources, interrelationships and validations become intricate and complex. As a result, unex- pected errors increase exponentially. The current data quality approaches are reasonably suitable to mitigate “expected threats.” However, they are not scalable or sustainable. They do not work when data hops across multiple platforms and are definitely not suitable in a big data/cloud initiative. Instead of retrofitting existing solutions to solve big data quality issues, organizations must choose an intelligent data validation solution, one that will con- tinue to learn autonomously. A New Paradigm of Data Quality The next evolution in ensuring data quality for big data in the cloud must, at the minimum, satisfy these needs: • Handle massive data volumes with a powerful