The Doppler Quarterly Spring 2017

underlying engine. (Even the Big Data editions of the major ETL solutions cannot do this yet.) • Users must be able to set up data quality tests with a minimum number of clicks. • Tools must look for threats beyond those for which they have been programmed. They must autonomously learn the gamut of data quality rules, specific to the dataset, using self-learn- ing algorithms. • Results of quality indicators must be translat- able into relevant metrics for different stake- holders, including executives, team leaders and data quality owners. With cloud adoption accelerating to operationalize mission critical functions and increase use of third party data, there is an urgent need for a systematic approach to ensure accuracy, completeness and quality of data used by cloud applications. In the absence of the appropriate level of data quality checks, organizations could run into regulatory or operational issues which may negate the potential benefits of using the cloud platform. When clients leverage AWS specifically, we recommend the follow- ing types of checks: • Input Data Quality Checks: When the data lands in Amazon S3, either from on-premises or third party systems, autonomous data quality checks need to be performed to flag out duplicate records and files, anomalous records, incom- plete records, and structural and semantic data drifts. • Data Completeness between on-premises, S3, Amazon EMR and Redshift: Ensure that no record is lost during the transmission of data between on-premises system to landing zone (S3), to the processing application (EMR) and then finally to the warehousing system (Redshift). Before any major analytical deployment in the cloud, your organization should work to define the key stan- dards for data quality to ensure your analysts are effective with the data available to them. The follow- ing tools can help to ensure your data maintains high quality and improves over time: • DataBuck from FirstEigen is an autonomous, self-learning, big data and cloud data quality validation and reconciliation tool. It validates the integrity of data and reconciles the cloud with the on-premises source, while enforcing constraints by filtering out the bad data and sending alerts to the appropriate people. • Informatica offers a full suite of Data as a Ser- vice (DaaS) products, including data manage- ment platforms, one-time defined business rules applied across platforms, contact record verification and data enrichment services. • IBM offers data quality solutions (Infosphere, BigInsights - BigQuality) that enable users to cleanse, assess and monitor data quality and maintain consistent views of key entities. • SAS combines decades of data quality experi- ence to provide users with tools that make it easy to identify problems, preview data and set up repeatable processes in a single manage- ment view across multiple sources. Data quality tools in the cloud must first of all, be deployed to scale as data volumes scale and secondly, maintain a variety of integration methods supporting multiple analysis tools across the same data set. The cloud provides new opportunities for increased efficiency and agility for big data storage and analy- sis, but your organization must ensure that the data integrity is maintained throughout the process in order to achieve these new cloud promises. About the Authors Seth Rao and Amit Dutta are the CEO and CTO of FirstEigen, a data validation and analytics company based in Chicago. Their focus is on leveraging machine learning in data quality tools to make the process autonomous with minimal configuration and human involvement. SPRING 2017 | THE DOPPLER | 33

The Doppler Quarterly Spring 2017 | Page 35