underlying engine. (Even the Big Data editions
of the major ETL solutions cannot do this yet.)
• Users must be able to set up data quality tests
with a minimum number of clicks.
• Tools must look for threats beyond those for
which they have been programmed. They must
autonomously learn the gamut of data quality
rules, specific to the dataset, using self-learn-
ing algorithms.
• Results of quality indicators must be translat-
able into relevant metrics for different stake-
holders, including executives, team leaders
and data quality owners.
With cloud adoption accelerating to operationalize
mission critical functions and increase use of third
party data, there is an urgent need for a systematic
approach to ensure accuracy, completeness and
quality of data used by cloud applications. In the
absence of the appropriate level of data quality
checks, organizations could run into regulatory or
operational issues which may negate the potential
benefits of using the cloud platform. When clients
leverage AWS specifically, we recommend the follow-
ing types of checks:
• Input Data Quality Checks: When the data lands
in Amazon S3, either from on-premises or third
party systems, autonomous data quality checks
need to be performed to flag out duplicate
records and files, anomalous records, incom-
plete records, and structural and semantic data
drifts.
• Data Completeness between on-premises, S3,
Amazon EMR and Redshift: Ensure that no
record is lost during the transmission of data
between on-premises system to landing zone
(S3), to the processing application (EMR) and
then finally to the warehousing system
(Redshift).
Before any major analytical deployment in the cloud,
your organization should work to define the key stan-
dards for data quality to ensure your analysts are
effective with the data available to them. The follow-
ing tools can help to ensure your data maintains high
quality and improves over time:
• DataBuck from FirstEigen is an autonomous,
self-learning, big data and cloud data quality
validation and reconciliation tool. It validates
the integrity of data and reconciles the cloud
with the on-premises source, while enforcing
constraints by filtering out the bad data and
sending alerts to the appropriate people.
• Informatica offers a full suite of Data as a Ser-
vice (DaaS) products, including data manage-
ment platforms, one-time defined business
rules applied across platforms, contact record
verification and data enrichment services.
• IBM offers data quality solutions (Infosphere,
BigInsights - BigQuality) that enable users to
cleanse, assess and monitor data quality and
maintain consistent views of key entities.
• SAS combines decades of data quality experi-
ence to provide users with tools that make it
easy to identify problems, preview data and set
up repeatable processes in a single manage-
ment view across multiple sources.
Data quality tools in the cloud must first of all, be
deployed to scale as data volumes scale and secondly,
maintain a variety of integration methods supporting
multiple analysis tools across the same data set.
The cloud provides new opportunities for increased
efficiency and agility for big data storage and analy-
sis, but your organization must ensure that the data
integrity is maintained throughout the process in
order to achieve these new cloud promises.
About the Authors
Seth Rao and Amit Dutta are the CEO and CTO of
FirstEigen, a data validation and analytics company
based in Chicago. Their focus is on leveraging machine
learning in data quality tools to make the process
autonomous with minimal configuration and human
involvement.
SPRING 2017 | THE DOPPLER | 33