for use when it comes out of the cloud. The basic
checks that we take for granted in an RDBMS envi-
ronment are lacking. As data moves in and out of the
cloud, data quality frequently deteriorates and loses
its trustworthiness due to:
• Inability to validate (and alert) on any bad/
missing data sources day over day
variables.
The traditional approach by developers to ensure
data quality is to follow these steps linearly:
• Prognosticate points of failure (expected
threats)
• Code to mitigate expected threats
• Inability to enforce data constraints, such as
allowing duplicates: Primary key violation! • Test the code
• Multiple data sources sending data to the cloud
get out of sync over time • Detect new points of failure that were
unexpected
• Structural change to data in upstream processes
not expected by the cloud
• Presence of multiple IT platforms (Hadoop, DW,
cloud)
Faulty processes, ad hoc data policies, poor discipline
in capturing and storing data and a lack of control
over some data sources all contribute to data incon-
sistencies between the cloud and on-premises
systems.
Data quality is based on a range of metrics. These
metrics vary, depending on the industry and the use
for the data. The common best practice for ensuring
data integrity is to check for the six core dimensions
of data quality:
• Completeness - Are all data sets and data items
recorded in entirety?
• Uniqueness - Are there any duplicates?
• Timeliness - The degree to which data repre-
sent reality from the required point in time.
• Validity - Does the data match the rules? A data
set is valid if it conforms to the syntax (format,
type, range) of its definition.
• Accuracy/Reasonableness - Does the data
reflect the data set? The degree to which data
correctly describes the “real world” object or
event being described.
• Consistency - The absence of difference when
comparing two or more representations of the
same data between different data sets. This can
also include the relationships between different
32 | THE DOPPLER | SPRING 2017
• Put code in production
• New coding for unexpected failures… testing…
production
• Maintain and update the rules for relevancy and
accuracy
The data threats that are the most damaging are usu-
ally unexpected ones that are not mitigated through
proactive programming. The biggest challenge of
data validation is the ability to create and maintain
thousands of data quality rules that can constantly
evolve over time. Organizations need to establish a
Data Quality Validation framework that lends itself to
the authorization of large and complex data flows
across multiple platforms. Right now, that onerous
undertaking is a tedious labor-intensive process,
prone to human errors. As data flows from many dif-
ferent sources, interrelationships and validations
become intricate and complex. As a result, unex-
pected errors increase exponentially.
The current data quality approaches are reasonably
suitable to mitigate “expected threats.” However,
they are not scalable or sustainable. They do not work
when data hops across multiple platforms and are
definitely not suitable in a big data/cloud initiative.
Instead of retrofitting existing solutions to solve big
data quality issues, organizations must choose an
intelligent data validation solution, one that will con-
tinue to learn autonomously.
A New Paradigm of Data Quality
The next evolution in ensuring data quality for big
data in the cloud must, at the minimum, satisfy these
needs:
• Handle massive data volumes with a powerful