Managing slowly changing dimensions is one of the
most important processes in a data warehouse. The
canonical example of a slowly changing dimension is
a customer who has recently moved to a new address.
We have a record for the customer in the data ware-
house with the old address information and another
record for the same customer, with a new address,
has come in from the transactional system.
The two most popular approaches to handling such a
situation in analytical systems are called SCD (slowly
changing dimension) Type I and Type II.
A Type I strategy overwrites an existing dimensional
attribute with new information. In our example, we
will overwrite the old address with the new address
and won’t worry about keeping historical information
about the customer.
A Type II change writes a record with the new attri-
bute information and preserves a record of the old
dimensional data. So we insert a record with the new
address information, make the record the active cus-
tomer record and deactivate the previous customer
record. At any point in time, we have a complete his-
tory of the customer’s address changes.
Strategy to Implement SCD Type II in Hive
Hive and Hado op are optimized for write-once and
read-many patterns. Any design such as SCD II that
requires an update is typically not a great candidate
for such systems. But over time, features have been
added to support these scenarios.
In order to implement SCD II, we have to enable ACID
transactions in Hive. Currently, ORC is the only file
format that supports ACID transactions in Hive. As
Slowly Changing Dimensions in Hive
~200 MB
We recommended ORC as the starting point for the
most suitable file format for Apache Hive. ORC files
are divided into stripes that are independent of each
other. We can build indexes to determine which
stripes in a file need to be read for a particular query,
and the row indexes can narrow the search to a par-
ticular set of 10,000 rows for high-performance
reads. Within each stripe, the columns are separated
from each other so the reader can read only the col-
umns that are required.
Column 1
Index Data
Column 2
Column 3
Row Data
Column 4
Column 5
Stripe Footer
Column 6
Column 7
Index Data
Column 8
Column 1
Row Data
Column 2
Stripe Footer Column 3
Index Data Column 4
Column 5
Row Data
Column 6
Column 7
Stripe Footer
Column 8
File Footer
Postscript
Figure 3: ORC File Structure
ORC is a write-once file format, changes are imple-
mented using base files and delta files where insert,
update, and delete operations are recorded. When
the number of deltas exceeds a threshold, a minor
compaction will automatically run and merge a set of
changes into a single delta. When these compacted
deltas get large enough, a major compaction will
rewrite the base to incorporate these larger deltas.
Hive Security and Ranger
Apache Hive currently provides two methods of
authorization: storage-based authorization and SQL
standard authorization. SQL standard authorization
provides grant/revoke functionality at database and
table level, using commands that would be familiar to
a DBA. If more detailed control is needed, we can use
Apache Ranger, which provides more granular access
control.
FALL 2017 | THE DOPPLER | 27