The Doppler Quarterly Fall 2017

Managing slowly changing dimensions is one of the most important processes in a data warehouse. The canonical example of a slowly changing dimension is a customer who has recently moved to a new address. We have a record for the customer in the data ware- house with the old address information and another record for the same customer, with a new address, has come in from the transactional system. The two most popular approaches to handling such a situation in analytical systems are called SCD (slowly changing dimension) Type I and Type II. A Type I strategy overwrites an existing dimensional attribute with new information. In our example, we will overwrite the old address with the new address and won’t worry about keeping historical information about the customer. A Type II change writes a record with the new attri- bute information and preserves a record of the old dimensional data. So we insert a record with the new address information, make the record the active cus- tomer record and deactivate the previous customer record. At any point in time, we have a complete his- tory of the customer’s address changes. Strategy to Implement SCD Type II in Hive Hive and Hado op are optimized for write-once and read-many patterns. Any design such as SCD II that requires an update is typically not a great candidate for such systems. But over time, features have been added to support these scenarios. In order to implement SCD II, we have to enable ACID transactions in Hive. Currently, ORC is the only file format that supports ACID transactions in Hive. As Slowly Changing Dimensions in Hive ~200 MB We recommended ORC as the starting point for the most suitable file format for Apache Hive. ORC files are divided into stripes that are independent of each other. We can build indexes to determine which stripes in a file need to be read for a particular query, and the row indexes can narrow the search to a par- ticular set of 10,000 rows for high-performance reads. Within each stripe, the columns are separated from each other so the reader can read only the col- umns that are required. Column 1 Index Data Column 2 Column 3 Row Data Column 4 Column 5 Stripe Footer Column 6 Column 7 Index Data Column 8 Column 1 Row Data Column 2 Stripe Footer Column 3 Index Data Column 4 Column 5 Row Data Column 6 Column 7 Stripe Footer Column 8 File Footer Postscript Figure 3: ORC File Structure ORC is a write-once file format, changes are imple- mented using base files and delta files where insert, update, and delete operations are recorded. When the number of deltas exceeds a threshold, a minor compaction will automatically run and merge a set of changes into a single delta. When these compacted deltas get large enough, a major compaction will rewrite the base to incorporate these larger deltas. Hive Security and Ranger Apache Hive currently provides two methods of authorization: storage-based authorization and SQL standard authorization. SQL standard authorization provides grant/revoke functionality at database and table level, using commands that would be familiar to a DBA. If more detailed control is needed, we can use Apache Ranger, which provides more granular access control. FALL 2017 | THE DOPPLER | 27

The Doppler Quarterly Fall 2017 | Page 29