The Doppler Quarterly Fall 2017

Software Selection in EMR Cluster and Apache Tez Hive Storage Structure You have a choice whether or not to install Apache Tez on an EMR cluster along with Apache Hive. Tradi- tionally, Hive processing is done through the MapReduce execution engine that has to keep writ- ing data back to disk while traversing a computa- tional graph. This incurs a performance penalty for disk I/O. In Apache Tez, intermediate data is passed directly to the next node in the computation graph and no data is written to disk. If you install Apache Tez along with Hive, Tez becomes the default execu- tion engine. We recommend you use Tez with Hive because Tez will generally improve the Hive query performance. We will discuss Apache Tez more in the performance optimization section of this article. Hive Data Storage Considerations The recommended best practice for data storage in an Apache Hive implementation on AWS is S3, with Hive tables built on top of the S3 data files. This sep- aration of compute and storage enables the possibil- ity of transient EMR clusters and allows the data stored in S3 to be used for other purposes. The two most important considerations for an AWS-based Apache Hive data storage design are: 1) the Hive stor- age structure and 2) storage format of the files in the S3 buckets. Under the top level S3 bucket, we should organize the data files in a folder structure that allows a query engine to optimize data access by avoiding scanning large tables (files) and by optimizing joins of multiple tables. Two strategies that we typically employ to achieve this optimization by organizing the data on S3 are: 1) Hive-based partitioning and 2) bucketing. A partition is a directory in Hive, where the partition key value gets stored in the actual partition directory name and the partition key is a virtual column in the table. However, in the case of bucketing, each bucket is a file that holds the actual data that is broken down on the basis of a hash algorithm. Bucketing does not add a virtual column to the table. The optimal parti- tioning strategy results in faster query response through partition elimination and bucketing results in better response through joint optimization. Hive Storage Format Items to be considered while choosing a file format for storage include: • Support for columnar storage • Splitability • Compression • Schema evolution • Indexing capabilities S3 – Bucket Table Transactions Partition Partition By Time Cluster Product ID Cluster Product ID Figure 2: Hive Partition and Clusters 26 | THE DOPPLER | FALL 2017 By Time

The Doppler Quarterly Fall 2017 | Page 28