Software Selection in EMR Cluster
and Apache Tez
Hive Storage Structure
You have a choice whether or not to install Apache
Tez on an EMR cluster along with Apache Hive. Tradi-
tionally, Hive processing is done through the
MapReduce execution engine that has to keep writ-
ing data back to disk while traversing a computa-
tional graph. This incurs a performance penalty for
disk I/O. In Apache Tez, intermediate data is passed
directly to the next node in the computation graph
and no data is written to disk. If you install Apache
Tez along with Hive, Tez becomes the default execu-
tion engine. We recommend you use Tez with Hive
because Tez will generally improve the Hive query
performance. We will discuss Apache Tez more in the
performance optimization section of this article.
Hive Data Storage Considerations
The recommended best practice for data storage in
an Apache Hive implementation on AWS is S3, with
Hive tables built on top of the S3 data files. This sep-
aration of compute and storage enables the possibil-
ity of transient EMR clusters and allows the data
stored in S3 to be used for other purposes. The two
most important considerations for an AWS-based
Apache Hive data storage design are: 1) the Hive stor-
age structure and 2) storage format of the files in the
S3 buckets.
Under the top level S3 bucket, we should organize the
data files in a folder structure that allows a query
engine to optimize data access by avoiding scanning
large tables (files) and by optimizing joins of multiple
tables. Two strategies that we typically employ to
achieve this optimization by organizing the data on
S3 are: 1) Hive-based partitioning and 2) bucketing.
A partition is a directory in Hive, where the partition
key value gets stored in the actual partition directory
name and the partition key is a virtual column in the
table. However, in the case of bucketing, each bucket
is a file that holds the actual data that is broken down
on the basis of a hash algorithm. Bucketing does not
add a virtual column to the table. The optimal parti-
tioning strategy results in faster query response
through partition elimination and bucketing results
in better response through joint optimization.
Hive Storage Format
Items to be considered while choosing a file format
for storage include:
• Support for columnar storage
• Splitability
• Compression
• Schema evolution
• Indexing capabilities
S3 – Bucket
Table
Transactions
Partition
Partition
By Time
Cluster
Product ID
Cluster
Product ID
Figure 2: Hive Partition and Clusters
26 | THE DOPPLER | FALL 2017
By Time