The Doppler Quarterly Summer 2017

Consumption

Derivations / Calculations Added

Metastore Added

Raw Layer

• Data in nearly organized form

• No transformations

Figure 2 : Partitioned Object Storage with Hive Clustering

enhance access patterns depicted in Figure 2 .

Much more could be written about this one example ; suffice to say that many additional layering approaches can be implemented depending on the desired consumption patterns .

Choose File Format

Introduction

People coming from the traditional RDBMS world are often surprised at the extraordinary amount of control that we as architects of data lakes have over exactly how to store data . We , as opposed to an RDBMS storage engine , get to determine an array of elements such as file sizes , type of storage ( row vs . columnar ), degree of compression , indexing , schemas , and block sizes . These are related to the Hadoop-oriented ecosystem of tools commonly used for accessing data in a lake .

File Size

A small file is one which is significantly smaller than the Hadoop file system ( HDFS ) default block size , which is 128 MB . If we are storing small files , given the large data volumes of a data lake , we will end up with a very large number of files . Every file is represented as an object in the cluster ’ s name node ’ s memory , each of which occupies 150 bytes , as a rule of thumb . So 100 million files , each using a block , would use about 30 gigabytes of memory . The takeaway here is that Hadoop ecosystem tools are not optimized for efficiently accessing small files . They are primarily designed for large files , typically an even multiple of the block size .

Apache ORC

ORC is a prominent columnar file format designed for Hadoop workloads . The ability to read , decompress , and process only the values that are required for the current query is made possible by columnar file formatting . While there are multiple columnar formats available , many large Hadoop users have adopted ORC . For instance , Facebook uses ORC to save tens of petabytes in their data warehouse . They have also demonstrated that ORC is significantly faster than RC File or Parquet . Yahoo also uses ORC to store their production data and has likewise released some of their benchmark results .

Same Data , Multiple Formats

It is quite possible that one type of storage structure and file format is optimized for a particular workload but not quite suitable for another . In situations like these , given the low cost of storage , it is actually per-

14 | THE DOPPLER | SUMMER 2017

The Doppler Quarterly Summer 2017 | Page 16