The Doppler Quarterly Summer 2017

Figure 1: Data Lake Storage Layers and data on the cloud, while business takes responsi- bility for exploring and mining it. tion purpose, can only be accomplished if the underly- ing core storage layer does not dictate a fixed schema. Design Physical Storage Separation from compute resources - The most sig- nificant philosophical and practical advantage of cloud-based data lakes as compared to “legacy” big data storage on Hadoop is the ability to decouple storage from compute, enabling independent scaling of each. The foundation of any data lake design and imple- mentation is physical storage. The core storage layer is used for the primary data assets. Typically it will contain raw and/or lightly processed data. The key considerations when evaluating technologies for cloud-based data lake storage are the following prin- ciples and requirements: Exceptional scalability - Because an enterprise data lake is usually intended to be the centralized data store for an entire division or the company at large, it must be capable of significant scaling without run- ning into fixed arbitrary capacity limits. High durability - As a primary repository of critical enterprise data, a very high durability of the core stor- age layer allows for excellent data robustness without resorting to extreme high-availability designs. Support for unstructured, semi-structured and structured data - One of the primary design consid- erations of a data lake is the capability to store data of all types in a single repository. Independence from fixed schema - The ability to apply schema upon read, as needed for each consump- Given the requirements, object-based stores have become the de facto choice for core data lake storage. AWS, Google and Azure all offer object storage technologies. The point of the core storage is to centralize data of all types, with little to no schema structure imposed upon it. However, a data lake will typically have addi- tional “layers” on top of the core storage. This allows the retention of the raw data as essentially immutable, while the additional layers will usually have some structure added to them in order to assist in effective data consumption such as reporting and analysis. Figure 1 represents additional layers being added on top of the raw storage layer. A specific example of would be the addition of a layer defined by a Hive metastore. In a layer such as this, the files in the object store are partitioned into “directo- ries” and files clustered by Hive are arranged within to SUMMER 2017 | THE DOPPLER | 13

The Doppler Quarterly Summer 2017 | Page 15