The Doppler Quarterly Spring 2017 | Page 31

engines, we can simplify the data structures in each individual engine. A very typical pattern is to write ingestion code for the data lake, and use transient AWS EMR transient clusters or AWS Lambda functions to trigger automatic data updates to other persistence engines. Figure 6: An Example of Automation in AWS Cloud The above diagram from AWS big data documentation illustrates a data load from on-prem to S3, through a transient EMR cluster for further transforma- tion to the eventual load into S3. File Formats and Performance In the world of distributed computing using clusters, file format choice can be crucial. The core idea is to use splittable and compressible file formats that can be split and processed in various nodes and transferred compressed over the network. Avro, Parquet and ORC are now all familiar file format names, but all file formats are not equal. In our experience, Apache Hive performs much bet- ter when the data is stored in ORC format. On the other hand, Apache Impala is partial to Parquet. Opportunity in Complexity The transformative power of cloud technologies brings enormous value to the advanced analytics solutions benefiting the modern enterprise. But with so much power comes responsibility: to carefully analyze the complexity in the world of legacy data warehousing; to form a solid hybrid cloud strategy to evolve and modernize the analytics infrastructure; to effectively manage the change; and to eventually save millions of dollars for your organization in the process. In subsequent articles in this series, we will go deeper into specific technolo- gies and solutions. SPRING 2017 | THE DOPPLER | 29