engines, we can simplify the data structures in each individual engine. A very
typical pattern is to write ingestion code for the data lake, and use transient
AWS EMR transient clusters or AWS Lambda functions to trigger automatic
data updates to other persistence engines.
Figure 6: An Example of Automation in AWS Cloud
The above diagram from AWS big data documentation illustrates a data load
from on-prem to S3, through a transient EMR cluster for further transforma-
tion to the eventual load into S3.
File Formats and Performance
In the world of distributed computing using clusters, file format choice can be
crucial. The core idea is to use splittable and compressible file formats that can
be split and processed in various nodes and transferred compressed over the
network. Avro, Parquet and ORC are now all familiar file format names, but all
file formats are not equal. In our experience, Apache Hive performs much bet-
ter when the data is stored in ORC format. On the other hand, Apache Impala
is partial to Parquet.
Opportunity in Complexity
The transformative power of cloud technologies brings enormous value to the
advanced analytics solutions benefiting the modern enterprise. But with so
much power comes responsibility: to carefully analyze the complexity in the
world of legacy data warehousing; to form a solid hybrid cloud strategy to
evolve and modernize the analytics infrastructure; to effectively manage the
change; and to eventually save millions of dollars for your organization in the
process.
In subsequent articles in this series, we will go deeper into specific technolo-
gies and solutions.
SPRING 2017 | THE DOPPLER | 29