Same Load, Many Clusters
In the same way that we need to start thinking about clusters that come alive to
do their intense computing and then go back to sleep, we should also break out
of the habit of considering only single clusters, and start thinking about many
clusters supporting different workloads. If you get used to the specific develop-
ment, test and deployment patterns associated with ephemeral clusters, the
natural next step is to think about running separate clusters for the various con-
sumption patterns; for example, one or more for data ingestion, one or more for
fast queries and one for data sciences.
Tools Ecosystem
We’ve mentioned that the favorite technology to build a data lake is HDFS. If you
commit to HDFS, natural choices are Apache Oozie for workflow management,
Apache Pig for scripting and Apache Hive for batch and some interactive queries.
Apache Spark has gained popularity in recent years and should be a serious
contender to handle streaming analytics and machine learning workloads. We
are also seeing Redshift, DynamoDB and ElasticSearch clusters co-exist with
the Hadoop ecosystem in an Amazon Web Services deployment. All tools come
with certain limitations. Therefore, careful upfront analysis is required to
make sure must-have features are supported or in the roadmap to be sup-
ported in the near term.
Amazon Redshift
Copy from HDFS
EMR-DynamoDB
connector
Amazon Redshift
Amazon DynamoDB
Streaming data
connectors
JDBC Data Source
w/ Spark SQL
Amazon RDS
Amazon Kinesis
Amazon EMR
Elasticsearch
connector
EMR File System
(EMRFS)
Amazon S3
Figure 5: Various Technologies Working Together in AWS
Automation of Data Ingestion
In our discussions, we repeatedly hear one concern about polyglot persistence:
the complexity of back-end data integration. Multiple processing engines do
require more ingestion code and associated development, maintenance and
modification costs. But by splitting out the workloads into multiple automating
28 | THE DOPPLER | SPRING 2017