The Doppler Quarterly Spring 2017

Same Load, Many Clusters In the same way that we need to start thinking about clusters that come alive to do their intense computing and then go back to sleep, we should also break out of the habit of considering only single clusters, and start thinking about many clusters supporting different workloads. If you get used to the specific develop- ment, test and deployment patterns associated with ephemeral clusters, the natural next step is to think about running separate clusters for the various con- sumption patterns; for example, one or more for data ingestion, one or more for fast queries and one for data sciences. Tools Ecosystem We’ve mentioned that the favorite technology to build a data lake is HDFS. If you commit to HDFS, natural choices are Apache Oozie for workflow management, Apache Pig for scripting and Apache Hive for batch and some interactive queries. Apache Spark has gained popularity in recent years and should be a serious contender to handle streaming analytics and machine learning workloads. We are also seeing Redshift, DynamoDB and ElasticSearch clusters co-exist with the Hadoop ecosystem in an Amazon Web Services deployment. All tools come with certain limitations. Therefore, careful upfront analysis is required to make sure must-have features are supported or in the roadmap to be sup- ported in the near term. Amazon Redshift Copy from HDFS EMR-DynamoDB connector Amazon Redshift Amazon DynamoDB Streaming data connectors JDBC Data Source w/ Spark SQL Amazon RDS Amazon Kinesis Amazon EMR Elasticsearch connector EMR File System (EMRFS) Amazon S3 Figure 5: Various Technologies Working Together in AWS Automation of Data Ingestion In our discussions, we repeatedly hear one concern about polyglot persistence: the complexity of back-end data integration. Multiple processing engines do require more ingestion code and associated development, maintenance and modification costs. But by splitting out the workloads into multiple automating 28 | THE DOPPLER | SPRING 2017

The Doppler Quarterly Spring 2017 | Page 30