The Doppler Quarterly Summer 2016 | Page 45

AWS Based Data Lake
Data Lake Data Processing
Metadata
Rules / Matching Engine
Governance Policies
Spark
ETL Engine
Data Lake Data Storage & Retrieval
Elastic Map Reduce
Redshift
Predictive Analytics AWS Machine Learning
In-Memory Analytics
Elasticsearch
DynamicDB
Data Consumers
Dashboards
Ecommerce
Data Science
Quicksight
Mobile Apps
Data Integration
S3
Glacier
Figure 7 : AWS Hosted Data Lake Architecture
Key AWS data lake technologies and capabilities include :
Operational Aspects
• CloudFormations – AWS provides CloudFormations , an automated method for standing up services and configurations in a repeatable manner .
Scalability & Performance
• IDM – AWS provides strong Identity and access management capabilities across their cloud portfolio , as well as the ability to integrate with existing LDAP or active directory infrastructures . This capability ensures consistent entitles across the data access methods .
Data Access & Retrieval
• S3 – S3 is the object store platform for AWS ; it provides a simple API for the storage and retrieval of data .
• Redshift – Redshift is the AWS enterprise data warehouse platform ; it provides high speed analytical access to large and complex data sets . Redshift is a PaaS capability , ensuring low operational overhead .
• EMR – Elastic MapReduce is an AWS implementation of MapReduce , allowing for highly scalable batch processing of data that is sent to other systems for query and analysis .
• DynamoDB – DynamoDB is a fully managed , low latency NoSQL platform
SUMMER 2016 | THE DOPPLER | 43