The Doppler Quarterly Summer 2017

Figure 3 : An AWS Suggested Architecture for Data Lake Metadata Storage

function when a data object is created on S3 , and which stores data attributes into a DynamoDB database . The resultant DynamoDB-based data catalog can be indexed by Elasticsearch , allowing a full-text search to be performed by business users .

AWS Glue ; a product soon to be released , provides a set of automated tools to support data source cataloging capability . AWS Glue can crawl data sources and construct a data catalog using pre-built classifiers for many popular source formats and data types , including JSON , CSV , Parquet , and more . As such , this offers potential promise for enterprise implementations .

We recommend that clients make data cataloging a central requirement for a data lake implementation .

Access and Mine the Lake

Schema on Read

‘ Schema on write ’ is the tried and tested pattern of cleansing , transforming and adding a logical schema to the data before it is stored in a ‘ structured ’ relational database . However , as noted previously , data lakes are built on a completely different pattern of ‘ schema on read ’ that prevents the primary data store from being locked into a predetermined schema . Data is stored in a raw or only mildly processed format , and each analysis tool can impose on the dataset a business meaning that is appropriate to the analysis context . There are many benefits to this approach , including enabling various tools to access the data for various purposes .

Data Processing

Once you have the raw layer of immutable data in the lake , you will need to create multiple layers of processed data to enable various use cases in the organization . These are examples of the structured storage described earlier . Typical operations required to create these structured data stores will involve :

• Combining different datasets

• Denormalization

• Cleansing , deduplication , householding

• Deriving computed data fields

Apache Spark has become the leading tool of choice for processing the raw data layer to create various value-added , structured data layers .

Data Warehousing

For some specialized use cases ( think high performance data warehouses ), you may need to run SQL queries on petabytes of data and return complex analytical results very quickly . In those cases , you may need to ingest a portion of your data from your

18 | THE DOPPLER | SUMMER 2017

The Doppler Quarterly Summer 2017 | Page 20