The Doppler Quarterly Summer 2017 - Page 20

Figure 3: An AWS Suggested Architecture for Data Lake Metadata Storage function when a data object is created on S3, and which stores data attributes into a DynamoDB data- base. The resultant DynamoDB-based data catalog can be indexed by Elasticsearch, allowing a full-text search to be performed by business users. AWS Glue; a product soon to be released, provides a set of automated tools to support data source catalog- ing capability. AWS Glue can crawl data sources and construct a data catalog using pre-built classifiers for many popular source formats and data types, includ- ing JSON, CSV, Parquet, and more. As such, this offers potential promise for enterprise implementations. We recommend that clients make data cataloging a central requirement for a data lake implementation. Access and Mine the Lake Schema on Read ‘Schema on write’ is the tried and tested pattern of cleansing, transforming and adding a logical schema to the data before it is stored in a ‘structured’ rela- tional database. However, as noted previously, data lakes are built on a completely different pattern of ‘schema on read’ that prevents the primary data store from being locked into a predetermine ͍)ф́ѽɕɅ܁ȁ䁵ɽ͕ȴ)аͥ́ѽ͔ѡх͕(Q!=AA1HMU55H)ͥ́ѡЁ́ɽɥєѼѡͥ)ѕиQɔɔ䁉́Ѽѡ́ɽ)Ցمɥ́ѽ́Ѽ́ѡф)مɥ̸͕́)фAɽͥ)=ԁٔѡɅ܁ȁхфѡ)ԁݥѼɕєձѥ́ɼ)͕фѼمɥ͔͕́́ѡɝ)ѥQ͔ɔᅵ́ѡՍɕѽɅ)͍ɥɱȸQɅѥ́ɕեɕѼɔ)єѡ͔Սɕфѽɕ́ݥٽٔ+ ɕЁх͕+ɵѥ+ ͥѥ͕+ɥ٥ѕф)Mɬ́ѡѽ)ȁɽͥѡɅ܁фȁѼɕєمɥ)مՔՍɕф̸)ф]ɕͥ)ȁͽ镐͔͕̀ѡəȴ)ф݅ɕ͕̤ԁ䁹ѼոME0)Օɥ́хѕ́фɕɸ)ѥɕձٕ́ե丁%ѡ͔͕̰)䁹ѼЁѥȁфɽ