The Doppler Quarterly Summer 2017 | Page 17

fectly suitable to create multiple copies of the same data set with different underlying storage structures (partitions, folders) and file formats (e.g. ORC vs Parquet). robust “defense-in-depth” strategy, by walling off large swaths of inappropriate access paths at the network level. This implementation should also be consistent with an enterprise’s overall security framework. Design Security Access Control - This focuses on Authentication (who are you?) and Authorization (what are you allowed to do?). Virtually every enterprise will have standard authentication and user directory technol- ogies already in place; Active Directory, for example. And every leading cloud provider supports methods for mapping the corporate identity infrastructure onto the permissions infrastructure of the cloud pro- vider’s resources and services. While the plumbing involved can be complex, the roles associated with the access management infrastructure of the cloud-provider (such as IAM on AWS) are assumable by authenticated users, enabling fine-grained per- missions control over authorized operations. The same is usually true for third-party products that run in the cloud such as reporting and BI tools. LDAP and/or Active Directory are typically supported for authentication, and the tools’ internal authorization and roles can be correlated with and driven by the authenticated users’ identities. Like every cloud-based deployment, security for an enterprise data lake is a critical priority, and one that must be designed in from the beginning. Further, it can only be successful if the security for the data lake is deployed and managed within the framework of the enterprise’s overall security infrastructure and controls. Broadly, there are three primary domains of security relevant to a data lake deployment: • Encryption • Network Level Security • Access Control Encryption - Virtually every enterprise-level organi- zation requires encryption for stored data, if not uni- versally, at least for most classifications of data other than that which is publicly available. All leading cloud providers support encryption on their primary objects store technologies (such as AWS S3) either by default or as an option. Likewise, the technologies used for other storage layers such as derivative data stores for consumption typically offer encryption as well. Encryption key management is also an important consideration, with requirements typically dictated by the enterprise’s overall security controls. Options include keys created and managed by the cloud pro- vider, customer-generated keys managed by the cloud-provider, and keys fully created and managed by the customer on-premises. Establish Governance Typically, data governance refers to the overall man- agement of the availability, usability, integrity, and security of the data employed in an enterprise. It relies on both business policies and technical practices. Sim- ilar to other described aspects of any cloud deploy- ment, data governance for an enterprise data lake needs to be driven by, and consistent with, overarch- ing practices and policies for the organization at large. The final related consideration is encryption in-tran- sit. This covers data moving over the network between devices and services. In most situations, this is easily configured with either built-in options for each service, or by using standard TLS/SSL with associated certificates. In traditional data warehouse infrastructures, con- trol over database contents is typically aligned with the business data, and separated into silos by busi- ness unit or system function. However, in order to derive the benefits of centralizing an organization’s data, it correspondingly requires a centralized view of data governance. Network Level Security - Another important layer of security resides at the network level. Cloud-native constructs such as security groups, as well as tradi- tional methods including network ACLs and CIDR block restrictions, all play a part in implementing a Even if the enterprise is not fully mature in its data governance practices, it is critically important that at least a minimum set of controls is enforced such that data cannot enter the lake without important meta- data (“data about the data”) being defined and cap- SUMMER 2017 | THE DOPPLER | 15