Securing the Hadoop Ecosystem

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Resource Management with YARN: YARN Past, Present and Future
Configuring a secure, multitenant cluster for the enterprise James Kinley // Principal Solutions Architect.
DESIGNING A PUBLIC KEY INFRASTRUCTURE
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Hortonworks. We do Hadoop.
Hadoop Ecosystem Overview
Understanding Active Directory
MongoDB Sharding and its Threats
Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.
May 30 th – 31 st, 2006 Sheraton Ottawa. Microsoft Certificate Lifecycle Manager Saleem Kanji Technology Solutions Professional - Windows Server Microsoft.
Edwin Sarmiento Microsoft MVP – Windows Server System Senior Systems Engineer/Database Administrator Fujitsu Asia Pte Ltd
TAM STE Series 2008 © 2008 IBM Corporation WebSEAL SSO, Session 108/2008 TAM STE Series WebSEAL SSO, Session 1 Presented by: Andrew Quap.
Copyright 2007, Information Builders. Slide 1 WebFOCUS Authentication Mark Nesson, Vashti Ragoonath Information Builders Summit 2008 User Conference June.
Sharing Resources Lesson 6. Objectives Manage NTFS and share permissions Determine effective permissions Configure Windows printing.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
111 EMC CONFIDENTIAL—INTERNAL USE ONLY NMC -- NW Administration NMC Team NetWorker 7.3 TOI July 28, 2005.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Developing Applications for SSO Justen Stepka Authentisoft, LLC
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
SSL, Single Sign On, and External Authentication Presented By Jeff Kelley April 12, 2005.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
TWSd - Security Workshop Part I of III T302 Tuesday, 4/20/2010 TWS Distributed & Mainframe User Education April 18-21, 2010  Carefree Resort  Carefree,
Sudha Iyer Principal Product Manager Oracle Corporation.
The New MR Repository & Security Authorization Model Ben Naphtali WebFOCUS Product Manager Architecture and Security May 2010 Copyright 2009, Information.
New MR Repository & Security Universal Object Access Brian A Suter VP WebFOCUS Product Development November 16, 2015 Copyright 2009, Information Builders.
Windows Role-Based Access Control Longhorn Update
Module 9 User Profiles and Social Networking. Module Overview Configuring User Profiles Implementing SharePoint 2010 Social Networking Features.
12 Copyright © 2009, Oracle. All rights reserved. Managing Backups, Development Changes, and Security.
Web Services Security Patterns Alex Mackman CM Group Ltd
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
Introduction to Active Directory
Computer Security: Principles and Practice
1 AHM, 2–4 Sept 2003 e-Science Centre GRID Authorization Framework for CCLRC Data Portal Ananta Manandhar.
Monitoring Hive: Metrics and WebUI
The GRIDS Center, part of the NSF Middleware Initiative Grid Security Overview presented by Von Welch National Center for Supercomputing.
IS 4506 Windows NTFS and IIS Security Features.  Overview Windows NTFS Server security Internet Information Server security features Securing communication.
6 Copyright © 2007, Oracle. All rights reserved. Managing Security and Metadata.
ASP.NET 2.0 Security Alex Mackman CM Group Ltd
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Apache Hadoop on Windows Azure Avkash Chauhan
19 Copyright © 2008, Oracle. All rights reserved. Security.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Alain Bethuyne Web Security Architect BNPParibas Fortis
Protecting a Tsunami of Data in Hadoop
Secure Connected Infrastructure
Stop Those Prying Eyes Getting to Your Data
Vinay Shukla Director, Product Management Dec 8, 2016
Hadoop.
HTCondor Security Basics
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
How to Solve BigData Security Puzzle?
SQOOP.
IBM Certified WAS 8.5 Administrator
HDInsight makes Hadoop Easy
Enterprise security for big data solutions on Azure HDInsight
HTCondor Security Basics HTCondor Week, Madison 2016
Introduction to Apache
Setup Sqoop.
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Pig Hive HBase Zookeeper
Presentation transcript:

Securing the Hadoop Ecosystem ATM (Cloudera) & Shreepadma (Cloudera) Strata/Hadoop World, Oct 2013

Agenda Hadoop Ecosystem Interactions Security Concepts Authentication Authorization Overview Confidentiality Auditing IT Infrastructure Integration Deployment Recommendations Advanced Authorization (Apache Sentry (Incubating))

Hadoop on its Own Hadoop hdfs, httpfs & mapred users end users Hadoop Ecosystem Interactions HDFS client WebHdfs Hadoop NN SNN DN TT Map Task HttpFS DN TT Map Task DN TT Reduce Task MR client JT hdfs, httpfs & mapred users end users protocols: RPC/data transfer/HTTP

Hadoop and Friends service users end users Hadoop Ecosystem Interactions service users end users protocols: RPCs/data/HTTP/Thrift/Avro-RPC clients services clients Hbase RPC Hbase Zookeeper RPC Zookeeper Oozie HTTP Oozie WebHdfs Hadoop Pig HTTP Hue HTTP browser Crunch HTTP Flume Avro RPC Cascading MapRed RPC Flume RPC Impala Sqoop Thrift Impala Hive Hive Metastore Thrift

Authentication / Authorization Security Concepts Authentication: End users to services, as a user: user credentials Services to Services, as a service: service credentials Services to Services, on behalf of a user: service credentials + trusted service Job tasks to Services, on behalf of a user: job delegation token Authorization Data: HDFS, HBase, Hive Metastore, Zookeeper Jobs: who can submit, view or manage Jobs (MR, Pig, Oozie, Hue, …) Queries: who can run queries (Impala, Hive)

Confidentiality / Auditing Security Concepts Confidentiality Data at rest (on disk) Data in transit (on the network) Auditing Who accessed (read/write) data Who submitted, managed or viewed a Job or a Query

Authentication Details End Users to services, as a user CLI & libraries: Kerberos (kinit or keytab) Web UIs: Kerberos SPNEGO & pluggable HTTP auth Services to Services, as a service Credentials: Kerberos (keytab) Services to Services, on behalf of a user Proxy-user (after Kerberos for service)

Authorization Details HDFS Data File System permissions (Unix like user/group permissions) HBase Data Read/Write Access Control Lists (ACLs) at table level Hive Server 2 and Impala Fine-grained authorization through Apache Sentry (Incubating) Jobs (Hadoop, Oozie) Job ACLs for Hadoop Scheduler Queues, manage & view jobs Zookeeper ACLs at znodes, authenticated & read/write

Confidentiality Details Data in transit RPC: using SASL HDFS data: using SASL HTTP: using SSL (web UIs, shuffle). Requires SSL certs Thrift: not avail (Hive Metastore, Impala) Avro-RPC: not avail (Flume) Data at rest Nothing out of the box Doable by: custom ‘compression’ codec or local file system encryption

Auditing Details Who accessed (read/write) FS data NN audit log contains all file opens, creates NN audit log contains all metadata ops, e.g. rename, listdir Who submitted, managed, or viewed a Job or a Query JT, RM, and Job History Server logs contain history of all jobs run on a cluster Who submitted, managed, or viewed a workflow Oozie audit logs contain history of all user requests

Auditing Gaps Not all projects have explicit audit logs Audit-like information can be extracted by processing logs Eg: Impala query logs are distributed across all nodes It is difficult to correlate jobs & data access Eg: Map-Reduce jobs launched by Pig job Eg: HDFS data accessed by a Map-Reduce job Tools written on top of Hadoop can do this well, e.g. Cloudera Navigator

IT Integration: Kerberos Users don’t want Yet Another Credential Corp IT doesn’t want to provision thousands of service principals Solution: local KDC + one-way trust Run a KDC (usually MIT Kerberos) in the cluster Put all service principals here Set up one-way trust of central corporate realm by local KDC Normal user credentials can be used to access Hadoop

IT Integration: Groups Much of Hadoop authorization uses “groups” User ‘atm’ might belong to groups ‘analysts’, ‘eng’, etc. Users’ groups are not stored in Hadoop anywhere Refers to external system to determine group membership NN/JT/Oozie/Hive servers all must perform group mapping Default plugins for user/group mapping: ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ JniBasedUnixGroupsMapping – makes a system call LdapGroupsMapping – talks directly to an LDAP server

IT Integration: Kerberos + LDAP Hadoop Cluster LDAP group mapping Central Active Directory atm@EXAMPLE.COM … NN JT Local KDC hdfs/host1@HADOOP.EXAMPLE.COM yarn/host2@HADOOP.EXAMPLE.COM … Cross-realm trust

IT Integration: Web Interfaces Most web interfaces authenticate using SPNEGO Standard HTTP authentication protocol Used internally by services which communicate over HTTP Most browsers support Kerberos SPNEGO authentication Hadoop components which use servlets for web interfaces can plug in custom filter Integrate with intranet SSO HTTP solution

Deployment Recommendations Security configuration is a PITA Do only what you really need Enable cluster security (Kerberos) only if un-trusted groups of users are sharing the cluster Otherwise use edge-security to keep outsiders out Only enable wire encryption if required Only enable web interface authentication if required

Deployment Recommendations Secure Hadoop bring-up order HDFS RPC (including SNN check-pointing) JobTracker RPC TaskTrackers RPC & LinuxTaskControler Hadoop web UI Configure monitoring to work with security Other services (HBase, Oozie, Hive Metastore, etc) Continue with authorization and network encryption if needed Recommended: Use an admin/management tool Several inter-related configuration knobs To manage principals/keytabs creation and distribution Automatically configures monitoring for security

Apache Sentry (Incubating)

Authorization What is Authorization? Authorization Concepts Privilege Sentry What is Authorization? Authorization Concepts Privilege Right to perform a particular action or an action on an object of a particular type Eg., query table FOO Role Collection of privileges Benefit: Ease of privilege administration Group Collection of users Benefit: Ease of user administration

Authorization Requirements Sentry Secure Authorization Reliably enforce privileges to control access to data and resources to authenticated users Fine-grained Authorization Ability to control access to subset of data E.g., specific rows and columns in a table Role-based Authorization Ability to group and administer privileges through roles Multi-Tenant Administration Allow global administrator to delegate management of security for subsets of data to other administrator E.g., A global server admin may delegate management of security for individual databases to database admins

State of Security Support for Strong Authentication Sentry Support for Strong Authentication Kerberos LDAP/AD Custom Authentication (Hive) Two sub-optimal choices for Authorization Coarse-grained HDFS File Permissions (Hive) Achieved through HS2 impersonation Controls permissions at file level Insufficient for controlling access to chunks of data in a file No authorization for metadata Insecure Advisory Authorization (Hive) Self-service system that allows users to grant themselves privileges Prevents accidental deletion but doesn’t stop malicious use

Introducing Apache Sentry (Incubating) Authorization system for various components of Hadoop ecosystem Currently, supports Hive and Impala Support for Solr underway Secure, fine-grained, role-based and multi-tenant Open Source Currently undergoing incubation at ASF

Sentry Architecture Sentry

Sentry Policy File Contains sections for roles, groups, users Users section maps users to groups Roles section maps privileges to roles Groups section maps roles to groups Global policy file can also contain databases section to point to a db specific policy file [databases] customers = hdfs://ha-nn-uri/usr/config/sentry/customers.ini Policy file is protected by file permissions Policy file can be on localFS/HDFS

Fine-Grained Authorization Sentry For Hive and Impala, ability to specify privileges on SERVER DATABASE TABLE VIEW (Row/Column level authorization) URI Privilege Granularity SELECT INSERT ALL

Role-Based Authorization Sentry Roles provide a mechanism to group privileges Used commonly by organizations to restrict access based on an employee’s role Example: Manager role allows INSERT on table EMPLOYEE and SELECT on view DIRECT_REPORTS on table EMPLOYEE manager = server=server1->db=hr_db->table=employee->action=INSERT, \ server=server1->db=hr_db->table=direct_reports->action=SELECT

Multi-Tenant Administration Sentry Support for DB specific policy file Allows the global admin to delegate security administration of databases to database admins DB policy file can specify privileges for a DB Global policy file contains location of the DB policy file Privileges in the global file supersede the privileges in the DB specific policy file

User Management Sentry doesn’t perform user management Reuses Kerberos/LDAP/AD users Groups provide a container for a set of users Roles can be assigned to groups Example: analyst = sales_reporting, audit_reports User to Group Mapping Reuse Hadoop groups Specify locally in policy file using user section

Granting/Revoking Privileges Sentry Specified in the policy file Example: Grant INSERT on table CUSTOMERS in database SALES: server=server1->db=sales->table=customer->action=INSERT Privileges are represented by a hierarchy (mirrors the hierarchy in Hive’s data model) Privileges granted for an object and its containees Example: ALL on DB implies SELECT, INSERT on all tables within the DB

Privilege Hierarchy Sentry

Configuring Sentry Sentry Old Hive CLI is not supported; HS2 /Impala is required Warehouse directory must be owned by the user running HS2/Impala Secure warehouse directory, including sub-directories, using 770 permissions In case of Hive, user HS2 is running as must be able to run MR jobs Turn off HS2 impersonation (strongly recommended) Configure sentry-site.xml and hive-site.xml appropriately

Q&A

ATM (Cloudera) & Shreepadma (Cloudera) Strata/Hadoop World, Oct 2013 Thanks ATM (Cloudera) & Shreepadma (Cloudera) Strata/Hadoop World, Oct 2013

Security Capabilities Appendix Client Protocol Authentication Proxy User Authorization Confidentiality Auditing Hadoop HDFS RPC Kerberos Yes FS permissions SASL Data Transfer No Hadoop WebHDFS HTTP Kerberos SPNEGO plus pluggable N/A Hadoop MapReduce (Pig, Hive, Sqoop, Crunch, Cascading) Yes (requires job config work) Job & Queue ACLs Oozie Job & Queue ACLs and FS permissions SSL (HTTPS) Hbase RPC/Thrift/HTTP table ACLs HiveServer2 Kerberos/LDAP Sentry In the works Zookeeper znode ACLs Impala Thrift Hue pluggable HTTPS Flume Avro RPC