Requirements for Secure, Multi-Tenant Hadoop

Slides:



Advertisements
Similar presentations
1/17/20141 Leveraging Cloudbursting To Drive Down IT Costs Eric Burgener Senior Vice President, Product Marketing March 9, 2010.
Advertisements

2  Industry trends and challenges  Windows Server 2012: Beyond virtualization  Complete virtualization platform  Improved scalability and performance.
System Center 2012 R2 Overview
Agile Infrastructure built on OpenStack Building The Next Generation Data Center with OpenStack John Griffith, Senior Software Engineer,
Current impacts of cloud migration on broadband network operations and businesses David Sterling Partner, i 3 m 3 Solutions.
Virtualization of Fixed Network Functions on the Oracle Fabric Krishna Srinivasan Director, Product Management Oracle Networking Savi Venkatachalapathy.
1 Vladimir Knežević Microsoft Software d.o.o.. 80% Održavanje 80% Održavanje 20% New Cost Reduction Keep Business Up & Running End User Productivity End.
HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.
PaaS Design and Architecture: A Deep Dive into Apache Stratos Samisa Abeysinghe VP Delivery, WSO2 Member Apache Software Foundation 10 th June 2014.
VMware Virtualization Last Update Copyright Kenneth M. Chipps Ph.D.
“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Demonstrating IT Relevance to Business Aligning IT and Business Goals with On Demand Automation Solutions Robert LeBlanc General Manager Tivoli Software.
Plan Introduction What is Cloud Computing?
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
Banking Clouds V International Youth Banking Forum.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Real Security for Server Virtualization Rajiv Motwani 2 nd October 2010.
Cloud Computing in Large Scale Projects George Bourmas Sales Consulting Manager Database & Options.
Page  1 SaaS – BUSINESS MODEL Debmalya Khan DEBMALYA KHAN.
Sanbolic Enabling the Always-On Enterprise Company Overview.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Service Catalog Self Service Elasticity Provisioning Chargeback Standardization Security Elasticity Flexibility Integration Economies of Scale.
Introduction To Windows Azure Cloud
Preparing your Fabric & Apps for Windows Server 2003 End of Support Jeff Woolsey Principal Program Manager.
Copyright © 2011 EMC Corporation. All Rights Reserved. MODULE – 6 VIRTUALIZED DATA CENTER – DESKTOP AND APPLICATION 1.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
= WEEKS, MONTHS, YEARS OF DELAYED APPLICATION VALUE MISSED REVENUE OPPORTUNITIES, INCREASED COST AND RISK DEV QA PACKAGE COMMERCIAL SOFTWARE CUSTOM APPLICATION.
Challenges towards Elastic Power Management in Internet Data Center.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
From Virtualization Management to Private Cloud with SCVMM 2012 Dan Stolts Sr. IT Pro Evangelist Microsoft Corporation
Plan  Introduction  What is Cloud Computing?  Why is it called ‘’Cloud Computing’’?  Characteristics of Cloud Computing  Advantages of Cloud Computing.
Server Virtualization
1 Apache Spark and Its Role in the Enterprise Data Hub Mike Olson, Chief Strategy Officer,
How* to Win the #BestMicrosoftHack Shahed Chowdhuri Sr. Technical WakeUpAndCode.com *Hint: Use the Cloud.
VMware vSphere Configuration and Management v6
Microsoft Azure Active Directory. AD Microsoft Azure Active Directory.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Architecture & Cybersecurity – Module 3 ELO-100Identify the features of virtualization. (Figure 3) ELO-060Identify the different components of a cloud.
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
Web Technologies Lecture 13 Introduction to cloud computing.
Mark Gilbert Microsoft Corporation Services Taxonomy Building Block Services Attached Services Finished Services.
Enabling the Cloud OS Today  New high-density Web Sites with elastic cloud scaling and complete dev-ops experiences  New rich IaaS experience for self-service.
Windows Azure Overview for IT Pros Anton Boyko. Intro to Cloud Computing Intro to Windows Azure Cloud Services Web Sites Virtual Machines Workload Options.
Hello Cloud… Mike Benkovich
Going Hybrid – part 1 Moving to Hybrid Cloud with Windows Azure Virtual Machines & System Center 2012 R2.
Cisco Consulting Services for Application-Centric Cloud Your Company Needs Fast IT Cisco Application-Centric Cloud Can Help.
Cloud, big data, and mobility Your phone today probably meets the minimum requirements to run Windows Server 2003 Transformational change up.
Going Hybrid – part 2 Moving to Hybrid Cloud with Windows Azure Virtual Machines & System Center 2012 R2.
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Apache Hadoop on Windows Azure Avkash Chauhan
Microsoft Partner since 2011
Containers: Life Beyond Microservices? Sushil Kumar Robin Systems.
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Organizations Are Embracing New Opportunities
Data Platform and Analytics Foundational Training
Big Data Enterprise Patterns
Introduction to Distributed Platforms
9/13/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
Introduction to Cloud Computing
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Microsoft Virtual Academy
Technical Capabilities
Agenda Need of Cloud Computing What is Cloud Computing
Windows Azure Hybrid Architectures and Patterns
Microsoft Virtual Academy
Presentation transcript:

Requirements for Secure, Multi-Tenant Hadoop IT’S MUCH MORE THAN YARN Anant Chintamaneni

About Me Products at BlueData @AnantCman Former Head of Hadoop Products & Analytics at Pivotal Launched Hadoop-based analytics system at Merced Systems (now NICE Systems)

Talk Track Hadoop ecosystem & typical data architecture Road to multi-tenancy in the enterprise Deep dive on the requirements for multi-tenancy On-premises options and considerations Customer examples & scenarios Q&A

Hadoop MapReduce & HDFS Hadoop Ecosystem BI/ETL/Analytics Search, BI, ETL, M/L Data Pipelines Data Platforms noSQL Databases Analytics Databases Spark Hadoop MapReduce & HDFS Data Ingestion Virtualization & Cloud BIG DATA INFRASTRUCTURE Hardware & Storage It is much more than Hadoop / YARN

Typical Enterprise Data Architecture Data Sources Ingestion Batch Service Speed Example: Kafka + Hadoop + Spark + noSQL + SQL

Road to Multi-Tenancy Multi-purpose platform + multi-functional use cases demand multi-tenant architecture

Road to Multi-Tenancy – Status Quo Marketing R&D Manufacturing INFRASTRUCTURE & CLUSTER SPRAWLS 360 customer view Log Analysis Predictive maintenance 2.2 2.4 2.2 (M/L node) (BI node Prod) <30% INFRASTRUCTURE UTILIZATION Prod Dev Pre-Prod LOCAL ADMINS LOCAL ADMINS IT 2.6 2.5 (BI node test) Dev MANAGEMENT COMPLEXITY Here is a quick visual of what we’ve seen at one of our customers Line of business started a 360 view of customer in the Marketing group and built out a Hadoop environment. Independently, the product R&D team created their own Hadoop cluster around log analysis, the support team created their own cluster with a newer, different version of Hadoop and so did the Manufacturing group to the point where they managed to get to a pre-prod stage At this point, some sort of a centralized group (IT) is usually pulled into help rationalize the efforts and look at the big picture. And then all issues become quite apparent. Significant amount of data duplication and loss of visibility; management complexity due to the different clusters, versions and expectations. And not to mention the infrastructure sprawl and utilization issues. If you are starting your Hadoop journey, especially in medium or large enterprises, this is what your environemnt coud look like if it isn’t aleady some variant of this. Test DUPLICATION OF DATA

Haven’t We Seen This Pattern Before But Hadoop is not a database…..it’s a distributed platform & ecosystem (remember!)

Multi-Tenancy – The story So Far.. HADOOP VENDOR # 1 Multi-tenancy in Hadoop refers to a set of features that enable multiple groups from within the same organisation to share the common set of resources in a cluster without negatively impacting service-levels, violating security constraints, or even revealing the existence of each other, all via policy rather than physical separation. HADOOP VENDOR # 2 Multi-tenancy is the ability of a single instance of software to serve multiple tenants. A tenant is a group of users that have the same view of the system. Hadoop, as an enterprise data hub, naturally demands multi-tenancy. Creating different instances of Hadoop for various users or functions is not acceptable as it makes it harder to share data across departments and creates silos.

Breaking it Down REQUIREMENTS HADOOP VENDOR POINT OF VIEW Shared, common resources Single instance of software Policy based security & isolation Sharing data without duplication HADOOP VENDOR POINT OF VIEW Implement single Hadoop cluster Deploy single version/distribution of Hadoop Use Hadoop specific capabilities (e.g. YARN) Use single storage system (i.e. HDFS on disks)

So the Recommendation Is …. Multiple Clusters  Single Cluster 2.2 Mktg Prod 2.x 2.4 2.6 Mfg R&D Mktg Prod Mktg Test R&D Mktg Test Mfg 2.2 Common set of resources Multiple instances Different data sets with duplication Common set of resources Single instance of software (just pick one!) Single data repository

Game Over, Right ?

Does This Work for You? Does your organization allow development on production Hadoop cluster? Do you test/evaluate new versions on your production Hadoop cluster? Are your different lines of business OK with one release cycle? Is your organization OK to have one and only one centralized storage system?

For Most Enterprises …

Enterprise Realities Multiple Hadoop Clusters Centralized Management Operational Models (e.g. Dev/Test, Specialized workloads) Agility – evaluate new Hadoop ecosystem products Data Center Infrastructure Security & Regulations Business flexibility & independence Data integration & mgmt Multiple Hadoop Clusters Centralized Management And to spice up the mix, there are some critical real world considerations that can’t be ignored as you consider your journey. First, multiple Hadoop clusters is a reality but that doesn’t have to mean that these clusters each need to have dedicated hardware. Secondly, Big Data can no longer be implemented as islands separate from your core data center infrastructure. Security and data governance considerations So you land up in a place where there the business realities demand a multiple hadoop environments, yet at the same time requiring centralized management.

Comprehensive Multi-Tenancy Multiple lines of business (i.e. tenants) Multiple distribution/application versions concurrently (e.g. Dev/Test & Prod) Multiple concurrent jobs across and/or within tenants Multiple application workloads of different types (Hadoop, On-Hadoop, Non-Hadoop)
 Security isolation between tenants (compute processing and data) Multiple service level guarantees by tenant and/or application workload On a shared, centrally managed infrastructure But before we go deeper, lets define the requirements for multi-tenancy. Tenancy is about supporting business requirements, not technical requirements, i.e. lines of business, groups of users who use Big Data to create business value. They care about a friction less experience to work with big data. The ability to support multiple versions concurrently is critical – use cases and workloads demand features You might think concurrent submission of jobs in a cluster and/or across clusters is a given. But there are environments out there where getting access to a cluster is like waiting for a bathroom in a gas station (in the words of a anonymous real Hadoop users) Security is a whole topic in itself. One of the core tenets of multi-tenancy is the ability to separate compute and data across tenants. In other words, having strict isolation between compute across tenants as well having the ability to strictly isolate data across tenants. As such, this type of requirements demand the separation of Hadoop compute and storage.

Multi-Tenancy Architecture Marketing R&D Manufacturing Multiple lines of business/user groups Multiple use cases 360 customer view Log Analysis Predictive maintenance Compute Isolation Compute Isolation Multiple ecosystem products (incl. non-Hadoop; BI/ETL tools) Compute isolation between tenants Prod Dev/Test POC Prod Dev/Test Multiple environments per tenant 2.2 2.6 2.6 2.5 2.7 Multiple versions and/or distributions Shared, Centrally Managed Compute Infrastructure Data isolation Data isolation Data isolation by tenant (incl. ability to physically isolate storage) Data/Storage So here is the solution architecture that multi-tenancy demands. Starting point of a Shared Centrally managed infrastructure – collection of servers with the ability to carve out storage for each tenant. Having compute isolation such than each tenant can have multiple Hadoop (or non-Hadoop) clusters and services that meets their business objectives Marketing R&D Manufacturing

Multi-Tenancy Benefits Better Infrastructure Utilization Quicker & Easier to Spin Up/Down environments for Each Tenant Separation of Compute & Storage for Scaling & Isolation Operational Efficiencies Utilize all available capacity via pooled resources Dynamically provision compute on demand Decoupling to ensure storage & compute match demand It’s all about agility & cost efficiency

On-Premises Options YARN-based Hadoop cluster (physical or virtual Virtual machines on common, physical infrastructure Docker containers on common, physical or virtual infrastructure Mesos on physical or virtual to run multiple frameworks There are several ways to accomplish this architecture. By no means intended to be comprehensive, these are some of the on-premises options that are discussed. Single YARN cluster which translates to creating a multi-tenant hadoop cluster – few considerations here include ability to support multiple versions of Hadoop on the same cluster and/or the enterprise realities of requiring multiple Hadoop clusters Virtualized or containerized Hadoop is clear candidate given how well virtualization, a proven technology has served the datacenter for running diverse range of a enterprise workloads on a shared, common infrastructure. Done right, companies have proven virtual machines and containers have proven to not only perform better than bare metal hadoop for many workloads but also offer a level of elasticity and server utilization not possible with traditional physical deployments. This said, implementation of Big Data on VMs matters. Using VMs like physical machines and deploying Hadoop including HDFS Mesos, an open source project that aims to be a data center operating system is another candidate. It can run Hadoop albeit requires the resource scheduler to be modified which does create few considerations. You are pretty much on your own since Hadoop distributions leverage YARN and don’t have support for Mesos. Lastly, there is a new project by the name Myriad which is aiming to bring YARN based Hadoop clusters to run on mesos. This project is still incubating and will likely take several years to mature

YARN Highlights Considerations Key Hadoop 2.0 enabler to run multiple compute services on a single cluster simultaneously Complex, yet sophisticated scheduler schemes e.g. Capacity scheduler YARN support varies by Hadoop platform provider Considerations Designed for one cluster, not across multiple clusters Scheduler, not a resource manager for Hadoop only Admin overhead with complex config’s for queues SQL, noSQL and BI tool resources are not managed Spark memory consumption poses issues for YARN

YARN Single Hadoop Cluster = Multi-tenant infrastructure Multiple lines of business (i.e. tenants) ✕ Multiple distribution/application versions concurrently (e.g. Dev/Test & Prod) Multiple concurrent jobs across and/or within tenants Multiple application workloads of different types (Hadoop and non-Hadoop)
 Security isolation between tenants (compute processing and data) Multiple service level guarantees by tenant and/or application workload Single Hadoop Cluster = Multi-tenant infrastructure

Virtual Machines Highlights Considerations Proven platform for running multiple workloads on common, physical infrastructure Deploy Hadoop and non-Hadoop with ultimate flexibility e.g. multiple Hadoop clusters with elasticity Enforce compute, memory & storage quotas by tenant and/or sub-tenants Leverage YARN for individual cluster resource management Considerations HDFS implementation matters – HDFS on VMDK vs. not Vanilla virtualization I/O subsystem not optimized for Big Data Guest OS (VM) needs management (e.g. security patches)

Containers Highlights Considerations Hyper lightweight OS virtualization (no Guest OS) Rapid adoption by enterprises for web/mobile app packaging Deploy Hadoop/non-Hadoop (e.g. multiple Hadoop clusters) with elasticity Packaging and deployment is elegant (e.g. Docker file) Leverage YARN for cluster resource management Considerations Container management is emerging – focused on web apps Big Data brings unique networking, storage and security requirements; Significant DIY even with open source tech Still a developer toolkit, management layers still emerging

Mesos Highlights Considerations Positioned as Data Center OS – used by Twitter Higher order resource mgmt framework compared to YARN Supports Hadoop*, Spark* and noSQL platforms Considerations Distributed platforms will need to be modified (e.g. Hadoop, Spark) to use Mesos Focus shifting to public cloud & container management DIY platform with limited documentation Demonstrated support is for Hadoop MRv1; Project Myriad (incubating) runs Hadoop YARN clusters on Mesos

VMs, Containers, & Mesos Multiple lines of business (i.e. tenants) Multiple distribution/application versions concurrently (e.g. Dev/Test & Prod) Multiple concurrent jobs across and/or within tenants Multiple application workloads of different types (Hadoop and non-Hadoop)
 Security isolation between tenants (compute & data): let’s discuss; implementation matters Multiple service level guarantees by tenant and/or application workload

Customer Data Points No Customer From To 1 Predictive Ad Tech 2 Two 1000+ node clusters (open source Hadoop) Spark on YARN causes QoS issues Multi-tenant infrastructure to manage QoS for different groups/jobs (work in progress) 2 Fortune 100 Media & Telco Started with single Hadoop YARN cluster Post-production complexity (Dev/Test, external user groups) Virtualized infrastructure to create multi-tenant Hadoop Self-service Hadoop cluster(s) for each tenant incl. different versions. 3 Fortune 100 Retail 100+ node cluster Resource mgmt for bursty, seasonal ad-hoc workloads Offloaded ad-hoc jobs to virtualized, multi-tenant infrastructure User groups run I/O throttled jobs against production HDFS data

Multiple Tenants & Role Based Access Control

Common Infrastructure & Containers Shared pool of resources (e.g. servers) Containers as Hadoop/Spark nodes

Tenants with Multiple ‘Containerized’ Clusters Compute Isolation

Storage Isolation between Tenants

Multi-Tenant Hadoop Infrastructure Takeaways Hadoop ecosystem and complexity increasing The adoption journey: from single to multi-workload Versatility and complexity requires balancing act Multi-Tenant Hadoop Infrastructure

Thank You For more information: Visit the BlueData booth (#637) www.bluedata.com www.bluedata.com/free < Free Download