Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Slides:



Advertisements
Similar presentations
MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin.
Advertisements

Ali Ghodsi UC Berkeley & KTH & SICS
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Resource Management with YARN: YARN Past, Present and Future
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
April 30, 2014 Anthony D. Joseph CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing.
Cluster Scheduler Reference: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011 Multi-agent Cluster Scheduling for Scalability.
Outline | Motivation| Design | Results| Status| Future
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Platform for Fine-Grained Resource Sharing in the Data Center
CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing April 29, 2013 Anthony D. Joseph
Benjamin Hindman Apache Mesos Design Decisions
Tyson Condie.
1 CS : Project Suggestions Ion Stoica ( September 14, 2011.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Cloud Computing Ali Ghodsi UC Berkeley, AMPLab
CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing December 2, 2013 Anthony D. Joseph and John Canny
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
(Private) Cloud Computing with Mesos at Twitter Benjamin
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
A Platform for Fine-Grained Resource Sharing in the Data Center
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Next Generation of Apache Hadoop MapReduce Owen
Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.
Apache Mesos What is it ? Beyond Hadoop Resource Sharing Mesos Intentions Architecture Users
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Part III BigData Analysis Tools (YARN) Yuan Xue
BIG DATA/ Hadoop Interview Questions.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
COS 518: Advanced Computer Systems Lecture 13 Michael Freedman
Scalable containers with Apache Mesos and DC/OS
COS 418: Distributed Systems Lecture 23 Michael Freedman
Machine Learning Library for Apache Ignite
Introduction to Distributed Platforms
Running Multiple Schedulers in Kubernetes
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
PA an Coordinated Memory Caching for Parallel Jobs
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
湖南大学-信息科学与工程学院-计算机与科学系
COS 518: Advanced Computer Systems Lecture 14 Michael Freedman
Introduction Apache Mesos is a type of open source software that is used to manage the computer clusters. This type of software has been developed by the.
Cloud Computing Large-scale Resource Management
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California, Berkeley

Summary Platform for sharing clusters across various computing frameworks Improves cluster utilization and avoids replication Introduces distributed two-level scheduling mechanism Results show near-optimal data locality

Pig Background Rapid innovation in cluster computing frameworks Dryad Pregel Percolator C IEL

Problem Rapid innovation in cluster computing frameworks No single framework optimal for all applications Want to run multiple frameworks in a single cluster »…to maximize utilization »…to share data between frameworks »…to reduce cost of computation

Where We Want to Go Hadoop Pregel MPI Shared cluster Today: static partitioningMesos: dynamic sharing

Solution Mesos is a common resource sharing layer over which diverse frameworks can run Mesos Node Hadoop Pregel … Node Hadoop Node Pregel …

Other Benefits of Mesos Run multiple instances of the same framework »Isolate production and experimental jobs »Run multiple versions of the framework concurrently Build specialized frameworks targeting particular problem domains »Better performance than general-purpose abstractions

Outline Mesos Goals and Architecture Implementation Results Related Work

Mesos Goals High utilization of resources Support diverse frameworks (current & future) Scalability to 10,000’s of nodes Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design Elements Fine-grained sharing: »Allocation at the level of tasks within a job »Improves utilization, latency, and data locality Resource offers: »Simple, scalable application-controlled scheduling mechanism

Element 1: Fine-Grained Sharing Framework 1 Framework 2 Framework 3 Coarse-Grained Sharing (HPC):Fine-Grained Sharing (Mesos): + Improved utilization, responsiveness, data locality Storage System (e.g. HDFS) Fw. 1 Fw. 3 Fw. 2 Fw. 1 Fw. 3 Fw. 2 Fw. 3 Fw. 1 Fw. 2 Fw. 1 Fw. 3 Fw. 2

Element 2: Resource Offers Option: Global scheduler »Frameworks express needs in a specification language, global scheduler matches them to resources + Can make optimal decisions – Complex: language must support all framework needs – Difficult to scale and to make robust – Future frameworks may have unanticipated needs

Element 2: Resource Offers Mesos: Resource offers »Offer available resources to frameworks, let them pick which resources to use and which tasks to launch + Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Resource offer Pick framework to offer resources to

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Pick framework to offer resources to Resource offer Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) } Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) }

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Hadoop executor Mesos slave MPI executor task Pick framework to offer resources to task Framework-specific scheduling Resource offer Launches and isolates executors

Architecture Overview Master process that maintains slave daemons Frameworks run tasks on these slaves Each Framework consists of two components  Scheduler that accepts resources  Executor that launches the framework’s tasks Does not require frameworks to specify resource requirements or constraints. How?

Optimization: Filters Let frameworks short-circuit rejection by providing a predicate on resources to be offered »E.g. “nodes from list L” or “nodes with > 8 GB RAM” »Could generalize to other hints as well Ability to reject still ensures correctness when needs cannot be expressed using filters

Resource Allocation Allocation decisions delegated to allocation module Two allocation modules implemented  Fair sharing based on generalization of max-min fairness  Strict priority based allocation Usually works for short jobs – fails for large Tasks then need to be revoked Causes problems in some frameworks – guaranteed allocation

Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation

Fault Tolerance Mesos master is critical Master is designed to be in a soft state Master reports executor failures to frameworks Mesos allows frameworks to register multiple schedulers for fault tolerance Result: fault detection and recovery in ~10 sec

Implementation

Implementation Stats 10,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iterative jobs (up to 10× faster than Hadoop) Open source in Apache Incubator

Users Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark Conviva is using Spark for data analytics UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps

Results »Utilization and performance vs static partitioning »Framework placement goals: data locality »Scalability »Fault recovery

Dynamic Resource Sharing

Mesos vs Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes FrameworkSpeedup on Mesos Facebook Hadoop Mix1.14× Large Hadoop Mix2.10× Spark1.26× Torque / MPI0.96×

Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes) Data Locality with Resource Offers 1.7×

Scalability Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra-framework scheduling Result: Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len)

Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements: »Fine-grained sharing at the level of tasks »Resource offers, a scalable mechanism for application-controlled scheduling Enables co-existence of current frameworks and development of new specialized ones In use at Twitter, UC Berkeley, Conviva and UCSF