MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Pregel: A System for Large-Scale Graph Processing
Distributed Systems Architectures
Author: Julia Richards and R. Scott Hawley
11 Application of CSF4 in Avian Flu Grid: Meta-scheduler CSF4. Lab of Grid Computing and Network Security Jilin University, Changchun, China Hongliang.
Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley.
Distributed and Parallel Processing Technology Chapter2. MapReduce
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
Week 2 The Object-Oriented Approach to Requirements
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Chapter 1: Introduction to Scaling Networks
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
© 2012 National Heart Foundation of Australia. Slide 2.
Executional Architecture
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Platform for Fine-Grained Resource Sharing in the Data Center
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
A Platform for Fine-Grained Resource Sharing in the Data Center
Next Generation of Apache Hadoop MapReduce Owen
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
Introduction to MapReduce
Cloud Computing Large-scale Resource Management
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin

Problem Companies have huge amounts of data now Conceptually straightforward problems being complicated by being performed on massive amounts of data – Grep – Sorting How to deal with this in a distributed setting? – What could go wrong? 2

Solution Restrict programming model so that framework can abstract away details of distributed computing MapReduce – Two user defined functions, map and reduce – Provides Automatic parallelization and distribution Fault-tolerance I/O Scheduling Status and Monitoring – Library improvements helps all users of library Interface can be many things, database etc. 3

Programming Model Input key/value pairs output a set of key/value pairs Map – Input pair intermediate key/value pair – (k1, v1) list(k2, v2) Reduce – One key all associated intermediate values – (k2, list(v2)) list(v3) 4

MapReduce Examples Word Count Distributed Grep URL Access Frequencies Inverted Index Rendering Map Tiles PageRank 5

Word Count 6

Rendering Map Tiles 7

Discussion What kind of applications would be hard to be expressed as a MapReduce job? Is it possible to modify the MapReduce model to make it more suitable for those applications? 8

Infrastructure Architecture Interface applicable to many implementations – Focus on Internet and Data Center deployment Master controls workers – Often 200,000 map tasks, 4,000 reduce tasks with 2,000 workers and only 1 master – Assigns idle workers a map or reduce task – Coordinate information globally, such as where to have reducers fetch data from 9

Execution Example 10

Parallel Execution 11

Task Granularity and Pipelining Many tasks means – Minimal time for fault recovery – Better pipeline shuffling with map execution – Better load balancing 12

Performance Sorted 1 TB in 891 seconds with 1800 nodes – 1TB in 68 seconds on 1000 nodes (2008) – 1 PB in 33 minutes on 8000 nodes (2011) Fault-Tolerance – 200 killed machines, only 5% increase in time – Lost 1600 once, but still able to finish the job 13

Discussion What happens if the underlying cluster is not homogenous? (Rajashekhar Arasanal) Can we go further with the locality? In an application where reduce tasks don't always read from all of the map tasks, could the reduce tasks be scheduled to save bandwidth? (Fred Douglas) 14

Bottlenecks Reduce stage starts when final map task is done Long startup latency Not the best tool for every job – Or just make everything a nail? – Leads to Mesos Not designed for iterative algorithms (Spark) – Unnecessary movement of intermediate data Move computation to data – Not good for sorting when you need to move data – If you have two big data sets and you want to join them, you have to move the data somehow. – Microsoft Research 15

Related Work Parallel Processing – MPI (1999) – Bulk Synchronous Programming (1997) Iterative – Spark (2011) Stream – S4 (2010) – Storm (2011) 16

Conclusions Useful programming model and abstraction that has changed the way industry processes massive amounts of data Still heavily in use at Google today – And many companies using Hadoop MapReduce Shows the need for frameworks which deal with the intricacies of distributed computing 17

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

Diversified Computation Frameworks No single framework optimal for all applications. 19

Questions Should we share a cluster between multiple computation jobs? More specifically, what kind of resources do we want to share? – If we have different frameworks for different applications, why would we expect to share data among them? (Fred) If so, should we partition resources statically or dynamically? 20

Motivation 21

Moses Mesos is a common resource sharing layer over which diverse frameworks can run 22

Other Benefits Run multiple instances of the same framework – Isolate production and experimental jobs – Run multiple versions of the framework concurrently Build specialized frameworks targeting particular problem domains – Better performance than general-purpose abstractions 23

Requirements High utilization of resources Support diverse frameworks Scalability Reliability (failure tolerance) What does it need to do? – Scheduling of computation tasks 24

Design Choices Fine-grained sharing: – Allocation at the level of tasks within a job – Improves utilization, latency, and data locality Resource offers: – Pushes the scheduling logic to frameworks – Simple, scalable application-controlled scheduling mechanism 25

Fine-Grained Sharing Improves utilization and responsiveness 26

Resource Offers Negotiates with frameworks to reach an agreement: – Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra- framework scheduling – Offer available resources to frameworks, let them pick which resources to use and which tasks to launch 27

Resource Offers 28

Questions Mesos separates inter-framework scheduling and intra-framework scheduling. Problems? Is it better to let Mesos be aware of intra- framework scheduling policy and do it as well? Can multiple frameworks coordinate with each other for scheduling without resorting to a centralized inter-framework scheduler? – Rajashekhar Arasanal – Steven Dalton 29

Reliability: Fault Tolerance Mesos master has only soft state: list of currently running frameworks and tasks Rebuild when frameworks and slaves re- register with new master after a failure 30

Evaluation 31

Mesos vs. Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes 32

Questions Is Mesos a general solution for sharing cluster among multiple computation frameworks? – Matt Sinclair – Holly Decker – Steven Dalton 33

Conclusion Mesos is a platform for sharing commodity clusters between multiple cluster computing frameworks Fine-grained sharing and resource offers have been shown to achieve better utilization 34