MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein

MapReduce Programming Model Programmers think in a data-centric fashion – Apply transformations to data sets The MR framework handles the Hard Stuff: – Fault tolerance – Distributed execution, scheduling, concurrency – Coordination – Network communication

MapReduce System Model Designed for batch-oriented computations over large data sets – Each operator runs to completion before producing any output – Operator output is written to stable storage Map output to local disk, reduce output to HDFS Simple, elegant fault tolerance model: operator restart – Critical for large clusters

Life Beyond Batch Processing Can we apply the MR programming model outside batch processing? Two domains of interest: 1.Interactive data analysis Enabled by high-level MR query languages, e.g. Hive, Pig, Jaql Batch processing is a poor fit 2.Continuous analysis of data streams Batch processing adds massive latency Requires saving and reloading analysis state

MapReduce Online Pipeline data between operators as it is produced – Decouple computation schedule (logical) from data transfer schedule (physical) Hadoop Online Prototype (HOP): Hadoop with pipelining support – Preserving the Hadoop interfaces and APIs – Challenge: retain elegant fault tolerance model Enables approximate answers and stream processing – Can also reduce the response times of jobs

Outline 1.Hadoop Background 2.HOP Architecture 3.Online Aggregation 4.Stream Processing with MapReduce 5.Future Work and Conclusion

Hadoop Architecture Hadoop MapReduce – Single master node, many worker nodes – Client submits a job to master node – Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes Hadoop Distributed File System (HDFS) – Single name node, many data nodes – Files stored as large, fixed-size (e.g. 64MB) blocks – HDFS typically holds map input and reduce output

Job Scheduling One map task for each block of the input file – Applies user-defined map function to each record in the block – Record = User-defined number of reduce tasks – Each reduce task is assigned a set of record groups Record group = all records with same key – For each group, apply user-defined reduce function to the record values in that group Reduce tasks read from every map task – Each read returns the record groups for that reduce task

Dataflow in Hadoop Map tasks write their output to local disk – Output available after map task has completed Reduce tasks write their output to HDFS – Once job is finished, next job’s map tasks can be scheduled, and will read input from HDFS Therefore, fault tolerance is simple: simply re- run tasks on failure – No consumers see partial operator output

Dataflow in Hadoop Submit job schedule map reduce

Dataflow in Hadoop HDFS Block 1 Block 2 map reduce Read Input File

Dataflow in Hadoop map reduce Local FS Finished Finished + Location

Dataflow in Hadoop map reduce Local FS HTTP GET

Dataflow in Hadoop reduce HDFS Write Final Answer

Hadoop Online Prototype (HOP)

Hadoop Online Prototype HOP supports pipelining within and between MapReduce jobs: push rather than pull – Preserve simple fault tolerance scheme – Improved job completion time (better cluster utilization) – Improved detection and handling of stragglers MapReduce programming model unchanged – Clients supply same job parameters Hadoop client interface backward compatible – No changes required to existing clients E.g., Pig, Hive, Sawzall, Jaql – Extended to take a series of job

Pipelining Batch Size Initial design: pipeline eagerly (for each row) – Prevents use of combiner – Moves more sorting work to mapper – Map function can block on network I/O Revised design: map writes into buffer – Spill thread: sort & combine buffer, spill to disk – Send thread: pipeline spill files => reducers Simple adaptive algorithm

Fault Tolerance Fault tolerance in MR is simple and elegant – Simply recompute on failure, no state recovery Initial design for pipelining FT: – Reduce treats in-progress map output as tentative Revised design: – Pipelining maps periodically checkpoint output – Reducers can consume output <= checkpoint – Bonus: improved speculative execution

Pipeline request Dataflow in HOP ScheduleSchedule + Location map reduce

Online Aggregation Traditional MR: poor UI for data analysis Pipelining means that data is available at consumers “early” – Can be used to compute and refine an approximate answer – Often sufficient for interactive data analysis, developing new MapReduce jobs,... Within a single job: periodically invoke reduce function at each reduce task on available data Between jobs: periodically send a “snapshot” to consumer jobs

Intra-Job Online Aggregation Approximate answers published to HDFS by each reduce task Based on job progress: e.g. 10%, 20%, … Challenge: providing statistically meaningful approximations – How close is an approximation to the final answer? – How do you avoid biased samples? Challenge: reduce functions are opaque – Ideally, computing 20% approximation should reuse results of 10% approximation – Either use combiners, or HOP does redundant work

Online Aggregation in HOP HDFS Write Snapshot Answer HDFS Block 1 Block 2 Read Input File map reduce

Inter-Job Online Aggregation Write Answer HDFS map Job 2 Mappers reduce Job 1 Reducers

Inter-Job Online Aggregation Like intra-job OA, but approximate answers are pipelined to map tasks of next job – Requires co-scheduling a sequence of jobs Consumer job computes an approximation – Can be used to feed an arbitrary chain of consumer jobs with approximate answers Challenge: how to avoid redundant work – Output of reduce for 10% progress vs. for 20%

Example Scenario Top K most-frequent-words in 5.5GB Wikipedia corpus (implemented as 2 MR jobs) 60 node EC2 cluster

Stream Processing MapReduce is often applied to streams of data that arrive continuously – Click streams, network traffic, web crawl data, … Traditional approach: buffer, batch process 1.Poor latency 2.Analysis state must be reloaded for each batch Instead, run MR jobs continuously, and analyze data as it arrives

Why? Why use MapReduce for stream processing? 1.Many existing MR use cases are a good fit 2.Ability to run user-defined code Machine learning, graph analysis, unstructured data 3.Massive scale + low-latency analysis 4.Use existing MapReduce tools and libraries

Stream Processing with HOP Map and reduce tasks run continuously Reduce function divides stream into windows – “Every 30 seconds, compute the 1, 5, and 15 minute average network utilization; trigger an alert if …” – Window management done by user (reduce)

Stream Processing Challenges 1.How to store stream input? – HDFS is not ideal 2.Fault tolerance for long-running tasks – Operator restart increasingly expensive 3.Elastic scale-up / scale-down during MR job

#1: Storing Stream Input Current approach: colocate map task and data producer – Apply map function, partition => reduce task – Fault tolerance: fate share – “Pushdown” predicates and scalar transforms – Total order = single reduce task User-defined code at data producer = bad? – Fault-tolerant “buffer” (map task), coordination

#2: Fault Tolerance for Streams Operator restart for long-running reduces: too expensive Hence, window-oriented fault tolerance – Reducers label windows with IDs – Mappers use window IDs to garbage collect spills Probably need fault-tolerant Job Tracker and HDFS Name Node

#3: Intra-Job Elasticity Peak load != average load – Increasingly important as job duration grows Solution: consistent hashing over reduce key space – Job Tracker manages reduce key => task mapping Useful for regular Hadoop as well

Other HOP Benefits Shorter job completion time via improved cluster utilization: reduce work starts early – Important for high-priority jobs, interactive jobs Adaptive load management – Better detection and handling of “straggler” tasks – Elastic scale-up/scale-down: better pre-emption – Decouple unit of data transfer from unit of scheduling E.g. Yahoo! Petasort: 15GB/map task

Sort Performance: Blocking 60 node EC2 cluster, 5.5GB input file 40 map tasks, 59 reduce tasks

Sort Performance: Pipelining 927 seconds vs. 610 seconds

Future Work 1.Basic pipelining – Performance analysis at scale (e.g. PetaSort) – Job scheduling is much harder 2.Online Aggregation – Statically-robust estimation – Better UI for approximate results 3.Stream Processing – Develop into full-fledged stream processing engine – Stream support for high-level query languages – Online machine learning

Thanks! Questions? Source code and technical report: Contact:

Map Task Execution 1.Map phase – Read the assigned input split from HDFS Split = file block by default – Parses input into records (key/value pairs) – Applies map function to each record Returns zero or more new records 2.Commit phase – Registers the final output with the slave node Stored in the local filesystem as a file Sorted first by bucket number then by key – Informs master node of its completion

Reduce Task Execution 1.Shuffle phase – Fetches input data from all map tasks The portion corresponding to the reduce task’s bucket 2.Sort phase – Merge-sort *all* map outputs into a single run 3.Reduce phase – Applies user reduce function to the merged run Arguments: key and corresponding list of values – Write output to a temp file in HDFS Atomic rename when finished

Design Implications 1.Fault Tolerance – Tasks that fail are simply restarted – No further steps required since nothing left the task 2.“Straggler” handling – Job response time affected by slow task – Slow tasks get executed redundantly Take result from the first to finish Assumes slowdown is due to physical components (e.g., network, host machine) Pipelining can support both!

Fault Tolerance in HOP Traditional fault tolerance algorithms for pipelined dataflow systems are complex HOP approach: write to disk and pipeline – Producers write data into in-memory buffer – In-memory buffer periodically spilled to disk – Spills sent to consumers – Consumers treat pipelined data as “tentative” until producer is known to complete – Fault tolerance via task restart, tentative output discarded

Refinement: Checkpoints Problem: Treating output as tentative inhibits parallelism Solution: Producers periodically “checkpoint” with Hadoop master node – “Output split x corresponds to input offset y” – Pipelined data <= split x is now non-tentative – Also improves speculation for straggler tasks, reduces redundant work on task failure