MapReduce : Simplified Data Processing on Large Clusters P76001027 謝光昱 P76011284 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MapReduce: Simplified Data Processing on Large Clusters
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Cloud Computing MapReduce, Batch Processing
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay Ghemawat

2Outline Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Introduction Motive : Most computations are conceptually ordinary. The input data is usually large and the computation have to finish in a reasonable amount of time. Problem : The following reasons obscure the computation with large amounts of complex code. Parallelize the computation Distribute the data Handle failures Solve : Design a new abstraction - MapReduce 3

4 Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Programming Model The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. Map: Takes an input pairs and produces a set of intermediate key/value pairs. Groups together all intermediate key I and passes them to the Reduce function. Reduce: Accepts an intermediate key I and a set of values for that key. Merges together these values to form a possibly smaller set of values. 5

Programming Model Example – WordCount map(String key,String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”); reduce(String key,Iterator values) //key: a word //values: a list of counts for each v in values: result +=ParseInt(v); Emit(AsString(result)); 6 Programming Map Reduce

7 Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Execution Overview Input data Implementation 8 User Program Master Worker Split 0 Split 1 Split 2 Split 3 Split 4 Worker Output file 0 Output file 1 Input filesMap phase Intermediate files (on local disks) Reduce phase Output files fork Assign mapAssign reduce readlocal write remote read write

Implementation Fault Tolerance Worker Failure Any completed map task by the worker are reset back to idle state, and then scheduled on the other worker. 9 Worker Master Ping response Failed

Implementation Master Failure The master write periodic checkpoints of the master data structures, therefore a new copy can restart from the last checkpoint state. If there is only a single master, its failure is unlikely. 10

Locality In order to conserve network bandwidth, we store the input data on the local disk. Implementation 11 64MB …… 64MB …… Worker Master Input files Map phrase copy

Implementation Backup Tasks The total time is lengthened by stragglers. Straggler : a machine takes an unusually long time to complete one of the last tasks in the computation. E.g. a machine with a bad disk may slow its read performance. When a MapReduce operation is closed to completion, the master schedules backup executions of the remaining in- progress tasks. The task is marked as completed whenever either the primary or the backup execution completes. 12

13 Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Refinements Partitioning function Data gets partitioned across these tasks using a partitioning function on the intermediate key. Default partitioning function is hashing E.g. hash(key) mod R 14

Refinements Combiner Function There is significant repetition in the internediate key. Zipf distribution E.g. The only difference between a reduce function and a combiner function is the output of them. 15 a b b a c c b c Worker Combiner Worker

Refinements Skipping Bad Records Some bugs that cause the Map or Reduce function to crash and prevent a MapReduce operation from completing. Sometimes fixing the bugs is not feasible. The bug is in an unavailable source code. Iit is acceptable to ignore a few records. E.g. statistical analysis Method Each worker process installs a signal handler to catch segmentation violation and bus error. If there are more than one failure on a particular record, it should be skipped when the master issues the next-execution. 16

Refinements Status Information The master runs an internal HTTP server and exports a set of status pages for human consumption. 17

Refinements 18

Refinements 19

Refinements Counters The MapReduce library provides a counter facility to count occurrences of various events. E.g. user code may want to count total number of words processed 20

21 Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Performance Cluster Configuration Approximately 1800 machines Each machine had : Two 2GHz Intel Xeon processors with Hyper-Threading enable 4GB of memory Two 160GB IDE disks A gigabit Ethernet link The machines were arranged in a two-level tree-shaped switched network 22

Performance Grep Scan through byte records, searching for a relatively rare three-character pattern. The input is split into approximately 64MB pieces(M=15000), and the entire output is placed in one file (R=1) workers overheadexecution Propagation of the program to workers Interacting with GFS Propagation of the program to workers Interacting with GFS

Performance Sort The program sorts byte records. The input data is split into 64MB pieces(M=15000), and the sort output will be partition into 4000 files(R=4000). 24

Performance 25 The input rate is less than for grep The shuffling starts as soon as the first map task completes. Remaining reduce tasks 850 delay The first batch of reduce tasks Input rate > shuffle rate > output rate Write two copies for reliability and availability Locality optimization

Performance Stragglers Increase of 44%!!

Performance 27 Worker death Re-execution 890 Only increase of 5%

28 Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Experience Broad applications in google : large-scale machine learning problems clustering problems for the Google News and Froogle products extraction of data used to produce reports of popular queries extraction of properties of web pages for new experiments and products large-scale graph computations 29

Experience Large-Scale Indexing Rewrite the production indexing system that produces the data structures used for the Google web search service. Benefits of using MapReduce : The indexing code is simpler, smaller, and easier to understand. E.g lines of C++ to 700 lines of MapReduce The performance of the MapReduce library is good enough to change the indexing process easily. It’s easier to add new machines to the indexing cluster. 30

31 Introduction Programming Model Implementation Refinements Performance Experience Conclusions

Conclusions The reasons of the MapReduce programming model has been successfully for many different purposes. The model is easy to use A large variety of problems are easily expressible Develop an implementation of MapReduce that Scales to large clusters of machines comprising thousands of machines Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss 32

Thank you!! 33