Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
1 Copyright © 2012, Elsevier Inc. All rights reserved Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 6: Cloud Programming.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
L22: Parallel Programming Language Features (Chapel and MapReduce) December 1, 2009.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Map-Reduce framework -By Jagadish Rouniyar.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
CS427 Multicore Architecture and Parallel Computing
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Take a Close Look at MapReduce Xuanhua Shi

Acknowledgement  Most of the slides are from Dr. Bing Chen,  Some slides are from SHADI IBRAHIM,

What is MapReduce  Origin from Google, [OSDI’04]  A simple programming model  Functional model  For large-scale data processing  Exploits large set of commodity computers  Executes process in distributed manner  Offers high availability

Motivation  Lots of demands for very large scale data processing  A certain common themes for these demands  Lots of machines needed (scaling)  Two basic operations on the input  Map  Reduce

Distributed Grep Very big data Split data grep matches cat All matches

Distributed Word Count Very big data Split data count merge merged count

Map+Reduce  Map:  Accepts input key/value pair  Emits intermediate key/value pair  Reduce :  Accepts intermediate key/value* pair  Emits output key/value pair Very big data Result MAPMAP REDUCEREDUCE Partitioning Function

The design and how it works

Architecture overview Job tracker Task tracker Master node Slave node 1 Slave node 2 Slave node N Workers user Workers

GFS: underlying storage system  Goal  global view  make huge files available in the face of node failures  Master Node (meta server)  Centralized, index all chunks on data servers  Chunk server (data server)  File is split into contiguous chunks, typically 16-64MB.  Each chunk replicated (usually 2x or 3x).  Try to keep replicas in different racks.

GFS architecture GFS Master C0C0 C1C1 C2C2 C5C5 Chunkserver 1 C0C0 C5C5 Chunkserver N C1C1 C3C3 C5C5 Chunkserver 2 … C2C2 Client

Functions in the Model  Map  Process a key/value pair to generate intermediate key/value pairs  Reduce  Merge all intermediate values associated with the same key  Partition  By default : hash(key) mod R  Well balanced

Diagram (1)

Diagram (2)

A Simple Example  Counting words in a large set of documents map(string value)‏ //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values)‏ //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));

How does it work?

Locality issue  Master scheduling policy  Asks GFS for locations of replicas of input file blocks  Map tasks typically split into 64MB (== GFS block size)  Map tasks scheduled so GFS input block replica are on same machine or same rack  Effect  Thousands of machines read input at local disk speed  Without this, rack switches limit read rate

Fault Tolerance  Reactive way  Worker failure  Heartbeat, Workers are periodically pinged by master  NO response = failed worker  If the processor of a worker fails, the tasks of that worker are reassigned to another worker.  Master failure  Master writes periodic checkpoints  Another master can be started from the last checkpointed state  If eventually the master dies, the job will be aborted

Fault Tolerance  Proactive way (Redundant Execution)  The problem of “ stragglers ” (slow workers)  Other jobs consuming resources on machine  Bad disks with soft errors transfer data very slowly  Weird things: processor caches disabled (!!)  When computation almost done, reschedule in-progress tasks  Whenever either the primary or the backup executions finishes, mark it as completed

Fault Tolerance  Input error: bad records  Map/Reduce functions sometimes fail for particular inputs  Best solution is to debug & fix, but not always possible  On segment fault  Send UDP packet to master from signal handler  Include sequence number of record being processed  Skip bad records  If master sees two failures for same record, next worker is told to skip the record

Status monitor

Refinements   Task Granularity  Minimizes time for fault recovery  load balancing  Local execution for debugging/testing  Compression of intermediate data

Points need to be emphasized  No reduce can begin until map is complete  Master must communicate locations of intermediate files  Tasks scheduled based on location of data  If map worker fails any time before reduce finishes, task must be completely rerun  MapReduce library does most of the hard work for us!

Model is Widely Applicable  MapReduce Programs In Google Source Tree distributed grep distributed sort web link-graph reversal term-vector / hostweb access log statsinverted index construction document clusteringmachine learningstatistical machine translation... Examples as follows

How to use it  User to do list:  indicate:  Input/output files  M: number of map tasks  R: number of reduce tasks  W: number of machines  Write map and reduce functions  Submit the job

Detailed Example: Word Count(1)  Map

Detailed Example: Word Count(2)  Reduce

Detailed Example: Word Count(3)  Main

Applications  String Match, such as Grep  Reverse index  Count URL access frequency  Lots of examples in data mining

MapReduce Implementations MapReduce Cluster, 1, Google 2, Apache Hadoop Multicore CPU, stanford GPU,

Hadoop  Open source  Java-based implementation of MapReduce  Use HDFS as underlying file system

Hadoop GoogleYahoo MapReduceHadoop GFSHDFS BigtableHBase Chubby (nothing yet… but planned)

Recent news about Hadoop  Apache Hadoop Wins Terabyte Sort Benchmark  The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory.  The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory.

Phoenix  The best paper at HPCA’07  MapReduce for multiprocessor systems  Shared-memory implementation of MapReduce  SMP, Multi-core  Features  Uses thread instead of cluster nodes for parallelism  Communicate through shared memory instead of network messages  Dynamic scheduling, locality management, fault recovery

Workflow

The Phoenix API  System-defined functions  User-defined functions

Mars: MapReduce on GPU  PACT’08 GeForce 8800 GTX, PS3, Xbox360

Implementation of Mars NVIDIA GPU (GeForce 8800 GTX) CPU (Intel P4 four cores, 2.4GHz) Operating System (Windows or Linux) CUDA System calls MapReduce User applications.

Implementation of Mars

Discussion We have MPI and PVM,Why do we need MapReduce? MPI, PVM MapReduce Objective General distributed programming model Large-scale data processing Availability Weaker, harder better Data Locality MPI-IOGFS Usability Difficult to learn easier

Conclusions  Provide a general-purpose model to simplify large-scale computation  Allow users to focus on the problem without worrying about details

References  Original paper (  On wikipedia (  Hadoop – MapReduce in Java (  duce-tutorial.html