MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Computations have to be distributed !
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
L22: Parallel Programming Language Features (Chapel and MapReduce) December 1, 2009.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill

Introduction ► MapReduce is a programming model and an associated implementation for processing and generating large datasets. ► Users specify the following two functions: * Map – processes a key/value pairs * Map – processes a key/value pairs *Reduce – merges all intermediate values associated with the same intermediate key *Reduce – merges all intermediate values associated with the same intermediate key

► Many real world tasks are expressible in this model ► Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines ► The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and manage the required inter-machine communication.

Programming Model ► The user of the MapReduce library expresses the computation as two functions : Map and Reduce ► Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs ► The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values

Example map(String key, String value): // key: document name // key: document name // value: document contents // value: document contents for each word w in value: for each word w in value: EmitIntermediate(w, "1"); EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // key: a word // values: a list of counts // values: a list of counts int result = 0; int result = 0; for each v in values: for each v in values: result += ParseInt(v); result += ParseInt(v); Emit(AsString(result)); Emit(AsString(result));

Types ► The map and reduce functions supplied by the user have associated types: * map (k1,v1) -> list(k2,v2) * map (k1,v1) -> list(k2,v2) * reduce (k2,list(v2)) -> list(v2) * reduce (k2,list(v2)) -> list(v2) i.e., the input keys and values are drawn from a different domain than the output keys and values.

More Examples ► Distributed Grep ► Count of URL Access Frequency ► Reverse Web-Link Graph ► Term-Vector per Host ► Inverted Index ► Distributed Sort

Implementation ► Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. ► Following slides describe an implementation targeted to the computing environment in wide use at Google: large clusters of commodity PCs connected together with switched Ethernet.

► ► Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine. ► ► Commodity networking hardware is used. Typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in over- all bisection bandwidth. ► ► A cluster consists of hundreds or thousands of machines, and therefore machine failures are common.

► ► Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed file system developed in-house is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable hardware. ► ► Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.

Execution Overview ► The map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. ► The input splits can be processed in parallel by different machines. ► Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partioning function.

► ► The MapReduce library in the user program first splits the input les into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines. ► ► One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

► ► A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. ► ► Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function

► ► When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. ► ► The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered,it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output le for this reduce partition.

► ► After successful completion, the output of the mapreduce execution is available in the R output les (one per reduce task, with le names as specfied by the user).

Master Data Structures ► ► The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in- progress, or completed), and the identity of the worker machine (for non-idle tasks).

Fault Tolerance ► Worker failure The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. ► ► Master failure It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last check pointed state.

Semantics in the presence of Failures ► ► When the user-supplied map and reduce operators are deterministic functions of their input values, our distributed implementation produces the same output as would have been produced by a non-faulting sequential execution of the entire program. ► ► We rely on atomic commits of map and reduce task outputs to achieve this property.

Locality ► ► Network bandwidth is a relatively scarce resource in our computing environment. We conserve network bandwidth by taking advantage of the fact that the input data is stored on the local disks of the machines that make up our cluster.

Task Granularity ► ► We subdivide the map phase into M pieces and the reduce phase into R pieces, as described above. Ideally, M and R should be much larger than the number of worker machines. ► ► Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails: the many map tasks it has completed can be spread out across all the other worker machines.

Backup Tasks ► ► One of the common causes that lengthens the total time taken for a MapReduce operation is a.straggler.: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. ► ► We have a general mechanism to alleviate the problem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.

Refinements ► Although the basic functionality provided by simply writing Map and Reduce functions is sufficient for most needs, few extensions have been found useful: Partitioning Function Partitioning Function Ordering Guarantees Ordering Guarantees Combiner Function Combiner Function Input and Output Types Input and Output Types Side-effects Side-effects Skipping Bad Records Skipping Bad Records

Local Execution Local Execution Status Information Status Information Counters Counters

Conclusions ► The MapReduce programming model has been successully used at Google for many different purposes. This success has been attributed to several reasons. This success has been attributed to several reasons. The model is easy to use even for programmers without experience with parallel and distributed systems. The model is easy to use even for programmers without experience with parallel and distributed systems. A large variety of problems are easily expressible as MapReduce computations. A large variety of problems are easily expressible as MapReduce computations. An implementation of MapReduce has been developed that scales to large clusters of machines comprising thousands of machines. An implementation of MapReduce has been developed that scales to large clusters of machines comprising thousands of machines.