MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
L22: Parallel Programming Language Features (Chapel and MapReduce) December 1, 2009.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Large-scale file systems and Map-Reduce
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS-4513 Distributed Computing Systems Hugh C. Lauer
Map-Reduce framework -By Jagadish Rouniyar.
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

Agenda Introduction Implementation Overview Google File System Hadoop Implementation Demo Conclusion

MapReduce Origin from Google in 2003 MapReduce is a programming framework. Programmers program map and reduce functions specific to their task. With MapReduce, these functions are automatically parallelized to utilize Google’s clusters of commodity machines. Allows programmers with little experience in parallelization and clusters to quickly accomplish computationally intensive tasks that have insanely huge input sets.

What are Map Reduce functions? A Map function evaluates a set of key/value pairs to generate an intermediate set of key/value pairs. Ex: Count the occurrences of each word in a large number of documents. The input key/value pair is the document name and the document content. The intermediate result set is (, 1).

What are Map Reduce functions? A Reduce function merges the intermediate set of key/value pairs to a more concise key/value pair set in which each key is unique. Ex: For the example of counting the occurrences of each word in a large number of documents, the key/value result set would be (, )

MapReduce Example Counting words in a large set of documents map(string value)‏ //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values)‏ //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result)); map (k1,v1) ! list(k2,v2) reduce (k2,list(v2)) ! list(v2)

More Example Count of URL Access Frequency: –The map function processes logs of web page requests and outputs for each request. –The reduce function adds the total number of requests for each URL and outputs

Implementation Computers have dual processors x86 processors running Linux, with 2-4 GB of memory per machine. Commodity networking hardware A cluster consists of hundreds or even thousands of machines, so machine failure is common Storage consists of inexpensive IDE disks connected to the computers. Each job is submitted to the scheduler and consists of a series of tasks.

Implementation The input data is automatically split into M splits. These splits can be accessed in parallel by different machines. The size of a split is user-specified. The output of the Map functions are partitioned into R pieces by a partitioning function. Both R and the partitioning function are specified by the user.

Implementation When the user program calls the MapReduce function, the MapReduce library first splits the input. Then, it starts up many copies of the user program on many different machines in clusters. One of these copies is special. It is the master program, which assigns work to the rest of the program copies, called workers. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a M or R task.

Implementation A worker who is assigned a Map task parses the corresponding input split and passes each key/value pair to the Map function The intermediate results produced by the Map function are buffered in local memory. Periodically, the intermediate results are split into R partitions and are written to local disk. The addresses of these results are forwarded to the master, who forwards them to the reduce workers.

Implementation When a reduce worker is notified about the locations of the intermediate results, it uses remote procedure calls to read the result from the local disk of the map worker. Once the reduce worker has read all the intermediate data, it sorts them so that the pairs with the same key are grouped together. If the amount of intermediate data is too large, an external sort (a sort used for very large input sizes) is used.

Implementation The reduce worker iterates over the sorted intermediate data and for every unique intermediate key, it passes the key and it’s corresponding set of values to the reduce function. The output of this function is appended to a final output file. There is a different output file for every reduce partition.

Implementation Once all the map and reduce tasks are finished, the master wakes up the user program. Then, the MapReduce call returns back to user code.

Google File System Goal – global view – make huge files available in the face of node failures Master Node (meta server) – Centralized, index all chunks on data servers Chunk server (data server) – File is split into contiguous chunks, typically 16-64MB. – Each chunk replicated (usually 2x or 3x). Try to keep replicas in different racks.

GFS architecture GFS Master C0C0 C1C1 C2C2 C5C5 Chunkserver 1 C0C0 C5C5 Chunkserver N C1C1 C3C3 C5C5 Chunkserver 2 … C2C2 Client

Fault Tolerance Master pings workers periodically Any machine who does not respond is considered “ dead ” Both Map- and Reduce-Machines – Any task in progress gets needs to be re-executed and becomes eligible for scheduling Map-Machines – Completed tasks are also reset because results are stored on local disk – Reduce-Machines notified to get data from new machine assigned to assume task

Skipping Bad Records Bugs in user code (from unexpected data) cause deterministic crashes – Optimally, fix and re-run – Not possible with third-party code When worker dies, sends “ last gasp ” UDP packet to Master describing record If more than one worker dies over a specific record, Master issues yet another re-execute command Tells new worker to skip problem record

Hadoop Demo Hadoop WordCount Implementation in Java

Conclusion Provide a general-purpose model to simplify large-scale computation Allow users to focus on the problem without worrying about details