Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Aakash Kag What Why How 1.
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation) Presented By - Navam Gupta

Quick Example 1

Refinements – Locality The Input data that is provided to the Map task can be: Stored on a central disk server. Stored in the local machines which make the cluster. (i.e. machines on which the map and reduce tasks are actually run) Centrally stored input data: Map Task Input data Input data Problems: -Central Point of failure -Data Access speed is limited by the speed of the switch connecting the central sever to cluster machine 2

Refinement – Locality Stored on Local Machines – Blocks of data are stored on local machine disks. Map Task Input Machine 1 Machine 3Machine 2 Data access speed no longer limited by switch data transfer speed. But How can we be sure that the input assigned to map task is available in its local disk? - Store multiple copies of each block on multiple machines. - Master receives a list containing the location of each block and utilizes that list to assign tasks depending on the availability of input. 3

Refinement – Locality Map Task Block A Block F Block C Block A Block F Block C Machine 1 Machine 3Machine 2 Block A Block B Block F Block A Block B Block F Block B Block C Block F Block B Block C Master Use Input A Use Input B Use Input C Ending Note: Solid State Drives are fast right? 4

Refinements – Partitioning Function Number of Reduce Tasks = Number of Output Files = Provided By User= R Generally to partition the Intermediate Keys ( the Keys generated as the output of the Map functions) into R different partitions we simply use: Hash(Key) mod R Say the output keys are actually URL’s and we would like all URLs of a single host/domain to end up in the same output file. Will that partitioning work ? No, we need something more. Hash(Hostnames(Key)) mod R Hostnames is a function that returns hostname corresponding to the URL. e.g. Hostnames( = example.comwww.example.com/everythingispossible Ending Note – Google allows overwriting of the Partitioning function. 5

Refinement – Ordering Guarantees Within a partition (those created by the partitioning function), the intermediate key/value pairs are processed in increasing key order. This has become a common feature now, but initially when the map-reduce model was used in Lisp/Haskell the idea was just to group similar keys and no constraint on the order of processing the keys e.g. The problem with the latter output - what if we intend to do frequent lookup’s in the output file? In such a scenario ordered output is always better. Ending Note - This is another reason to perform costly sorts on the intermediate keys. Partition {, } Output{, } or Output{, 6

Refinements – Combiner Function At times there is a lot of repetition in the intermediate keys produced by a Map task. For e.g. the sentence “Piano sound is soothing. Its easy to play a piano. Pianos are awesome” The Map task will produce output 3 times. Afterwards this output is sent over the network for further processing. But should we really waste network bandwidth sending the same data over and over again ? Why don’t we just combine it into 1 result ? That is exactly what a combiner function does ! Combiner function is basically a copy of the Reduce task but : It is executed on the same machine as the Map task, right after the Map task completes its execution. Unlike the Reduce task the output instead of being written into an output file is written to the intermediate data file. 7

Refinements – Combiner Function Network Intermediate Data File Intermediate Data File Combiner Intermediate Data File Intermediate Data File Map Task without Combiner Map Task with Combiner 8

Refinements – Skipping Bad Records Sometimes there are bugs in the code that causes Map or Reduce functions to crash on certain records e.g. Suppose a reduce function is programmed in such a manner that it only accepts Alphabets as the Key. But if there is a record such as, reduce function will always crash for this particular record and will keep on trying to re-execute it. In general we handle crashes by fixing the bug, but sometimes its not feasible/possible like – Maybe caused by 3 rd party tool whose source code is inaccessible. Highly distributed environment/data makes it very hard to find the error. In such a case we, Install a signal handler to catch the errors Once a particular record errors it sends “last gasp” UDP packet to master. If Master receives more than certain threshold for a particular record, that record is no longer assigned to any task 9

Local Execution and Status Information Local Execution: Map-Reduce Library is primarily meant for a highly distributed environment over thousands of machine which may or may not be in the same geographical region. There was a need to be able to Locally Execute the program so that: It could be tested Debugged and Profiled So Map-Reduce library supports local execution on the user’s machine. Status Information: Being able to track the progress of the execution on such a large scale distributed environment is important in order to fully utilize all the resources. The Map-Reduce library comes with an HTTP Server, which displays pages containing information such as: Number of Tasks completed and In Progress Bytes of Input, Intermediate and output data. Which worker failed and the kind of task being performed by them. 10

Status Information 11

Google Map-Reduce vs. Hadoop In 2008, Hadoop had just begun and Doug Cutting (Yahoo employee) was one of the creators. Hence the comparison benchmarks came from Yahoo. In 2011, it was rumored that Google drastically improved its cluster hardware causing the sudden improvement in performance. 12

If Time Permits – Map Reduce Is Everywhere! Pretty much any task that needs to process large amount of data can be Map Reduced Content of ex.com – “wow bow how” Content of ex1.com –” bow wow” Map input – (ex.com, “wow how”) Assume Hash of wow = “abc”, how =“def”… Map output – {,,…} Map input – (ex.com, “wow how”) Assume Hash of wow = “abc”, how =“def”… Map output – {,,…} Reduce input – (ex.com, {“abc”,”def”,…}) Reduce output – {, } Reduce input – (ex.com, {“abc”,”def”,…}) Reduce output – {, } Output of reduce1 is sent to intermediate files for grouping and sorting Reduce input – (“abc”, {ex.com-3,ex1.com-2}) Reduce output – { } Reduce input – (“abc”, {ex.com-3,ex1.com-2}) Reduce output – { } 13

Everywhere! Output of reduce2 is sent to intermediate files for grouping and sorting Reduce input – Key is url and values represent every other url which had any shingle that was present in the key e.g. (ex.com-3,{ex1.com-2,ex1.com-2}) ex1.com comes for bow and wow both Reduce output – { } Reduce input – Key is url and values represent every other url which had any shingle that was present in the key e.g. (ex.com-3,{ex1.com-2,ex1.com-2}) ex1.com comes for bow and wow both Reduce output – { } My Conclusions Map reduce is here to stay. Map Reduce is the reason why Google has billions of dollars. Yes Page Rank is novel and awesome too, but without Map Reduce it would be infeasible. Get used to it. Play with hadoop.apache.org My Conclusions Map reduce is here to stay. Map Reduce is the reason why Google has billions of dollars. Yes Page Rank is novel and awesome too, but without Map Reduce it would be infeasible. Get used to it. Play with hadoop.apache.org 14