By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Map Reduce.
Introduction to MapReduce and Hadoop
MapReduce Simplied Data Processing on Large Clusters
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS-4513 Distributed Computing Systems Hugh C. Lauer
Distributed System Gang Wu Spring,2018.
Cloud Computing MapReduce, Batch Processing
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe

MapReduce: Simplified Data Processing on Large Clusters In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04) Also appears in the Communications of the ACM (2008)

 Ph.D. in Computer Science – University of Washington  Google Fellow in Systems and Infrastructure Group  ACM Fellow  Research Areas: Distributed Systems and Parallel Computing

 Ph.D. in Computer Science – Massachusetts Institute of Technology  Google Fellow  Research Areas: Distributed Systems and Parallel Computing

 Calculate 30*50 Easy?  30* * * * *60 Little bit hard?

 Simple computation, but huge data set  Real world example for large computations  20+ billion web pages * 20kB webpage  One computer reads 30/35 MB/sec from disc  Nearly four months to read the web

 Parallelize tasks in a distributed computing environment  Web page problem solved in 3 hours with 1000 machines

o How to parallelize the computation? o Coordinate with other nodes o Handling failures o Preserve bandwidth o Load balancing  Complexities in Distributed Computing

 A platform to hide the messy details of distributed computing  Which are,  Parallelization  Fault-tolerance  Data distribution  Load Balancing  A programming model  An implementation

 Example: Word count the quick brown fox the fox ate the mouse the 1 quick 1 brown 1 fox 1 the 1 fox 1 ate 1 the 1 mouse 1 the 3 quick 1 brown 1 fox 2 ate 1 mouse 1 DocumentMappedReduced the 1

 Eg: Word count using MapReduce the quick brown fox the fox ate the mouse Map Reduce the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate,1 the, 1 mouse, 1 Input Map Reduce Output the, 3 quick, 1 brown, 1 fox, 2 ate, 1 mouse, 1

map(String key, String value): for each word w in value: EmitIntermediate(w, "1"); Intermediate key/value pair – Eg: (“fox”, “1”) Document Name Document Contents Input  Text file Output  (“fox”, “1”)

reduce(String key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Word List of Counts (Output from Map) Input  (“fox”, {“1”, “1”}) Output  (“fox”, “2”) Accumulated Count

 Reverse Web-Link Graph Target (My web page) Target (My web page) Source Web page 2 Source Web page 2 Source Web page 3 Source Web page 3 Source Web page 4 Source Web page 4 Source Web page 5 Source Web page 5 Source Web page 1 Source Web page 1

 Reverse Web-Link Graph Map (“My Web”, “Source 1”) (“Not My Web”, “Source 2”) (“My Web”, “Source 3”) (“My Web”, “Source 4”) (“My Web”, “Source 5”) Reduce (“My Web”, {“Source 1”, “Source 3”,.....}) Target Source pointing to the target Source web pages

Implementation: Execution Overview Map LayerReduce Layer User Program Master Worker Input Layer Intermediate Files Output Layer Split 1 Split 2 Split 3 Split 4 Split 0 (1) Fork (2) Assign Map (2) Assign Reduce (3) Read(4) Local Write (5) Remote Read O/P File 1 O/P File 0 (6) Write

 Complexities in Distributed Computing, to be solved o How to parallelize the computation? o Coordinate with other nodes o Handling failures o Preserve bandwidth o Load balancing o Automatic parallelization using Map & Reduce o How to parallelize the computation?

 Restricted Programming model  User specified Map & Reduce functions  1000s of workers, different data sets Worker1 Worker2 Worker3 User-defined Map/Reduce Instruction Data

o Automatic parallelization using Map & Reduce  Complexities in Distributed Computing, solving.. o Coordinate with other nodes o Handling failures o Preserve bandwidth o Load balancing o Coordinate nodes using a master node

 Master data structure  Pushing information (meta-data) between workers Information Master Map Worker Reduce Worker

o Fault tolerance (Re-execution) & back up tasks o Coordinate nodes using a master node o Automatic parallelization using Map & Reduce  Complexities in Distributed Computing, solving.. o Handling failures o Preserve bandwidth o Load balancing

 No response from a worker task?  If an ongoing Map or Reduce task: Re-execute  If a completed Map task: Re-execute  If a completed Reduce task: Remain untouched  Master failure (unlikely)  Restart

 “Straggler”: machine that takes a long time to complete the last steps in the computation  Solution: Redundant Execution  Near end of phase, spawn backup copies  Task that finishes first "wins"

o Saves bandwidth through locality o Fault tolerance (Re-execution) & back up tasks o Coordinate nodes using a master node o Automatic parallelization using Map & Reduce  Complexities in Distributed Computing, solving.. o Preserve bandwidth o Load balancing

 Same data set in different machines  If a task has data locally, no need to access other nodes

o Saves bandwidth through locality o Fault tolerance & back up tasks o Coordinate nodes using a master node o Automatic parallelization using Map & Reduce  Complexities in Distributed Computing, solving.. o Load balancing o Load balancing through granularity  Complexities in Distributed Computing, solved

 Fine granularity tasks: map tasks > machines  1 worker  several tasks  Idle workers are quickly assigned to work

 Partitioning  Combining  Skipping bad records  Debuggers – local execution  Counters

Normal Execution No backup tasks 891 S1283 S  44% increment in time  Very long tail  Stragglers take >300s to finish

Normal Execution 200 processes killed 891 S 933 S  5% increment in time  Quick failure recovery

 Clustering for Google News and Google Product Search  Google Maps  Locating addresses  Map tiles rendering  Google PageRank  Localized Search

 Apache Hadoop MapReduce  Hadoop Distributed File System (HDFS)  Used in,  Yahoo! Search  Facebook  Amazon  Twitter  Google

 Higher level languages/systems based on Hadoop  Amazon Elastic MapReduce  Available for general public  Process data in the cloud  Pig and Hive

 Large variety of problems can be expressed as Map & Reduce  Restricted programming model  Easy to hide details of distributed computing  Achieved scalability & programming efficiency

 GFS solution: Shadow masters  Only meta data is passed through the master  A new copy can be started from the last point of state

 Programmer’s burden?  “If we hadn’t had to deal with failures, if we had a perfectly reliable set of computers to run this on, we would probably never have implemented MapReduce” – Sanjay Ghemawat

Combiner