Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld’s class at U. Washington (who in turn made his slides based on those.
Computations have to be distributed !
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Cloud Computing.
MapReduce: Simplified Data Processing on Large Clusters
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development

Outline Motivation MapReduce Concept Map? Reduce? Example of MapReduce problem Reverse Web-Link Graph MapReduce Cluster Environment Lifecycle of MapReduce operation Optimizations to MapReduce process Conclusion MapReduce in Googlicious Action

Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data Want to use hundreds or thousands of CPUs... but this needs to be easy! MapReduce provides User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Programming Concept Map Perform a function on individual values in a data set to create a new list of values Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Reduce Combine values in a data set to create a new value Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏

Example: Reverse Web-Link Graph Find all pages that link to a certain page Map Function Outputs pairs for each link to a target URL found in a source page For each page we know what pages it links to Reduce Function Concatenates the list of all source URLs associated with a given target URL and emits the pair: For a given web page, we know what pages link to it

Additional Examples Distributed grep Distributed sort Term-Vector per Host Web Access Log Statistics Document Clustering Machine Learning Statistical Machine Translation

Performance Boasts Distributed grep byte files (~1TB of data)‏ 3-character substring found in ~100k files ~1800 workers 150 seconds start to finish, including ~60 seconds startup overhead Distributed sort Same files/workers as above 50 lines of MapReduce code 891 seconds, including overhead Best reported result of 1057 seconds for TeraSort benchmark

Typical Cluster 100s/1000s of Dual-Core, 2-4GB Memory Limited internal bandwidth Temporary storage on local IDE disks Google File System (GFS)‏ Distributed file system for permanent/shared storage Job scheduling system Jobs made up of tasks Master-Scheduler assigns tasks to Worker machines

Execution Initialization Split input file into 64MB sections (GFS)‏ Read in parallel by multiple machines Fork off program onto multiple machines One machine is Master Master assigns idle machines to either Map or Reduce tasks Master Coordinates data communication between map and reduce machines

Map-Machine Reads contents of assigned portion of input-file Parses and prepares data for input to map function (e.g. read from HTML)‏ Passes data into map function and saves result in memory (e.g. )‏ Periodically writes completed work to local disk Notifies Master of this partially completed work (intermediate data)‏

Reduce-Machine Receives notification from Master of partially completed work Retrieves intermediate data from Map-Machine via remote-read Sorts intermediate data by key (e.g. by target page)‏ Iterates over intermediate data For each unique key, sends corresponding set through reduce function Appends result of reduce function to final output file (GFS)‏

Worker Failure Master pings workers periodically Any machine who does not respond is considered “dead” Both Map- and Reduce-Machines Any task in progress gets needs to be re-executed and becomes eligible for scheduling Map-Machines Completed tasks are also reset because results are stored on local disk Reduce-Machines notified to get data from new machine assigned to assume task

Skipping Bad Records Bugs in user code (from unexpected data) cause deterministic crashes Optimally, fix and re-run Not possible with third-party code When worker dies, sends “last gasp” UDP packet to Master describing record If more than one worker dies over a specific record, Master issues yet another re-execute command Tells new worker to skip problem record

Backup Tasks Some “Stragglers” not performing optimally Other processes demanding resources Bad Disks (correctable errors) Slow down I/O speeds from 30MB/s to 1MB/s CPU cache disabled ?! Near end of phase, schedule redundant execution of in-process tasks First to complete “wins”

Locality Network Bandwidth scarce Google File System (GFS)‏ Around 64MB file sizes Redundant storage (usually 3+ machines)‏ Assign Map-Machines to work on portions of input-files which they already have on local disk Read input file at local disk speeds Without this, read speed limited by network switch

Conclusion Complete rewrite of the production indexing system 20+ TB of data indexing takes 5-10 MapReduce operations indexing code is simpler, smaller, easier to understand Fault Tolerance, Distribution, Parallelization hidden within MapReduce library Avoids extra passes over the data Easy to change indexing system Improve performance of indexing process by adding new machines to cluster