Introduction to MapReduce

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Distributed Computations
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Big Data is a Big Deal!.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Distributed System Gang Wu Spring,2018.
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
Big Data Analysis MapReduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Introduction to MapReduce Most of the contents about MapReduce in this set of slides are borrowed from the presentation by Michael Kleber from Google dated on January 14, 2008.

Part 2 Reference Texts Tom White, Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale (4th Edition), O'Reilly Media, April 11, 2015, ISBN: 978-1491901632 Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan and Claypool Publishers, April 30, 2010, ISBN: 978-1608453429 Jason Swartz, Learning Scala, O'Reilly Media, December 8, 2014, ISBN: 978- 1449368814 Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, Learning Spark: Lightning-Fast Big Data Analysis, O'Reilly Media, February 27, 2015, ISBN: 978-1449358624 Bill Chambers and Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple, O'Reilly Media, March 8, 2018, ISBN: 978- 1491912218

Why we want to use MapReduce? MapReduce is a distributed computing paradigm Distributed computing is hard Do we really need to do distributed computing? Yes, otherwise some problems are too big for single computers. Examples: 20+ billion web pages × 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disc ~4 months to read the web ~400 hard / SSD drives just to store the web Even more time to do something with the data

Distributed computing is hard Bad news I: programming work Communication and coordination Recovering from machine failure (all the time!) Status reporting Debugging Optimization Data locality Bad news II: repeat for every problem you want to solve How can we make this easier?

MapReduce A simple programming model that can be applied to many large-scale computing problems Hide messy details in MapReduce runtime library Automatic parallelization Load balancing Network and disk transfer optimization Automatic handing of machine failures Robustness Improvements to core library benefit all users of library!

Typical flow of problem solving by MapReduce Read a lot of data Map: extract something you care about from each record (Hidden) shuffle and sort Reduce: aggregate, summarize, filter, or transform …… Write the results The above outline stays the same when solving different problems Map and Reduce change to fit the particular problem

MapReduce paradigm Basic data type: the key-value pair (k, v) For example, key = URL, value = HTML of the web page Programmer specifies two primary methods Map (k, v) → <(k1, v1), (k2, v2), (k3, v3), ……, (kn, vn)> Reduce (k’, <v’1, v’2, ……, v’n>) → <(k’, v”1), (k’, v”2), ……, (k’, v”m)> All v’ with the same k’ are reduced together (Remember the invisible “Shuffle and Sort” step)

Example: word frequencies (or word count) in web pages Considered as “Hello World!” example in cloud computing Input: files with one document per record Specify a map function that takes a key/value pair key = document URL value = document contents Output of map function is (potentially many) key/value pairs In this case, output (word, “1”) once per word in the document “document 1”, “to be or not to be” “to”, “1” “be”, “1” “or”, “1” “not”, “1”

Example: word frequencies (or word count) in web pages MapReduce library gathers together all pairs with the same key in the shuffle/sort step Specify a reduce function that combines the values for a key In this case, compute the sum key = “be” value = <“1”, “1”> key = “not” value = <“1”> key = “or” value = <“1”> key = “to” value = <“1”, “1”> “2” “1” “1” “2” Output of reduce step “be”, “2” “not”, “1” “or”, “1” “to”, “2”

Example: the overall process

Under the hood: scheduling One master, many workers Input data split into M map tasks (typically, 128 MB in each split) The input data decides how many map tasks will be created Reduce phase partitioned into R reduce tasks (= number of output files) Each reduce task generates an output file The programmer has the authority to decide the number of reduce tasks Tasks are assigned to workers dynamically Master assigns each map task to a free worker Consider locality of data to worker when assigning task Worker reads task input (often from local disk!) Worker produces R local files containing intermediate (k, v) pairs Master assigns each reduce task to a free worker Worker reads intermediate (k, v) pairs generated by map workers Worker applies Reduce function to produce the output User may specify Partition: which intermediate keys to which Reducers

MapReduce: the flow in diagram

MapReduce: fault tolerance via re-execution Worker failure Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in-progress reduce tasks Task completion committed through master Master failure State is checkpointed to replicated file system New master recovers and continues Very robust: lost thousands of machines once, but finished successfully