Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Computations have to be distributed !
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Key-Value stores simple data model that maps keys to a list of values Easy to achieve Performance Fault tolerance Heterogeneity Availability due to its.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
CS639: Data Management for Data Science
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon, 2009. 08. 13.

Outline Introduction Map-Reduce Map-Reduce-Merge Conclusions

Introduction New data-processing systems should consider alternatives to using big, traditional databases Map-Reduce does a good job, in a limited context, with extraordinary simplicity Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

Introduction (cont’d) Application SQL Sawzall ≈SQL LINQ, SQL Parallel Databases Sawzall Pig, Hive DryadLINQ Scope Language Map-Reduce Hadoop Dryad Execution GFS BigTable HDFS S3 Cosmos Azure SQL Server Storage

Map-Reduce : Motivation Many special purpose tasks that operate on and produce large amounts of data Crawled documents, web requests, etc Inverted indices, summaries, other kinds of derived data Needs to be distributed across large number of machines to finish in a reasonable time Parallelize the computation Distribute data Obscures original computation with these extra concerns

Map-Reduce : Benefits Automatic parallelization and distribution User code complexity and size reduced Transparent fault-tolerance I/O scheduling Fine grained partitioning of tasks Dynamically scheduled on available workers Status and monitoring

Map-Reduce : Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list (out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list (out_value) Produces a set of merged output values (usually just one)

Map-Reduce : Data Flow Data Map Reduce

Map-Reduce : Data Flow Map : Generate new Key and its value Reduce : Integrate values of same key Map Reduce Key1 Value1 KeyA ValueX KeyB ValueY ValueZ A=X B=Y,Z

Map-Reduce : Architecture Master Worker Worker Map GFS GFS Reduce Worker Worker Reduce Map

Map-Reduce : Architecture Master Assigns and maintains the state of each map/reduce task Propagating intermediate files to reduce tasks Worker Execute Map or Reduce by request of Master

Map-Reduce : Distributed Processing Input File Input 1 Input 2 … Input M Map Map … Map Intermediate File 1 2 … 1 2 … R … 2 … R Shuffle Reduce Shuffle Reduce Shuffle Reduce … Output File Output 1 Output 2 Output R …

Map-Reduce : Example Inverted Index wordID docID Location 101 1 2 201 203 3 301 302 DocID=1 IDS 연구실의 페이지 DocID=2 IDB 연구실의 페이지 Word docID 연구실 101 의 201 페이지 203 IDS 301 IDB 302

Map-Reduce : Example (cont’d) Input data to Map Output of Map Data Map Reduce Key(docID) Value(Text) 1 IDS 연구실의 페이지 2 IDB 연구실의 페이지 Key (wordID) Value (docID:Location) 301 1:0 101 1:1 201 1:2 203 1:3 Key (wordID) Value (docID:Location) 302 2:0 101 2:1 201 2:2 203 2:3

Map-Reduce : Example (cont’d) Shuffle Collect same keys and convey them to Reduce Reduce writes the final result Key (wordID) Value (docID:Location) 101 1:1 2:1 201 1:2 2:2 203 1:3 2:3 301 1:0 302 2:0 Data Map Reduce 101=1:1, 2:1 201=1:2, 2:2 203=1:3, 2:3 301=1:0 302=2:0

Map-Reduce : Example (cont’d) Other Examples Distributed Grep Count URL Access Frequency <URL, 1> <URL, total count> Reverse Web-Link Graph <target, source> <target, list(source)>

Map-Reduce-Merge Map-Reduce is an extremely simple model, but with limited context Map-Reduce handles mainly homogeneous datasets Relational operators are hard to implement with Map-Reduce(especially join operations) Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

Map-Reduce-Merge Adds a merge phase to the Map-Reduce algorithm Allows processing of multiple heterogeneous datasets Like Map and Reduce, the Merge phase is implemented by the developer Example: Two datasets: department and employee Goal: compute employee’s bonus based on individual rewardsand department bonus adjustment

Map-Reduce-Merge Example Match keys on dept_id in tables

Map-Reduce-Merge: Extending Map-Reduce Change to reduce phase / Merge phase Phases 1. Map: (k1, v1) → [(k2, v2)] 2. Reduce: (k2, [v2]) → [v3] becomes: 2. Reduce: (k2, [v2]) → (k2, [v3]) 3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Map-Reduce-Merge Additional user-definable operations Merger: same principle as map and reduce analogous to the map and reduce definitions, define logic to do the merge operation Processor: processes data from one source process data on an individual source Partition selector: selects the data that should go to the merger which data should go to which merger? Configurable iterator: how to iterate through each list as the merging is done how to step through each of the lists as you merge

Map-Reduce-Merge

Map-Reduce-Merge : Relational Data Processing Relational operators can be implemented using the Map-Reduce-Merge model. This includes: Projection Aggregation Generalized selection Joins Set union Set intersection Set difference Etc…

Map-Reduce-Merge : Example, Set Union The two Map-Reduces emit each a sorted list of unique elements The Merge merges the two lists by iterating in the following way: Store the smallest value of two and increase it’s iterator by one If they are equal, store one of them and increase both iterators

Map-Reduce-Merge : Example, Set Difference We have two sets, A and B, we want to compute A-B The two Map-Reduces emit each a sorted list of unique elements The merge iterates simultaneously over the two lists: If the value of A is less than B’s, store A’s value If the value of B is smaller, increment B’s iterator If the two are equal, increment both iterators

Map-Reduce-Merge : Example, Sort-Merge Join Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer Reduce: data in the sets are merged into a sorted set => sort the data Merge: the merger joins the sorted data for each key range

Map-Reduce-Merge : Optimizations Map-reduce already optimizes using locality and backup tasks Optimize the number of connections between the outputs of the reduce phase and the input of the merge phase ( Example: Set intersection) Combining two phases into one (example: ReduceMerge)

Conclusions Map-Reduce-Merge allows us to work on heterogeneous datasets Map-Reduce-Merge supports joins which Map-reduce didn’t directly do Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow