MapReduce Algorithm Design

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

MapReduce.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
大规模数据处理 / 云计算 Lecture 4 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 4/24/2011 This work is licensed under a Creative.
Ch. 3 Lin and Dyer’s text Pages (39-69)
Computations have to be distributed !
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
MapReduce Algorithm Design Data-Intensive Information Processing Applications ― Session #3 Jimmy Lin University of Maryland Tuesday, February 9, 2010 This.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
大规模数据处理 / 云计算 Lecture 5 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 7/19/2011 This work is licensed under a Creative.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
大规模数据处理 / 云计算 Lecture 3 – Mapreduce Algorithm Design 闫宏飞 北京大学信息科学技术学院 7/16/2013 This work is licensed under a Creative.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Framework and Its Applications CSCI5570 Large Scale Data Processing Systems Instructor: James Cheng, CSE, CUHK Slide Ack.: modified based on the.
Big Data Infrastructure
Hadoop Framework and Its Applications
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Map-Reduce framework.
MapReduce Types, Formats and Features
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Aggregation Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together,
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MR Application with optimizations for performance and scalability
Hadoop MapReduce Types
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
February 26th – Map/Reduce
MapReduce Algorithm Design
Cse 344 May 4th – Map/Reduce.
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dyer.
Sets, Maps and Hash Tables
Distributed System Gang Wu Spring,2018.
Data processing with Hadoop
Inverted Indexing for Text Retrieval
CS639: Data Management for Data Science
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Word Co-occurrence Chapter 3, Lin and Dryer.
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Distributed Systems and Concurrency: Map Reduce
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce Algorithm Design

Contents Combiner and in-mapper combining Complex keys and values Relative frequency Secondary Sorting

Contents Combiner and in-mapper combining Complex keys and values Relative frequency Secondary Sorting

Combiner and in-mapper combining Purpose Carry out local aggregation before shuffle and sort phase Reduce the communication volume between map and reduce stages

Combiner example (word count) Use reducer as combiner Integer addition is both associative and commutative

MapReduce with Combiner

MapReduce with combiner map: (k1, v1) -> [(k2, v2)] combine: (k2, [v2]) -> [(k2, v2)] reduce: (k2, [v2]) -> [(k3, v3)] The combiner input and output key-value types must match the mapper output key-value type

Combiner is an optimization, not a requirement Combiner is optional A particular implementation of MapReduce framework may choose to execute the combine method many times or none Calling the combine method zero, one, or many times should produce the same output from the reducer The correctness of the MapReduce program should not rely on the assumption that the combiner is always carried out

In-method combining

Use Java Map to implement AssociativeArray A Map is an object that maps keys to values A map cannot contain duplicate keys Each key can map to at most one value The Java platform contains three general-purpose Map implementations HashMap, TreeMap, and LinkedHashMap Basic operations boolean containsKey(Object key) get(Object key) put(K key, V value) keySet() Returns a Set view of the keys contained in this map entrySet() Returns a Set view of the mappings contained in this map.

In-mapper combining

Demo and Some online resource Demo on in-method and in-mapper combining Online resource https://vangjee.wordpress.com/2012/03/07/the-in-mapper-combining- design-pattern-for-mapreduce-programming codingjunkie.net Blog by Bill Bejeck, who provides a lot of source code on MapReduce in Hadoop

Contents Combiner and in-mapper combining Complex keys and values Relative frequency Secondary Sorting

Complex keys and values Both keys and values can be complex data structures Pairs Stripes Serialization and deserialization for complex structures After map stage, structures need to be serialized to be written to storage Complex keys and values need to be de-serialized at the reducer side

Motivation An example to compute the mean of value

With combiner

Revised mapper

In-mapper combining

A running example Build word co-occurrence matrices for large corpora Build the word co-occurrence matrix for all the works by Shakespeare Co-occurrence within a specific context A sentence A paragraph A document A certain window of m words

Pairs For each map task For each unique pair (a, b), emit [pair (a, b), 1] Use combiner or in-mapper combining to reduce the volume of intermediate pairs Reducers sum up counts associated with these pairs

Pairs approach

Stripes Idea: group together pairs into an associative array Mappers emit [word, associate array] Reducers perform element-wise sum of associative arrays (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +

Stripes approach

Contents Combiner and in-mapper combining Complex keys and values Relative frequency Secondary Sorting

Relative frequency What proportion of the time does B appear in the context A? Whenever there is a co-occurrence of (A,*), for what percentage will * be B The total count of co-occurrence of (A,*) is called marginal (A, *) → 32 Reducer holds this value in memory (a, b1) → 3 (a, b2) → 12 (a, b3) → 7 (a, b4) → 1 (a, b5) → 4 (a, b6) → 5 (a, b1) → 3 / 32 (a, b2) → 12 / 32 (a, b3) → 7 / 32 (a, b4) → 1 / 32 (a, b5) → 4 / 32 (a, b6) → 5 / 32

Relative frequency with stripes It is easy One pass to compute (A, *) Another pass to directly compute f(B|A) May have the scalability issue for really large data The final associative array holds all the neighbors and their co-occurrence counts with the word A

Relative frequency with pairs Must emit extra (A, *) for every B (B can be any word) in mapper Must make sure all pairs of (A, *) and (A, B) get sent to same reducer (use partitioner) Must make sure (A, *) comes first (define sort order) Must hold state in reducer across different key-value pairs This pattern is called “order inversion” For the mapper, the example is as follows. Alice go to the wonderland to have fun For the relative frequency, use the following example ((a, *), [12,4,3]) ((a, b), [1,9]) ((a, c), [2,3]) ((a, d), [4]) ((b, *), [7,12,6]) ((b, b), [3,6]) ((b, c), [2]) ((b, d), [13,1])

Contents Combiner and in-mapper combining Complex keys and values Relative frequency Secondary Sorting

Secondary Sorting A motivating example The readings of m sensors are recorded over the time (m1, t1, r80521) (m1, t2, r21823) …… (m2, t1, r14209) (m2, t2, r66508) (m3, t1, r76042) (m3, t2, r98347) (t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) …… (t2, m1, r21823) (t2, m2, r66508) (t2, m3, r98347)

First approach (t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) …… (t2, m1, r21823) (t2, m2, r66508) (t2, m3, r98347) (m1, (t1, r80521)) (m2, (t1, r14209)) (m3, (t1, r76042)) …… (m1, (t2, r21823)) (m2, (t2, r66508)) (m3, (t2, r98347)) However, Hadoop MapReduce sorts intermediate pairs by key Values can be arbitrarily ordered E.g., (m1, [(t100, r23456), (t2, r21823),…,(t234, r34870)]) Buffer values in memory, then sort Is there issue with this approach?

Second approach Value-to-key conversion Move part of the value into the intermediate key to form a composite key Composite key: (m, t) Intermediate pair: ((m, t), r) Let execution framework do the sorting First sort by the sensor id, i.e., m (the left element in the key) Then sort by the timestamp, i.e., t (the right element in the key) Implement the custom partitioner All pairs with the same sensor shuffled to the same reducer ((m1, t1), r80521) ((m1, t2), r21823) ((m1, t3), r149625) ……