Ch. 3 Lin and Dyer’s text Pages 43-73 (39-69)

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
MapReduce.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, February 9, 2015 Session 3: MapReduce – Basic Algorithm Design This work is licensed under.
大规模数据处理 / 云计算 Lecture 4 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 4/24/2011 This work is licensed under a Creative.
Structure clashes and program decomposition
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Based on Lin and Dryer’s text: Chapter 3.  Figure 2.6.
MapReduce Algorithm Design Data-Intensive Information Processing Applications ― Session #3 Jimmy Lin University of Maryland Tuesday, February 9, 2010 This.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Map/Reduce Programming Model
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
大规模数据处理 / 云计算 Lecture 5 – Mapreduce Algorithm Design 彭波 北京大学信息科学技术学院 7/19/2011 This work is licensed under a Creative.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
大规模数据处理 / 云计算 Lecture 3 – Mapreduce Algorithm Design 闫宏飞 北京大学信息科学技术学院 7/16/2013 This work is licensed under a Creative.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Next Contents Back. Next Contents Back The Integers are natural numbers including 0 (0, 1, 2, 3,...) and their negatives (0, −1, −2, −3,...). They are.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Big Data Infrastructure
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
HADOOP ADMIN: Session -2
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MR Application with optimizations for performance and scalability
Hadoop.
MapReduce Algorithm Design
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dyer.
MR Application with optimizations for performance and scalability
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Midterm Review CSE4/587 B.Ramamurthy 4/4/2019 4/4/2019 B.Ramamurthy
MapReduce Algorithm Design
Midterm Review CSE4/587 B.Ramamurthy 4/8/2019 4/8/2019 B.Ramamurthy
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dryer.
Map Reduce, Types, Formats and Features
Presentation transcript:

Ch. 3 Lin and Dyer’s text Pages (39-69)

Word count:  Local aggregation as opposed to external combiner that is NOT guaranteed by the Hadoop framework ◦ May not work all the time: what if wanted word “mean” instead of word “count”: may have to adjust types at the output of map Word co-occurrence (matrix) ◦ Very important since many (many) problems are expressed and solved using matrices ◦ Pairs and stripes approaches ◦ And comparison of these two methods P.60 (56)

 First version simplistic counts  Then “relative frequency” instead of counts ◦ What is relative frequency? Instead of absolute counts ◦ f(w i /w j ) = N(w i,w j )/∑ w’ (w i, w’) ◦ For example, if word “dog” co-occurred with “food” 23 times, and “dog” co-occurred with all words 460 times, then relative frequency is 23/460 = 1/20 = 0.05 ◦ Also the 460 could come from many mappers, many documents over the entire corpus. ◦ These co-occurrences from every mapper are delivered to “corresponding reducer” with a special key ◦ This is delivered as special key item as the first pair ◦ The magic is that reducer processes

KeyValueReducer operation/compute Result (dog,*)[200,350,650] One per mapper with combiner Marginal∑ w’ (wi, w’)=1200 (dog, bark)60Relative frequency (dog, cat)12Relative frequency (dog, food)600Relative frequency ….… At the reducer: Blue: reducer1/Orange: reducer 2 KeyValueReducer operation Result (tiger,*)[100,300,600]Compute marginal ∑ w’ (wi, w’)=1000 (tiger, cub)10Compute R.freq (tiger, hunt)100Compute R.freq (tiger, prey)200Compute R.freq ….

4 different reducers

 Emitting a special key-value pair for each co- occurring word pair in the mapper to capture its contribution to the marginal.  Controlling the sort order of the intermediate key so that the key-value pairs representing the marginal contributions are processed by the reducer before any of the pairs representing the joint word co- occurrence counts.  Defining a custom partitioner to ensure that all pairs with the same left word are shuffled to the same reducer.  Preserving state across multiple keys in the reducer to first compute the marginal based on the special key-value pairs and then dividing the joint counts by the marginal to arrive at the relative frequencies.

all delivered to the same reducer.. What can you do with this? Reducer can “middle(left’s value, right’s value) “  Some more: You can do anything you want for function… “KEY.operation” on “VALUE.data” Therein lies the power of MR.

 Text word count  Text co-occurrence  pairs and stripes  Numerical data processing with most math functions  How about sensor data?  Consider m sensors, sending out readings r x at various times t y resulting large volume of data of the format:  (t1;m1; r80521)  (t1;m2; r14209)  (t1;m3; r76042)  :::  (t2;m1; r21823)  (t2;m2; r66508)  (t2;m3; r98347)  Suppose you wanted to know the readings by the sensors, how could process the above to get that info?  Use MR to do that… etc.  But what if wanted that sorted by time t that is a part of the value?

 Solution 1: Let the reducer do the in-memory sorting  memory bottle neck  Solution 2: Move value to be sorted to the key, and modify the shuffler and partitioner  In the later, the “secondary sorting” is left to the framework and it excels in doing this anyway.  So solution 2 is a preferred approach.  Lesson: Let the framework do what it is good at and don’t try to move into your code… in the latter you will be regressing to the “usual” coding practices and ensuing disadvantages

 Reduce-side join is intuitive but inefficient  Map-side join requires simple merge of respective input files and appropriate sort by the MR framework  In-memory joins can be done for smaller data.  We will NOT discuss this in detail since there are other solutions such as Hive, Hbase available for warehouse data. We will look into these later.