Download presentation

Presentation is loading. Please wait.

Published byBeatrice O'Hara Modified over 3 years ago

1
Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011

2
Automatic parallelization technique Map function Reads input file in parallel Outputs pairs Reduce function Input: All pairs with same key Output: Results Information Week: Hadoop skills in demand

3
Theta-join Join on non-equality predicate Example: Select qid, hid From Heroes h, Quests q where q.level <= h.level Nested Block Loop For every block of r read all of s Always applicable “Computes” cross-product Hash Join Only examines tuples to join Cannot always be used (e.g., theta join)

4
MapReduce Algorithm “Computes” cross-product Goals: Tuples matched at exactly one reducer Minimal input to a reducer Minimal output from each reducer “1-Bucket” refers to no statistics about data distribution

5
Precompute regions of cross-product SxT Use size of S (|S|) and T (|T|) Regions are disjoint Union of regions covers cross-product Each region assigned to single reducer

6
11112222 11112222 11112222 11112222 33334444 33334444 33334444 33334444 |S|=8; |T|=8; #reducers =4 Rows are tuples in s; columns are tuples in t Value is region for the pair

7
Each row in S Randomly assign value (x) from 1 to size(S) Output for each region containing x Example: Assume x=3. Output and Each row in T Same, except output ExampleL Assume x=3. Output and

8
Joins all S rows with all T rows Can use any join algorithm appropriate for join value Output cross-product, theta join or equi-join

9
Random assignment of tuples Since actual row number unknown, any row number works Some reducer will compare tuple to any tuple in other table Therefore, every pair compared (as in nested block loop join) in only one reducer

10
Basis for minimal input and minimal output Let |S| be size of table S; r number of reducers Optimal output |S||T|/r Optimal input sqrt(|S||T|/r) from each table Special case: |S| = s*sqrt(|S||T|/r); |T| = t* s*sqrt(|S||T|/r) Optimal: s*t squares with side length sqrt(|S||T|/r)

11
11112222 11112222 11112222 11112222 33334444 33334444 33334444 33334444 |S|=8; |T|=8; r=4; sqrt(|S||T|/r) =4; s=t=2

12
Optimal case is rare General case t=floor(|T|/ sqrt(|S||T|/r)) Side length: floor((1+1/min(s,t)) * sqrt(|S||T|/r)) Note floor function omitted from paper Example: |S|=|T|=8; r=9 s=t=floor(8/sqrt(64/9))=3 Side length = floor((1+1/3)*sqrt(64/9))=3

13
11122255 11122255 11122255 33344466 33344466 33344466 77788899 77788899 Assumed partitioning Note: 64/9=7.111... Eight partitions with 7 and one with 8 is better

14
Map Each row in S output Each row in T output Reducer Join all matching rows (same as 1-Bucket) Cannot be used for arbitrary theta joins Subject to skew Great for foreign key join w/uniform distribution

15
Cloud data set Information about cloud cover 382 million records 28.8 GB Cloud-5-i is 5 million record subset SELECT S.date, S.longitude, S.latitude FROM Cloud S, Cloud T WHERE s.date = t.date and S.longitude = T. longitude and ABS(S.latitude- T.latitude) <= 10 SELECT S.latitude, T.latitude FROM Cloud-5-1 S, Cloud-5-2 T WHERE ABS(S.latitude-T.latitude) < 2

17
MapReduce algorithm for arbitrary joins Always applicable Effective for large-scale data analysis Additional statistics provide better performance

Similar presentations

OK

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on marketing strategy of coca cola in india Ppt on polynomials download youtube Ppt on business plan for new business Ppt on principles of object-oriented programming vs procedural programming Ppt on bmc remedy training Ppt on polynomials of 99 Ppt on google products and services Ppt on total parenteral nutrition guidelines Ppt on limitation act 1963 Ppt on first conditional form