Presentation is loading. Please wait.

Presentation is loading. Please wait.

J OIN ALGORITHMS USING MAPREDUCE Haiping Wang

Similar presentations


Presentation on theme: "J OIN ALGORITHMS USING MAPREDUCE Haiping Wang"— Presentation transcript:

1 J OIN ALGORITHMS USING MAPREDUCE Haiping Wang ctqlwhp1022@163.com

2 O UTLINE MapReduce Framework MapReduce implementation on Hadoop Join algorithms using MapReduce

3 M AP R EDUCE : S IMPLIFIED DATA PROCESSING ON LARGE CLUSTERS. I N OSDI, 2004

4 M AP R EDUCE W ORD C OUNT D IAGRAM ah ah erahif oror uhorah if ah:1,1,1,1 ah:1 if:1 or:1or:1 uh:1or:1ah:1 if:1 er:1if:1,1or:1,1,1uh:1 ah:1 ah:1 er:1 41 2 31 file 1 file 2 file 3 file 4 file 5 file 6 file 7 (ah)(er)(if)(or)(uh) reduce(String outputkey, Iterator intermediate_alues): map(String inputkey, String inputvalue):

5 JobTracker TaskTracker Record Reader Record Writer Mapper Partitioner Sorter Reducer Copy InputFormat OutputFormat M AP R EDUCE IMPLEMENTATION ON H ADOOP

6

7 H ADOOP MAPREDUCE FRAMEWORK ARCHITECTURE

8 J OIN ALGORITHMS USING MAPREDUCE Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters sigmod07 Semi-join Computation on Distributed File Systems Using Map-Reduce-Merge Model Sac10 Optimizing joins in a map-reduce environment VLDB09,EDBT2010 A Comparison of Join Algorithms for Log Processing in MapReduce sigmod10

9 M AP -R EDUCE -M ERGE : S IMPLIFIED R ELATIONAL D ATA P ROCESSING ON L ARGE C LUSTERS SIGMOD 07

10 M AP -R EDUCE -M ERGE I MPLEMENTATIONS OF R ELATIONAL J OIN A LGORITHMS Sort- merger join Maprange partitioner, ordered bucket s, each bucket a reducer ReduceRead the designed buckets from all mappers and merged them into a sorted set MergeRead sorted buckets from two data sets and do sort-merge join Hash joinMapHash partitioner, hashed buckets, each bucket a reducer ReduceRead the designed buckets from all mappers, use a hash table to group and aggregate these records(the same hash function as the mapper ), does not need a sorter MergeIn memory hash join Block Nested loop join MapThe same as the hash join ReduceThe same as the hash join MergeNested loop join

11 E XAMPLE : H ASH J OIN split mapper reducer merger split mapper reducer Use a hash partitioner Read from every mapper for one designated partition Read from two sets of reducer outputs that share the same hashing buckets One is used as a build set and the other probe

12 ANALYSIS AND CONCLUSION Connections A(ma, ra ), B(mb, rb ), r mergers suppose ra=rb=r Map->Reduce connections= ra*ma+rb*mb=r*(ma+mb) Reduce->Merge in one-to-one case, connections=2r matcher: compare tuples to see id they should be merged or not Conclusion Use multiple map-reduce job Partitioner may cause data skew problem The number of ma, ra, mb, rb, r ra=rb? –> connections

13 S EMI - JOIN COMPUTATION STEPS AND WORKFLOW Equal join reduce communication costs disk I/O costs Insensitive to data skew ?

14 A C OMPARISON OF J OIN A LGORITHMS FOR L OG P ROCESSING IN M AP R EDUCE SIGMOD 10 Equi-join between a log table L and a reference table R on a single column. L,R and the Join Result is stored in DFS. Scans are used to access L and R. Each map or reduce task can optionally implement two additional functions: init() and close(). These functions can be called before or after each map or reduce task. L ⊲⊳ L.k=R.k R, with |L| ≫ |R|

15 REPARTITION JOIN (H IVE ) Drawback: all records may have to be buffered Out of memory  The key cardinality is small  The data is highly skewed L: Ratings.dat R: movies.dat Pairs: (key, targeted record) 1:: 1193 ::5::978300760 1:: 661 ::3::978302109 1:: 661 ::3::978301968 1:: 661 ::4::978300275 1 :: 1193 ::5::97882429 1:: 1193 ::5::978300760 1:: 661 ::3::978302109 1:: 661 ::3::978301968 1:: 661 ::4::978300275 1 :: 1193 ::5::97882429 661 ::James and the Glant… 914 ::My Fair Lady.. 1193 ::One Flew Over the… 2355 ::Bug’s Life, A… 3408 ::Erin Brockovich… 661 ::James and the Glant… 914 ::My Fair Lady.. 1193 ::One Flew Over the… 2355 ::Bug’s Life, A… 3408 ::Erin Brockovich… 1193, L:1::1193::5::978300760 661, L :1::661::3::978302109 661, L :1::661::3::978301968 661, L :1::661::4::978300275 1193, L :1 ::1193::5 ::97882429 661, R :661::James and the Gla… 914, R : 914::My Fair Lady.. 1193, R : 1193::One Flew Over … 2355, R : 2355::Bug’s Life, A… 3408, R : 3408::Erin Brockovi… (661, …) (1193, …) (661, …) (2355, …) (3048, …) (914, …) (1193, …) ( 661, [L :1::661::3::97…], [R:661::James…], [L:1::661::3::978…], [L :1::661::4::97…]) ( 2355, [R:2355::B’…]) (3408, [R:3408::Eri…]) (1,Ja..,3, …) (1,Ja..,4, …) Group by join key Buffers records into two sets according to the table tag + Cross-product Buffers records into two sets according to the table tag + Cross-product {(661::James…) } X (1::661::3::97…), (1::661::4::97…) Phase /FunctionImprovement Map FunctionOutput key is changed to a composite of the join key and the table tag. Partitioning functionHashcode is computed from just the join key part of the composite key Grouping functionRecords are grouped on just the join key

16 T HE C OST M EASURE FOR MR A LGORITHMS The communication cost of a process is the size of the input to the process This paper does not count the output size for a process The output must be input to at least one other process The final output is much smaller than its input The total communication cost is the sum of the communication costs of all processes that constitute an algorithm The elapsed communication cost is defined on the acyclic graph of processes Consider a path through this graph, and sum the communication costs of the processes along that path The maximum sum, over all paths is the elapsed communication cost

17 2-W AY J OIN IN M AP R EDUCE R(A,B) S(B,C) R S Input Reduce input Final output Map Reduce AB a0b0 a1b1 a2b2 …… BC b0c0 b0c1 b1c2 …… KV b0(a0, R) b0(c0, S) b0(c1, S) …… KV b1(a1, R) b1(c2, S) …… ABC a0b0c0 a0b0c1 a1b1c2 ……… TabletuplemapPartition& sort R(a,b ) b ->(a, R)Hash(b) ->(a, R) S(b, c ) b ->(c, S)Hash(b) ->(c, S) b->(a, c)

18 J OINING S EVERAL R ELATIONS AT O NCE R S Input Reduce input Final output Map Reduce R(A,B) S(B,C) T(C,D) T

19 J OINING S EVERAL R ELATIONS AT O NCE Let h be a hash function with range 1, 2, …, m S(b, c) -> (h(b), h(c)) R(a, b) -> (h(b), all) T(c, d) -> (all, h(c)) Each Reduce process computes the join of the tuples it receives (# of Reduce processes: 4 2 = 16) m=4, k=16 h(c) = 01 2 3 h(b) = 0 1 2 3 h(R.b) = 2 h(T.c) = 1 h(S.b) = 2 h(S.c) = 1 Reduce processes R(A,B) S(B,C) T(C,D)

20 P ROBLEM S OLVING Problem solving using the method of Lagrange Multipliers Take derivatives with respect to the three variables a, b, c Multiply the three equations

21 S PECIAL C ASES Star Joins Chain Joins A chain join is a join of the form

22 C ONCLUSION Just suitable for Equal join Use one map-reduce Does not consider the IO ( intermediate pairs IO ) and CPU time intermediate Main contribution: use “Lagrangean multipliers” method


Download ppt "J OIN ALGORITHMS USING MAPREDUCE Haiping Wang"

Similar presentations


Ads by Google