A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.

A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian By Santosh Kumar Nukavarapu

Contents Introduction Requirement Log Processing and MapReduce Join Algorithms in Map Reduce i) Overview of Repartition Join Algorithm ii)Outlook of Broadcast Join, Semi-Join, Per-Split Semi-Join Experimental Evaluation Results Conclusion and Future Work

Introduction Map Reduce is very popular in analysis of large datasets. Positives Hide’s the parallelization, fault tolerance and load balancing details through it’s framework. Negatives ): Ignores many concepts of Parallel RDBMs. Lack of declarative language, solid schema and indexes.

Facebook,Yahoo,Google and many Web 2.0 companies are highly interested in Map Reduce. Why ? log processing is very important data analysis that is required by these companies. Map Reduce absolutely suit’s their Requirement. Requirement

Log Processing And Map Reduce What is Log Processing ? Log of events such as click-stream,phone call records or sequence of transactions are collected and are stored in flat files. Then these files are processed to compute various statistics to derive some business insights. Reasons to use Map Reduce for Log Processing : 1.Extremely large amount of Data involved. CompanyData stored per day China Mobile5-8 TB Facebook6TB

2. Log records do not always follow the same schema. 3. Third, all the log records within a time period are typically analyzed together, making simple scans preferable to index scans. 4. Important to keep the job analysis going even in the event of failures Parallel RDBMSMap Reduce Solid Schema not suitable for developers/analysts. ): Lack of a solid schema Formatting and Loading this huge amount of data is a challenge. ): Easily achieved here Log Records within a time period are analyzed together, so we need simple scans rather than indexes. Does not support indexing and simple scans can be easily achieved Failures and processing cannot take place in parallel. We achieve it here

Problem Specification Problem Required SolutionSolutions achieved here 1. Log needs to be joined with Reference data like information of users So we need to go for joinsyes 2. Map Reduce Framework is cumbersome for joins We need to have some join algorithms for Map Reduce yes 3. MapReduce programmers use inefficient algorithms Guidance about correct algorithm at correct time Yes, but also some work is left for future!

Assumptions made for our JOIN ALGORITHMS IN MAPREDUCE We consider an equi-join between a log table L and a reference table R on a single column. L,R and the Join Result is stored in DFS. Scans are used to access L and R. Each map or reduce task can optionally implement two additional functions: init() and close(). These functions can be called before or after each map or reduce task. L ⊲⊳ L.k=R.k R, with |L| ≫ |R|

Algorithm1 : Repartition Join Map Phase : 1.Each map task works on a split of either R or L. 2.Each map task tags the record with its originating table. 3.Outputs the extracted join key and the tagged record as a (key, value) pair. 4.The outputs are then partitioned, sorted and merged by the framework.

Reducer Phase : 1.All the records for each join key are grouped together and eventually fed to a reducer. 2.For each join key, the reduce function first separates and buffers the input records into two sets according to the table tag. 3.Performs a cross-product between records in the above sets. Problem with this version of Algorithm : All the records for a given join key from both L and R have to be buffered. So, we can be out of memory ):

Improvement to Re-partition join Phase /FunctionImprovement Map Functionoutput key is changed to a composite of the join key and the table tag. partitioning functionHashcode is computed from just the join key part of the composite key Grouping functionrecords are grouped on just the join key

AlgorithmProblem or CaseKey Method/Point for achieving solution Broadcast JoinIf the reference table R is much smaller than the log table L, i.e. |R| ≪ |L| Broadcast the smaller table R, as it avoids sorting on both tables and more importantly avoids the network overhead for moving the larger table L. Semi-JoinR is large, many records in R may not be actually referenced by any records in table L In the map function, a main-memory hash table is used to determine the set of unique join keys in a split of L. Per-Split Semi-JoinOne problem with semi- join is that not every record in the ﬁltered version of R will join with a particular split Li of L. Moves just the records in R that will join with each split of L.

Experimental Evaluation Hardware/software/procedureConfiguration /version Cluster100 node Each node2.4GHz Intel Core 2 Duo processor 4GB of DRAM two SATA disks Operating SystemRed Hat Enterprise Server 5.2 running Linux 2.6.18 2 Racksown gigabit Ethernet switch rack level bandwidth32Gb/s Hadoop version0.19.0 conﬁgured it to run up to two map and two reduce tasks concurrently per node block size128 Mb ReplicationEach HDFs block was replicated 3 times

Results Picture taken from : A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian.

Conclusion Joining log data with all kinds of reference data in MapReduce has emerged as an important part of analytic operations for : 1. Enterprise customers 2. Web 2.0 companies Evaluated the join methods on a 100-node system. Shown Unique tradeoﬀs of these join algorithms in the context of MapReduce. Study can help an optimizer select the appropriate algorithm based on data. Future Work Evaluating methods for multi-way joins. Exploring indexing methods to speedup join queries, Designing an optimization module that can automatically select the appropriate join algorithms.

References Google labs A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian. Wikipedia Ibm.com

A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.

Similar presentations

Presentation on theme: "A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.

Similar presentations

Presentation on theme: "A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian."— Presentation transcript:

Similar presentations

About project

Feedback