A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
J OIN ALGORITHMS USING MAPREDUCE Haiping Wang
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Clydesdale: Structured Data Processing on MapReduce Jackie.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel(University of Wisconsin-Madison) Eugene J. Shekita, Yuanyuan.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
A Comparison of Join Algorithms for Log Processing in MapReduce SIGMOD 2010 Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Only Aggressive Elephants are Fast Elephants Nov 11 th 2013 Database Lab. Wonseok Choi.
BIG DATA/ Hadoop Interview Questions.
Large-scale file systems and Map-Reduce
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian By Santosh Kumar Nukavarapu

Contents Introduction Requirement Log Processing and MapReduce Join Algorithms in Map Reduce i) Overview of Repartition Join Algorithm ii)Outlook of Broadcast Join, Semi-Join, Per-Split Semi-Join Experimental Evaluation Results Conclusion and Future Work

Introduction Map Reduce is very popular in analysis of large datasets. Positives Hide’s the parallelization, fault tolerance and load balancing details through it’s framework. Negatives ): Ignores many concepts of Parallel RDBMs. Lack of declarative language, solid schema and indexes.

Facebook,Yahoo,Google and many Web 2.0 companies are highly interested in Map Reduce. Why ? log processing is very important data analysis that is required by these companies. Map Reduce absolutely suit’s their Requirement. Requirement

Log Processing And Map Reduce What is Log Processing ? Log of events such as click-stream,phone call records or sequence of transactions are collected and are stored in flat files. Then these files are processed to compute various statistics to derive some business insights. Reasons to use Map Reduce for Log Processing : 1.Extremely large amount of Data involved. CompanyData stored per day China Mobile5-8 TB Facebook6TB

2. Log records do not always follow the same schema. 3. Third, all the log records within a time period are typically analyzed together, making simple scans preferable to index scans. 4. Important to keep the job analysis going even in the event of failures Parallel RDBMSMap Reduce Solid Schema not suitable for developers/analysts. ): Lack of a solid schema Formatting and Loading this huge amount of data is a challenge. ): Easily achieved here Log Records within a time period are analyzed together, so we need simple scans rather than indexes. Does not support indexing and simple scans can be easily achieved Failures and processing cannot take place in parallel. We achieve it here

Problem Specification Problem Required SolutionSolutions achieved here 1. Log needs to be joined with Reference data like information of users So we need to go for joinsyes 2. Map Reduce Framework is cumbersome for joins We need to have some join algorithms for Map Reduce yes 3. MapReduce programmers use inefficient algorithms Guidance about correct algorithm at correct time Yes, but also some work is left for future!

Assumptions made for our JOIN ALGORITHMS IN MAPREDUCE We consider an equi-join between a log table L and a reference table R on a single column. L,R and the Join Result is stored in DFS. Scans are used to access L and R. Each map or reduce task can optionally implement two additional functions: init() and close(). These functions can be called before or after each map or reduce task. L ⊲⊳ L.k=R.k R, with |L| ≫ |R|

Algorithm1 : Repartition Join Map Phase : 1.Each map task works on a split of either R or L. 2.Each map task tags the record with its originating table. 3.Outputs the extracted join key and the tagged record as a (key, value) pair. 4.The outputs are then partitioned, sorted and merged by the framework.

Reducer Phase : 1.All the records for each join key are grouped together and eventually fed to a reducer. 2.For each join key, the reduce function first separates and buffers the input records into two sets according to the table tag. 3.Performs a cross-product between records in the above sets. Problem with this version of Algorithm : All the records for a given join key from both L and R have to be buffered. So, we can be out of memory ):

Improvement to Re-partition join Phase /FunctionImprovement Map Functionoutput key is changed to a composite of the join key and the table tag. partitioning functionHashcode is computed from just the join key part of the composite key Grouping functionrecords are grouped on just the join key

AlgorithmProblem or CaseKey Method/Point for achieving solution Broadcast JoinIf the reference table R is much smaller than the log table L, i.e. |R| ≪ |L| Broadcast the smaller table R, as it avoids sorting on both tables and more importantly avoids the network overhead for moving the larger table L. Semi-JoinR is large, many records in R may not be actually referenced by any records in table L In the map function, a main-memory hash table is used to determine the set of unique join keys in a split of L. Per-Split Semi-JoinOne problem with semi- join is that not every record in the filtered version of R will join with a particular split Li of L. Moves just the records in R that will join with each split of L.

Experimental Evaluation Hardware/software/procedureConfiguration /version Cluster100 node Each node2.4GHz Intel Core 2 Duo processor 4GB of DRAM two SATA disks Operating SystemRed Hat Enterprise Server 5.2 running Linux Racksown gigabit Ethernet switch rack level bandwidth32Gb/s Hadoop version configured it to run up to two map and two reduce tasks concurrently per node block size128 Mb ReplicationEach HDFs block was replicated 3 times

Results Picture taken from : A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian.

Conclusion Joining log data with all kinds of reference data in MapReduce has emerged as an important part of analytic operations for : 1. Enterprise customers 2. Web 2.0 companies Evaluated the join methods on a 100-node system. Shown Unique tradeoffs of these join algorithms in the context of MapReduce. Study can help an optimizer select the appropriate algorithm based on data. Future Work Evaluating methods for multi-way joins. Exploring indexing methods to speedup join queries, Designing an optimization module that can automatically select the appropriate join algorithms.

References Google labs A Comparison of Join Algorithms for Log Processing in MapReduce by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao,Eugene J. Shekita, Yuanyuan Tian. Wikipedia Ibm.com