Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li Speaker : Razvan Belet.

Slides:

Advertisements

Similar presentations

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

EXECUTION PLANS By Nimesh Shah, Amit Bhawnani. Outline  What is execution plan  How are execution plans created  How to get an execution plan  Graphical.

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Fast Algorithms for Association Rule Mining

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Module 7 Reading SQL Server® 2008 R2 Execution Plans.

Database Management 9. course. Execution of queries.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Copyright © Curt Hill Query Evaluation Translating a query into action.

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

GIS Data Models GEOG 370 Christine Erlien, Instructor.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

CS4432: Database Systems II Query Processing- Part 2.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

CS 540 Database Management Systems

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Execution Plans Detail From Zero to Hero İsmail Adar.

1 VLDB, Background What is important for the user.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Module 11: File Structure

CS 540 Database Management Systems

Optimizing Parallel Algorithms for All Pairs Similarity Search

CS522 Advanced database Systems

CS 440 Database Management Systems

Database Management Systems (CS 564)

Evaluation of Relational Operations

Evaluation of Relational Operations: Other Operations

On Spatial Joins in MapReduce

Relational Operations

Indexing and Hashing Basic Concepts Ordered Indices

Lecture 2- Query Processing (continued)

Overview of Query Evaluation

Implementation of Relational Operations

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Evaluation of Relational Operations: Other Techniques

Presentation transcript:

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li Speaker : Razvan Belet

Outline Motivating Scenarios Background Knowledge Parallel Set-Similarity Join –Self Join –R-S Join Evaluation Conclusions Strengths & Weaknesses

Scenario: Detecting Plagiarism Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journal

Scenario: Near-duplicate elimination The archive of a search engine can contain multiple copies of the same page Reasons: re-crawling, different hosts holding the same redundant copies of a page, etc.

Problem Statement Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ, find the pairs of objects/items/records satisfying sim(o1,o2)> λ Solution: Similarity Join

Motivation(2) Some of the collections are enormous: –Google N-gram database : ~1trillion records –GeneBank : 416GB of data –Facebook : 400 million active users Try to process this data in a parallel, distributed way => MapReduce

Outline Motivating Scenarios Background Knowledge Parallel Set-Similarity Join –Self Join –R-S Join Evaluation Conclusions

Background Knowledge Set-Similarity Join Join Similarity Join Set-Similarity Join

Background Knowledge: Join Logical operator heavily used in Databases Whenever it is needed to associate records in 2 tables => use a JOIN Associates records in the 2 input tables based on a predicate (pred) LastNameDepartmentID Rafferty31 Jones33 Steinberg33 Robinson34 Smith34 John NULL DepartmentIDDepartmentName 31Sales 33Engineering 34Clerical 35Marketing Table Employees Table Departments Consider this information need: for each employee find the department he works in

Background Knowledge: Join EMPLOYEES LastNameDepID Rafferty31 Jones33 Steinberg33 Robinson34 Smith34 John NULL DEPARTMENTS Department ID DepartmentNa me 31Sales 33Engineering 34Clerical 35Marketing Example :For each employee find the department he works in JOIN pred pred : EMPLOYEES.DepID= DEPARTMENTS.DerpartmentI D JOIN RESULT LastNameDepartmentName RaffertySales JonesEngineering SteinbergEngineering ……

Background Knowledge: Similarity Join Special type of join, in which the predicate (pred) is a similarity metric/function: sim(obj1,obj2) Return pair (obj1, ob2) if pred holds: sim(obj1,obj2) > threshold Similarity Join pred pred: sim(T1.c,T2.c)>threshold abc ……… ……... dec ……… …… T1: T2: abcde …………… …………… …………… ……...……

Background Knowledge: Similarity Join Examples of sim(obj1,obj2) functions: sim(paper1,paper2) = Si, most common words in page i Tj, most common words in page j,

Similarity Join sim(obj 1,obj 2 ) obj1,obj2 : documents, records in DB tables, user profiles, images, etc. Particular class of similarity joins: (string/text-) similarity join:obj1, obj2 are strings/texts Many real-world application => of particular interest abcName ………John W. Smith ………Marat Safin ………Rafael P. Nadal ……...… deName ……Smith, John ……Safin, Marat Michailowitsch ……Nadal, Rafael Parera …...…. SimilarityJoin pred sim(T1.Name,T2.Name)=#common words pred: sim(T1.Name, T2.Name) > 2

Set-Similarity Join(SSJoin) SSJoin: a powerful primitive for supporting (string-)similarity joins Input: 2 collections of sets Goal: Identify all pairs of highly similar sets S1={… } S2={… } …. Sn={… } T1={…} T2={…} … Tn={…} SSJoin pred pred: sim(Si,Ti)>0.3 {word1,word2 ….…. wordn} {word1,word2 ….…. wordn}

Set-Similarity Join How can a (string-)similarity join be reduced to a SSJoin? Example: SimilarityJoin SSJoin BasedOn abcName ………{John, W., Smith} ………{Marat, Safin} ………{Rafael, P., Nadal} ……...… deName ……{Smith, John} ……{Safin, Marat, Michailowitsch} ……{Nadal, Rafael, Parera} …...…. pred: sim(T1.Name, T2.Name) > 0.5 SSJoin pred

Set-Similarity Join Most SSJoin algorithms are signature-based: INPUT: Set collections R and S and threshold λ 1. For each r R, generate signature-set Sign(r) 2. For each s S, generate signature-set Sign(s) 3. Generate all candidate pairs (r, s), r R,s S satisfying Sign(r) ∩ Sign(s) 4. Output any candidate pair (r, s) satisfying Sim(r, s) ≥ λ. Filtering phase Post-filtering phase

Set-Similarity Join Signatures: –Have a filtering effect: SSJoin algorithm compares only candidates not all pairs (in post-filtering phase) –Give the efficiency of the SSJoin algorithm: the smaller the number of candidate pairs, the better –Ensure correctness: Sign(r) ∩ Sign(s), whenever Sim(r, s) ≥ λ;

Set-Similarity Join : Signatures Example One possible signature scheme: Prefix-filtering Compute Global Ordering of Tokens: Marat …W. Safin... Rafael... Nadal...P. … Smith …. John Compute Signature of each input set: take the prefix of length n Sign({John, W., Smith})=[W., Smith] Sign({Marat,Safin})=[Marat, Safin] Sign({Rafael, P., Nadal})=[Rafael,Nadal] abcName ………{John, W., Smith} ………{Marat, Safin} ………{Rafael, P., Nadal} ……...…

Set-Similarity Join Filtering Phase: Before doing the actual SSJoin, cluster/group the candidates Run the SSjoin on each cluster => less workload … cluster/bucket1cluster/bucket2cluster/bucketN deName …...…. abcName ……...… … … … {John, W., Smith} … … … {Marat, Safin} {Rafael, P., Nadal} … … {Smith, John} … … {Safin,Marat,Michailowitsc} {Nadal, Rafael, Parera}

Outline Motivating Scenarios Background Knowledge Parallel Set-Similarity Join –Self Join –R-S Join Evaluation Conclusions Strengths & Weaknesses

Parallel Set-Similarity Join Method comprises 3 stages: Generate actual pairs of joined records Group candidates based on signature Compute SSJoin & Compute data statistics for good signatures Stage II RID-Pair Generation Stage I: Token Ordering Stage III: Record Join

Explanation of input data RID = Row ID a : join column “A B C” is a string: Address: “14 th Saarbruecker Strasse” Name: “John W. Smith”

Stage I: Data Statistics Generate actual pairs of joined records Group candidates based on signature Compute SSJoin & Compute data statistics for good signatures Basic Token Ordering Basic Token Ordering One Phase Token Ordering One Phase Token Ordering Stage II RID-Pair Generation Stage I: Token Ordering Stage III: Record Join

Token Ordering Creates a global ordering of the tokens in the join column, based on their frequency 1A B D A A…… 2B B D A E…… RID a b c Global Ordering: (based on frequency) EDBA 1234

Basic Token Ordering(BTO) 2 MapReduce cycles: –1 st : computing token frequencies –2 nd : ordering the tokens by their frequencies

Basic Token Ordering – 1 st MapReduce cycle map: tokenize the join value of each record emit each token with no. of occurrences 1,, reduce: for each token, compute total count (frequency)

Basic Token Ordering – 2nd MapReduce cycle map: interchange key with value reduce(use only 1 reducer): emits the value

One Phase Tokens Ordering (OPTO) alternative to Basic Token Ordering (BTO): –Uses only one MapReduce Cycle (less I/O) –In-memory token sorting, instead of using a reducer

OPTO – Details map: tokenize the join value of each record emit each token with no. of occurrences 1,, reduce: for each token, compute total count (frequency) Use tear_down method to order the tokens in memory

Stage II: Group Candidates & Compute SSJoin Generate actual pairs of joined records Group candidates based on signature Stage II RID-Pair Generation Compute SSJoin & Compute data statistics for good signatures Stage I: Token Ordering Stage III: Record Join Individual Tokens Grouping Individual Tokens Grouping Grouped Tokens Grouping Grouped Tokens Grouping Basic Kernel PPJoin

RID-Pair Generation scans the original input data(records) outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) consists of only one MapReduce cycle Global ordering of tokens obtained in the previous stage

RID-Pair Generation: Map Phase scan input records and for each record: –project it on RID & join attribute –tokenize it –extract prefix according to global ordering of tokens obtained in the Token Ordering stage –route tokens to appropriate reducer

Grouping/Routing Strategies Goal: distribute candidates to the right reducers to minimize reducers’ workload Like hashing (projected)records to the corresponding candidate-buckets Each reducer handles one/more candidate-buckets 2 routing strategies: Using Individual TokensUsing Grouped Tokens

Routing: using individual tokens Treats each token as a key For each record, generates a (key, value) pair for each of its prefix tokens: token (projected) record Example: Given the global ordering: TokenABEDGCF Frequency “A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: (A, (1,A B C)) (B, (1,A B C))

Grouping/Routing: using individual tokens Advantage: –high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer) Disadvantage: –high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)

Routing: Using Grouped Tokens Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) For each record, generates a (key, value) pair for each the groups of the prefix tokens:

Routing: Using Grouped Tokens “A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: (X, (1,A B C)) (Y, (1,A B C)) Example: Given the global ordering: TokenABEDGCF Frequency

Grouping/Routing: Using Grouped Tokens The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner TokenABEDGCF Frequency Group1Group3 Group2 ADFBEGC Groups will be balanced w.r.t the sum of frequencies of token belonging to one specific group

Grouping/Routing: Using Grouped Tokens Advantage: –Replication of data is not so pervasive Disadvantage: –Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity)

RID-Pair Generation: Reduce Phase This is the core of the entire method Each reducer processes one/more buckets In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate Bucket of candidates If the similarity of the 2 candidates >= threshold => output their ids and also their similarity

RID-Pair Generation: Reduce Phase Computing similarity of the candidates in a bucket comes in 2 flavors: Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket Indexed Kernel : uses a PPJoin+ index

RID-Pair Generation: Basic Kernel Straightforward method for finding candidates satisfying the join predicate Quadratic complexity : O(#candidates 2 ) reduce: foreach candidate in bucket for each cand in bucket\{candidate} if sim(candidate,cand)>= threshold emit((candidateRID, candRID), sim)

RID-Pair Generation:PPJoin+ Uses a special index data structure Not so straightforward to implement Much more efficient reduce: probe PPJoinIndex with join attr value of current_candidate => a list RIDs satisfying the join predicate add the current_candidate to the PPJoinIndex

Stage III: Generate pairs of joined records Generate actual pairs of joined records Group candidates based on signature Stage II Compute SSJoin & Compute data statistics for good signatures Stage I Stage III Basic Record Join One Phase Record Join One Phase Record Join

Until now we have only pairs of RIDs, but we need actual records Use the RID pairs generated in the previous stage to join the actual records Main idea: –bring in the rest of the each record (everything excepting the RID which we already have) 2 approaches: –Basic Record Join (BRJ) –One-Phase Record Join (OPRJ)

Record Join: Basic Record Join Uses 2 MapReduce cycles –1 st cycle: fills in the record information for each half of each pair –2 nd cycle: brings together the previously filled in records

Record Join: One Phase Record Join Uses only one MapReduce cycle

R-S Join Challenge: We now have 2 different record sources => 2 different input streams Map Reduce can work on only 1 input stream 2 nd and 3 rd stage affected Solution: extend (key, value) pairs so that it includes a relation tag for each record

Outline Motivating Scenarios Background Knowledge Parallel Set-Similarity Join –Self Join –R-S Join Evaluation Conclusions Strengths & Weaknesses

Evaluation Cluster: 10-node IBM x3650, running Hadoop Data sets: DBLP: 1.2M publications CITESEERX: 1.3M publication Consider only the header of each paper(i.e author, title, date of publication, etc.) Data size synthetically increased (by various factors) Measure: Absolute running time Speedup Scaleup

Self-Join running time Best algorithm: BTO-PK- OPRJ Most expensive stage: the RID-pair generation

Self-Join Speedup Fixed data size, vary the cluster size Best time: BTO-PK- OPRJ

Self-Join Scaleup Increase data size and cluster size together by the same factor Best time: BTO-PK- OPRJ

R-S Join Performance Mostly, the same behavior

R-S Join Performance

Outline Motivating Scenarios Background Knowledge Parallel Set-Similarity Join –Self Join –R-S Join Evaluation Conclusions Strengths & Weaknesses

Conclusions Efficient way of computing Set-Similarity Join Useful in many data cleaning scenarios SSJoin and MapReduce: one solution for huge datasets Very efficient when based on prefix-filtering and PPJoin+ Scales-up up nicely

Strengths & Weaknesses Strengths: –More efficient than single-node/local SSJoin –Failure safer than single-node SSJoin –Uses powerful filtering methods (routing strategies) –Uses PPJoinIndex (data structure optimized for SSJoin) Weaknesses: –This implementation is applicable only to string-based input data –Supposes the dictionary and RID-pairs list fit in main memory –Repeated tokenization –Evaluation based on synthetically increased data

Thank you! Questions