Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Similar presentations


Presentation on theme: "Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD."— Presentation transcript:

1 Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD. Srivastava AT&T Labs

2 Scenario of Application The work was done before 2001. The related research has been put in practice 1/19/2016 Columbia University Computer Science Dept. 2

3 1/19/2016 Columbia University Computer Science Dept. 3 The Need for String Joins Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith … … Service B Panos Ipirotis Jonh Smith … Jenny Stamatopulou John P. McDougal … Al Dridge Rodriguez Substantial amounts of data in existing DBMSs are strings Often, there is a need to correlate data stored in different tables Example: Find common customers across various customer databases

4 1/19/2016 Columbia University Computer Science Dept. 4 Problems with Exact String Joins Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith … … Service B Panos Ipirotis Jonh Smith … Jenny Stamatopulou John P. McDougal … Al Dridge Rodriguez Typing mistakes (e.g., John vs. Jonh) No standard way of recording string data Standard equijoins do not “forgive such mistakes” ⋈ = ∅

5 1/19/2016 Columbia University Computer Science Dept. 5 Approximate String Joins Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith … … Service B Panos Ipirotis Jonh Smith … Jenny Stamatopulou John P. McDougal … Al Dridge Rodriguez We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one (minimal!) K=1 K=2 K=1 K=3 K=1

6 1/19/2016 Columbia University Computer Science Dept. 6 Our Focus: Approximate String Joins over Relational DBMSs Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS)

7 1/19/2016 Columbia University Computer Science Dept. 7 Related Work Data Cleaning Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 Monge and Elkan, SIGMOD DMKD Workshop, 1997... Approximate String Matching Baeza-Yates and Gonnet, SPIRE 1999 Sutinen and Tarhio, ESA’95, CPM’96 Smith and Waterman, Journal of Molecular Biology 147, 1981 Ukkonen, TCS 92(1), 1992 Ullman, Computer Journal, 1977 …

8 1/19/2016 Columbia University Computer Science Dept. 8 Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS

9 1/19/2016 Columbia University Computer Science Dept. 9 Approximate String Joins outside of a DBMS 1. Export data 2. Join outside of DBMS 3. Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the DBMS

10 1/19/2016 Columbia University Computer Science Dept. 10 Approximate String Joins with UDFs 1. Write a UDF to check if two strings match within distance K 2. Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations

11 Edit distance Algorithms Applied in UDF method: WHERE edit_distance(R.stringAttr, S.stringAttr, K) Input: two strings s and t with reasonable lengths m and n. Output: their edit_distance. Dynamic Programming Algorithm: Levenshtein distance 1/19/2016 Columbia University Computer Science Dept. 11

12 Levenshtein distance Algorithm int LevenshteinDistance(char s[1..m], char t[1..n]) declare int d[0..m, 0..n] //(m+1)*(n+1) Matrix //initialization of the Matrix for i from 0 to m d[i, 0] := i for j from 0 to n d[0, j] := j 1/19/2016 Columbia University Computer Science Dept. 12

13 Matrix for Levenshtein distance (intialization) KITTEN 0123456 S1 I2 T3 T4 I5 N6 G7 1/19/2016 Columbia University Computer Science Dept. 13

14 Levenshtein distance Algorithm (con’d) //after the initialization for j from 1 to n { if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) } 1/19/2016 Columbia University Computer Science Dept. 14

15 Matrix for Levenshtein distance 1/19/2016 Columbia University Computer Science Dept. 15 KITTEN 0123456 S1123456 I2212345 T3321234 T4432123 I5543223 N6654332 G776544 3 The “right-most and lowest” element d[m, n] is the edit-distance.

16 Levenshtein distance Algorithm Correctness: Invariant: Given the distance of two substrings s[1…i] and t[1…j], we can determine s[1…i+1] and t[1…j+1] by adding d[i, j].d[i, j] The initialization step meets the definition of edit distance well, and then we induce it according to the Invariant and prove it true.initialization step O(mn) time, which could be improved (not the heart of the matter..) 1/19/2016 Columbia University Computer Science Dept. 16

17 1/19/2016 Columbia University Computer Science Dept. 17 Our Approach: Approximate String Joins over an Unmodified RDBMS 1. Preprocess data and generate auxiliary tables 2. Perform join exploiting standard RDBMS capabilities Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs

18 1/19/2016 Columbia University Computer Science Dept. 18 Our Approach: Intuition and Roadmap Intuition: Similar strings have many common substrings Use exact joins to perform approximate joins (current DBMSs are good for exact joins) A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] Roadmap: Break strings into substrings of length q (q-grams) Perform an exact join on the q-grams Find candidate string pairs based on the results Check only candidate pairs with a UDF to obtain final answer

19 1/19/2016 Columbia University Computer Science Dept. 19 What is a “Q-gram”? Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams Similar strings have a many common q-grams

20 1/19/2016 Columbia University Computer Science Dept. 20 Q-grams and Edit Distance Operations With no edits: L + q - 1 common q-grams Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} Insertion: (L max + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} Deletion: (L max + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

21 1/19/2016 Columbia University Computer Science Dept. 21 “Q-gram Distance” and Edit Distance In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: |Maximum number of q-grams| - |Common q-grams| Each Edit Distance operation affects a q-gram in the following ways: It destroys it, or It leaves it intact, or It shifts it by one position Each Edit Distance operation destroys at most q q-grams For Edit Distance = K, the “q-gram distance” is at most Kq

22 1/19/2016 Columbia University Computer Science Dept. 22 Number of Common Q-grams and Edit Distance For Edit Distance = K, there could be at most K replacements, insertions, deletions Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

23 1/19/2016 Columbia University Computer Science Dept. 23 Using a DBMS for Q-gram Joins If we have the q-grams in the DBMS, we can perform this counting efficiently. Create auxiliary tables with tuples of the form: and join these tables A GROUP BY – HAVING COUNT clause can perform the counting / filtering

24 1/19/2016 Columbia University Computer Science Dept. 24 Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

25 1/19/2016 Columbia University Computer Science Dept. 25 Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING

26 1/19/2016 Columbia University Computer Science Dept. 26 Exploiting Q-gram Positions for Filtering Consider strings aabbzzaacczz and aacczzaabbzz Strings are at edit distance 4 Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings Either q-grams do not "originate" from same q-gram, or Too many edit operations "caused" spurious q-grams at various parts of strings to match

27 1/19/2016 Columbia University Computer Science Dept. 27 POSITION FILTERING - Filtering using positions Keep the position of the q-grams Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING

28 1/19/2016 Columbia University Computer Science Dept. 28 The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

29 1/19/2016 Columbia University Computer Science Dept. 29 Experimental Results: Data Three sets of customer data from AT&T Worldnet Service 3 customer data sets from AT&T Worldnet: (a) set1 with about 40K strings (b) set2 and (c) set3 with about 30K strings each

30 1/19/2016 Columbia University Computer Science Dept. 30 DBMS Setup Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server Materialized the q-gram tables with entries (less then 2 minutes per table) Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index) The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible

31 1/19/2016 Columbia University Computer Science Dept. 31 Query Plans Generated by RDBMS Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets) Q-gram approach: usually sort-merge joins In our prototype implementation, sort-merge joins is the fastest as well

32 1/19/2016 Columbia University Computer Science Dept. 32 Naïve UDFs vs. Filtering For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs

33 1/19/2016 Columbia University Computer Science Dept. 33 Effect of Filters (Candidate Set) LENGTH FILTERING: 40-70% reduction for set1 (small length deviation) 90-98% reductions for set2, set3 (big length deviation) +COUNT FILTERING: > 99% reduction POSITION FILTERING: ~ 50% reduction (additionally)

34 1/19/2016 Columbia University Computer Science Dept. 34 Effect of Filters (Candidate Set) – Best Q For the given data sets q=2 worked best q=2 is close to the theoretical approximations as well q=2 is small enough to avoid, as much as possible, the space overhead for the auxiliary tables

35 1/19/2016 Columbia University Computer Science Dept. 35 Effect of Filters (Q-gram Join Size) COUNT FILTERING is applied last (in HAVING clause) For efficiency, it is important to have a small join size for the q-gram tables LENGTH FILTERING cuts the size by a factor of 2 to 10 +POSITION FILTERING cuts the size by a factor of ~100

36 1/19/2016 Columbia University Computer Science Dept. 36 Extensions: Substring Joins Substring approximate joins: Length filtering is not applicable Substring matches have fewer common q-grams (no “overflow” q-grams) Position filtering is not directly applicable (it needs tuning depending on the substring location) We also propose a fast in-memory filter (see paper) The result is not trivial! Exploiting position of q-grams extensively to find possible alignments

37 1/19/2016 Columbia University Computer Science Dept. 37 Extensions: Block Moves Match “AT&T Corp” with “Corp AT&T” as a unit cost operation. We can allow block moves: Length filtering is still applicable The threshold for count filtering is different Position filtering is not applicable (the q-gram may move far away)

38 1/19/2016 Columbia University Computer Science Dept. 38 Conclusions We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression Our technique does not require modifying the underlying RDBMS Our technique exploits the RDBMS's query optimizer Our technique significantly outperforms existing approaches Many opportunities for improvements and future work!


Download ppt "Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD."

Similar presentations


Ads by Google