Chuan Xiao, Wei Wang, Xuemin Lin Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia 2018/11/15 CSE@UNSW
Motivation Data Cleaning Bioinformatics typo: multiple representation: ‘harbor’ vs ‘harbour’ Bioinformatics DNA/protein sequence AAAGTCTGAC… AAACTCTGAC… ‘Stephen Spielburg’ ‘Steven Spielberg’ 2018/11/15 CSE@UNSW
More Applications identify plagiarism detect spam SPAM EMAIL TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. identify plagiarism detect spam Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. 2018/11/15 CSE@UNSW
Outline Motivation Problem Definition Algorithms Experiments Conclusions 2018/11/15 CSE@UNSW
Edit Similarity Join Focus on similarity join on strings with edit distance threshold (d) edit distance d two strings are similar Problem Definition Given two collection of strings S and T, the edit similarity join problem is to compute { <s, t> | s S, t T, ed(s,t) d } Consider the self-join case here 2018/11/15 CSE@UNSW
Outline Motivation Problem Definition Algorithms Experiments Conclusions 2018/11/15 CSE@UNSW
q-gram Based Filtering [Gravano et al. VLDB01] Naïve algorithm compute edit distance: O(n2) time complexity do this for N2/2 pairs q-gram based filtering filter-and-refine length filtering | len(s)-len(t) | d New_Zealand New ew_ w_Z _Ze Zea eal ala lan and 2018/11/15 CSE@UNSW
Matching q-grams New_Zealand S New ew_ w_Z _Ze Zea eal S ala lan S and count filtering at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d position filtering positions of common q-grams should be within d Implemented on RDBMS best performance when small q, such as q=2,3 New_Zealand New ew_ w_Z _Ze Zea eal ala lan and S S S S destroy at most q*d q-grams share most q-grams matching q-grams 2018/11/15 CSE@UNSW
Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07] Bottleneck: generating candidate pairs which share at least LB(s,t) matching q-grams Prefix Filter sort q-grams by global ordering, such as idf Qs= Qt= q*d+1 l-q*d-1 = LB(s,t)-1 qa qb qx qy 2018/11/15 CSE@UNSW
All-Pairs-Ed Algorithm [Bayardo et al. WWW07] Indexed Record Set Prefix Filter Cand-1 Generation Count Filter Cand-2 Generation Verification Edit Distance Result Pairs 2018/11/15 CSE@UNSW
Example – All-Pairs-Ed d=1, q=2 a=‘Austria’ b=‘Australia’ c=‘Australiana’ d=‘New_Zealand’ e=‘New_Sealand’ after prefix filter: <a,b> <b,c> <d,e> after count filter: <b,c> <d,e> after edit distance verification: <d,e> prefix_len = q*d+1 = 3 Qa={ri, Au, us, …} Qb={ra, li, Au, …} Qc={na, ra, li, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} 2018/11/15 CSE@UNSW
Ed-Join Idea mismatching q-grams provide useful information filters edit operations location-based non-clustered content-based clustered 2018/11/15 CSE@UNSW
Location-Based Filtering Idea: reduce prefix length Example, d=1, q=2 s=‘Austria’ t=‘Australia’ Qs= Qt= location 5 1 ri Au us pruned ra li Au location 5 7 2018/11/15 CSE@UNSW
Minimum Prefix Length Qs = q*d+1 1 2 3 4 5 6 A C G A C G T A sequential search at least d+1 edit operations to destroy them 1 2 3 4 5 6 A C d=2, q=2 G A C G T A Further optimization: binary search within [d+1, q*d+1] min. prefix len. = 4 2018/11/15 CSE@UNSW
Limit of Count/Loc.-Based Filter Clustered edit operations s=‘…please submit by Aug…’ t=‘…please submit by Sep…’ Non-clustered edit operations s’=‘…please submit by Aug…’ t’=‘…pleese supmit bi Aug…’ Clustered edit operations destroy fewer q-grams count/location-based filtering less effective 4 mismatching q-grams if q=2 retained (d=2) 6 mismatching q-grams if q=2 pruned (d=2) 2018/11/15 CSE@UNSW
Content-Based Filtering Probing Window An edit operation increases L1 distance within the probing window by at most two L1 distance should be 2d if ed(s, t) d s A C G T t A G C T 2018/11/15 CSE@UNSW
Select Probing Window Example, d=3, q=3 s t L1 = 2 L1 = 8 > 2d A C pruned 2018/11/15 CSE@UNSW
Example – Ed-Join d=1, q=2 a=‘Austria’ b=‘Australia’ c=‘Australiana’ d=‘New_Zealand’ e=‘New_Sealand’ after prefix filter: <b,c> <d,e> after count filter: <b,c> <d,e> after content-based filter: <d,e> after edit distance verification: <d,e> Qa={ri, Au, …} Qb={ra, li, …} Qc={na, ra, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} Qa={ri, Au, us, …} Qb={ra, li, Au, …} Qc={na, ra, li, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} 2018/11/15 CSE@UNSW
Outline Motivation Problem Definition Algorithms Experiments Conclusions 2018/11/15 CSE@UNSW
Experiment Settings Environment Algorithm Dataset Intel Xeon X3220 2.4GHz CPU, 4GB RAM Debian 4.1, GCC 4.1.2 with –O3 Algorithm All-Pairs-Ed [Bayardo et al. WWW07] PartEnum [Arasu et al. VLDB06] Ed-Join / Ed-Join-l Dataset dataset # of strings |Σ| avg. len DBLP (author, title) 900k 93 104.8 TEXAS (name, address, licence no) 155k 37 112.1 TREC (author, title, abstract) 271k 1098.4 UNIREF (protein) 366k 25 465.1 2018/11/15 CSE@UNSW
Experiment – Large Threshold UNIREF, Running Time 2018/11/15 CSE@UNSW
Experiment - q TREC, Running Time q=8 achieves best performance for TREC 2018/11/15 CSE@UNSW
Experiment - with PartEnum d=1 d=2 d=3 2018/11/15 CSE@UNSW
Conclusions Contributions Future work an efficient algorithm for edit similarity join exploit mismatching q-grams location-based filtering – non-clustered edit ops. content-based filtering – clustered edit ops. longer q-grams perform best for stand-alone implementation Future work other similarity measures, e.g., used in DNA/protein alignment 2018/11/15 CSE@UNSW
Additional Materials Available at Thank you! Additional Materials Available at http://www.cse.unsw.edu.au/~weiw/project/simjoin.html 2018/11/15 CSE@UNSW
Related Work q-qram Based Filtering Algorithms to Set Similarity Join L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Algorithms to Set Similarity Join Index-based approaches S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008. Prefix-based approaches S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. PartEnum A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. 2018/11/15 CSE@UNSW
Related Work Edit Distance Computation R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31, 1980. G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic Programming. J. ACM, 46(3):395–415, 1999. E. Ukkonen. On approximate string matching. In FCT, 1983. 2018/11/15 CSE@UNSW
Experiment – Pruning Power 2018/11/15 CSE@UNSW