Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chuan Xiao, Wei Wang, Xuemin Lin

Similar presentations


Presentation on theme: "Chuan Xiao, Wei Wang, Xuemin Lin"— Presentation transcript:

1 Chuan Xiao, Wei Wang, Xuemin Lin
Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia 2018/11/15

2 Motivation Data Cleaning Bioinformatics typo:
multiple representation: ‘harbor’ vs ‘harbour’ Bioinformatics DNA/protein sequence AAAGTCTGAC… AAACTCTGAC… ‘Stephen Spielburg’ ‘Steven Spielberg’ 2018/11/15

3 More Applications identify plagiarism detect spam SPAM EMAIL TEMPLATE
Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal address attached to ticket number with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. identify plagiarism detect spam Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. 2018/11/15

4 Outline Motivation Problem Definition Algorithms Experiments
Conclusions 2018/11/15

5 Edit Similarity Join Focus on similarity join on strings with edit distance threshold (d) edit distance  d  two strings are similar Problem Definition Given two collection of strings S and T, the edit similarity join problem is to compute { <s, t> | s  S, t  T, ed(s,t)  d } Consider the self-join case here 2018/11/15

6 Outline Motivation Problem Definition Algorithms Experiments
Conclusions 2018/11/15

7 q-gram Based Filtering [Gravano et al. VLDB01]
Naïve algorithm compute edit distance: O(n2) time complexity do this for N2/2 pairs q-gram based filtering filter-and-refine length filtering | len(s)-len(t) |  d New_Zealand New ew_ w_Z _Ze Zea eal ala lan and 2018/11/15

8 Matching q-grams New_Zealand S New ew_ w_Z _Ze Zea eal S ala lan S and
count filtering at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d position filtering positions of common q-grams should be within d Implemented on RDBMS best performance when small q, such as q=2,3 New_Zealand New ew_ w_Z _Ze Zea eal ala lan and S S S S destroy at most q*d q-grams  share most q-grams matching q-grams 2018/11/15

9 Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07]
Bottleneck: generating candidate pairs which share at least LB(s,t) matching q-grams Prefix Filter sort q-grams by global ordering, such as idf Qs= Qt= q*d+1 l-q*d-1 = LB(s,t)-1 qa qb qx qy 2018/11/15

10 All-Pairs-Ed Algorithm [Bayardo et al. WWW07]
Indexed Record Set Prefix Filter Cand-1 Generation Count Filter Cand-2 Generation Verification Edit Distance Result Pairs 2018/11/15

11 Example – All-Pairs-Ed
d=1, q=2 a=‘Austria’ b=‘Australia’ c=‘Australiana’ d=‘New_Zealand’ e=‘New_Sealand’ after prefix filter: <a,b> <b,c> <d,e> after count filter: <b,c> <d,e> after edit distance verification: <d,e> prefix_len = q*d+1 = 3 Qa={ri, Au, us, …} Qb={ra, li, Au, …} Qc={na, ra, li, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} 2018/11/15

12 Ed-Join Idea mismatching q-grams provide useful information filters
edit operations location-based non-clustered content-based clustered 2018/11/15

13 Location-Based Filtering
Idea: reduce prefix length Example, d=1, q=2 s=‘Austria’ t=‘Australia’ Qs= Qt= location 5 1 ri Au us pruned ra li Au location 5 7 2018/11/15

14 Minimum Prefix Length Qs = q*d+1 1 2 3 4 5 6 A C G A C G T A
sequential search at least d+1 edit operations to destroy them A C d=2, q=2 G A C G T A Further optimization: binary search within [d+1, q*d+1] min. prefix len. = 4 2018/11/15

15 Limit of Count/Loc.-Based Filter
Clustered edit operations s=‘…please submit by Aug…’ t=‘…please submit by Sep…’ Non-clustered edit operations s’=‘…please submit by Aug…’ t’=‘…pleese supmit bi Aug…’ Clustered edit operations destroy fewer q-grams  count/location-based filtering less effective 4 mismatching q-grams if q=2  retained (d=2) 6 mismatching q-grams if q=2  pruned (d=2) 2018/11/15

16 Content-Based Filtering
Probing Window An edit operation increases L1 distance within the probing window by at most two L1 distance should be  2d if ed(s, t)  d s A C G T t A G C T 2018/11/15

17 Select Probing Window Example, d=3, q=3 s t L1 = 2 L1 = 8 > 2d A C
pruned 2018/11/15

18 Example – Ed-Join d=1, q=2 a=‘Austria’ b=‘Australia’ c=‘Australiana’ d=‘New_Zealand’ e=‘New_Sealand’ after prefix filter: <b,c> <d,e> after count filter: <b,c> <d,e> after content-based filter: <d,e> after edit distance verification: <d,e> Qa={ri, Au, …} Qb={ra, li, …} Qc={na, ra, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} Qa={ri, Au, us, …} Qb={ra, li, Au, …} Qc={na, ra, li, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} 2018/11/15

19 Outline Motivation Problem Definition Algorithms Experiments
Conclusions 2018/11/15

20 Experiment Settings Environment Algorithm Dataset
Intel Xeon X GHz CPU, 4GB RAM Debian 4.1, GCC with –O3 Algorithm All-Pairs-Ed [Bayardo et al. WWW07] PartEnum [Arasu et al. VLDB06] Ed-Join / Ed-Join-l Dataset dataset # of strings |Σ| avg. len DBLP (author, title) 900k 93 104.8 TEXAS (name, address, licence no) 155k 37 112.1 TREC (author, title, abstract) 271k 1098.4 UNIREF (protein) 366k 25 465.1 2018/11/15

21 Experiment – Large Threshold
UNIREF, Running Time 2018/11/15

22 Experiment - q TREC, Running Time
q=8 achieves best performance for TREC 2018/11/15

23 Experiment - with PartEnum
d=1 d=2 d=3 2018/11/15

24 Conclusions Contributions Future work
an efficient algorithm for edit similarity join exploit mismatching q-grams location-based filtering – non-clustered edit ops. content-based filtering – clustered edit ops. longer q-grams perform best for stand-alone implementation Future work other similarity measures, e.g., used in DNA/protein alignment 2018/11/15

25 Additional Materials Available at
Thank you! Additional Materials Available at 2018/11/15

26 Related Work q-qram Based Filtering Algorithms to Set Similarity Join
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Algorithms to Set Similarity Join Index-based approaches S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008. Prefix-based approaches S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. PartEnum A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. 2018/11/15

27 Related Work Edit Distance Computation
R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31, 1980. G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic Programming. J. ACM, 46(3):395–415, 1999. E. Ukkonen. On approximate string matching. In FCT, 1983. 2018/11/15

28 Experiment – Pruning Power
2018/11/15


Download ppt "Chuan Xiao, Wei Wang, Xuemin Lin"

Similar presentations


Ads by Google