Chuan Xiao, Wei Wang, Xuemin Lin

Slides:



Advertisements
Similar presentations
Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.
Advertisements

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Jiaheng Lu, University of California, Irvine
String Similarity Measures and Joins with Synonyms
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Indexing DNA Sequences Using q-Grams
Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
Similarity Join Wu Yang Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Sorting.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Jianmin Wang 1, Shaoxu Song 1, Xiaochen Zhu 1, Xuemin Lin 2 1 Tsinghua University, China 2 University of New South Wales, Australia 1/23 VLDB 2013.
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Optimizing Parallel Algorithms for All Pairs Similarity Search
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Evaluation of Relational Operations
Efficient Similarity Joins for Near Duplicate Detection
TT-Join: Efficient Set Containment Join
Weighted Exact Set Similarity Join
Efficient Subgraph Similarity All-Matching
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Relaxing Join and Selection Queries
Wei Wang University of New South Wales, Australia
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Relax and Adapt: Computing Top-k Matches to XPath Queries
An Efficient Partition Based Method for Exact Set Similarity Joins
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
Presentation transcript:

Chuan Xiao, Wei Wang, Xuemin Lin Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia 2018/11/15 CSE@UNSW

Motivation Data Cleaning Bioinformatics typo: multiple representation: ‘harbor’ vs ‘harbour’ Bioinformatics DNA/protein sequence AAAGTCTGAC… AAACTCTGAC… ‘Stephen Spielburg’ ‘Steven Spielberg’ 2018/11/15 CSE@UNSW

More Applications identify plagiarism detect spam SPAM EMAIL TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. identify plagiarism detect spam Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. 2018/11/15 CSE@UNSW

Outline Motivation Problem Definition Algorithms Experiments Conclusions 2018/11/15 CSE@UNSW

Edit Similarity Join Focus on similarity join on strings with edit distance threshold (d) edit distance  d  two strings are similar Problem Definition Given two collection of strings S and T, the edit similarity join problem is to compute { <s, t> | s  S, t  T, ed(s,t)  d } Consider the self-join case here 2018/11/15 CSE@UNSW

Outline Motivation Problem Definition Algorithms Experiments Conclusions 2018/11/15 CSE@UNSW

q-gram Based Filtering [Gravano et al. VLDB01] Naïve algorithm compute edit distance: O(n2) time complexity do this for N2/2 pairs q-gram based filtering filter-and-refine length filtering | len(s)-len(t) |  d New_Zealand New ew_ w_Z _Ze Zea eal ala lan and 2018/11/15 CSE@UNSW

Matching q-grams New_Zealand S New ew_ w_Z _Ze Zea eal S ala lan S and count filtering at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d position filtering positions of common q-grams should be within d Implemented on RDBMS best performance when small q, such as q=2,3 New_Zealand New ew_ w_Z _Ze Zea eal ala lan and S S S S destroy at most q*d q-grams  share most q-grams matching q-grams 2018/11/15 CSE@UNSW

Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07] Bottleneck: generating candidate pairs which share at least LB(s,t) matching q-grams Prefix Filter sort q-grams by global ordering, such as idf Qs= Qt= q*d+1 l-q*d-1 = LB(s,t)-1 qa qb qx qy 2018/11/15 CSE@UNSW

All-Pairs-Ed Algorithm [Bayardo et al. WWW07] Indexed Record Set Prefix Filter Cand-1 Generation Count Filter Cand-2 Generation Verification Edit Distance Result Pairs 2018/11/15 CSE@UNSW

Example – All-Pairs-Ed d=1, q=2 a=‘Austria’ b=‘Australia’ c=‘Australiana’ d=‘New_Zealand’ e=‘New_Sealand’ after prefix filter: <a,b> <b,c> <d,e> after count filter: <b,c> <d,e> after edit distance verification: <d,e> prefix_len = q*d+1 = 3 Qa={ri, Au, us, …} Qb={ra, li, Au, …} Qc={na, ra, li, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} 2018/11/15 CSE@UNSW

Ed-Join Idea mismatching q-grams provide useful information filters edit operations location-based non-clustered content-based clustered 2018/11/15 CSE@UNSW

Location-Based Filtering Idea: reduce prefix length Example, d=1, q=2 s=‘Austria’ t=‘Australia’ Qs= Qt= location 5 1 ri Au us pruned ra li Au location 5 7 2018/11/15 CSE@UNSW

Minimum Prefix Length Qs = q*d+1 1 2 3 4 5 6 A C G A C G T A sequential search at least d+1 edit operations to destroy them 1 2 3 4 5 6 A C d=2, q=2 G A C G T A Further optimization: binary search within [d+1, q*d+1] min. prefix len. = 4 2018/11/15 CSE@UNSW

Limit of Count/Loc.-Based Filter Clustered edit operations s=‘…please submit by Aug…’ t=‘…please submit by Sep…’ Non-clustered edit operations s’=‘…please submit by Aug…’ t’=‘…pleese supmit bi Aug…’ Clustered edit operations destroy fewer q-grams  count/location-based filtering less effective 4 mismatching q-grams if q=2  retained (d=2) 6 mismatching q-grams if q=2  pruned (d=2) 2018/11/15 CSE@UNSW

Content-Based Filtering Probing Window An edit operation increases L1 distance within the probing window by at most two L1 distance should be  2d if ed(s, t)  d s A C G T t A G C T 2018/11/15 CSE@UNSW

Select Probing Window Example, d=3, q=3 s t L1 = 2 L1 = 8 > 2d A C pruned 2018/11/15 CSE@UNSW

Example – Ed-Join d=1, q=2 a=‘Austria’ b=‘Australia’ c=‘Australiana’ d=‘New_Zealand’ e=‘New_Sealand’ after prefix filter: <b,c> <d,e> after count filter: <b,c> <d,e> after content-based filter: <d,e> after edit distance verification: <d,e> Qa={ri, Au, …} Qb={ra, li, …} Qc={na, ra, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} Qa={ri, Au, us, …} Qb={ra, li, Au, …} Qc={na, ra, li, …} Qd={_Z, Ze, Ne, …} Qe={_S, Se, Ne, …} 2018/11/15 CSE@UNSW

Outline Motivation Problem Definition Algorithms Experiments Conclusions 2018/11/15 CSE@UNSW

Experiment Settings Environment Algorithm Dataset Intel Xeon X3220 2.4GHz CPU, 4GB RAM Debian 4.1, GCC 4.1.2 with –O3 Algorithm All-Pairs-Ed [Bayardo et al. WWW07] PartEnum [Arasu et al. VLDB06] Ed-Join / Ed-Join-l Dataset dataset # of strings |Σ| avg. len DBLP (author, title) 900k 93 104.8 TEXAS (name, address, licence no) 155k 37 112.1 TREC (author, title, abstract) 271k 1098.4 UNIREF (protein) 366k 25 465.1 2018/11/15 CSE@UNSW

Experiment – Large Threshold UNIREF, Running Time 2018/11/15 CSE@UNSW

Experiment - q TREC, Running Time q=8 achieves best performance for TREC 2018/11/15 CSE@UNSW

Experiment - with PartEnum d=1 d=2 d=3 2018/11/15 CSE@UNSW

Conclusions Contributions Future work an efficient algorithm for edit similarity join exploit mismatching q-grams location-based filtering – non-clustered edit ops. content-based filtering – clustered edit ops. longer q-grams perform best for stand-alone implementation Future work other similarity measures, e.g., used in DNA/protein alignment 2018/11/15 CSE@UNSW

Additional Materials Available at Thank you! Additional Materials Available at http://www.cse.unsw.edu.au/~weiw/project/simjoin.html 2018/11/15 CSE@UNSW

Related Work q-qram Based Filtering Algorithms to Set Similarity Join L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Algorithms to Set Similarity Join Index-based approaches S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008. Prefix-based approaches S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. PartEnum A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. 2018/11/15 CSE@UNSW

Related Work Edit Distance Computation R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31, 1980. G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic Programming. J. ACM, 46(3):395–415, 1999. E. Ukkonen. On approximate string matching. In FCT, 1983. 2018/11/15 CSE@UNSW

Experiment – Pruning Power 2018/11/15 CSE@UNSW