Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Slides:



Advertisements
Similar presentations
String Similarity Measures and Joins with Synonyms
Advertisements

Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Large-Scale Entity-Based Online Social Network Profile Linkage.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE.
Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Outline Introduction State-of-the-art solutions Equi-Truss Experiments
Outline Introduction State-of-the-art solutions
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Privacy Preserving Subgraph Matching on Large Graphs in Cloud
Ambika Shrestha Chitrakar Prof. Slobodan Petrovic
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
TT-Join: Efficient Set Containment Join
Pass-Join: A Partition based Method for Similarity Joins
Entity Matching : How Similar Is Similar?
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Pyramid Sketch: a Sketch Framework
Top-k String Similarity Search with Edit-Distance Constraints
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Weighted Exact Set Similarity Join
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Actively Learning Ontology Matching via User Interaction
An Efficient Partition Based Method for Exact Set Similarity Joins
Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^
Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng
Presentation transcript:

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE2011 2/34

Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 ICDE2011 Card #NameAddrPhn 1234****Jeffery UllmanCS Dept. Stanford, CA ****Marvin MinskyCS Dept., MIT, MA ………… Card #Name Tel 1205****David 0101****Jeffrey ………… Jeffery Ullman Jeffrey Ullman Perform a similarity join on name attribute 3/34

Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 ICDE2011 User Id QueryTimestamp 1018**** ICDE 2011 Hanover :12: **** NBA All Stars :05: **** ICDE Hannover :10: **** weather Hanover :34:10 … … … Perform a self similarity join on query attribute 4/34

Motivation 2011/4/13 ICDE2011 Existing Similarity Metrics Token-based Similarity Character-based Similarity Hybrid Similarity Dice, Cosine, Jaccard, … Edit Distance, Edit Similarity, … GED [SIGMOD 03] Jaccard(S1, S2) = 1/3 ED(S1, S2) = 8GED(S1, S2) = 0 S1 = “nba mcgrady”, S2 = “macgrady nba” 5/34

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE2011 6/34

Token-based Similarity Dice similarity Cosine similarity Jaccard similarity 2011/4/13 ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} |T 1 ∩ T 2 | =1 Example Exactly matched token pairs, i.e. T 1 ∩ T 2 7/34

2011/4/13 ICDE2011 T1T1 T2T2 mcgrady nba wnba macgrady nba Weighted Bipartite Graph 3.Fuzzy Overlap: Maximum Weighted Matching (Quantify token similarity) Better than |T 1 ∩ T 2 |= 1 8/34

Fuzzy-Token Similarity Fuzzy-Dice similarity Fuzzy-Cosine similarity Fuzzy-Jaccard similarity 2011/4/13 ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} Example 9/34

Comparison with Existing Similarities 2011/4/13 ICDE /34

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE /34

2011/4/13 ICDE2011 String Similarity Join using Fuzzy-Token Similarity s1s1 “kobe and trancy” s2s2 “trcy macgrady mvp” …… s' 1 “kobe bryant age” s' 2 “mvp tracy mcgrady” …… T1T1 {kobe, and, trancy} T2T2 {trcy, macgrady, mvp} …… T’ 1 {kobe, bryant, age} T’ 2 {mvp, tracy, mcgrady} …… Tokenization (s 2, s’ 2 ), … Naive Solution Enumerating N 2 pairs Quite Expensive !!! Naive Solution Enumerating N 2 pairs Quite Expensive !!! 12/34

Using Existing Methods 2011/4/13 ICDE /34

Our Signature Scheme 2011/4/13 ICDE2011 The superscript denotes which token generates the signature The superscript denotes which token generates the signature 14/34

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 ICDE /34

ICDE /4/13 Prefix Filtering Signature Scheme Alphabetical Order Remove 2 largest signatures 16/34

2011/4/13 ICDE2011 Token Sensitive Signature Scheme Prefix Filtering No! Token Sensitive Yes! 17/34

2011/4/13 ICDE2011 Candidates : {(T2,T4)} Delete the maximal number of largest signatures that contain 2 tokens Alphabetical Order Token Sensitive Signature Scheme (Cont’d) Candidates : {(T 1,T 2 ),(T 1,T 3 ),(T 1,T 4 ),(T 2,T 4 )} 18/34

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE /34

2011/4/13 ICDE2011 Partition-NED Signature Scheme 20/34

2011/4/13 ICDE2011 Partition t’ 21/34

2011/4/13 ICDE2011 Partition t 22/34

2011/4/13 ICDE2011 Partition t (Cont’d) /34

2011/4/13 ICDE2011 Punning Techniques Reduce substrings from 21 to 8 24/34

Comparison with Partition-ED (SIGMOD 09) 2011/4/13 ICDE /34

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 ICDE /34

Experiment Setup Data sets DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Environment C++, GCC 4.2.3, Ubuntu Intel Core 2 Quad X GHz processor and 4 GB memory 2011/4/13 ICDE /34

Result Quality 2011/4/13 ICDE /34

Evaluation on Different Signature Schemes for Tokens 2011/4/13 ICDE /34

Evaluation on Different Signature Schemes for Token Sets 2011/4/13 ICDE /34

Put Everything Together 2011/4/13 ICDE /34

Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 ICDE /34

Conclusion Fuzzy-token similarity Hybrid similarity Subsume many well-known similarities High result quality String similarity join using fuzzy-token similarity Signature-based framework Token-sensitive signature scheme Partition-NED signature scheme Achieve higher performance than the state-of-the-art methods both theoretically and experimentally 2011/4/13 ICDE /34

2011/4/13 ICDE /34