Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne.

Slides:

Advertisements

Similar presentations

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

Advertisements

Indexing DNA Sequences Using q-Grams

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.

Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen.

Evaluating Search Engine

Mutual Information Mathematical Biology Seminar

XP Chapter 3 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Analyzing Data For Effective Decision Making.

Chad A. Williams † Peter C. Nelson Abolfazl (Kouros) Mohammadian University of Illinois at Chicago Department of Computer Science Colloquium July 16th,

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Sequence similarity.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.

Mapping Techniques and Visualization of Statistical Indicators Haitham Zeidan Palestinian Central Bureau of Statistics IAOS 2014 Conference.

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

Evaluating Classifiers

Developing Pairwise Sequence Alignment Algorithms

Multiple testing correction

Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.

Analyzing Data For Effective Decision Making Chapter 3.

Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Module 5 Planning for SQL Server® 2008 R2 Indexing.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.

Chapter 6: Information Retrieval and Web Search

Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.

Chapter 3 Computational Molecular Biology Michael Smith

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Lec 7 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.

Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Sequence Alignment.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

1 Management Information Systems M Agung Ali Fikri, SE. MM.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Practical Database Design and Tuning

Why Metrics in Software Testing?

Sequence Alignment 11/24/2018.

Text Joins in an RDBMS for Web Data Integration

Practical Database Design and Tuning

Vertical Fragmentation

Scale-Space Representation for Matching of 3D Models

Efficient Record Linkage in Large Data Sets

Bioinformatics Algorithms and Data Structures

False discovery rate estimation

Presentation transcript:

Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne State University Sept. 15, 2009

Reason using Similarity Join Correlate data from different data sources (e.g., data integration) Data is often dirty (e.g. typing mistakes) Abbreviated, incomplete or missing information Differences in information “formatting” due to the lack of standard conventions (e.g. for addresses)

Example NameAddrPhone Jack LemmonMaple St Harrison FordCulver Blvd Tom HanksMain St …… Table RTable S NameAddrPhone Ton HanksMain Street Kevin SpaceyFrost Blvd Jack LemonMaple Street …… Find records from different datasets that could be the same entity.

Experimental Results – Natural Join

Experimental Results – Similarity Join

Problem Statement for Similarity Join Given a string S called the source and another string T called the target. Allowing a defined number of errors to be presented in the joins, the similarity join is to verify whether or not two strings represent the same real-world entity based on certain methods.

Sample Applications 1. Finding matching DNA subsequences even after mutations have occurred. 2. Signal recovery for transmissions over noisy lines. 3. Searching for spelling/typing errors and finding possible corrections. 4. Handwriting recognition, virus and intrusion detection.

General Approaches Attracting different research communities: statistics, artificial intelligence and database. Statistics refers similarity join as probabilistic record linkage armed at minimizing the probability of misclassification. Artificial intelligence uses supervised learning to learn the parameters of string edit distance metrics Database uses knowledge intensive approach, edit distance as a general record match scheme.

General Algorithms on Database Area All the algorithms focus on Edit Distance Dynamic Programming Algorithms Automata Algorithms Bit – Parallelism Algorithms Filtering Algorithms

Comments on Existing Methods All above proposed algorithms are based on the generic edit distance function. Some improve the speed of the dynamic programming method. Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence. Current similarity algorithms are under the assumption that join conditions are known and do not consider relevant field in their join conditions Although there have been many efforts for efficient string similarity join, there is still room for improvement.

Outlines Motivation Pre-experimental Results Proposed Approach Identify Clustered Join Attributes Experimental Results Conclusion

Research Goal Identifying the same real-world entities from multiple heterogeneous databases

Motivation of Clustering Concept Current similarity algorithms do not consider relevant field concepts. Clustering concept fits well on relevant field concepts.

Pre-experimental Results

Proposed Approach Our proposed approach takes consideration of clustered related attributes Question: how to identify clustered join attributes?

Clustering Algorithm The rationale behind the clustering is to produce fragments, groups of attribute columns that are closely related.

Identify Clustered Related Attributes Pre-knowledge of Applications on Data Attributes Usage Information Calculate Attribute Affinities Calculate Clustered Affinities Use Bond Energy Bond (BEA) approach to regroup affinity value Apply split approach to find clustered related attributes

Clustered Approach - Diagram Computation of Affinities Clustering Logical Accesses Attribute Affinity Matrix Clustered attribute affinity matrix Group of Clustered related attributes Split Approach

Clustered Approach – Con’t Attribute Usage 1 if attribute Aj is referenced by application qk 0 otherwise Attribute Affinity Cluster Affinity permutation to maximize the global affinity measure and results in the grouping of large affinity values with large affinity attributes and small affinity values with small affinity attributes.

Clustered Approach - Example

Split Approach Split based on access model where af(Vfi) stands for the access frequency for vertical fragment and af(VFi,VFj) stands for the access frequency for queries having at least one attribute in vertical fragment ),()(*)(VF afVFafVFafSQ 

Split Approach – con’t on Table 3, for the first possible split {Address} and {Birthday, Name, phone}, SQ=25*35; for the second possible split {Address, Birthday} and {Name, Phone}, SQ=-(30+35) ; for the third possible split {Address, Birthday, Name} and {Phone}, SQ=-(35+35).

Existing Similarity Join Techniques Edit Distance Q-gram

Similarity Join – Edit Distance A widely used metric to define string similarity ED(s1, s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: surgery s2: survey ED(s1, s2) = 2

Programming Algorithm This is the oldest algorithm. Answers the question, how do we compute ed(x,y). Take a matrix C 0..|x|,0..|y| where C i,j is the minimum number of operations to match x i to y j. This is calculated as follows: C i,0 = i C 0,j = j if (x i = y j ) then C i,j = C i-1,j-1 Otherwise, C i,j = 1 + min(C i-1,j, C i,j-1, C i-1,j-1 ) O(mn) complexity.

Matrix Example Edit Distance

Similarity Join - Qgram Qgram Roadmap: - break strings into substrings of length q - perform an exact join on the q-grams - find candidate string pairs based on the results - check only candidate pairs with a UDF to obtain final answer

Similarity Join – Q-gram Q-gram is a pair of substrings having the properties: Slide a window of length q over the string s Add new characters # and % Generate |s| + q -1 substrings

Q-gram Technique (cont’d) R ationale : when two strings s and t are within a small edit distance of each other, they share a large number of q-grams in common. Advantage: build on the top of relational databases with an augmented table created on the fly.

Similarity Join - Qgram Issue with Qgram: don’t work on the large dataset Resolution to the issue: - clear the data by using exact join - create a table to hold the dismatching data - apply the Qgram on the new temp table

Similarity Join – Q-gram (continued) For a string john smith: { (1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n s), (7, sm), (8,smi), (9,mit), (10,ith), (11,th%), (12,h%)} with q=3 For a string john a smith: {(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n a), (7, a ), (8,a s), (9, sm), (10,smi), (11,mit), (12,ith), (13,th%), (14,h%)} with q=3

Sample SQL Expression SELECT R 1.A 0,, R 2.A 0, R 2.A i, R 2.A j FROMR 1, R 1 A i Q, R 2, R 2 A j Q WHERE R 1 A 0 = R 1 A i Q.A 0 AND R 2 A 0 = R 2 A j Q.A 0 AND R 1 A i Q.Qgram = R 2 A j Q.Qgram AND |R 1 A i Q.Pos – R 2 A j Q.Pos| <= k AND |strlen(R 1.A i ) – strlen(R 2.A j )| <= k GROUP BY R 1.A 0, R 2.A 0, R 2.A i, R 2.A j HAVINGCOUNT(*) >= strlen(R 1.A i ) – 1 – (k-1)*q AND COUNT(*) >= strlen(R 2.A j ) – 1 – (k-1)*q AND Edit_distance(R 1.A i, R 2.A j, k)

Precision, Recall and F-measure Precision is defined as the number of true positives divided by the sum of true positives and false positives (TP/(TP + FP) Recall is defined as the number of true positives divided by the sum of true positives and false negatives (TP/(TP + FN) F-measure is defined as the weighted harmonic mean of precision and recall: F = 2 * (precision * recall) / (precision + recall)

Experimental Results Known join attributes vs clustered join attributes on Precision

Experimental Results Known join attributes vs clustered join attributes on Recall

Experimental Results ED vs. Qgram

Experimental Results ED vs Qgram on Recall

Experimental Results ED vs Qgram on F-measure

Conclusion Proposed a pre-processing approach to improve existing similarity join techniques Experimental results showed improvement of ED by about 5% and Q-gram by about 15%

Future Work Potential further works: work on alternative clustering method increase the datasets add some pre and post filter abilities …

Publications Lisa Tan, Farshad Fotouhi and William Grosky "Improving Similarity Join Algorithms using Vertical Clustering Techniques", ICADIWT 2009, Page Improving Similarity Join Algorithm Using Fuzzy Clustering Techniques has been accepted by ICDM-09 Workshop on Mining Multiple Information Sources (MMIS)

Thank You! Lisa Tan – Co-Authors Dr. Farshad Fotouhi – Dr. William Grosky – Acknowledgement Dr. Farshad Fotouhi, Dr. William Grosky, and Computing & Information Technology

Question

Wayne State University - Facts 30 th largest university in nation Top 50 in NSF public rankings Over 33,300 students Over 350 undergraduate/graduate degree programs in 12 Schools and Colleges

Comments on Existing Methods All above proposed algorithms are based on the generic edit distance function. Some improve the speed of the dynamic programming method. Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence. Although there have been many efforts for efficient string similarity join, there is still room for improvement.