1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.

1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine

Seminar 32 Paper readings Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001, Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003 Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 33 Motivation Correlate data from different data sources (e.g., data integration) –Data is often dirty –Needs to be cleansed before being used Example: –A hospital needs to merge patient records from different data sources –They have different formats, typos, and abbreviations

Seminar 34 Example NameSSNAddr Jack Lemmon430-871-8294Maple St Harrison Ford292-918-2913Culver Blvd Tom Hanks234-762-1234Main St ……… Table R NameSSNAddr Ton Hanks234-162-1234Main Street Kevin Spacey928-184-2813Frost Blvd Jack Lemon430-817-8294Maple Street ……… Table S Find records from different datasets that could be the same entity

Seminar 35 Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25- 40(1981)P. BernsteinD. Chiu Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 36 Record linkage Problem statement: “Given two relations, identify the potentially matched records –Efficiently and –Effectively”

Seminar 37 Challenges How to define good similarity functions? –Many functions proposed (edit distance, cosine similarity, …) –Domain knowledge is critical  Names: “Wall Street Journal” and “LA Times”  Address: “Main Street” versus “Main St” How to do matching efficiently –Offline join version –Online (interactive) search  Nearest search  Range search

Seminar 38 Outline Supporting string-similarity joins using RDBMS Using mapping techniques Interactive deduplication

Seminar 39 Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

Seminar 310 Approximate String Joins Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith … … Service B Panos Ipirotis Jonh Smith … Jenny Stamatopulou John P. McDougal … Al Dridge Rodriguez We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one K=1 K=2 K=1 K=3 K=1

Seminar 311 Focus: Approximate String Joins over Relational DBMSs Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS)

Seminar 312 Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS

Seminar 313 Approximate String Joins outside of a DBMS 1. Export data 2. Join outside of DBMS 3. Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the DBMS

Seminar 314 Approximate String Joins with UDFs 1. Write a UDF to check if two strings match within distance K 2. Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations

Seminar 315 Our Approach: Approximate String Joins over an Unmodified RDBMS 1. Preprocess data and generate auxiliary tables 2. Perform join exploiting standard RDBMS capabilities Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs

Seminar 316 Intuition and Roadmap Intuition: –Similar strings have many common substrings –Use exact joins to perform approximate joins (current DBMSs are good for exact joins) –A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] Roadmap: –Break strings into substrings of length q (q-grams) –Perform an exact join on the q-grams –Find candidate string pairs based on the results –Check only candidate pairs with a UDF to obtain final answer

Seminar 317 What is a “Q-gram”? Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams Similar strings have a many common q-grams

Seminar 318 Q-grams and Edit Distance Operations With no edits: L + q - 1 common q-grams Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} Insertion: (L max + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} Deletion: (L max + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 319 Number of Common Q-grams and Edit Distance For Edit Distance = K, there could be at most K replacements, insertions, deletions Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q- grams in common Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 320 Using a DBMS for Q-gram Joins If we have the q-grams in the DBMS, we can perform this counting efficiently. Create auxiliary tables with tuples of the form: and join these tables A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 321 Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 322 Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING

Seminar 323 Exploiting Q-gram Positions for Filtering Consider strings aabbzzaacczz and aacczzaabbzz Strings are at edit distance 4 Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings –Either q-grams do not "originate" from same q-gram, or –Too many edit operations "caused" spurious q-grams at various parts of strings to match

Seminar 324 POSITION FILTERING - Filtering using positions Keep the position of the q-grams Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING

Seminar 325 The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q- 1)–K*q

Seminar 326 Summary of 1 st paper Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression Our technique does not require modifying the underlying RDBMS

Seminar 328 Single-attribute Case Given –two sets of strings, R and S –a similarity function f between strings (metric space)  Reflexive: f(s1,s2) = 0 iff s1=s2  Symmetric: f(s1,s2) = d(s2, s1)  Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) –a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

Seminar 329 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

Seminar 330 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

Seminar 331 Advantages Applicable to many metric similarity functions –Use edit distance as an example –Other similarity functions also tried, e.g., q- gram-based similarity Open to existing algorithms –Mapping techniques –Join techniques

Seminar 332 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

Seminar 333 Mapping: StringMap Input: A list of strings Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap –Each step greedily picks two strings (pivots) to form an axis –All axes are orthogonal

Seminar 334 Can it preserve distances? Data Sources: –IMDB star names: 54,000 –German names: 132,000 Distribution of string lengths:

Seminar 335 Use data set 1 (54K names) as an example k=2, d=20 –Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?

Seminar 336 Choose Dimensionality d Increase d? Good : –better to differentiate similar pairs from dissimilar ones. Bad  : –Step 1: Efficiency ↓ –Step 2: “curse of dimensionality”

Seminar 337 Choose dimensionality d using sampling Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=

Seminar 338 Choose Dimensionality d d=15 ~ 25

Seminar 339 Choose new threshold k’ Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling –Sample 1Kx1K strings, find similar pairs –Calculate their maximum new distance as k’ –repeat multiple times, choose their maximum

Seminar 340 New threshold k’ in step 2 d=20

Seminar 341 Step 2: Similarity Join Input: Two sets of points in Euclidean space. Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used

Seminar 342 Example Adopted an algorithm by Hjaltason and Samet. –Building two R-Trees. –Traverse two trees, find points whose distance is within k’. –Pruning during traversal (e.g., using MinDist).

Seminar 343 Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k

Seminar 344 Running time

Seminar 345 Recall Recall: (#of found similar pairs)/(#of all similar pairs)

Seminar 346 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

Seminar 347 Evaluation strategies Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: –Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. –Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 348 Summary of 2 nd paper A novel two-step approach to record linkage. Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied

Seminar 350 Problems with Existing Deduplication Methods Matching Functions Calculate similarity scores, thresholds Tedious coding Learning-based Methods Require large-sized training set for accuracy (static training set) Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

ALIAS: Active Learning led Interactive Alias SuppressionSeminar 351 New Approach Relegate the task of finding the deduplication function to a machine learning algorithm Design goal:  Less training instances  Interactive response  Fast convergence  High accuracy  Design an interactive deduplication system called ALIAS

Seminar 352 ALIAS SYSTEM A learned-based method Exploit existing similarity functions Use Active Learning - an active learner actively picks unlabeled instances with the most information gain in the training set Produce a deduplication function that can identify duplicates

Seminar 353 Predicate for uncertain region Overall Architecture Similarity Functions Mapper Training data Train classifiers Select n instances for labeling Similarity Indices L D F Learner Feedback From user Lp Dp T S

Seminar 354 Primary Inputs for ALIAS A set of initial training pairs (L) –less than 10 labeled records –arranged in pairs of duplicates and non-duplicates A set of similarity functions (F) –Ex: word-match, qgram-match … –To compute similarities scores between 2 records based on any subset of attributes. –Learner will find the right way of combining those scores to get the final deduplication function A database of unlabeled records(D) Number of classifiers (<5) Similarity Functions L F D

Seminar 355 Mapped Labeled Instances Similarity Functions Mapper L F Lp Take r1, r2 from input L Record r1(a1, a2, a3) Record r2(a1, a2, a3) Use similarity functions f1, f2…fn to compute similarity scores s between r1 and r2 New record  r1&r2 (s1, s2…sn, y/n)  y = duplicate;  n = non-duplicate Put new record in Lp D

Seminar 356 Mapped Unlabeled Instances Similarity Functions Mapper L F Lp Take r1, r2 from D x D Record r1(a1, a2, a3) Record r2(a1, a2, a3) Use similarity functions f1, f2…fn to compute similarity scores between r1 and r2 New record  r1&r2 (s1, s2…sn)  No y/n field Put new record in Dp D Mapper Dp

Seminar 357 Active Learner Similarity Functions Mapper Training data Train Classifiers Select set S of n instances for labeling L D F Learner Feedback From user Lp Dp T s

Seminar 358 ALIAS Algorithm 1. Input: L, D, F. 2. Create pairs Lp from the labeled data L and F. 3. Create pairs Dp from the unlabeled data D and F. 4. Initial training set T = Lp 5. Loop until user satisfaction – Train classier C using T. – Use C to select a set S of n instances from Dp for labeling. – If S is empty, exit loop. – Collect user feedback on the labels of S. – Add S to T and remove S from Dp. 6. Output classifier C

Seminar 359 The Indexing Component Purpose: –Avoid mapping all pairs of records in D x D 3 Methods: –Grouping –Sampling –Indexing

Seminar 360 The Indexing Component Grouping –Example: group records in D according to the field year of publication –Mapped pairs are formed only within records of a group. Sampling –Sample in units of a group instead of individual records.

Seminar 361 The Indexing Component Indexing –A similarity function: “ fraction of common words between two text attributes >=0.4 ” –we can create an index on the words of the text attributes

Seminar 362 The learning component Contain a number of classifiers A classifier: a machine learning algorithm such as decision tree (D-tree), na ï ve Bayes (NB), Support Vector Machine (SVM) … to classify instances A classifier is trained using a training data set

Seminar 363 Criteria for a classifier Accuracy Interpretability Indexability Efficient training

Seminar 364 Accuracy of a Classifier Measured by the mean F of recall r and precision p  r = fraction of duplicates correctly classified  p = fraction correct amongst all instances actually libeled duplicate

Seminar 365 Accuracy (cont.) Example  A case with 1% duplicates; a classifier labels all pairs as non-duplicates recall r = 0  mean F = 0  accuracy = 0%  A case with 1% duplicates; a classifier identifies all duplicates correctly but misclassifies 1% non-duplicates recall r = 1, p = 0.5  F = 0.667  accuracy = 66.7%  If we don’t use r and p, then t hey both have 99% accuracy!

Seminar 366 Criteria for a classifier (cont.) Interpretability –Final deduplication rule is easy to understand and interpret Indexability –Final deduplication rule has indexable predicates Efficient training –Fast to train

Seminar 367 Comparison of Different Classifiers

Seminar 368 Active Learning The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 369 A simple Example  Assume all are unlabeled except a and b  Suppose r-coordinate = 0, b-coordinate = 1  Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty  By including m in the training set, the size of the uncertain region will reduce by half Most uncertain instance

Seminar 370 How to select an unlabeled instance Uncertainty –The instance about which the learner was most unsure was also the instance for which the expected reduction in confusion was the largest –Uncertainty score  the disagreement among the predictions the instance get from a committee of N classifiers  A sure duplicate or non-duplicate would get the same prediction from all members Representativeness –An uncertain instance representing a larger number of unlabeled instances has greater impact to the classifier

Seminar 371 Example 3 similarity functions: word match f1, qgram match f2, string edit distance f3. Take from Mapped Unlabeled Instances (Dp) –r1&r2 (s1, s2…sn) –r3&r4 (s1, s2…sn) –s1, s2…sn: scores using functions f1, f2, f3 3 classifiers: D-tree, Naïve Bayes, SVM – D-treeNaïve BayesSVM r1&r2duplicateNon-duplicateduplicate r3&r4duplicate selected

Seminar 372 How to Combine Uncertainty and Representativeness –Proposed two approaches –1st approach: weighted sum  Cluster the unlabeled instances  Estimate the density of points around it  The instances are scored using a weighted sum of its density and uncertainty value  n highest scoring instances selected –2 nd approach: sampling

Seminar 373 Conclusion ALIAS: Makes deduplication much easier (less training instances) Provide interaction response to the user High accuracy

1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.

Similar presentations

Presentation on theme: "1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.

Similar presentations

Presentation on theme: "1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine."— Presentation transcript:

Similar presentations

About project

Feedback