Presentation on theme: "Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas."— Presentation transcript:
http://www.cs.columbia.edu/~pirot/DataCleaning Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Service A EUROAFT CORP HATRONIC INC … Service B HATRONIC CORP EUROAFT INC EUROAFT CORP … Matching Text Attributes Text matching is an important component of data cleaning systems, and relies on a good distance metric to capture entity matches. Cosine similarity gives high similarity to tuple pairs that share many infrequent tokens, and low similarity to pairs that share only a few, common tokens (WHIRL: [Cohen, SIGMOD98]). Common token (low weight) Infrequent token (high weight) EUROAFT CORPEUROAFT INC EUROAFT CORPHATRONIC CORP Similarity = Σ weight(token, t 1 ) * weight(token, t 2 ) token Different token choices capture different types of mismatches: Words: Handles insertions and deletions of common words and variations of word order; cannot handle spelling errors. Q-grams: Handles spelling errors in addition to insertions and deletions of common words and variations of word order. 2 1 Name … HATRONIC INC EUROAFT CORP R 1.NameR 2.NameSimilarity EUROAFT CORPEUROAFT INC0.98 EUROAFT CORP 1.00 EUROAFT CORPHATRONIC CORP0.01 HATRONIC INCHATRONIC CORP0.98 HATRONIC INCEUROAFT INC0.02 2 2 1 1 … INC HATRONIC CORP EUROAFT token 0.01 0.98 w 0.02 0.98 R 1 Weights R2R2 3 2 1 Name … EUROAFT CORP EUROAFT INC HATRONIC CORP 0.07CORP3 3 2 2 1 1 0.02CORP 0.05INC 0.92EUROAFT … HATRONIC token 0.95 0.98 w R 2 Weights R1R1 SELECTr1w.tid AS tid1, r2w.tid AS tid2 FROMR 1 Weights r1w, R 2 Weights r2w WHEREr1w.token = r2w.token GROUP BYr1w.tid, r2w.tid HAVINGSUM(r1w.weight*r2w.weight) φ Text Joins: A Baseline Preprocessing Step This join calculates the similarity of all pairs of tuples and filters out all tuple pairs with similarity lower than a given threshold φ. We are interested in high values for threshold φ. Using the baseline, most of the candidate tuple pairs do not make it to the final result of the text join. Sampling for Text Joins Similarity is a sum of products. Products cannot be high when weight is small. Can (safely) drop low weights from R i Weights. Weighted sampling [Cohen&Lewis, SODA97] gives good weight approximations and eliminates low weight tokens. tokenc 1HATRONIC20 (20/20=1.00) 2EUROAFT19 (19/20=0.95) 3EUROAFT18 (18/20=0.90) Sampling 20 times R i Sample INSERT INTO R i Sample(tid,token,c) SELECTrw.tid, rw.token, ROUND(S*rw.weight/rs.total, 0) AS c FROMR i Weights rw, R i Sum rs WHERErw.token = rs.token AND ROUND(S*rw.weight/rs.total, 0) > 0 SELECTr1w.tid AS tid1, r2s.tid AS tid2 FROMR 1 Weights r1w, R 2 Sample r2s, R 2 sum r2sum WHEREr1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVINGSUM(r1w.weight*r2sum.total*r2s.c) S*φ tokenw 1EUROAFT0.98 1CORP0.02 2HATRONIC0.98 2INC0.01 … tokenw 1HATRONIC1.00 1CORP0.00 2EUROAFT0.95 2INC0.00 3EUROAFT0.90 3CORP0.00 R 1 Weights R 2 Sample R1R2Similarity EUROAFT CORPEUROAFT INC0.98 EUROAFT CORP 0.9 HATRONIC INCHATRONIC CORP0.98 = R i Weights store the weights for the tokens. Tuple tid,token,w indicates that token has normalized weight w in the R i tuple tid. R i Weights can be computed entirely in SQL. Sampling-Based Text Joins Considers much fewer tuple pairs, speeding up join execution. Approximates well real tuple-pair similarities. Leverages scalability of RDBMS. Does not require moving data in and out of the RDBMS. Experimental evaluation and more details in upcoming paper: Text Joins in an RDBMS for Web Data Integration L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava Proceedings of the 12th International World-Wide Web Conference (WWW2003), 2003 0.07CORP3 3 2 2 1 1 0.02CORP 0.05INC 0.92EUROAFT … HATRONIC token 0.95 0.98 w R i Weights
Your consent to our cookies if you continue to use this website.