Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David
Data Cleaning Real-world data is dirty Examples: Syntactical Errors: e.g., Micrsoft Heterogeneous Formats: e.g., Phone number formats Missing Values (Incomplete data) Violation of Integrity Constraints: e.g., FDs/INDs Duplicate Records Data Cleaning is the process of improving data quality by removing errors, inconsistencies and anomalies of data VLDB 2009George Beskales2
Duplicate Elimination Process IDnameZIPIncome P1Green k P2Green k P3Peter k P4Peter k P5Gree k P6Chuck k VLDB 2009George Beskales3 IDnameZIPIncome C1Green k C2Peter k C3Chuck k Compute Pairwise Similarity P1P2 P3 P4 P5 P Cluster Records P1P2 P3P4 P5 P6Merge Clusters C1 C3 C2 Unclean Relation Clean Relation
Probabilistic Data Cleaning VLDB 2009George Beskales4 (a) One-Shot Cleaning Unclean Database Deterministic Database Deterministic Duplicate Elimination RDBMS QueriesResults Single Clean Instance Queries (b) Probabilistic Cleaning Unclean Database Uncertain Database Probabilistic Duplicate Elimination Uncertainty and Cleaning-aware RDBMS Probabilistic Results Single Clean Instance Single Clean Instance Multiple Possible Clean Instances
Motivation Probabilistic Data Cleaning: 1.Avoid deterministic resolution of conflicts during data cleaning 2.Enrich query results by considering all possible cleaning instances 3.Allows specifying query-time cleaning requirements VLDB 2009George Beskales5
VLDB 2009George Beskales6 Probabilistic Duplicate Elimination Probabilistic Duplicate Elimination Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Uncertain Database Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Queries Unclean Database Uncertainty and Cleaning-aware RDBMS Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Probabilistic Results Outline Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation
Uncertainty in Duplicate Detection VLDB 2009George Beskales7 Person Uncertain Clustering X1X1 {P1} {P2} {P3,P4} {P5} {P6} X2X2 {P1,P2} {P3,P4} {P5} {P6} X3X3 {P1,P2,P5} {P3,P4} {P6} Possible Repairs Uncertain Duplicates: Determining what records should be clustered is uncertain due to noisy similarity measurements A possible repair of a relation is a clustering (partitioning) of the unclean relation IDNameZIPIncome P1Green k P2Green k P3Peter k P4Peter k P5Gree k P6Chuck k
Challenges 1.The space of all possible clusterings (repairs) is exponentially large 2.How to efficiently generate, store and query the possible repairs? VLDB 2009George Beskales8
The Space of Possible Repairs VLDB 2009George Beskales9 All possible repairs 1 Possible repairs given by any fixed parameterized clustering algorithm 2 Possible repairs given by any fixed hierarchical clustering algorithm Link-based algorithms Hierarchical cut clustering algorithm … 3
Hierarchical Clustering Algorithms Records are clustered in a form of a hierarchy: all singletons are at leaves, and one cluster is at root Example: Linkage-based Clustering Algorithm VLDB 2009George Beskales10 r1r1 r2r2 r4r4 r5r5 r3r3 Pair-wise Distance Distance Threshold ( ) {r 1,r 2 },{r 3,r 4,r 5 }
Uncertain Hierarchical Clustering Hierarchical clustering algorithms can be modified to accept uncertain parameters The number of generated possible repairs is linear VLDB 2009George Beskales11 r1r1 r2r2 r4r4 r5r5 r3r3 Possible Thresholds [ l, u ] {r 1 },{r 2 },{r 3 },{r 4 },{r 5 } {r 1 },{r 2 },{r 3 },{r 4,r 5 } {r 1,r 2 },{r 3 },{r 4,r 5 } {r 1,r 2 },{r 3,r 4,r 5 }
Probabilities of Repairs VLDB 2009George Beskales12 {P1,P2} {P3,P4} {P5} {P6} {P1,P2,P5} {P3,P4} {P6} {P1} {P2} {P3,P4} {P5} {P6} Clustering 1Clustering 2Clustering 3 1 3 Pr = 10 Pr = 1 Pr = 0.1 IDNameZIPIncome P1Green k P2Green k P3Peter k P4Peter k P5Gree k P6Chuck k ~U(0,10)
Representing the Possible Repairs VLDB 2009George Beskales13 ID…IncomeCP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,10) CP3…55k{P5}[0,3) CP4…30k{P6}[0,10) CP5…39k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) U-clean Relation Person C {P1,P2} {P3,P4} {P5} {P6} {P1,P2,P5} {P3,P4} {P6} {P1} {P2} {P3,P4} {P5} {P6} Clustering 1Clustering 2Clustering 3 1 3 Pr = 10 Pr = 1 Pr = 0.1
VLDB 2009George Beskales14 Probabilistic Duplicate Elimination Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Uncertain Database Queries Unclean Database Uncertainty and Cleaning-aware RDBMS Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Probabilistic Results Outline Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation
Queries over U-Clean Relations We adopt the possible worlds semantics to define queries on U-clean relations VLDB 2009George Beskales15 U-Clean Relation(s) R C Possible Clean Instances X 1,X 2,…,X n Possible Clean Instances Q(X 1 ), Q(X 2 ),…,Q(X n ) U-Clean Relation Q(R C ) Definition Implementation Instances Representation
Example: Selection Query VLDB 2009George Beskales16 SELECT ID, Income FROM Person c WHERE Income>35k IDIncomeCP CP240k{P3,P4}[0,10) CP355k{P5}[0,3) CP539k{P1,P2,P5}[3,10) ID… Income CP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,10) CP3…55k{P5}[0,3) CP4…30k{P6}[0,10) CP5…39k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) Person C
Example: Projection Query VLDB 2009George Beskales17 IncomeCP 30k {P1} v {P6}[0,1) v [3,10) 31k {P1,P2}[1,3) 32k{P2}[0,1) 40k {P3,P4} v {P1,P2,P5} [0,1) v [3,10) 55k {P5}[0,3) SELECT DISTINCT Income FROM Person c ID… Income CP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,1) CP3…55k{P5}[0,3) CP4…30k{P6}[3,10) CP5…40k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) Person C
Example: Join Query VLDB 2009George Beskales18 SELECT Income, Price FROM Person c, Vehicle c WHERE Income/10 >= Price IncomePriceCP 40k4k {P3,P4} ^ {V4} 1 : [0, 10) ^ 2 : [3,5)... ID…IncomeCP CP1…31k{P1,P2} 1 :[1,3) CP2…40k{P3,P4} 1 :[0,10) …………… IDPriceCP CV54k{V4} 2 : [3,5) CV66k{V3,V4} 2 : [5,10) ……. Person c Vehicle c
Aggregation Queries VLDB 2009George Beskales19 CP2 CP3 CP4 CP6 CP7 CP1 CP2 CP3 CP4 CP2 CP4 CP5 Pr = 0.1 Pr = 0.2 Pr = 0.7 X1X1 X2X2 X3X3 SELECT Sum(Income) FROM Person c Sum=187k Sum=156k Sum=109k ID… Income CP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,10) CP3…55k{P5}[0,3) CP4…30k{P6}[0,10) CP5…39k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) Person c
Other Meta-Queries 1.Obtaining the most probable clean instance 2.Obtaining the -certain clusters 3.Obtaining a clean instance corresponding to a specific parameters of the clustering algorithms 4.Obtaining the probability of clustering a set of records together VLDB 2009George Beskales20
Outline VLDB 2009George Beskales21 Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation
Experimental Evaluation A prototype as an extension of PostgreSQL Synthetic data generator provided by Febrl (freely extensible biomedical record linkage) Two hierarchical algorithms: Single-Linkage (S.L.) Nearest-neighbor based clustering algorithm (N.N.) [Chaudhuri et al., ICDE’05] VLDB 2009George Beskales22
Experimental Evaluation VLDB 2009George Beskales23
Experimental Evaluation VLDB 2009George Beskales24
Experimental Evaluation VLDB 2009George Beskales25
Conclusion We allow representing and querying multiple possible clean instances We modified hierarchical clustering algorithms to allow generating multiple repairs We compactly store the possible repairs by keeping the lineage information of clusters in special attributes New (probabilistic) query types can be issued against the population of possible repairs VLDB 2009George Beskales26