Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy
Data Integration
Problem & Solution Problem Large-scale Data Integration Systems Bottleneck: Semantic Mappings 1-1 Mappings Solution Multi-strategy Learning Integrity Constraints XML Structure Learner
Learning Source Descriptions (LSD) Components Base learners Meta-learner Prediction converter Constraint handler Operations Training phase Matching phase
Learners Basic Learners Name Matcher (Whirl) Content Matcher (Whirl) Naïve Bayes Learner County-Name Recognizer XML Learner Meta-Learner (Stacking)
XML Learner
XML Learner (Cont.)
Constraint Handler Domain Constraints
Constraint Handler (Cont.) Search Heuristic Mapping Cost
Training Phase
Example1 (Training Phase)
Example1 (Cont.)
(“location” , ADDRESS) (“Miami, FL”, ADDRESS)
Matching Phase
Example2 (Matching Phase)
Example2 (Cont.)
Empirical Evaluation
Measures Matching accuracy of a source Average matching accuracy of a source Average matching accuracy of a domain
Experiment Result
Experiment Result (Cont.) Contributions of base learners and the constraint handler
Experiment Result (Cont.) Contributions of Schema information and Data Instances
Experiment Result (Cont.) Performance sensitivity to the amount of data instances
Limitations Enough Training Data Domain Dependent Learners Ambiguities in Sources Efficiency Overlapping of Schemas
Conclusion and Future Work Improve over time Extensible framework Multiple types of knowledge Non 1-1 mapping ?