Presentation is loading. Please wait.

Presentation is loading. Please wait.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

Similar presentations


Presentation on theme: "AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project."— Presentation transcript:

1 AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project

2 2 Data Integration Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com wrapper source schema 3source schema 1

3 3 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping

4 4 Current State of Affairs Finding semantic mappings is now the bottleneck! –largely done by hand –labor intensive & error prone Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data –reconciling ontologies on the semantic web Need (semi-)automatic approaches to scale up!

5 5 Suppose user wants to integrate 100 data sources 1. User –manually creates mappings for a few sources, say 3 –shows LSD these mappings 2. LSD learns from the mappings 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach

6 6 listed-price $250,000 $110,000... address price agent-phone descriptionExample location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

7 7 Our Contributions 1. Use of multi-strategy learning –well-suited to exploit multiple types of knowledge –highly modular & extensible 2. Extend learning to incorporate constraints –handle a wide range of domain & user-specified constraints 3. Develop XML learner –exploit hierarchical nature of XML

8 8 Multi-Strategy Learning Use a set of base learners –each exploits well certain types of information Match schema elements of a new source –apply the base learners –combine their predictions using a meta-learner Meta-learner –uses training sources to measure base learner accuracy –weighs each learner based on its accuracy

9 9 Base Learners Input –schema information: name, proximity, structure,... –data information: value, format,... Output –prediction weighted by confidence score Examples –Name learner –agent-name => (name,0.7), (phone,0.3) –Naive Bayes learner –“Kent, WA” => (address,0.8), (name,0.2) –“Great location” => (description,0.9), (address,0.1)

10 10 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

11 11 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

12 12 Domain Constraints Impose semantic regularities on sources –verified using schema or data Examples –a = address & b = address a = b –a = house-id a is a key –a = agent-info & b = agent-name b is nested in a Can be specified up front –when creating mediated schema –independent of any actual source schema

13 13 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Can specify arbitrary constraints User feedback = domain constraint –ad-id = house-id Extended to handle domain heuristics –a = agent-phone & b = agent-name a & b are usually close to each other 0.3 0.1 0.4 0.012 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252 Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner

14 14 Putting It All Together: the LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner –uses stacking [Ting&Witten99, Wolpert92] –returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase

15 15 Empirical Evaluation Four domains –Real Estate I & II, Course Offerings, Faculty Listings For each domain –create mediated DTD & domain constraints –choose five sources –extract & convert data listings into XML –mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 Ten runs for each experiment - in each run: –manually provide 1-1 mappings for 3 sources –ask LSD to propose mappings for remaining 2 sources –accuracy = % of 1-1 mappings correctly identified

16 16 High Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% Average Matching Acccuracy (%)

17 17 Performance Sensitivity Average matching accuracy (%) Number of data listings per source

18 18 Contribution of Schema vs. Data More experiments in the paper! Average matching accuracy (%)

19 19 Related Work Rule-based approaches –TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98], CUPID [Madhavan et. al. 01] –utilize only schema information Learner-based approaches –SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] –employ a single learner, limited applicability Others –DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01] Multi-strategy learning in other domains –series of workshops [91,93,96,98,00] –[Freitag98], Proverb [Keim et. al. 99]

20 20 Summary LSD project –applies machine learning to schema matching Main ideas & contributions –use of multi-strategy learning –extend learning to handle domain & user-specified constraints –develop XML learner System design: A contribution to generic schema-matching –highly modular & extensible –handle multiple types of knowledge –continuously improve over time

21 21 Ongoing & Future Work Ongoing & Future Work Improve accuracy –address current system limitations Extend LSD to more complex mappings Apply LSD to other application contexts –data translation –data warehousing –e-commerce –information extraction –semantic web www.cs.washington.edu/homes/anhai/lsd.html

22 22 Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

23 23 Existing learners flatten out all structures Developed XML learner –similar to the Naive Bayes learner –input instance = bag of tokens –differs in one crucial aspect –consider not only text tokens, but also structure tokens Exploiting Hierarchical Structure Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Gail Murphy MAX Realtors


Download ppt "AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project."

Similar presentations


Ads by Google