Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai Doan

2 Administrivia Midterm due Thursday  5-10 pages (single-spaced, 10-12 pt)

3 Semantic Mappings between Schemas  Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping

4 Suppose user wants to integrate 100 data sources 1. User  manually creates mappings for a few sources, say 3  shows LSD these mappings 2. LSD learns from the mappings  “Multi-strategy” learning incorporates many types of info in a general way  Knowledge of constraints further helps 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach

5 listed-price $250,000 $110,000... address price agent-phone description Example location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

6 LSD’s Multi-Strategy Learning Use a set of base learners  each exploits well certain types of information Match schema elements of a new source  apply the base learners  combine their predictions using a meta-learner Meta-learner  uses training sources to measure base learner accuracy  weighs each learner based on its accuracy

7 Base Learners  Input  schema information: name, proximity, structure,...  data information: value, format,...  Output  prediction weighted by confidence score  Examples  Name learner  agent-name => (name,0.7), (phone,0.3)  Naive Bayes learner  “Kent, WA” => (address,0.8), (name,0.2)  “Great location” => (description,0.9), (address,0.1)

8 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

9 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

10 Domain Constraints  Impose semantic regularities on sources  verified using schema or data  Examples  a = address & b = address a = b  a = house-id a is a key  a = agent-info & b = agent-name b is nested in a  Can be specified up front  when creating mediated schema  independent of any actual source schema

11 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler  Can specify arbitrary constraints  User feedback = domain constraint  ad-id = house-id  Extended to handle domain heuristics  a = agent-phone & b = agent-name a & b are usually close to each other 0.3 0.1 0.4 0.012 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252 Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner

12 Putting It All Together: LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints  Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner  Meta-learner  uses stacking [Ting&Witten99, Wolpert92]  returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase

13 Empirical Evaluation  Four domains  Real Estate I & II, Course Offerings, Faculty Listings  For each domain  create mediated DTD & domain constraints  choose five sources  extract & convert data listings into XML  mediated DTDs: 14 - 66 elements, source DTDs: 13 – 48  Ten runs for each experiment - in each run:  manually provide 1-1 mappings for 3 sources  ask LSD to propose mappings for remaining 2 sources  accuracy = % of 1-1 mappings correctly identified

14 LSD Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% Average Matching Acccuracy (%)

15 LSD Summary  Applies machine learning to schema matching  use of multi-strategy learning  Domain & user-specified constraints  Probably the most flexible means of doing schema matching today in a semi-automated way  Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings

Since LSD…  A lot more work on the following:  Alternative schemes for putting together info from base learners  Hierarchical learners  Compare two trees: parent nodes are likely to be the same if child nodes are similar; child nodes are likely to be the same if parent nodes are similar  Using mass collaboration – humans do the work  And a lot of work on entity resolution or record matching  Uses similar ideas to try to determine when two records are referring to the same entity 16

17 Jumping Up a Level  We’ve now seen how heterogeneous data makes a huge difference  … In the need for relating different kinds of attributes  Mapping languages  Mapping tools  Query reformulation  … and in query processing  Adaptive query processing  Next time we’ll go even further, and start to consider search – focusing on Google

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Similar presentations

Presentation on theme: "Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Similar presentations

Presentation on theme: "Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai."— Presentation transcript:

Similar presentations

About project

Feedback