Learning to Map Between Schemas Ontologies

Learning to Map Between Schemas Ontologies
Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos

Agenda Ontology mapping is a key problem in many applications: LSD:
Data integration Semantic web Knowledge management E-commerce LSD: Solution that uses multi-strategy learning. We’ve started with schema matching (I.e., very simple ontologies) Currently extending to more expressive ontologies. Experiments show the approach is very promising!

The Structure Mapping Problem
Types of structures: Database schemas, XML DTDs, ontologies, …, Input: Two (or more) structures, S1 and S2 Data instances for S1 and S2 Background knowledge Output: A mapping between S1 and S2 Should enable translating between data instances. Semantics of mapping?

Semantic Mappings between Schemas
Source schemas = XML DTDs house address contact-info num-baths agent-name agent-phone 1-1 mapping non 1-1 mapping house Points: 0) Sources export data in XML 1) Mediated & source schemas are represented with XML DTDs 2) Describe different types of mappings 1-1 mappings more complex mappings 3) We focus on 1-1 mappings Note: when defining the schema-matching problem, say clearly that we are *given* the two schemas. All we need to do is to find the mappings. Do not talk about handicap-equipped => amenities. Not enough time. Not important. location contact full-baths half-baths name phone

Motivation Database schema integration
A problem as old as databases themselves. database merging, data warehouses, data migration Data integration / information gathering agents On the WWW, in enterprises, large science projects Model management: Model matching: key operator in an algebra where models and mappings are first-class objects. See [Bernstein et al., 2000] for more. The Semantic Web Ontology mapping. System interoperability E-services, application integration, B2B applications, …,

Desiderata from Proposed Solutions
Accuracy, efficiency, ease of use. Realistic expectations: Unlikely to be fully automated. Need user in the loop. Some notion of semantics for mappings. Extensibility: Solution should exploit additional background knowledge. “Memory”, knowledge reuse: System should exploit previous manual or automatically generated matchings. Key idea behind LSD.

LSD Overview L(earning) S(ource) D(escriptions)
Problem: generating semantic mappings between mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) Step 1: [SIGMOD, 2001]: 1-1 mappings between XML DTDs. Current focus: Complex mappings Ontology mapping.

Outline Overview of structure mapping
Data integration and source mappings LSD architecture and details Experimental results Current work.

Data Integration Find houses with four bathrooms priced under $500,000
mediated schema Query reformulation and optimization. source schema 1 source schema 2 source schema 3 wrappers realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code.

Semantic Mappings between Schemas
Source schemas = XML DTDs house address contact-info num-baths agent-name agent-phone 1-1 mapping non 1-1 mapping house Points: 0) Sources export data in XML 1) Mediated & source schemas are represented with XML DTDs 2) Describe different types of mappings 1-1 mappings more complex mappings 3) We focus on 1-1 mappings Note: when defining the schema-matching problem, say clearly that we are *given* the two schemas. All we need to do is to find the mappings. Do not talk about handicap-equipped => amenities. Not enough time. Not important. location contact full-baths half-baths name phone

Semantics (preliminary)
Semantics of mappings has received no attention. Semantics of 1-1 mappings – Given: R(A1,…,An) and S(B1,…,Bm) 1-1 mappings (Ai,Bj) Then, we postulate the existence of a relation W, s.t.: P (C1,…,Ck) (W) = P (A1,…,Ak) (R) , P (C1,…,Ck) (W) = P (B1,…,Bk) (S) , W also includes the unmatched attributes of R and S. In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.

Why Matching is Difficult
Aims to identify same real-world entity using names, structures, types, data values, etc Schemas represent same entity differently different names => same entity: area & address => location same names => different entities: area => location or square-feet Schema & data never fully capture semantics! not adequately documented, not sufficiently expressive Intended semantics is typically subjective! IBM Almaden Lab = IBM? Cannot be fully automated. Often hard for humans. Committees are required!

Current State of Affairs
Finding semantic mappings is now the bottleneck! largely done by hand labor intensive & error prone GTE: 4 hours/element for 27,000 elements [Li&Clifton00] Will only be exacerbated data sharing & XML become pervasive proliferation of DTDs translation of legacy data reconciling ontologies on semantic web Need semi-automatic approaches to scale up! Points: -- what’s the state of research in data integration today? -- this and this are well-understood -- schema matching is now the bottleneck => need to automate

The LSD Approach User manually maps a few data sources to the mediated schema. LSD learns from the mappings, and proposes mappings for the rest of the sources. Several types of knowledge are used in learning: Schema elements, e.g., attribute names Data elements: ranges, formats, word frequencies, value frequencies, length of texts. Proximity of attributes Functional dependencies, number of attribute occurrences. One learner does not fit all. Use multiple learners and combine with meta-learner.

Example Mediated schema Learned hypotheses Schema of realestate.com
address price agent-phone description location listed-price phone comments Learned hypotheses Schema of realestate.com If “phone” occurs in the name => agent-phone location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) (617) ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from. homes.com price $550,000 $320,000 ... contact-phone (278) (617) ... extra-info Beautiful yard Great beach ...

Multi-Strategy Learning
Use a set of base learners: Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

Base Learners Name Learner Naive Bayes Learner [Domingos&Pazzani 97]
(contact-info,office-address) (contact-info,office-address) (contact,agent-phone) (contact,agent-phone) (contact-phone, ? ) (phone,agent-phone) (phone,agent-phone) (listed-price,price) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3) Naive Bayes Learner [Domingos&Pazzani 97] “Kent, WA” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner exploits hierarchical structure of XML data Points: 1) Describe learners in general 2) Describe two example learners in detail Note: must say clearly and briefly how each example learner works.

Training the Base Learners
Mediated schema address price agent-phone description location listed-price phone comments Schema of realestate.com Name Learner <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) </> <comments> Fantastic house </> (location, address) (listed-price, price) (phone, agent-phone) ... realestate.com Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from. IT’S IMPORTANT TO SAY HERE THAT TRAINING = GLEANING KNOWLEDGE FROM THE DATA <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) </> <comments> Great location </> Naive Bayes Learner (“Miami, FL”, address) (“$ 250,000”, price) (“(305) ”, agent-phone) ...

Entity Recognizers Use pre-programmed knowledge to identify specific types of entities date, time, city, zip code, name, etc house-area (30 X 70, 500 sq. ft.) county-name recognizer Recognizers often have nice characteristics easy to construct many off-the-self research & commercial products applicable across many domains help with special cases that are hard to learn

Meta-Learner: Stacking
Training of meta-learner produces a weight for every pair of: (base-learner, mediated-schema element) weight(Name-Learner,address) = 0.1 weight(Naive-Bayes,address) = 0.9 Combining predictions of meta-learner: computes weighted sum of base-learner confidence scores Name Learner Naive Bayes (address,0.6) (address,0.8) <area>Seattle, WA</> Meta-Learner (address, 0.6* *0.9 = 0.78)

Training the Meta-Learner
For address Extracted XML Instances Name Learner Naive Bayes True Predictions <location> Miami, FL</> <listed-price> $250,000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</> ... Least-Squares Linear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9

Applying the Learners Schema of homes.com Mediated schema
area day-phone extra-info address price agent-phone description Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) </> <day-phone>(617) </> <day-phone>(512) </> (agent-phone,0.9), (description,0.1) Points: 1) Explain how the example learners are applied to match “area” with “address”. 2) Note that in general an arbitrary number of learners can be plugged into the matching process. If we have a new learner that has been trained, we can simply plug it in. This show how easy it is to add a new learner in our approach. <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</> (description,0.8), (address,0.2)

The Constraint Handler
Extends learning to incorporate constraints hard constraints a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a soft constraints a = agent-phone & b = agent-name a & b are usually close to each other user feedback = hard or soft constraints Details in [Doan et. al., SIGMOD 2001]

The Current LSD System Training Phase Matching Phase Mediated schema
Source schemas Domain Constraints Data listings User Feedback Constraint Handler Base-Learner1 Base-Learnerk Meta-Learner Mappings

Empirical Evaluation Four domains For each domain
Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML (faithful to schema!) mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

Matching Accuracy LSD’s accuracy: 71 - 92%
Average Matching Acccuracy (%) LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: %

Sensitivity to Amount of Available Data
Average matching accuracy (%) Number of data listings per source (Real Estate I)

Contribution of Schema vs. Data
Average matching accuracy (%) LSD with only schema info. LSD with only data info. Complete LSD More experiments in the paper [Doan et. al. 01]

Reasons for Incorrect Matching
Unfamiliarity suburb solution: add a suburb-name recognizer Insufficient information correctly identified general type, failed to pinpoint exact type <agent-name>Richard Smith</> <phone> (206) </> solution: add a proximity learner Subjectivity house-style = description?

Moving Up the Expressiveness Ladder
Schemas are very simple ontologies. More expressive power = More domain constraints. Mappings become more complex, but constraints provide more to learn from. Non 1-1 mappings: F1(A1,…,Am) = F2(B1,…,Bm) Ontologies (of various flavors): Class hierarchy (I.e., containment on unary relations) Relationships between objects Constraints on relationships

Finding Non 1-1 Mappings Current work
Given two schemas, find 1-many mappings: address = concat(city,state) many-1: half-baths + full-baths = num-baths many-many: concat(addr-line1,addr-line2) = concat(street,city,state) 1-many mappings expressed as query value correspondence expression: room-rate = rate * (1 + tax-rate) relationship: state of tax-rate = state of hotel that has rate special case: 1-many mappings between two relational tables Flat schemas so that the set of operators under consideration can be simplified Mediated schema Source schema address description num-baths city state comments half-baths full-baths

Brute-Force Solution Define a set of operators
concat, +, -, *, /, etc For each set of mediated-schema columns enumerate all possible mappings evaluate & return best mapping Mediated-schema columns Source-schema columns compute similarity using all base learners m1 m1, m2, ..., mk

Search-Based Solution
States = columns goal state: mediated-schema column initial states: all source-schema columns use 1-1 matching to reduce the set of initial states Operators: concat, +, -, *, /, etc Column-similarity: use all base learners + recognizers

Multi-Strategy Search
Use a set of expert modules: L1, L2, ..., Ln Each module applies to only certain types of mediated-schema column searches a small subspace uses a cheap similarity measure to compare columns Example L1: text; concat; TF/IDF L2: numeric; +, -, *, /; [Ho et. al. 2000] L3: address; concat; Naive Bayes Search techniques beam search as default specialized, do not have to materialize columns

Multi-Strategy Search (cont’d)
Apply all applicable expert modules L1: m11, m12, m13, ..., m1x L2: m21, m22, m23, ..., m2y L3: m31, m32, m33, ..., m3z Combine modules’ predictions & select the best one compute similarity using all base learners m11, m12, m21, m22, m31,m32 m11

Related Work ? Recognizers + Schema + 1-1 Matching
Single Learner Matching TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] Hybrid Matching DELTA [Clifton et. al. 97] Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction LSD [Doan et. al. 2000, 2001] CLIO [Miller et. al. 00],[Yan et. al. 01] ?

Summary LSD: Future work and issues to ponder:
uses multi-strategy learning to semi-automatically generate semantic mappings. LSD is extensible and incorporates domain and user knowledge, and previous techniques. Experimental results show the approach is very promising. Future work and issues to ponder: Accommodating more expressive languages: ontologies Reuse of learned concepts from related domains. Semantics? Data management is a fertile area for Machine Learning research!

Backup Slides

Mapping Maintenance Ten months later ...
Mediated-schema M Source-schema S m1 m2 m3 Ten months later ... are the mappings still correct? Mediated-schema M’ Source-schema S’ m1 m2 m3

Information Extraction from Text
Extract data fragments from text documents date, location, & victim’s name from a news article Intensive research on free-text documents Many documents do have substantial structure XML pages, name card, tables, list Each such document = a data source structure forms a schema only one data value per schema element “real” data source has many data values per schema element Ongoing research in the IE community

Contribution of Each Component
Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

Exploiting Hierarchical Structure
Existing learners flatten out all structures Developed XML learner similar to the Naive Bayes learner input instance = bag of tokens differs in one crucial aspect consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>

Domain Constraints Impose semantic regularities on sources Examples
verified using schema or data Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a Can be specified up front when creating mediated schema independent of any actual source schema

The Constraint Handler
Predictions from Meta-Learner Domain Constraints a = address & b = adderss a = b area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) 0.3 0.1 0.4 0.012 area: address contact-phone: agent-phone extra-info: address 0.7 0.9 0.6 0.378 area: address contact-phone: agent-phone extra-info: description 0.7 0.9 0.4 0.252 Can specify arbitrary constraints User feedback = domain constraint ad-id = house-id Extended to handle domain heuristics a = agent-phone & b = agent-name a & b are usually close to each other

Learning to Map Between Schemas Ontologies

Similar presentations

Presentation on theme: "Learning to Map Between Schemas Ontologies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Map Between Schemas Ontologies

Similar presentations

Presentation on theme: "Learning to Map Between Schemas Ontologies"— Presentation transcript:

Similar presentations

About project

Feedback