AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
The LSD Project Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Good morning, everyone. My name is AnHai Doan. I’m here to talk about the LSD project at the University of Washington. In this project we apply machine learning techniques to address the schema-matching problem. And this is a joint work with my advisors, Pedro Domingos and Alon Halevy. We have just heard a great talk from Ling Ling on the Clio project, which addresses schema matching from a data translation perspective. Our project, however, began by looking at this problem from a data-integration perspective. So let me first motivate the problem by briefly reviewing the basic ideas underlying data integration.

Data Integration Find houses with four bathrooms priced under $500,000
mediated schema source schema 1 source schema 2 source schema 3 wrapper wrapper wrapper In a nutshell, a data integration system provides a uniform interface to a set of data sources. For example, suppose we have a data integration system over three real-estate sources on the Web. This system should have a mediated schema, which describes the real-estate domain, and three source schemas which describe the contents of the sources. There are also three wrappers that help translate data between the sources and source schemas. Now, given a user query, such as this one [POINT], which is formulated in the mediated schema, the system translates it into queries in the source schemas, then executes these queries with the help of the wrappers. The system then takes the data returned by the sources, and combines them to produce final answers to the user query. To build such a system, ... realestate.com homeseekers.com homes.com

Semantic Mappings between Schemas
Mediated & source schemas = XML DTDs house address contact-info num-baths agent-name agent-phone 1-1 mapping non 1-1 mapping house Points: 0) Sources export data in XML 1) Mediated & source schemas are represented with XML DTDs 2) Describe different types of mappings 1-1 mappings more complex mappings 3) We focus on 1-1 mappings Note: when defining the schema-matching problem, say clearly that we are *given* the two schemas. All we need to do is to find the mappings. Do not talk about handicap-equipped => amenities. Not enough time. Not important. location contact full-baths half-baths name phone

Current State of Affairs
Finding semantic mappings is now the bottleneck! largely done by hand labor intensive & error prone Will only be exacerbated data sharing & XML become pervasive proliferation of DTDs translation of legacy data reconciling ontologies on the semantic web Need (semi-)automatic approaches to scale up! Points: -- what’s the state of research in data integration today? -- this and this are well-understood -- schema matching is now the bottleneck => need to automate

The LSD (Learning Source Descriptions) Approach
Suppose user wants to integrate 100 data sources 1. User manually creates mappings for a few sources, say 3 shows LSD these mappings 2. LSD learns from the mappings 3. LSD proposes mappings for remaining 97 sources Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from.

Example Mediated schema Learned hypotheses Schema of realestate.com
address price agent-phone description location listed-price phone comments Learned hypotheses Schema of realestate.com If “phone” occurs in the name => agent-phone location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) (617) ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from. homes.com price $550,000 $320,000 ... contact-phone (278) (617) ... extra-info Beautiful yard Great beach ...

Our Contributions 1. Use of multi-strategy learning
well-suited to exploit multiple types of knowledge highly modular & extensible 2. Extend learning to incorporate constraints handle a wide range of domain & user-specified constraints 3. Develop XML learner exploit hierarchical nature of XML

Multi-Strategy Learning
Use a set of base learners each exploits well certain types of information Match schema elements of a new source apply the base learners combine their predictions using a meta-learner Meta-learner uses training sources to measure base learner accuracy weighs each learner based on its accuracy Points: 1) Introduce the multi-strategy learning approach Note: In the next several slides we are going to describe the multi-strategy learning approach in more details.

Base Learners Input Output Examples
schema information: name, proximity, structure, ... data information: value, format, ... Output prediction weighted by confidence score Examples Name learner agent-name => (name,0.7), (phone,0.3) Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1) Points: 1) Describe learners in general 2) Describe two example learners in detail Note: must say clearly and briefly how each example learner works.

Training the Learners Mediated schema Schema of realestate.com
address price agent-phone description location listed-price phone comments Schema of realestate.com Name Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description) ... <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) </> <comments> Fantastic house </> realestate.com Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from. <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) </> <comments> Great location </> Naive Bayes Learner (“Miami, FL”, address) (“$ 250,000”, price) (“(305) ”, agent-phone) (“Fantastic house”, description) ...

Applying the Learners Schema of homes.com Mediated schema
area day-phone extra-info address price agent-phone description Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) </> <day-phone>(617) </> <day-phone>(512) </> (agent-phone,0.9), (description,0.1) Points: 1) Explain how the example learners are applied to match “area” with “address”. 2) Note that in general an arbitrary number of learners can be plugged into the matching process. If we have a new learner that has been trained, we can simply plug it in. This show how easy it is to add a new learner in our approach. <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</> (address,0.6), (description,0.4)

Domain Constraints Impose semantic regularities on sources Examples
verified using schema or data Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a Can be specified up front when creating mediated schema independent of any actual source schema

The Constraint Handler
Predictions from Meta-Learner Domain Constraints a = address & b = adderss a = b area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) 0.3 0.1 0.4 0.012 area: address contact-phone: agent-phone extra-info: address 0.7 0.9 0.6 0.378 area: address contact-phone: agent-phone extra-info: description 0.7 0.9 0.4 0.252 Can specify arbitrary constraints User feedback = domain constraint ad-id = house-id Extended to handle domain heuristics a = agent-phone & b = agent-name a & b are usually close to each other

Putting It All Together: the LSD System
Training Phase Matching Phase Mediated schema Source schemas Domain Constraints Data listings Training data for base learners User Feedback Constraint Handler L1 L2 Lk Mapping Combination Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner uses stacking [Ting&Witten99, Wolpert92] returns linear weighted combination of base learners’ predictions

Empirical Evaluation Four domains For each domain
Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

High Matching Accuracy
Average Matching Acccuracy (%) LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: %

Performance Sensitivity
Average matching accuracy (%) Number of data listings per source

Contribution of Schema vs. Data
Average matching accuracy (%) More experiments in the paper!

Related Work Rule-based approaches Learner-based approaches Others
TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98], CUPID [Madhavan et. al. 01] utilize only schema information Learner-based approaches SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] employ a single learner, limited applicability Others DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01] Multi-strategy learning in other domains series of workshops [91,93,96,98,00] [Freitag98], Proverb [Keim et. al. 99] Points: 1) compare and constrast related work with ours. 2) learners in the learner-based approach could be incorporated into ours. 3) multi-strategy learning has been applied to many domains, but not yet to schema matching.

Summary LSD project Main ideas & contributions
applies machine learning to schema matching Main ideas & contributions use of multi-strategy learning extend learning to handle domain & user-specified constraints develop XML learner System design: A contribution to generic schema-matching highly modular & extensible handle multiple types of knowledge continuously improve over time Points: See slide.

Ongoing & Future Work Improve accuracy
address current system limitations Extend LSD to more complex mappings Apply LSD to other application contexts data translation data warehousing e-commerce information extraction semantic web Points: 1) Here’s a roadmap of the problem. And this is our first step. 2) We are currently addressing the other issues. 3) We talk more about them when we get to future work.

Contribution of Each Component
Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

Exploiting Hierarchical Structure
Existing learners flatten out all structures Developed XML learner similar to the Naive Bayes learner input instance = bag of tokens differs in one crucial aspect consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Similar presentations

Presentation on theme: "AnHai Doan, Pedro Domingos, Alon Halevy University of Washington"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Similar presentations

Presentation on theme: "AnHai Doan, Pedro Domingos, Alon Halevy University of Washington"— Presentation transcript:

Similar presentations

About project

Feedback