A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Motivating Example  Assume UTA needs to integrate 40 databases from its different schools with a total of 27,000 elements.  It would take approximately 12 person years to integrate them if done manually.  How would you reduce the manual burden ?

Schema Matching Schema 1 Schema 2

Schema Matching Definition Schema matching is defined as the task of finding the semantic correspondences between elements of two schemas. Match S1 S2 Match Result Auxiliary information ( User feedback, Dictionaries, Previous mappings)

Application Domains  Schema integration Developing global view over set of independently developed schemas  Comparing data schemes: Items from different shopping sites Merger between two corporations Preparation of data for data warehousing and analyzing processes Any other examples?

High Level Architecture of Generic Match http://db18.informatik.uni-leipzig.de:8080/WebEdition/

Classification of Schema Matching Approaches 1) Schema Level Matching Granularity of Schema Level  Element Level  Structural Level 2) Instance level Matching 3) Hybrid and composite Matching

Schema Level Matching  Only Schema level information(No data content)  Properties? (Name, description, data type,is-a /part-of relationship, constraints and structure)  Match will find match candidates (each having similarity value)

Granularity: Element Level  For each element of Source Schema determine matching elements in Target Schema  Element Level o atomic level (Attributes in XML schema) o higher level (Columns in Relational tables) Eg: Address = CustomerAddress

Granularity: Structure-Level  Structure-Level: Matches combinations of elements that appear together in S1 with “combinations” of elements that appear together in S2.  Full Structure Match vs Partial Structure Match S1 ElementsS2 Elements AddressCustAddress Street City StateUSState ZipPostalCode S1 ElementsS2 Elements AccountOwner(Finance)Customer(Sales) NameCname AddressCAddress BirthdateCPhone TaxExempt

Granularity: Structure-Level (Contd)  Equivalence Patterns: Can enhance structure matching by considering known equivalence patterns stored in a library.

6 October 201512 Matching Cardinality  One or more S1 elements can match one or more S2 elements.  1:1, 1:n, n:1, (m:n) 1:1 n:1 1:n m:n

Instance Level Matching  Insight into the contents and meaning of schema elements  Useful when schema information is limited and when semi- structured data is used  Incorrect interpretation of schema level information can be corrected Eg : X is match candidate for CompanyName and Manufacturer

Techniques for Schema Level Matching  Linguistic approaches Name based (equality of names)  equality of canonical name (Cust# = CustNo)  equality of synonyms (make = brand)  equality of hypernyms (book is-a publication & article is-a publication implies book =article)

Techniques for Schema level Matching Name Matching (Contd)  Similarity based on pronunciation or soundex (ship2=ShipTo)  user-provided name matches (issue=bug)  Not limited to 1:1 matches (phone = {homePhone, officePhone} )  Context based :Payroll application(salary=income) vs Tax reporting application(salary!=income)

Techniques for Schema Level Matching  Description based Eg: Comments in schema elements

Techniques for Schema Level Matching  Constraint based Mapping - Eg:data types and value ranges, optionality, relationship types, cardinalities, etc. - Combined with other matchers to limit match candidates

Techniques for Schema Level Matching  Reusing Schema and Mapping Information -Idea: schemas from same domains are often very similar eg address fields and name fields repeated -Create schema library and schema editors should access library ( Analogy: XML namespaces) S->S2(known) Goal:S1->S? S1->S2?(easy to find)

Techniques for Instance Level  IR techniques (Measures such as Jacard coefficient)  Constraint-based Characterization (EmpNo range vs Dept No range)  Auxiliary Information  Learning (Eg :Evaluate S1 contents  Characterization 1, Evaluate S2 contents against Characterization 1 ) Drawback of Instance based?

Combining Matcher: Hybrid Matcher  Integrates multiple matching criteria Eg:-A Matcher with Name matching and constraint based matching  Single Pass  Matching criteria is hard-wired

Combining Matcher: Composite Matcher  Combine the result of several independently executed Matchers  Iterative (Match result of 1 st Matcher is consumed by the 2 nd Matcher)  Flexible ordering Which is efficient –Hybrid and Composite?

Summarization

How good is a Match?  Assessing match quality is difficult  Human verification and tuning of matching is often required  A useful metric would be to measure the amount of human work required to reach the perfect match Recall: how many good matches did we show? Precision: how many of the matches we show are good?

Current Work  LSD  SKAT  Similarity Flooding

LSD(Learning Source Description)  Produces 1:1 Instance level Mapping Suppose user wants to integrate 100 data sources 1. User:  manually creates mappings for a few sources, say 3  shows LSD these mappings 2. LSD learns from the mappings  “Multi-strategy” learning incorporates many types of info in a general way  Knowledge of constraints further helps 3. LSD proposes mappings for remaining 97 sources

LSD: Example listed-price $250,000 $110,000... address price agent-phone description location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

LSD: Training the Learners Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

LSD: Applying the Learners Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra- info

SKAT(Semantic Knowledge Articulation)  Expert supplies SKAT with few initial rules Ex : 1) Match US.president US.chancellor 2) MisMatch human.nail factory.nail  SKAT articulates on supplied matching rules  Expert approves/rejects.  Creates correct rules and computes an updated articulation (Knowledge gained from irrelevant and rejected rules stored)

Similarity Flooding  Intuition : Whenever any two elements in the graphs G1 and G2 are similar, their neighbors tend to be similar.  Transform schemas into directed labeled graphs

Similarity Flooding Example

Conclusion  User feedback:  User Interaction: minimize user input but maximize impact of the feedback  If we require user acceptance for our matches, then what happens if our matcher returns thousands or hundreds of matches?  The more configurable the matcher,the better  Problem with Schema representation and Data  Dealing with inconsistent data values for a schema element.  independence of schema representation  Mapping Maintenance: what happens when you map between two schemas and then one changes?  Sophisticated techniques required for n:m matches [Current work based on 1:1]

Conclusion  More attention 1) Re-use opportunities 2) Learning from User feedback Any other issues to address?

THANK YOU!

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Similar presentations

Presentation on theme: "A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Similar presentations

Presentation on theme: "A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang."— Presentation transcript:

Similar presentations

About project

Feedback