Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 636 Data Integration Schema Matching Cupid Fall 2006.

Similar presentations


Presentation on theme: "CSE 636 Data Integration Schema Matching Cupid Fall 2006."— Presentation transcript:

1 CSE 636 Data Integration Schema Matching Cupid Fall 2006

2 2 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User  Design-Time Mediation Language Schema Matching Run-Time Query Reformulation Optimization & Execution XML Web Services

3 3 Independently created schemas… … might be modeling similar information… … in slightly different ways Schema Heterogeneity name ugradID ugrad * DB1 enrollment * courseID ugradID grade type courseID course * student * DB3 studentID name type letter title ? evaluation studentID student * course * DB2 courseID title name type

4 4 Schema Heterogeneity name ugradID ugrad * DB1 enrollment * courseID ugradID grade type courseID course * student * DB3 studentID name type letter title ? Similar entities represented Dissimilar structures (inverted nesting) Different element names for similar data values Similar element names for different data values evaluation studentID student * course * DB2 courseID title name type

5 5 Schema Matching vs. Schema Mapping GAV and LAV are schema mapping languages Mappings: –set of queries –associations + semantics Match: –set of associations only Schema Matching: –Identifying associations –First step towards constructing mappings

6 6 Associations Semantics Schema Matching vs. Schema Mapping for $s1 in DB3/student where $s1/type = ‘UGRAD’ return {$s1/studentID} {$s1/name} LAV Mapping: DB1  Q(DB3) name ugradID ugrad * DB1 enrollment * courseID ugradID grade type courseID course * student * DB3 studentID name type letter title ?

7 7 The Problem of Schema Matching Input Schemas S 1 and S 2 Possibly data instances for S 1 and S 2 Background knowledge –thesauri –validated matches –standard schemas –reference instances –ontologies –constraints (keys, data types etc) Output Associations between S 1 and S 2 Goal Schema matching tools with significant automated support

8 8 Schema Matching How is the match result expressed? type courseID course * student * DB3 studentID name type letter title ? evaluation studentID student * course * DB2 courseID title name type Pairs of paths Lists of paths Schema names

9 9 Schema Matching What do we match? Depends on the queries we want to ask 1.Elements in isolation (leaves in particular) 2.Substructures 3.Whole schemas

10 10 Motivation Important component in many applications –Data Integration –Data Migration –E-Commerce Model Management [Bernstein, Halevy, Pottinger ’00] –Algebra for manipulating models and mappings –Match, Merge, Compose …

11 11 Minimize user involvement (semi-automatic) Data model independent matching (generic) Schema matching is a hard problem –Naming and structural differences in schemas –Similar, but non-identical concepts modeled –Multiple data models – SQL DDL, XML, ODMG… Problems

12 12 Schema Matching Approaches Graph matching Constraint- based Individual matchers Schema-basedContent-based StructuralPer-Element Constraint- based Types Keys Linguistic Names Descriptions Value pattern and ranges Constraint- based Linguistic IR (word frequencies, key terms) Per-Element Combined matchers CompositeHybrid automatic composition manual composition Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001 How to match?

13 13 Cupid Individual matchers Schema-basedContent-based Graph matching Linguistic Constraint- based StructuralPer-Element Types Keys Value pattern and ranges Constraint- based Linguistic IR (word frequencies, key terms) Per-Element Constraint- based Names Descriptions Combined matchers automatic composition Composite manual composition Hybrid Madhavan, Bernstein and Rahm, VLDB, 2001

14 14 Cupid Example PO Item POLines Qty Line UoM POShipTo City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure DeliverTo CityStreet Address Name

15 15 Cupid Architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Thesaurus Linguistic Matching LSIM SSIM WSIM

16 16 Linguistic Matching Heuristic name matching –Tokenization of names POOrderNum  PO, Order, Num –Expansion of short-forms, acronyms PO  Purchase, Order; Num  Number –Clustering of schema elements based on keywords and data-types Street, City, POAddress  Address –Thesaurus of synonyms, hypernyms, acronyms –Linguistic Similarity coefficient (LSIM)  [0,1]

17 17 Structure Matching PO Item POLines Qty Line UoM City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure POShipTo DeliverTo CityStreet Address Name

18 18 PO Item POLines Qty Line UoM Item PurchaseOrder Items Quantity ItemNum UnitofMeasure WSIM > th high SSIM++ Structure Matching Mutually Reinforcing Similarity

19 19 PO POShipTo PurchaseOrder InvoiceTo DeliverTo StreetCity Address Street City POBillTo StreetCity Address StreetCity SSIM++ SSIM-- Structure Matching Context Dependent Disambiguation

20 20 Intuition Atomic elements are similar –Linguistically and data-type similar –Their ancestors are similar Compound elements (non-leaf) are similar if –Linguistically similar –Subtrees rooted at the elements are similar Mutually recursive –Leaves determine internal node similarity –Similarity of internal nodes leads to increase in leaf similarity

21 21 Structure Match Details Subtrees are similar if –Immediate children are similar –Leaf sets are similar Subtree Similarity (nodes s and t) –Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa –Less sensitive to variation in intermediate structure Pruning the number of comparisons –Elements must have comparable number of leaves

22 22 Order-Customer-fk Referential Integrity Purchase Order Product Name Order ID Customer ID Customer Customer ID Name Address Order-Customer-fk Schema A Customer-Purchase-Order Schema B Join nodes added to the schema tree for each referential integrity constraint Views can be similarly used

23 23 Cupid Architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Thesaurus Linguistic Matching LSIM SSIM WSIM Structural (SSIM), Weighted (WSIM) Similarity InvoiceToBillTo0.7 UoMUnitMeasure0.9 City 1.0 Linguistic Similarity (LSIM) InvoiceToBillTo0.80.7 UoMUnitMeasure0.70.8 InvoiceTo/CityBillTo/City0.80.9

24 24 Mapping Generation Individual mapping elements computed from WSIM values: –Consider only mapping pairs that have WSIM greater than threshold –For each element of target find most similar source element –Not accepted mappings with high similarity are returned in order to help user modify map

25 25 Cupid Architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Thesaurus Linguistic Matching LSIM SSIM WSIM Input hint

26 26 Work Needed A more robust solution –Auto-tuning parameters –Thesaurus Generation and Evolution Schema matching component architecture –Easily extensible by adding multiple techniques –Data Instances for matching –Look at COMA & ProtoPlasm systems

27 27 References 1.J. Madhavan, P. A. Bernstein, E. Rahm Generic Schema Matching with Cupid VLDB, 2001 2.H. H. Do, E. Rahm: COMA - A System for Flexible Combination of Schema Matching Approaches VLDB, 2002 3.P. A. Bernstein, S. Melnik, M. Petropoulos, C. Quix Industrial-Strength Schema Matching SIGMOD Record 33(4), 2004


Download ppt "CSE 636 Data Integration Schema Matching Cupid Fall 2006."

Similar presentations


Ads by Google