Presentation is loading. Please wait.

Presentation is loading. Please wait.

Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.

Similar presentations


Presentation on theme: "Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003."— Presentation transcript:

1 Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003

2 2 Acknowledgments Many of these slides are from presentations by –Erhard Rahm, Hong Hai Do, and Sergey Melnik (Univ. of Leipzig)

3 3 Mapping Schemas Given two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation). Applications –Web site integration –Catalog integration –Schema evolution –Data translation –Reverse engineering –Data warehouse loading –XML message translation –Ontology integration

4 4 Partitioning the Problem Schema matching (aka mapping discovery) –Given two schemas, return a set of corres- pondences that specify pairs of related terms Semantic Mapping (aka query discovery) –Given correspondences between two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation).

5 5 The Schema Matching Problem n Types of schemas: Database schemas, XML schemas, ontologies, …, n Input: n Two (or more) schemas, S 1 and S 2 n Possibly data instances for S 1 and S 2 Background knowledge – thesauri, validated matches, constraints (keys, data types), standard schemas, ontologies, NLP, etc. n Output: n A mapping between S 1 and S 2 Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik Phils’ mods are in underscored Times Roman

6 6 Generic Match Implementation Global libraries (dictionaries, schemas …) Generic Match Implementation Tool 1 (Portal schemas) Tool 2 (E-Business schemas) Tool 3 (Data Warehousing schemas) Schema import/ export Tool 4 (Database Design) Internal Schema Representation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

7 7 Match Example 1 Electronics and Photography Camcorders DV Computers and Software Digital Cameras Electronics Video Camcorders Remote Controls PDAs and Handhelds Home Computers & Internet Computer Hardware PDAs Yahoo! Shopping Epinions.com Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

8 8 Match Example 2 CREATE TABLE PurchaseOrder.Customer ( custNo INT, custName VARCHAR(50), custStreet VARCHAR(50), custCity VARCHAR(50), custZip VARCHAR(10), PRIMARY KEY (custNo) ) ; CREATE TABLE PurchaseOrder.ShipTo ( poNo INT, custNo INT REFERENCES PO1.Customer, shipToStreet VARCHAR(50), shipToCity VARCHAR(50), shipToZip VARCHAR(10), PRIMARY KEY (poNo) ) ; a) A relational schema and an XML schema b) Their corresponding graph representation Containment link Legends: Node DeliverTo Address StreetCity Zip BillTo POrder shipToCityshipToStreet ShipTo shipToZip custCity custStreet Customer custZip PurchaseOrder custName Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

9 9 Tool Example (Biztalk Mapper) Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

10 10 Current Situation n Finding mappings is now the bottleneck! n largely done by hand n labor intensive, tedious & error prone Will only get worse n data sharing & XML become pervasive n proliferation of DTDs and XML schemas n translation of legacy data n reconciling ontologies on semantic web n Need semi-automatic approaches to scale up! Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

11 11 Why Matching is Difficult n Aims to identify same real-world entity n using names, structures, types, data values, etc. n Schemas represent same entity differently different names => same entity (synonyms):  client & user => customer same names => different entities (homonyms):  bug => insect or software error n Schema & data never fully capture semantics! n not adequately documented, not sufficiently expressive n data values suffer from synonyms and homonyms too n Intended semantics is typically subjective! n Cannot be fully automated. Often hard for humans. Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

12 12 Desiderata for Match Solution n Low degree of manual work n Accuracy, efficiency, ease of use n Extensibility n Exploit additional match techniques n Exploit additional background knowledge n Support for Reuse n Exploit previous manual or automatically generated matchings n Generic approach n Different schema languages n Different application areas Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

13 13 Special Situations Match schema S to an incremental modification of S –Can ignore homonyms and possibly synonyms –Little if any reshaping of the structure –Instances probably don’t help Lightweight integration for the semantic web vs E-commerce or data warehouse loading. –The former can’t afford much human review –The latter needs “perfect” mappings and hence human review

14 14 Automatic Match Approaches n Individual approaches n Combining approaches: hybrid vs. composite Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik Schema-based Instance-based Parents Children Leaves Linguistic Constraint- based Types Keys Value pattern and ranges Constraint- based Linguistic IR (word frequencies, key terms) Constraint- based Names Descriptions StructureElement Reuse-oriented StructureElement Dictionaries Thesauri Previous match results

15 15 Match Quality Measures n Comparison of automatically with manually (i.e. real) derived match correspondences n Quality measures: n Overall : post-match effort to add missed (A) and to remove false matches (C): negative Overall  no gain Real matchesSuggested matches A B C D A: False Negatives B: True Positives C: False Positives D: True Negatives SimilarityFlooding [ICDE02]: Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

16 16 UW & MSR Contributions LSD [SIGMOD ] –Learning algorithm based on structures and instances –AnHai Doan, Pedro Domingos, Alon Halvey Cupid [VLDB 02] –Structure matching –Jayant Madhavan, Phil Bernstein Glue [WWW 03] –Taxonomy matching. Uses relaxation labeling. –An Hai Doan, Jayant Madhavan, Pedro Domingos, Alon Halevy Mapping Knowledge Base (MKB) [submitted] –Reuses past mappings to help produce new mappings –Jayant Madhavan, Phil Bernstein, Chuang Chen, Alon Halevy, Pradeep Shenoy

17 17 Where’s the Research Action? There’s always room for new techniques –Compare distribution of data values between elements of two schemas –Create a global schema by clustering elements from different sample schemas Re-using mappings Combining techniques Better user interfaces


Download ppt "Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003."

Similar presentations


Ads by Google