Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

Slides:

Advertisements

Similar presentations

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

Advertisements

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions.

Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.

Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.

Information Retrieval in Practice

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.

11/8/20051 Ontology Translation on the Semantic Web D. Dou, D. McDermott, P. Qi Computer Science, Yale University Presented by Z. Chen CIS 607 SII, Week.

Automatic Data Ramon Lawrence University of Manitoba

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March.

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.

Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.

1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Overview of Search Engines

Learning to Map between Structured Representations of Data

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

CSE 636 Data Integration Schema Matching Cupid Fall 2006.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

ISWC2007, Nov. 14. Discovering simple mappings between Relational database schemas and ontologies Wei Hu, Yuzhong Qu {whu,

Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt,

Semantic Mappings for Data Mediation

Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.

Of 24 lecture 11: ontology – mediation, merging & aligning.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Information Retrieval in Practice

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Lecture 12: Data Wrangling

Semantic Interoperability and Data Warehouse Design

Information Retrieval

CSc4730/6730 Scientific Visualization

Block Matching for Ontologies

Learning to Map Between Schemas Ontologies

Presentation transcript:

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies

2 Agenda Ontology mapping is a key problem in many applications: –Data integration –Semantic web –Knowledge management –E-commerce LSD: –Solution that uses multi-strategy learning. –We’ve started with schema matching (I.e., very simple ontologies) –Currently extending to more expressive ontologies. –Experiments show the approach is very promising!

3 The Structure Mapping Problem Types of structures: –Database schemas, XML DTDs, ontologies, …, Input: –Two (or more) structures, S 1 and S 2 –Data instances for S 1 and S 2 –Background knowledge Output: –A mapping between S 1 and S 2 –Should enable translating between data instances. –Semantics of mapping?

4 Semantic Mappings between Schemas Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

5 Motivation Database schema integration –A problem as old as databases themselves. –database merging, data warehouses, data migration Data integration / information gathering agents –On the WWW, in enterprises, large science projects Model management: –Model matching: key operator in an algebra where models and mappings are first-class objects. –See [Bernstein et al., 2000] for more. The Semantic Web –Ontology mapping. System interoperability –E-services, application integration, B2B applications, …,

6 Desiderata from Proposed Solutions Accuracy, efficiency, ease of use. Realistic expectations: –Unlikely to be fully automated. Need user in the loop. Some notion of semantics for mappings. Extensibility: –Solution should exploit additional background knowledge. “Memory”, knowledge reuse: –System should exploit previous manual or automatically generated matchings. –Key idea behind LSD.

7 LSD Overview L(earning) S(ource) D(escriptions) Problem: generating semantic mappings between mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) Step 1: – [SIGMOD, 2001]: 1-1 mappings between XML DTDs. Current focus: –Complex mappings –Ontology mapping.

8 Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

9 Data Integration Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code. wrappers Query reformulation and optimization.

10 Semantic Mappings between Schemas Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

11 Semantics (preliminary) Semantics of mappings has received no attention. Semantics of 1-1 mappings – Given: –R(A 1,…,A n ) and S(B 1,…,B m ) –1-1 mappings (A i,B j ) Then, we postulate the existence of a relation W, s.t.: –  (C1,…,Ck) (W) =  (A1,…,Ak) (R), –  (C1,…,Ck) (W) =  (B1,…,Bk) (S), –W also includes the unmatched attributes of R and S. In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.

12 Why Matching is Difficult Aims to identify same real-world entity –using names, structures, types, data values, etc Schemas represent same entity differently –different names => same entity: –area & address => location –same names => different entities: –area => location or square-feet Schema & data never fully capture semantics! –not adequately documented, not sufficiently expressive Intended semantics is typically subjective! –IBM Almaden Lab = IBM? Cannot be fully automated. Often hard for humans. Committees are required!

13 Current State of Affairs Finding semantic mappings is now the bottleneck! –largely done by hand –labor intensive & error prone –GTE: 4 hours/element for 27,000 elements [Li&Clifton00] Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data –reconciling ontologies on semantic web Need semi-automatic approaches to scale up!

14 Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

15 The LSD Approach User manually maps a few data sources to the mediated schema. LSD learns from the mappings, and proposes mappings for the rest of the sources. Several types of knowledge are used in learning: –Schema elements, e.g., attribute names –Data elements: ranges, formats, word frequencies, value frequencies, length of texts. –Proximity of attributes –Functional dependencies, number of attribute occurrences. One learner does not fit all. Use multiple learners and combine with meta-learner.

16 listed-price $250,000 $110, address price agent-phone descriptionExample location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

17 Multi-Strategy Learning Use a set of base learners: –Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: –County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

18 Name Learner Base Learners (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3) Naive Bayes Learner [Domingos&Pazzani 97] –“Kent, WA” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner –exploits hierarchical structure of XML data (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) (contact-phone, ? )

19 Boston, MA $110,000 (617) Great location Miami, FL $250,000 (305) Fantastic house Training the Base Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) ”, agent-phone)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

20 Entity Recognizers Use pre-programmed knowledge to identify specific types of entities –date, time, city, zip code, name, etc –house-area (30 X 70, 500 sq. ft.) –county-name recognizer Recognizers often have nice characteristics –easy to construct –many off-the-self research & commercial products –applicable across many domains –help with special cases that are hard to learn

21 Meta-Learner: Stacking Training of meta-learner produces a weight for every pair of: –(base-learner, mediated-schema element) –weight(Name-Learner,address) = 0.1 –weight(Naive-Bayes,address) = 0.9 Combining predictions of meta-learner: –computes weighted sum of base-learner confidence scores Seattle, WA (address,0.6) (address,0.8) Name Learner Naive Bayes Meta-Learner (address, 0.6* *0.9 = 0.78)

22 Least-Squares Linear Regression Training the Meta-Learner Miami, FL $250,000 Seattle, WA Kent, WA 3... Extracted XML Instances Name Learner Naive BayesTrue Predictions Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9 For address

23 Beautiful yard Great beach Close to Seattle (278) (617) (512) Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (description,0.8), (address,0.2) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

24 The Constraint Handler Extends learning to incorporate constraints –hard constraints –a = address & b = address a = b –a = house-id a is a key –a = agent-info & b = agent-name b is nested in a –soft constraints –a = agent-phone & b = agent-name a & b are usually close to each other –user feedback = hard or soft constraints Details in [Doan et. al., SIGMOD 2001]

25 The Current LSD System Mediated schema Source schemas Data listings Constraint Handler Mappings User Feedback Domain Constraints Matching PhaseTraining Phase Base-Learner 1 Base-Learner k Meta-Learner

26 Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

27 Empirical Evaluation Four domains –Real Estate I & II, Course Offerings, Faculty Listings For each domain –create mediated DTD & domain constraints –choose five sources –extract & convert data listings into XML (faithful to schema!) –mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: –manually provide 1-1 mappings for 3 sources –ask LSD to propose mappings for remaining 2 sources –accuracy = % of 1-1 mappings correctly identified

28 Matching Accuracy LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)

29 Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)

30 Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) More experiments in the paper [Doan et. al. 01]

31 Reasons for Incorrect Matching Unfamiliarity –suburb –solution: add a suburb-name recognizer Insufficient information –correctly identified general type, failed to pinpoint exact type – Richard Smith (206) –solution: add a proximity learner Subjectivity –house-style = description?

32 Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

33 Moving Up the Expressiveness Ladder Schemas are very simple ontologies. More expressive power = More domain constraints. Mappings become more complex, but constraints provide more to learn from. Non 1-1 mappings: –F 1 (A 1,…,A m ) = F 2 (B 1,…,B m ) Ontologies (of various flavors): –Class hierarchy (I.e., containment on unary relations) –Relationships between objects –Constraints on relationships

34 Given two schemas, find –1-many mappings: address = concat(city,state) –many-1: half-baths + full-baths = num-baths –many-many: concat(addr-line1,addr-line2) = concat(street,city,state) 1-many mappings –expressed as query –value correspondence expression: room-rate = rate * (1 + tax-rate) –relationship: state of tax-rate = state of hotel that has rate –special case: 1-many mappings between two relational tables Finding Non 1-1 Mappings Current work address description num-baths Source schema Mediated schema city state comments half-baths full-baths

35 Brute-Force Solution Brute-Force Solution m 1, m 2,..., m k m1m1 Define a set of operators –concat, +, -, *, /, etc For each set of mediated-schema columns –enumerate all possible mappings –evaluate & return best mapping Source-schema columnsMediated-schema columns compute similarity using all base learners

36 Search-Based Solution Search-Based Solution States = columns – goal state: mediated-schema column –initial states: all source-schema columns –use 1-1 matching to reduce the set of initial states Operators: concat, +, -, *, /, etc Column-similarity: –use all base learners + recognizers

37 Multi-Strategy Search Use a set of expert modules: L 1, L 2,..., L n Each module –applies to only certain types of mediated-schema column –searches a small subspace –uses a cheap similarity measure to compare columns Example –L1: text; concat; TF/IDF –L2: numeric; +, -, *, /; [Ho et. al. 2000] –L3: address; concat; Naive Bayes Search techniques –beam search as default –specialized, do not have to materialize columns

38 Multi-Strategy Search (cont’d) Combine modules’ predictions & select the best one L 1 : m 11, m 12, m 13,..., m 1x L 2 : m 21, m 22, m 23,..., m 2y L 3 : m 31, m 32, m 33,..., m 3z Apply all applicable expert modules m 11, m 12, m 21, m 22, m 31,m 32 m 11 compute similarity using all base learners

39 Related Work TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] LSD [Doan et. al. 2000, 2001] CLIO [Miller et. al. 00],[Yan et. al. 01] Single Learner Matching Hybrid Matching Schema + Data non 1-1 Matching Sophisticated Data-Driven User Interaction Recognizers + Schema Matching Multi-Strategy Learning Learners + Recognizers Schema + Data non 1-1 Matching ?

40 Summary LSD: –uses multi-strategy learning to semi-automatically generate semantic mappings. –LSD is extensible and incorporates domain and user knowledge, and previous techniques. –Experimental results show the approach is very promising. Future work and issues to ponder: –Accommodating more expressive languages: ontologies –Reuse of learned concepts from related domains. –Semantics? Data management is a fertile area for Machine Learning research!

41 Backup Slides

42 Mapping Maintenance Source-schema S’Mediated-schema M’ m2m2 m3m3 m1m1 Source-schema SMediated-schema M m2m2 m3m3 m1m1 Ten months later... –are the mappings still correct?

43 Information Extraction from Text Extract data fragments from text documents –date, location, & victim’s name from a news article Intensive research on free-text documents Many documents do have substantial structure –XML pages, name card, tables, list Each such document = a data source –structure forms a schema –only one data value per schema element –“real” data source has many data values per schema element Ongoing research in the IE community

44 Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

45 Existing learners flatten out all structures Developed XML learner –similar to the Naive Bayes learner –input instance = bag of tokens –differs in one crucial aspect –consider not only text tokens, but also structure tokens Exploiting Hierarchical Structure Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Gail Murphy MAX Realtors

46 Domain Constraints Impose semantic regularities on sources –verified using schema or data Examples –a = address & b = address a = b –a = house-id a is a key –a = agent-info & b = agent-name b is nested in a Can be specified up front –when creating mediated schema –independent of any actual source schema

47 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Can specify arbitrary constraints User feedback = domain constraint –ad-id = house-id Extended to handle domain heuristics –a = agent-phone & b = agent-name a & b are usually close to each other Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner