AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions.

Slides:

Advertisements

Similar presentations

IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.

Advertisements

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Semantic integration of data in database systems and ontologies

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Xyleme A Dynamic Warehouse for XML Data of the Web.

New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.

A Review of Ontology Mapping, Merging, and Integration Presenter: Yihong Ding.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

1 CIS607, Fall 2005 Semantic Information Integration Presentation by Enrico Viglino Week 3 (Oct. 12)

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.

Presented by Zeehasham Rasheed

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,

Automatic Data Ramon Lawrence University of Manitoba

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Ontology matching ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΙΓΑΙΟΥ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ ΠΛΗΡΟΦΟΡΙΑΚΩΝ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑΚΩΝ ΣΥΣΤΗΜΑΤΩΝ Πρόγραμμα Μεταπτυχιακών Σπουδών

Chapter 5 Data mining : A Closer Look.

Learning to Map between Structured Representations of Data

ONTOLOGY MATCHING Part III: Systems and evaluation.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 5: Schema Matching and Mapping PRINCIPLES OF DATA INTEGRATION.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.

1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Algorithmic Detection of Semantic Similarity WWW 2005.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Ontology Mapping in Pervasive Computing Environment C.Y. Kong, C.L. Wang, F.C.M. Lau The University of Hong Kong.

Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.

Semantic Mappings for Data Mediation

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Data Mining and Decision Support

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

CS Machine Learning Instance Based Learning (Adapted from various sources)

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Semi-Supervised Clustering

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Sofus A. Macskassy Fetch Technologies

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Data Integration with Dependent Sources

Block Matching for Ontologies

Integrating Taxonomies

Learning to Map Between Schemas Ontologies

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Presentation transcript:

AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions

2 Road Map Schema Matching –motivation & problem definition –representative current solutions: LSD, iMAP, Clio –broader picture Ontology Matching –motivation & problem definition –representative current solution: GLUE –broader picture Conclusions & Emerging Directions

3 New faculty member Find houses with 2 bedrooms priced under 200K homes.comrealestate.comhomeseekers.com Motivation: Data Integration

4 Architecture of Data Integration System mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Find houses with 2 bedrooms priced under 200K

5 price agent-name address Semantic Matches between Schemas 1-1 matchcomplex match homes.com listed-price contact-name city state Mediated-schema 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

6 Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases –data integration –data translation –schema/view integration –data warehousing –semantic query processing –model management –peer data management AI –knowledge bases, ontology merging, information gathering agents,... Web –e-commerce –marking up data using ontologies (e.g., on Semantic Web)

7 Why Schema Matching is Difficult Schema & data never fully capture semantics! –not adequately documented –schema creator has retired to Florida! Must rely on clues in schema & data –using names, structures, types, data values, etc. Such clues can be unreliable –same names => different entities: area => location or square-feet –different names => same entity: area & address => location Intended semantics can be subjective –house-style = house-description? –military applications require committees to decide! Cannot be fully automated, needs user feedback!

8 Current State of Affairs Finding semantic mappings is now a key bottleneck! –largely done by hand –labor intensive & error prone –data integration at GTE [Li&Clifton, 2000] –40 databases, elements, estimated time: 12 years Will only be exacerbated –data sharing becomes pervasive –translation of legacy data Need semi-automatic approaches to scale up! Many research projects in the past few years –Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, U Wisconsin, NCSU, UIUC, Washington,... –AI: Stanford, Karlsruhe University, NEC Japan,...

9 Road Map Schema Matching –motivation & problem definition –representative current solutions: LSD, iMAP, Clio –broader picture Ontology Matching –motivation & problem definition –representative current solution: GLUE –broader picture Conclusions & Emerging Directions

10 LSD Learning Source Description Developed at Univ of Washington –with Pedro Domingos and Alon Halevy Designed for data integration settings –has been adapted to several other contexts Desirable characteristics –learn from previous matching activities –exploit multiple types of information in schema and data –incorporate domain integrity constraints –handle user feedback –achieves high matching accuracy ( %) on real-world data

11 Suppose user wants to integrate 100 data sources 1. User –manually creates matches for a few sources, say 3 –shows LSD these matches 2. LSD learns from the matches 3. LSD predicts matches for remaining 97 sources Schema Matching for Data Integration: the LSD Approach

12 price agent-name agent-phone office-phone description Learning from the Manual Matches listed-price contact-name contact-phone office comments Schema of realestate.com Mediated schema $250K James Smith (305) (305) Fantastic house $320K Mike Doan (617) (617) Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great” occur frequently in data instances => description sold-at contact-agent extra-info $350K (206) Beautiful yard $230K (617) Close to Seattle homes.com If “office” occurs in name => office-phone

13 price agent-name agent-phone office-phone description Must Exploit Multiple Types of Information! listed-price contact-name contact-phone office comments Schema of realestate.com Mediated schema $250K James Smith (305) (305) Fantastic house $320K Mike Doan (617) (617) Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great” occur frequently in data instances => description sold-at contact-agent extra-info $350K (206) Beautiful yard $230K (617) Close to Seattle homes.com If “office” occurs in name => office-phone

14 Multi-Strategy Learning Use a set of base learners –each exploits well certain types of information To match a schema element of a new source –apply base learners –combine their predictions using a meta-learner Meta-learner –uses training sources to measure base learner accuracy –weighs each learner based on its accuracy

15 Base Learners Training Matching Name Learner –training: (“location”, address) (“contact name”, name) –matching: agent-name => (name,0.7),(phone,0.3) Naive Bayes Learner –training: (“Seattle, WA”,address) (“250K”,price) –matching: “Kent, WA” => (address,0.8),(name,0.2) labels weighted by confidence score X (X 1,C 1 ) (X 2,C 2 )... (X m,C m ) Observed label Training examples Object Classification model (hypothesis)

16 The LSD Architecture Matching PhaseTraining Phase Mediated schema Source schemas Base-Learner 1 Base-Learner k Meta-Learner Training data for base learners Hypothesis 1 Hypothesis k Weights for Base Learners Base-Learner Base-Learner k Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints

17 Naive Bayes Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) ”, agent-phone) (“(305) ”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) Training the Base Learners Miami, FL $250K James Smith (305) (305) Fantastic house Boston, MA $320K Mike Doan (617) (617) Great location location price contact-name contact-phone office comments realestate.com (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Name Learner address price agent-name agent-phone office-phone description Mediated schema

18 Meta-Learner: Stacking [Wolpert 92,Ting&Witten99] Training –uses training data to learn weights –one for each (base-learner,mediated-schema element) pair –weight (Name-Learner,address) = 0.2 –weight (Naive-Bayes,address) = 0.8 Matching: combine predictions of base learners –computes weighted average of base-learner confidence scores Seattle, WA Kent, WA Bend, OR (address,0.4) (address,0.9) Name Learner Naive Bayes Meta-Learner (address, 0.4* *0.8 = 0.8) area

19 The LSD Architecture Matching PhaseTraining Phase Mediated schema Source schemas Base-Learner 1 Base-Learner k Meta-Learner Training data for base learners Hypothesis 1 Hypothesis k Weights for Base Learners Base-Learner Base-Learner k Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints

20 contact-agent Applying the Learners Name Learner Naive Bayes Prediction-Combiner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (price,0.9), (agent-phone,0.1) extra-info homes.com Seattle, WA Kent, WA Bend, OR area sold-at (agent-phone,0.9), (description,0.1) Meta-Learner area sold-at contact-agent extra-info homes.com schema

21 Domain Constraints Encode user knowledge about domain Specified only once, by examining mediated schema Examples –at most one source-schema element can match address –if a source-schema element matches house-id then it is a key –avg-value(price) > avg-value(num-baths) Given a mapping combination –can verify if it satisfies a given constraint area: address sold-at: price contact-agent: agent-phone extra-info: address

22 area: (address,0.7), (description,0.3) sold-at: (price,0.9), (agent-phone,0.1) contact-agent: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback –sold-at does not match price Domain Constraints At most one element matches address Predictions from Prediction Combiner area: address sold-at: price contact-agent: agent-phone extra-info: description area: address sold-at: price contact-agent: agent-phone extra-info: address

23 The Current LSD System Can also handle data in XML format –matches XML DTDs Base learners –Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97] –exploits frequencies of words & symbols –WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98] –employs information-retrieval similarity metric –Name Learner [SIGMOD-01] –matches elements based on their names –County-Name Recognizer [SIGMOD-01] –stores all U.S. county names –XML Learner [SIGMOD-01] –exploits hierarchical structure of XML data

24 Empirical Evaluation Four domains –Real Estate I & II, Course Offerings, Faculty Listings For each domain –created mediated schema & domain constraints –chose five sources –extracted & converted data into XML –mediated schemas: elements, source schemas: Ten runs for each domain, in each run: –manually provided 1-1 matches for 3 sources –asked LSD to propose matches for remaining 2 sources –accuracy = % of 1-1 matches correctly identified

25 High Matching Accuracy LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)

26 Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) More experiments in [Doan et al. SIGMOD-01]

27 LSD Summary LSD –learns from previous matching activities –exploits multiple types of information –by employing multi-strategy learning –incorporates domain constraints & user feedback –achieves high matching accuracy LSD focuses on 1-1 matches Next challenge: discover more complex matches! –iMAP (illinois Mapping) system [SIGMOD-04] –developed at Washington and Illinois, –with Robin Dhamanka, Yoonkyong Lee, Alon Halevy, Pedro Domingos

28 listed-price agent-id full-baths half-baths city zipcode The iMAP Approach For each mediated-schema element –searches space of all matches –finds a small set of likely match candidates –uses LSD to evaluate them To search efficiently –employs a specialized searcher for each element type –Text Searcher, Numeric Searcher, Category Searcher,... price num-baths address Mediated-schema homes.com 320K Seattle K Miami 23591

29 The iMAP Architecture [SIGMOD-04] Source schema + dataMediated schema Searcher k Searcher 2 Domain knowledge and data Searcher 1 User Base-Learner Base-Learner k 1-1 and complex matches Meta-Learner Similarity Matrix Match candidates Match selector Explanation module

30 An Example: Text Searcher Best match candidates for address –(agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9) listed-price agent-id full-baths half-baths city zipcode price num-baths address Mediated-schema 320K 532a 2 1 Seattle K 115c 1 1 Miami homes.com concat(agent-id,zipcode) 532a c concat(city,zipcode) Seattle Miami concat(agent-id,city) 532a Seattle 115c Miami Beam search in space of all concatenation matches Example: find match candidates for address

31 Empirical Evaluation Current iMAP system –12 searchers Four real-world domains –real estate, product inventory, cricket, financial wizard –target schema: elements, source schema: Accuracy: % Sample discovered matches –agent-name = concat(first-name,last-name) –area = building-area / –discount-cost = (unit-price * quantity) * (1 - discount) More detail in [Dhamanka et. al. SIGMOD-04]

32 Observations Finding complex matches much harder than 1-1 matches! –require gluing together many components –e.g., num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms –if missing one component => incorrect match However, even partial matches are already very useful! –so are top-k matches => need methods to handle partial/top-k matches Huge/infinite search spaces –domain knowledge plays a crucial role! Matches are fairly complex, hard to know if they are correct –must be able to explain matches Human must be fairly active in the loop –need strong user interaction facilities Break matching architecture into multiple "atomic" boxes!

33 Road Map Schema Matching –motivation & problem definition –representative current solutions: LSD, iMAP, Clio –broader picture Ontology Matching –motivation & problem definition –representative current solution: GLUE –broader picture Conclusions & Emerging Directions

34 Finding Matches is only Half of the Job! Mappings –area = SELECT location FROM HOUSES –agent-address = SELECT concat(city,state) FROM AGENTS –list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id Schema T Schema S location price ($) agent-id Atlanta, GA 360, Raleigh, NC 430, HOUSES area list-price agent-address agent-name Denver, CO 550,000 Boulder, CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown LISTINGS id name city state fee-rate 32 Mike Brown Athens GA Jean Laup Raleigh NC 0.04 AGENTS To translate data/queries, need mappings, not matches

35 Clio: Elaborating Matches into Mappings Developed at Univ of Toronto & IBM Almaden, –by Renee Miller, Laura Haas, Mauricio Hernandez, Lucian Popa, Howard Ho, Ling Yan, Ron Fagin Given a match –list-price = price * (1 + fee-rate) Refine it into a mapping –list-price = SELECT price * (1 + fee-rate) FROM HOUSES (FULL OUTER JOIN) AGENTS WHERE agent-id = id Need to discover –the correct join path among tables, e.g., agent-id = id –the correct join, e.g., full outer join? inner join? Use heuristics to decide –when in doubt, ask users –employ sophisticated user interaction methods [VLDB-00, SIGMOD-01]

36 Clio: Illustrating Examples Mappings –area = SELECT location FROM HOUSES –agent-address = SELECT concat(city,state) FROM AGENTS –list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id Schema T Schema S location price ($) agent-id Atlanta, GA 360, Raleigh, NC 430, HOUSES area list-price agent-address agent-name Denver, CO 550,000 Boulder, CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown LISTINGS id name city state fee-rate 32 Mike Brown Athens GA Jean Laup Raleigh NC 0.04 AGENTS

37 Road Map Schema Matching –motivation & problem definition –representative current solutions: LSD, iMAP, Clio –broader picture Ontology Matching –motivation & problem definition –representative current solution: GLUE –broader picture Conclusions & Emerging Directions

38 Broader Picture: Find Matches COMA by Erhard Rahm group David Embley group at BYU Jaewoo Kang group at NCSU Kevin Chang group at UIUC Clement Yu group at UIC SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et al. 97] AutoMatch, Autoplex [Berlin & Motro, 01-03] LSD [Doan et al., SIGMOD-01] iMAP [Dhamanka et. al., SIGMOD-04] Single learner Exploit data 1-1 matches Hand-crafted rules Exploit schema 1-1 matches Learners + rules, use multi-strategy learning Exploit schema + data complex matches Exploit domain constraints More about some of these works soon.... TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01] Other Important Works

39 Broader Picture: From Matches to Mappings iMAP [Dhamanka et al., SIGMOD-04] CLIO [Miller et. al., 00] [Yan et al. 01] Rules Exploit data Powerful user interaction Learners + rules Exploit schema + data complex matches Automate as much as possible ?

40 Road Map Schema Matching –motivation & problem definition –representative current solutions: LSD, iMAP, Clio –broader picture Ontology Matching –motivation & problem definition –representative current solution: GLUE –broader picture Conclusions & Emerging Directions

41 Ontology Matching Increasingly critical for –knowledge bases, Semantic Web An ontology –concepts organized into a taxonomy tree –each concept has –a set of attributes –a set of instances –relations among concepts Matching –concepts –attributes –relations name: Mike Burns degree: Ph.D. Entity Undergrad Courses Grad Courses People StaffFaculty Assistant Professor Associate Professor CS Dept. US

42 Matching Taxonomies of Concepts Entity Courses Staff Technical StaffAcademic Staff Lecturer Senior Lecturer Professor CS Dept. Australia Entity Undergrad Courses Grad Courses People StaffFaculty Assistant Professor Associate Professor CS Dept. US

43 Glue Solution –Use data instances extensively –Learn classifiers using information within taxonomies –Use a rich constraint satisfaction scheme [Doan, Madhavan, Domingos, Halevy; WWW’2002]

44 Concept Similarity Multiple Similarity measures in terms of the JPD Concept A Concept S A,  S  A, S  A,  S A,S P(A,  S) + P(A,S) + P(  A,S) P(A,S) = Joint Probability Distribution: P(A,S),P(  A,S),P(A,  S),P(  A,  S) Hypothetical universe of all examples P(A  S) P(A  S) Sim(Concept A, Concept S) = [Jaccard, 1908]

45 Machine Learning for Computing Similarities JPD estimated by counting the sizes of the partitions CL S S SSTaxonomy 1 Taxonomy 2 A AA S SS CL A A AA A,  SA,S  A,  S  A,S A,S  A,S A,  S  A,  S

46 The Glue System Similarity Estimator Base Learner Meta Learner Relaxation Labeling Common Knowledge & Domain Constraints Similarity Function Joint Probability Distribution P(A,B), P(A’, B)… Similarity Matrix Taxonomy O1 (tree structure + data instances) Taxonomy O2 (tree structure + data instances) Distribution Estimator Matches for O 1, Matches for O 2

47 Constraints in Taxonomy Matching Domain-dependent –at most one node matches department-chair –a node that matches professor can not be a child of a node that matches assistant-professor Domain-independent –two nodes match if parents & children match –if all children of X matches Y, then X also matches Y –Variations have been exploited in many restricted settings [Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98], [Noy et al., IJCAI-01], [Madhavan et al., VLDB-01] Challenge: find a general & efficient approach

48 Solution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83] –applied to graph labeling in vision, NLP, hypertext classification –finds best label assignment, given a set of constraints –starts with initial label assignment –iteratively improves labels, using constraints Standard relax. labeling not applicable –extended it in many ways [Doan et al., W W W-02]

49 Real World Experiments Taxonomies on the web –University organization (UW and Cornell) –Colleges, departments and sub-fields –Companies (Yahoo and The Standard) –Industries and Sectors For each taxonomy –Extract data instances – course descriptions, company profiles –Trivial data cleaning –100 – 300 concepts per taxonomy –3-4 depth of taxonomies –10-90 data instances per concept Evaluation against manual mappings as the gold standard

50 Glue’s Performance University Depts 1 Company Profiles University Depts 2

51 Broader Picture Ontology matching parallels the development of schema matching –rule-based & learning-based approaches –PROMPT family, OntoMorph, OntoMerge, Chimaera, Onion, OBSERVER, FCAMerge,... –extensive work by Ed Hovy's group –ontology versioning (e.g., by Noy et. al.) More powerful user interaction methods –e.g., iPROMPT, Chimaera Much more theoretical works in this area

52 Road Map Schema Matching –motivation & problem definition –representative current solutions: LSD, iMAP, Clio –broader picture Ontology Matching –motivation & problem definition –representative current solution: GLUE –broader picture Conclusions & Emerging Directions

53 Develop the Theoretical Foundation Not much is going on, however... –see works by Alon Halevy (AAAI-02) and Phil Bernstein (in model management contexts) –some preliminary work in AnHai Doan's Ph.D. dissertation –work by Stuart Russell and other AI people on identity uncertainty is potentially relevant Most likely foundation –probability framework

54 Need Much More Domain Knowledge Where to get it? –past matches (e.g., LSD, iMAP) –other schemas in the domain –holistic matching approach by Kevin Chang group [SIGMOD-02] –corpus-based matching by Alon Halevy group [IJCAI-03] –clustering to achieve bridging effects by Clement Yu group [SIGMOD-04] –external data (e.g., iMAP at SIGMOD-04) –mass of users (e.g., MOBS at WebDB-03) How to get it and how to use it? –no clear answer yet

55 Employ Multi-Module Architecture Many "black boxes", each is good at doing a single thing Combine them and tailor them to each application Examples –LSD, iMAP, COMA, David Embley's systems Open issues –what are these back boxes? –how to build them? –how to combine them?

56 Powerful User Interaction Minimize user effort, maximize its impact Make it very easy for users to –supply domain knowledge –provide feedback on matches/mappings Develop powerful explanation facilities

57 Other Issues What to do with partial/top-k matches? Meaning negotiation Fortifying schemas for interoperability Very-large-scale matching scenarios (e.g., the Web) What can we do without the mappings? Interaction between schema matching and tuple matching? Benchmarks, tools?

58 Summary Schema/ontology matching: key to numerous data management problems –much attention in the database, AI, Semantic Web communities Simple problem definition, yet very difficult to do –no satisfactory solution yet –AI complete? We now understand the problems much better –still at the beginning of the journey –will need techniques from multiple fields

59 Backup Slides

60 Backup Slides

61 Least-Squares Linear Regression Training the Meta-Learner Miami, FL $250,000 Seattle, WA Kent, WA 3... Extracted XML Instances Name Learner Naive BayesTrue Predictions Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9 For address

62 Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)

63 Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

64 Existing learners flatten out all structures Developed XML learner –similar to the Naive Bayes learner –input instance = bag of tokens –differs in one crucial aspect –consider not only text tokens, but also structure tokens Exploiting Hierarchical Structure Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Gail Murphy MAX Realtors

65 Reasons for Incorrect Matchings Unfamiliarity –suburb –solution: add a suburb-name recognizer Insufficient information –correctly identified general type, failed to pinpoint exact type –agent-name phone Richard Smith (206) –solution: add a proximity learner Subjectivity –house-style = description? Victorian Beautiful neo-gothic house Mexican Great location

66 Evaluate Mapping Candidates For address, Text Searcher returns –(agent-id,0.7) –(concat(agent-id,city),0.8) –(concat(city,zipcode),0.75) Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8) –Naive Bayes Learner: 0.8 –Name Learner: “address” vs. “agent id city” 0.3 –Meta-Learner: 0.8 * * 0.3 = 0.65 Meta-Learner returns –(agent-id,0.59) –(concat(agent-id,city),0.65) –(concat(city,zipcode),0.70)

67 Relaxation Labeling Dept U.S. Dept Australia Courses Staff People StaffFaculty Tech. StaffAcad. Staff Staff People Courses Faculty Applied to similar problems in –vision, NLP, hypertext classification

68 Relaxation Labeling for Taxonomy Matching Must define –neighborhood of a node –k features of neighborhood –how to combine influence of features – Algorithm –init: for each pair, compute –loop: for each pair, re-compute Acad. Staff: Faculty Tech. Staff: Staff Staff = People Neighborhood configuration

69 Relaxation Labeling for Taxonomy Matching Huge number of neighborhood configurations! –typically neighborhood = immediate nodes –here neighborhood can be entire graph 100 nodes, 10 labels => configurations Solution –label abstraction + dynamic programming –guarantee quadratic time for a broad range of domain constraints Empirical evaluation –GLUE system [Doan et. al., WWW-02] –three real-world domains – nodes / taxonomy –high accuracy % vs % of best base learner –relaxation labeling very fast, finished in several seconds