Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies

2 Agenda Ontology mapping is a key problem in many applications: –Data integration –Semantic web –Knowledge management –E-commerce LSD: –Solution that uses multi-strategy learning. –We’ve started with schema matching (I.e., very simple ontologies) –Currently extending to more expressive ontologies. –Experiments show the approach is very promising!

3 The Structure Mapping Problem Types of structures: –Database schemas, XML DTDs, ontologies, …, Input: –Two (or more) structures, S 1 and S 2 –Data instances for S 1 and S 2 –Background knowledge Output: –A mapping between S 1 and S 2 –Should enable translating between data instances. –Semantics of mapping?

4 Semantic Mappings between Schemas Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

5 Motivation Database schema integration –A problem as old as databases themselves. –database merging, data warehouses, data migration Data integration / information gathering agents –On the WWW, in enterprises, large science projects Model management: –Model matching: key operator in an algebra where models and mappings are first-class objects. –See [Bernstein et al., 2000] for more. The Semantic Web –Ontology mapping. System interoperability –E-services, application integration, B2B applications, …,

6 Desiderata from Proposed Solutions Accuracy, efficiency, ease of use. Realistic expectations: –Unlikely to be fully automated. Need user in the loop. Some notion of semantics for mappings. Extensibility: –Solution should exploit additional background knowledge. “Memory”, knowledge reuse: –System should exploit previous manual or automatically generated matchings. –Key idea behind LSD.

7 LSD Overview L(earning) S(ource) D(escriptions) Problem: generating semantic mappings between mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) Step 1: – [SIGMOD, 2001]: 1-1 mappings between XML DTDs. Current focus: –Complex mappings –Ontology mapping.

8 Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

9 Data Integration Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code. wrappers Query reformulation and optimization.

10 Semantic Mappings between Schemas Source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

11 Semantics (preliminary) Semantics of mappings has received no attention. Semantics of 1-1 mappings – Given: –R(A 1,…,A n ) and S(B 1,…,B m ) –1-1 mappings (A i,B j ) Then, we postulate the existence of a relation W, s.t.: –  (C1,…,Ck) (W) =  (A1,…,Ak) (R), –  (C1,…,Ck) (W) =  (B1,…,Bk) (S), –W also includes the unmatched attributes of R and S. In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.

12 Why Matching is Difficult Aims to identify same real-world entity –using names, structures, types, data values, etc Schemas represent same entity differently –different names => same entity: –area & address => location –same names => different entities: –area => location or square-feet Schema & data never fully capture semantics! –not adequately documented, not sufficiently expressive Intended semantics is typically subjective! –IBM Almaden Lab = IBM? Cannot be fully automated. Often hard for humans. Committees are required!

13 Current State of Affairs Finding semantic mappings is now the bottleneck! –largely done by hand –labor intensive & error prone –GTE: 4 hours/element for 27,000 elements [Li&Clifton00] Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data –reconciling ontologies on semantic web Need semi-automatic approaches to scale up!

15 The LSD Approach User manually maps a few data sources to the mediated schema. LSD learns from the mappings, and proposes mappings for the rest of the sources. Several types of knowledge are used in learning: –Schema elements, e.g., attribute names –Data elements: ranges, formats, word frequencies, value frequencies, length of texts. –Proximity of attributes –Functional dependencies, number of attribute occurrences. One learner does not fit all. Use multiple learners and combine with meta-learner.

16 listed-price $250,000 $110,000... address price agent-phone descriptionExample location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

17 Multi-Strategy Learning Use a set of base learners: –Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: –County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

18 Name Learner Base Learners (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3) Naive Bayes Learner [Domingos&Pazzani 97] –“Kent, WA” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner –exploits hierarchical structure of XML data (contact,agent-phone) (contact-info,office-address) (phone,agent-phone) (listed-price,price) (contact-phone, ? )

19 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Base Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

20 Entity Recognizers Use pre-programmed knowledge to identify specific types of entities –date, time, city, zip code, name, etc –house-area (30 X 70, 500 sq. ft.) –county-name recognizer Recognizers often have nice characteristics –easy to construct –many off-the-self research & commercial products –applicable across many domains –help with special cases that are hard to learn

21 Meta-Learner: Stacking Training of meta-learner produces a weight for every pair of: –(base-learner, mediated-schema element) –weight(Name-Learner,address) = 0.1 –weight(Naive-Bayes,address) = 0.9 Combining predictions of meta-learner: –computes weighted sum of base-learner confidence scores Seattle, WA (address,0.6) (address,0.8) Name Learner Naive Bayes Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)

22 Least-Squares Linear Regression Training the Meta-Learner Miami, FL $250,000 Seattle, WA Kent, WA 3... Extracted XML Instances Name Learner 0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0......... Naive BayesTrue Predictions Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9 For address

23 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (description,0.8), (address,0.2) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

24 The Constraint Handler Extends learning to incorporate constraints –hard constraints –a = address & b = address a = b –a = house-id a is a key –a = agent-info & b = agent-name b is nested in a –soft constraints –a = agent-phone & b = agent-name a & b are usually close to each other –user feedback = hard or soft constraints Details in [Doan et. al., SIGMOD 2001]

25 The Current LSD System Mediated schema Source schemas Data listings Constraint Handler Mappings User Feedback Domain Constraints Matching PhaseTraining Phase Base-Learner 1 Base-Learner k Meta-Learner

27 Empirical Evaluation Four domains –Real Estate I & II, Course Offerings, Faculty Listings For each domain –create mediated DTD & domain constraints –choose five sources –extract & convert data listings into XML (faithful to schema!) –mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 Ten runs for each experiment - in each run: –manually provide 1-1 mappings for 3 sources –ask LSD to propose mappings for remaining 2 sources –accuracy = % of 1-1 mappings correctly identified

28 Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% Average Matching Acccuracy (%)

29 Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)

30 Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) More experiments in the paper [Doan et. al. 01]

31 Reasons for Incorrect Matching Unfamiliarity –suburb –solution: add a suburb-name recognizer Insufficient information –correctly identified general type, failed to pinpoint exact type – Richard Smith (206) 234 5412 –solution: add a proximity learner Subjectivity –house-style = description?

33 Moving Up the Expressiveness Ladder Schemas are very simple ontologies. More expressive power = More domain constraints. Mappings become more complex, but constraints provide more to learn from. Non 1-1 mappings: –F 1 (A 1,…,A m ) = F 2 (B 1,…,B m ) Ontologies (of various flavors): –Class hierarchy (I.e., containment on unary relations) –Relationships between objects –Constraints on relationships

34 Given two schemas, find –1-many mappings: address = concat(city,state) –many-1: half-baths + full-baths = num-baths –many-many: concat(addr-line1,addr-line2) = concat(street,city,state) 1-many mappings –expressed as query –value correspondence expression: room-rate = rate * (1 + tax-rate) –relationship: state of tax-rate = state of hotel that has rate –special case: 1-many mappings between two relational tables Finding Non 1-1 Mappings Current work address description num-baths Source schema Mediated schema city state comments half-baths full-baths

35 Brute-Force Solution Brute-Force Solution m 1, m 2,..., m k m1m1 Define a set of operators –concat, +, -, *, /, etc For each set of mediated-schema columns –enumerate all possible mappings –evaluate & return best mapping Source-schema columnsMediated-schema columns compute similarity using all base learners

36 Search-Based Solution Search-Based Solution States = columns – goal state: mediated-schema column –initial states: all source-schema columns –use 1-1 matching to reduce the set of initial states Operators: concat, +, -, *, /, etc Column-similarity: –use all base learners + recognizers

37 Multi-Strategy Search Use a set of expert modules: L 1, L 2,..., L n Each module –applies to only certain types of mediated-schema column –searches a small subspace –uses a cheap similarity measure to compare columns Example –L1: text; concat; TF/IDF –L2: numeric; +, -, *, /; [Ho et. al. 2000] –L3: address; concat; Naive Bayes Search techniques –beam search as default –specialized, do not have to materialize columns

38 Multi-Strategy Search (cont’d) Combine modules’ predictions & select the best one L 1 : m 11, m 12, m 13,..., m 1x L 2 : m 21, m 22, m 23,..., m 2y L 3 : m 31, m 32, m 33,..., m 3z Apply all applicable expert modules m 11, m 12, m 21, m 22, m 31,m 32 m 11 compute similarity using all base learners

39 Related Work TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] LSD [Doan et. al. 2000, 2001] CLIO [Miller et. al. 00],[Yan et. al. 01] Single Learner + 1-1 Matching Hybrid + 1-1 Matching Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction Recognizers + Schema + 1-1 Matching Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching ?

40 Summary LSD: –uses multi-strategy learning to semi-automatically generate semantic mappings. –LSD is extensible and incorporates domain and user knowledge, and previous techniques. –Experimental results show the approach is very promising. Future work and issues to ponder: –Accommodating more expressive languages: ontologies –Reuse of learned concepts from related domains. –Semantics? Data management is a fertile area for Machine Learning research!

41 Backup Slides

42 Mapping Maintenance Source-schema S’Mediated-schema M’ m2m2 m3m3 m1m1 Source-schema SMediated-schema M m2m2 m3m3 m1m1 Ten months later... –are the mappings still correct?

43 Information Extraction from Text Extract data fragments from text documents –date, location, & victim’s name from a news article Intensive research on free-text documents Many documents do have substantial structure –XML pages, name card, tables, list Each such document = a data source –structure forms a schema –only one data value per schema element –“real” data source has many data values per schema element Ongoing research in the IE community

44 Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

45 Existing learners flatten out all structures Developed XML learner –similar to the Naive Bayes learner –input instance = bag of tokens –differs in one crucial aspect –consider not only text tokens, but also structure tokens Exploiting Hierarchical Structure Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Gail Murphy MAX Realtors

46 Domain Constraints Impose semantic regularities on sources –verified using schema or data Examples –a = address & b = address a = b –a = house-id a is a key –a = agent-info & b = agent-name b is nested in a Can be specified up front –when creating mediated schema –independent of any actual source schema

47 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Can specify arbitrary constraints User feedback = domain constraint –ad-id = house-id Extended to handle domain heuristics –a = agent-phone & b = agent-name a & b are usually close to each other 0.3 0.1 0.4 0.012 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252 Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

Similar presentations

Presentation on theme: "Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

Similar presentations

Presentation on theme: "Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies."— Presentation transcript:

Similar presentations

About project

Feedback