IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

iMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign Alon Halevy, Pedro Domingos University of Washington 1

Semantic Heterogeneity Semantics refer to the meaning of data in contrast to syntax, which only defines the structure of the schema items (e.g., classes and attributes). Semantic heterogeneity is a general term referring to disagreement about the meaning, interpretation or intended use of the same or related data. One of the toughest problems in Data Integration. 2

Schema Matching A match between two schemas specifies semantic correspondences between elements of both schemas. A complex match specifies that a combination of attributes in one schema corresponds to a combination in the other. 3

Aim Examine an unbounded number of match candidates for complex matching between data sources. Describe an iMAP system which semi-automatically discovers both 1-1 and complex matches for relational data (database schemas in this paper). The system uses following steps: o Generating Matches o Exploiting Domain Knowledge o Explaining Match Prediction 4

Content An architecture for semi-automatically discovering complex matches that combines search through a set of candidate matches and methods for evaluating each match in isolation, and a set of matches as a whole. Uses of new kinds of domain knowledge (overlap data and mining external data), and applying the knowledge as early as possible in the matching process. A mechanism for explaining the decisions made by the matching system. The iMAP system which embodies all these innovations, and a set of experiments on real-world. 5

Problem Definition consider the two relational schemas S and T in Figure 1. Both databases store house listings and are managed by two di®erent real-estate companies. The schema of database T, for example, has one table, LISTINGS, whereas database S stores its data in two tables, HOUSES and AGENT. Suppose the two real-estate companies have decided to merge. To cut costs, they eliminate database S by transferring all house listings from S to database T. 6

Jean Laup Mike Brown Semantic Mapping Specify relationships between schema elements of disparate data sources. Example: Table S1 32 15 citystateprice ($) AthensGA360,000 RaleighNC360,000 agent-id 32 15 agentcommission 0.03 0.04 id Table S2 location Table T listed-priceagent-name Denver, CO 550,000 Laura Smith Atlanta, GA 370,800 Mike Brown 7

Creating Semantic Mappings Step1: Generating matches 8 Step2: Elaborating matches into mappings location=concat(city,state) listed-price=price*(1+commission) agent-name=agent location = SELECT concat(city,state) FROM S1 listed-price = SELECT price*(1+commission) FROM S1, S2 WHERE agent-id = id agent-name = SELECT agent FROM S2 Table S1 citystateprice ($) AthensGA360,000 RaleighNC360,000 agent-id 32 15 32 15 Jean Laup Mike Brown agentcommission 0.03 0.04 id Table S2 location Target T listed-priceagent-name Denver, CO550,000Laura Smith Atlanta, GA370,800Mike Brown

Generating Matches 1-1 matches agent-name=agent Complex matches location=concat(city,state) listed-price=price*(1+commission) The experiments in this paper contain 30-50% of complex matches. 9

Illustrative Example 10 Finding the best global assigment location = concat(city,state) listed-price=price*(1+commission) agent-name=agent Estimating similarity Generating candidate matches citystatepriceagentcommission Table S locationlisted-price Table T agent- name sim(location,concat(city,state))= 0.96 sim(listed- price,price(1+commision))=0.97 sim(location,city)=0.76 … … sim(listed-price,price)=0.86 … … sim(agent- name,agent)=0.95 location=concat(city,state) listed-price=price(1+commision) location=city … … listed-price=price … … agent- name=agent … “location” and “listed- price” are unrelated “city” and “price” are unrelated

iMAP Architecture 11 Match selector 1-1 and complex matches Match candidates Explanation module User Domain knowledge and data Target schema T and source schema S Similarity matrix Similarity estimator Match generator

Match Generator The space of candidate matches is huge 12 Solution: multi-searcher strategy – employing a set of special-purpose searchers – each searcher examines only a specific portion of search space – e.g., text searcher, numeric searcher, … iMAP is highly extensible – new searchers can be added as developed location=city location=stat e … location=concat(city,state) location=concat(city,agent- name) … listed- price=price listed- price=commission … listed- price=price*(1+commision) listed-price=price*(1-commision) … location=price*(1+commision) … location=price*(1-commision) listed-price=concat(city,state) … listed-price=concat(city,agent- name)

Default Search Strategy Use beam search o at each search level, keep k best match candidates Example: concat(city,price)concat(city,agent-name)concat(city,state) city stateprice ($) commission Find all concatenation candidate matches for location (k=3) … … … agent-name 13

Default Search Strategy Use beam search o at each search level, keep k-best match candidates Example: 14 concat(city,state) city stateprice ($) commission Find all concatenation candidate matches for location (k=3) … concat(city,state,price)concat(city,state,agent-name) … … … agent-name concat(city,price)concat(city,agent-name)

Scoring Functions Each searcher has its own scoring function. o can use machine learning, statistics, heuristics, etc. E.g., current text searcher uses Naive Bayes classification. To compute score between location & concat(city,state) o build a Naive Bayes classifier for location. o Use classifier to compute how similar each instance of concat(city,state) to location o average similarity value is the desired score. 15

Termination Condition Even with beam search, search space can still be huge or infinite o need to find effective termination condition Currently: terminate when see diminishing returns 16 Best candidate match at the 2 nd iteration f(concat(city,state)) - || <= 0.03 threshold scoring function Best candidate match at the 3 rd iteration

Implemented Searchers SearcherSpace of candidatesExample matches Text Text attributes at the source schema name = concat(first-name, last-name) Numeric User supplied matches or past complex matches list-price = price * (1+tax-rate) Category Attributes with less than t distinct values product-categories = product-types Schema mismatch Source attribute containing target schema info fireplace = 1 if house-desc has “fireplace” Unit conversion Physical quantity attributes weight-kg = 2.2* net-weight-pounds Date Columns recognized as ontology nodes birth-date = b-day / b-month / b-year Overlap numeric Specified by a context free grammar interest-earned = balance * interest-rate 17

Exploiting Domain Knowledge Exploit new kinds of domain knowledge - overlap data and external data o Overlap data: data from different sources that represent a same entity tuples from real estate listings that represent same house o External data: data not in the two input schemas “# of real estate agents <= 50” mined from real estate listings Use domain knowledge as early as possible 18

Generating Explanations Motivating example 2: 19 Motivating example 1: num-rooms=bath-rooms + bed-rooms + dining-rooms: 0.434 … Correct match: num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms num-rooms=bath-rooms + bed-rooms + living-rooms: 0.432 name=last-name: 0.434 name=concat(first-name,last-name): 0.420 … Correct match: name=concat(first-name,last-name)

Types of Questions Users Can Ask Existence of a match o why is month-posted=monthly-fee-rate generated? Non-existence of a match o why is num-rooms= bath-rooms+bed-rooms+dining- rooms+ living-rooms not generated? Ranking of a match o why is name=last-name ranked higher than name=concat(first-name,last-name)? 20

Generating Explanation using Dependency Graph 21 list-price=price list-price=price(1 + monthly-fee-rate) score = 0.76 month-posted is unrelated to list-price month-posted = monthly-fee-rate score = 0.67 Match selector month-posted = monthly-fee-rate score = 0.79 month-posted = monthly-fee-rate score = 0.55 Combining module month-posted = monthly-fee-rate Naïve Bayes evaluatorName based evaluator monthly-fee-rate is a monthmonth-posted is a month. Date searcher Preprocessor If data in a column is in (1..12) it’s a month Assumption Source ColumnConstraint Target Column Match List Constraint Candidate list-price = price score = 0.63 month-posted monthly-fee-rate 32…32… 9 10 …

Sample Explanation 22 USER: Why num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms does not appear? iMAP: (1) Overlap numeric matcher CAN generate bath-rooms + bed-rooms + dining-rooms + living-rooms for num-rooms. (2) Overlap numeric matcher DID NOT generate it. (3) A reason: the match has length of 4 terms. overlap numeric searcher has considered only candidates of length up to 3 terms. (4) Characteristics of the search space for num-rooms: a. Number of considered numeric attributes: 7 b. Considered numeric attributes: building-area lot-dimension1 lot-dimension2 bath-rooms bed-rooms dining-rooms living-rooms c. Used successor functions: Addition Multiplication d. Max. number of terms: 3 e. Max. number of elements in a term: 3

Empirical Evaluation Domains & data sources o four domains: Real Estate, Inventory, Cricket, and Financial Wizard o avg. # of matches: 30 o obtain data from several sources Internet, Microsoft Access sample databases, students in a large database class Methodology and performance measure o run experiments with several configurations on eight experiment sets o top-1 matching accuracy = % of target attributes whose top-1 matches are correct 23

Complex Matching Accuracy 1-1 matching accuracy: 77-100% 24 Matching accuracy (%) Complex matching accuracy: 50 - 86% Default system: 33 - 55% + Domain constraints: 50 - 55% + Overlap data: 33 - 71% + Domain constraints + Overlap data: 50 - 86%

Complex Matching Accuracy for Disjoint Domains 25 Complex matching accuracy: 27 - 58% Default system: 18 - 58% + Domain constraints: 27 - 58% Matching accuracy (%)

Why not 100%? Hard to find smaller components of a complex match agent-address = concat(street-name,city,state), where apt-number is missed Hard to remove small noise components from a complex match phone-number = concat(agent-id,area-code,number), where agent-id is noise (a single digit number) Many correct complex matches are in the top three, but not in the top one 26

Summary & Future Work Complex matches are pervasive o finding them is crucial for practical matching solutions Described the iMAP solution o discovers both complex and 1-1 matches o is highly extensible o exploits several kinds of domain knowledge o can explain produced matches Experiments show the promise of iMAP. Future work o apply iMAP to other data representations (e.g., XML) o combine iMAP-style techniques with Clio-style user interaction 27

28 QUESTIONS?

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Similar presentations

Presentation on theme: "IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Similar presentations

Presentation on theme: "IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback