Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Slides:



Advertisements
Similar presentations
Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Advertisements

IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.
XML DOCUMENTS AND DATABASES
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Semantic integration of data in database systems and ontologies
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
The Experience Factory May 2004 Leonardo Vaccaro.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
EE663 Image Processing Edge Detection 5 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.
ETEC 100 Information Technology
Aki Hecht Seminar in Databases (236826) January 2009
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Enrico Viglino Week 3 (Oct. 12)
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.
Using Use Case Scenarios and Operational Variables for Generating Test Objectives Javier J. Gutiérrez María José Escalona Manuel Mejías Arturo H. Torres.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.
Firat Batmaz, Chris Hinde Computer Science Loughborough University A Diagram Drawing Tool For Semi–Automatic Assessment Of Conceptual Database Diagrams.
Information Retrieval in Practice
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
1 Lesson 18 Managing and Reporting Database Information Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Semantic Mappings for Data Mediation
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
CSC 370 – Database Systems Introduction Instructor: Alex Thomo.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Of 24 lecture 11: ontology – mediation, merging & aligning.
Lesson 23 Managing and Reporting Database Information
KNOWLEDGE ACQUISITION
Relational Algebra Chapter 4, Part A
Chapter 15 QUERY EXECUTION.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Data Integration for Relational Web
Block Matching for Ontologies
Probabilistic Ranking of Database Query Results
Presentation transcript:

Amit Shvarchenberg and Rafi Sayag

Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA Alon Halevy, Pedro Domingos Department of Computer Science and Engineering University of Washington, Seattle, WA, USA

Introduction Today there are a lot of databases around the world, and many times it is required to combine two or more similar databases into a single database In the past, many of this integrations were made manually The iMAP system offers a semi-automatic method of matching information from different sources

The Real-Estate-Agents Example locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 Schema T Schema S HOUSES AGENTS LISTING

The Big Merge

Making Tuples Using SQL area= SELECT location from HOUSES agent-address= SELECT concat(city, state) FROM AGENTS list-price= SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

How Do We Match ? The process of creating mappings typically proceeds in two steps. first step: schema matching, we find matches between elements of the two schemas. second step :we elaborate the matches to create query expressions that enable automated data translation or exchange.

Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown

Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown

Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown

Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown

Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown

Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown

The Solution – The iMAP System We will describe the iMAP system which semi- automatically discovers complex matches for relational data in a single table. In some cases iMAP able to find matches that combine attributes from multiple tables.

The iMAP Architecture

Match Generator Input: target schema and source schema. Output: match candidates.

How Match Generator Works Match generator uses a searching method that goes through all possible match candidates. The searchers uses a prior knowledge of possible match types and heuristic methods.

The Internals of a Searcher Applying search to candidate generation involve three major issues: Search strategy Evaluation of candidate matches Termination condition

Search Strategy The space search can be very large or even unbounded. We need to efficiently search such spaces. iMAP address this problem using a search technique called beam search.

Beam Search Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest- scoring match. By that the searcher can conduct a very efficient search in any type of search space.

Implemented Searchers on iMAP

Example: Unit Conversion Searcher The unit conversion searcher can identify a conversion between two different types of measurement unit. It can do so By looking in the name and data of the attributes. (e.g., “hours", “kg", “$", etc.)

The searcher finds the best conversion from a set of conversion functions between the units. In this case weight_kg = 2.2 * weight_pounds. productpounds apple10 Fruits and vegetableskg banna5 Fruits and vegetableskg banna5 apple22 Example: Unit Conversion Searcher (cont.)

Similarity Estimator Input: Match candidates. Output: Similarity matrix. Similarity matrix –stores the similarity score of pairs

Similarity Estimator The similarity estimator gets the results from all the searchers. Then it gathers the data and calculates a final score for each match

Similarity Estimator (cont.) The similarity estimator uses two methods to score match pairs: Name based evaluator Naïve Bayese evaluator

Match Selector Input: Similarity matrix. Output: 1-1 and complex matches.

Match Selector Match Selector – examines the score matrix and outputs the best matches under certain conditions.

Exploiting Domain Knowledge Exploiting domain knowledge was shown to be beneficial on 1-1 matching On complex matching, it can be even more crucial, since it can save valuable processing by early detection of unlikely matches

Domain Constraints Constraints are either present in the schema, or provided by an expert or the user iMAP considers 3 kinds of constraints: Two attributes are un-related Constraint on a single attribute Multiple schema attributes are un-related

Sources For Domain Constraints Past Complex Matches Overlap data External Data

Past Complex Matches We often find that we map the same or similar schemas repeatedly iMAP can extract a template expression from such matches Example Given the past match: “price = pr * (1+0.6)” iMAP will extract: “VAR * (1 + CONST)” and ask the numeric searcher to look for matches for that template

Overlap Data In some cases, both the source and the target share the same data This can be used as information for the matching process Searchers that exploit overlap data: Overlap text searcher Overlap numeric searcher Overlap category and schema mismatch searcher

External Data External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching

Why do we need it?

Generating Explanations in iMAP iMAP’s goal is to provide a design environment where a human user can quickly generate a mapping between a pair of schemas For a user to know what match to choose, it is necessary to supply an explanation for each of the matches

User Questions iMAP considers 3 questions that might be asked by a user: Why the match exist? Why the match doesn’t exist? Why is one match better than the other?

Explanation Generation iMAP keeps track of the decision making progress as a dependency graph: Each node is either a schema attribute, an assumption, candidate matches or domain knowledge An edge between two nodes means that one node lead to another

Explanation Generation Example