Data Integration Approaches

Slides:



Advertisements
Similar presentations
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Advertisements

1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.
CSE 636 Data Integration Data Integration Approaches.
CHAPTER 3: DESCRIBING DATA SOURCES
Information Integration Using Logical Views Jeffrey D. Ullman.
Data integration Chitta Baral Arizona State University.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
Planning for the Web I Data Integration Dan Weld University of Washington June, 2003.
Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.
Planning for the Web I Data Integration Dan Weld University of Washington June, 2003.
Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.
Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information Integration (Semantic Web done in Bottom-up manner)
BYU 2003BYU Data Extraction Group Combining the Best of Global-as-View and Local-as-View for Data Integration Li Xu Brigham Young University Funded by.
Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.
1 CSE Students: Please do not log in yet. Check-in with Brian in the back. Review Days 3 and 4 in the book. Others: Please save your work and logout.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Paea LePendu Week 8 (Nov. 16)
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
CSE 636 Data Integration Answering Queries Using Views Overview.
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004.
Information Integration Using Logical Views Jeffrey D. Ullman.
Information Integration Modified from Alon Halevy’s lecture notes.
Infomaster: An information Integration Tool O. M. Duschka and M. R. Genesereth Presentation by Cui Tao.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Describing data sources. Outline Overview Schema mapping languages.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Automatic Data Ramon Lawrence University of Manitoba
Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.
09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.
1 Data integration Most slides are borrowed from Dr. Chen Li, UC Irvine.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
Information Systems: Databases Define the role of general information systems Describe the elements of a database management system (DBMS) Describe the.
1 Data Integration. 2 Motivating Examples An organization has on average 49 databases –can talk about the same topic, but use different vocabularies,
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
Planning for the Web I Data Integration Dan Weld University of Washington June, 2003.
End of Query Optimization Data Integration May 24, 2004.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
Presented by Jiwen Sun, Lihui Zhao 24/3/2004
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
University of Maryland Scaling Heterogeneous Information Access for Wide area Environments Michael Franklin and Louiqa Raschid.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Lecture #7 Query Optimization May 16 th, Agenda/Administration Exam date set: June 10 th, 6:30pm. Place TBA. Volunteers for presenting projects.
1 Corso di Architetture della Info A.A Carlo Batini I sistemi di Data Integration elementi architetturali.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Fundamentals of DBMS Notes-1.
Chapter 12 Information Systems.
Information & Data Integration
Database Architecture
Query Optimization.
SQL – Constraints & Triggers
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Presentation transcript:

Data Integration Approaches CSE 636 Data Integration Data Integration Approaches Fall 2006

Virtual Integration Architecture Leave the data in the sources When a query comes in: Determine the relevant sources to the query Break down the query into sub-queries for the sources Get the answers from the sources, filter them if needed and combine them appropriately Data is fresh Otherwise known as On Demand Integration

Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation Query Result End User Mediation Language Optimization & Execution Mediator Global Schema Web Services XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation Query Result End User Mediation Language Optimization & Execution Mediator Global Schema Web Services 2 XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation Query Result End User Mediation Language Optimization & Execution 3 Mediator Global Schema Web Services 2 XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation 4 Query Result End User Mediation Language Optimization & Execution 3 Mediator Global Schema Web Services 2 XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation 4 Query Result 5 End User Mediation Language Optimization & Execution 3 Mediator Global Schema Web Services 2 XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

Virtual Integration Architecture Design-Time Run-Time  Mapping Tool Query Reformulation 4 Query Result 5 End User Mediation Language Optimization & Execution 3 6 Mediator Global Schema Web Services 2 XML 1 Wrapper Wrapper Data Source Data Source Local Schema Local Schema

Virtual Integration Approaches Dimensions to Consider: How many sources are we accessing? How autonomous are they? Meta-data about sources? Is the data structured? Queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. Closed world assumption vs. open world?

Mediation Languages Logic Global Schema CD Artist CDs Books Authors ASIN Title Genre … Artist ASIN Name … Logic CDs Album ASIN Price DiscountPrice Studio Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName Artists ASIN ArtistName GroupName CDCategories ASIN Category BookCategories ISBN Category

Desiderata from Source Descriptions Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

Source Descriptions Elements of source descriptions: • Contents: source contains movies, directors, cast. • Constraints: only movies produced after 1965. • Completeness: contains all American movies. • Capabilities: – Negative: source requires movie title or director as input – Positive: source can perform selections, joins, …

Approaches to Specification of Source Descriptions • Global-as-View (GAV): Mediator relation defined as a view over source relations Ex: TSIMMIS (Stanford), HERMES (Maryland). • Local-as-View (LAV): Source relation defined as view over mediator relations Ex: Information Manifold (AT&T), Tukwila(UW), InfoMaster (Stanford). GLAV: combines both (Friedman & Millstein 1999)

Approaches to Specification of Source Descriptions Q Mediator Global Schema Mediated Schema GAV LAV GLAV Q’ Source Source Source Source Source Local Schema Local Schema Local Schema Local Schema Local Schema

Global-as-View (GAV) Global Schema: Integrating View of Movie: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View of Movie: SELECT * FROM S1 [S1(title,dir,year,genre)] union SELECT * FROM S2 [S2(title,dir,year,genre)] SELECT S3.title, S3.dir, S4.year, S4.genre FROM S3, S4 [S3(title,dir), WHERE S3.title = S4.title S4(title,year,genre)]

Global-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View of Schedule: SELECT title, dir, year, NULL FROM S1 [S1(title,dir,year)] union SELECT title, dir, NULL, genre FROM S2 [S2(title,dir,genre)]

Global-as-View: Example 3 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating Views: SELECT NULL, NULL, NULL, genre FROM S4 [S4(cinema, genre)] SELECT cinema, NULL, NULL

Global-as-View (GAV): Example 4 Global Schema: MovieActor(title,actor) MovieReview(title, review) Integrating Views: MovieActor(title,actor) ← S1(id,title,actor,year) union MovieActor(title,actor) ← S2(title,director,actor,year) MovieReview(title, review) ← S1(id,title,actor,year), S3(id,review)

Query Reformulation in GAV Query reformulation= rule unfolding+simplification Query: Find reviews for ‘DeNiro’ movies q(title,review) :- MovieActor(title,‘DeNiro’), MovieReview(title,review) 1. q’(title,review) :- S1(id,title,‘DeNiro’, year), S3(id,review) 2. q’(title,review) :- S2(title,director,‘DeNiro’,year), S1(id,title, ‘DeNiro’,year), S3(id,review)

Global-as-View Summary Query reformulation boils down to view unfolding. Very easy conceptually. Can build hierarchies of global schemas. You sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available.

Local-as-View (LAV) Mediated Schema Create View R1 AS SELECT B.ISBN, B.Title, A.Name FROM Book B, Author A WHERE A.ISBN = B.ISBN AND B.Year < 1970 Create View R5 AS SELECT B.ISBN, B.Title FROM Book B WHERE B.Genre = ‘Humor’ Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name Books before 1970 Humor Books Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Local Schema R1 ISBN Title Name R5 ISBN Title

Reformulation Problem Given: A query Q posed over the global schema Descriptions of the data sources Find: A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers to Q given the sources.

Query Reformulation Query: Find authors of humor books Plan: R1 Join R5 Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name Books before 1970 Humor Books Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Local Schema R1 ISBN Title Name R5 ISBN Title

Query Reformulation Query: Find authors of humor books after 1970 Plan: Can’t do it! Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name Books before 1970 Humor Books Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Local Schema R1 ISBN Title Name R5 ISBN Title

Local-as-View: Example 1 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S1 AS [S1(title, dir, year, genre)] SELECT * FROM Movie Create Source S3 AS [S3(title, dir)] SELECT title, dir FROM Movie Create Source S5 AS [S5(title, dir, year)] SELECT title, dir, year FROM Movie WHERE year > 1960 AND genre=‘Comedy’

Local-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S4 [S4(cinema, genre)] SELECT cinema, genre FROM Movie M, Schedule S WHERE M.title=S.title

Local-as-View (LAV): Example 3 Global Schema: Movie(title,year,director,genre) American(director) MovieReview(title, review) Source Views: S1(title, year, director)→ Movie(title,year,director,genre), American(director), year ≥1960, genre= ‘Comedy’ S2(title, review)→Movie(title,year,director,genre), year≥1990, MovieReview(title, review)

Query Reformulation in LAV Query: Reviews for comedies produced after 1950 q(title,review) :- Movie(title,year,director,’Comedy’), year ≥1950, MovieReview (title,review) Reformulated query: q’(title,review) :- S1(title,year,director), S2(title, review) S1(title, year, director)→ Movie(title,year,director,genre), American(director), year ≥1960, genre= ‘Comedy’ S2(title, review)→Movie(title,year,director,genre), year≥1990, MovieReview(title, review)

Local-as-View Summary Very flexible. You have the power of the entire query language to define the contents of the source. Hence, can easily distinguish between contents of closely related sources. Adding sources is easy: they’re independent of each other. Query reformulation: answering queries using views!

LAV vs. GAV See [Ullman,ICDT-1997] for a detailed comparison. • Local-as-View: – Easier to add sources: specify the query expression. – Easier to specify constraints on contents of the sources: they are part of the query expression describing them. • Global-as-View: – Easier query reformulation GLAV combines both (Friedman & Millstein 1999)

The General Problem Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? Many, many papers on this problem The best performing algorithm: The MiniCon Algorithm (Pottinger & Halevy, VLDB 2000)

Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?

Example Movie(title, director, year) Show(title, theater, city, hour) complete after 1960 Show(title, theater, city, hour) Query: find movies (and directors) playing in Seattle: SELECT M.title, M.director FROM Movie M, Show S WHERE M.title=S.title AND city=‘Seattle’ Complete or not?

Example #2 Movie(title, director, year), Oscar(title, year) Query: find directors whose movies won Oscars after 1965: SELECT M.director FROM Movie M, Oscar O WHERE M.title=O.title AND M.year=O.year AND O.year > 1965 Complete or not?

References Information integration Data Integration: a Status Report Maurizio Lenzerini Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 Invited Tutorial Data Integration: a Status Report Alon Halevy German Database Conference (BTW), 2003 Invited Talk