Describing data sources. Outline Overview Schema mapping languages.

Slides:



Advertisements
Similar presentations
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Advertisements

CSE 636 Data Integration Data Integration Approaches.
CHAPTER 3: DESCRIBING DATA SOURCES
Information Integration Using Logical Views Jeffrey D. Ullman.
Relational Algebra Dashiell Fryer. What is Relational Algebra? Relational algebra is a procedural query language. Relational algebra is a procedural query.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Search Engines and Information Retrieval
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Ordering the Display of Tuples List in alphabetic order the names of all customers having.
Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.
Local-as-View Mediators Priya Gangaraju(Class Id:203)
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Paea LePendu Week 8 (Nov. 16)
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 21, 2005.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004.
Recursive Views and Global Views Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 9, 2004 Some slide content.
Slides adapted from A. Silberschatz et al. Database System Concepts, 5th Ed. SQL - part 2 - Database Management Systems I Alex Coman, Winter 2006.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 15, 2005.
Compe 301 ER - Model. Today DBMS Overview Data Modeling Going from conceptual requirements of a application to a concrete data model E/R Model.
Dr. Kalpakis CMSC 461, Database Management Systems Introduction.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 12: Ontologies and Knowledge Representation PRINCIPLES OF DATA INTEGRATION.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
Objectives of the Lecture :
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 1: Introduction.
©Silberschatz, Korth and Sudarshan6.1Database System Concepts Chapter 6: Integrity and Security Domain Constraints Referential Integrity Assertions Triggers.
1 Introduction to databases concepts CCIS – IS department Level 4.
Introduction to Databases
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries.
SQL. Basic Structure SQL is based on set and relational operations with certain modifications and enhancements A typical SQL query has the form: select.
Search Engines and Information Retrieval Chapter 1.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
©Silberschatz, Korth and Sudarshan2.1Database System Concepts - 5 th Edition, Oct 5, 2006 Outer Join n An extension of the join operation that avoids loss.
The TSIMMIS Approach to Mediation: Data Models and Languages Hector Garcia-Molina Yannis Papakonstantinou Dallan Quass Anand Rajaraman Yehoshua Sagiv Jeffrey.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
CS 370 Database Systems Lecture 11 Relational Algebra.
Relational Algebra  Souhad M. Daraghma. Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
IS 230Lecture 6Slide 1 Lecture 7 Advanced SQL Introduction to Database Systems IS 230 This is the instructor’s notes and student has to read the textbook.
Chapter 4: SQL Complex Queries Complex Queries Views Views Modification of the Database Modification of the Database Joined Relations Joined Relations.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
©Silberschatz, Korth and Sudarshan3.1Database System Concepts Extended Relational-Algebra-Operations Generalized Projection Aggregate Functions Outer Join.
CPSC 603 Database Systems Lecturer: Laurie Webster II, M.S.S.E., M.S.E.E., M.S.BME, Ph.D., P.E. Lecture 4 Introduction to a First Course in Database Systems.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
Data Integration Approaches
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Chapter 3: Relational Model III Additional Relational Algebra Operations Additional Relational Algebra Operations Views Views.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
Databases Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
Query Languages Language in which user requests information from the database. Categories of languages Procedural Non-procedural, or declarative “Pure”
Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Chapter 2 Database Environment.
Chen Li Information and Computer Science
Presentation transcript:

Describing data sources

Outline Overview Schema mapping languages

Source descriptions Which sources are available What data exists in each source How each source can be accessed

Components of a data integration system User query is reformulated into a query over the data sources

Working example Mediated schema: Movie(title, director, year, genre), Actors(title, name) Plays(movie, location, startTime) Reviews(title, rating, description) Data sources: S1: Movie(MID, title) Actor(AID, firstName, lastName, nationality, yearOfBirth) ActorPlays(AID, MID) MovieDetail(MID, director, genre, year) S2: S3: Cinemas(place,movie,start) NYCCinemas(name, title, startTime) S4: S5: Reviews(title,date,grade,review) MovieGenres(title, genre) S6: S7: MovieDirectors(title, dir) MovieYears(title, year)

Components of source descriptions Schema mappings  What data exists in sources  How to map terms used in source schemata with terms used in the mediated schema Information used to optimize queries  Access pattern limitations Because data sources may differ on the access patterns supported  Source completeness

Components of source descriptions  Schema mappings  What data exists in sources  How to map terms used in source schemata with terms used in the mediated schema Information used to optimize queries to the sources and to avoid illegal access patterns  Access pattern limitations Because data sources may differ on the access patterns supported  Source completeness

Schema mapping Main component of a source description Specification of:  What data exists in the source  How the terms used in the source schema relate to the terms used in the mediated schema Needs to handle semantic heterogeneity: discrepancies between the source schemata and the mediated schema  Relation and attribute names  Tabular organization  Domain coverage  Data-level variations

Query reformulation Besides schema mappings, source descriptions specify information:  To enable the data integration system to optimize queries posed to the sources Knowing that a data source is known to be complete saves work by not accessing other data sources that have overlapping data  To avoid illegal access patterns Data sources may differ on which access patterns they support

Schema mapping languages Schema mapping: set of expressions that describe a relationship between a set of schemata (typically two). In our case, mediator schema and the schema of the sources  Used to reformulate a query formulated in terms of the mediated schema into appropriate queries on the sources.  Result is called logical query plan (query expression that refers only to the relations in the data sources) It will not be always possible to generate a query plan that produces all the certain answers Two types of algorithms involved:  To find the best possible logical plan  To find all the certain answers Schema mapping languages based on: query expressions

Semantics of schema mappings A semantic mapping M defines a relation M R over: I(G) X I(S1) X.... X I(Sn) Where:  I(G) denotes the possible instances of the mediated schema  I(S1),..., I(Sn) denote the possible instances of the source relations S1,..., Sn, respectively If (g, s1,..., sn)  M R, then g is a possible instance of the mediated schema when the source relation instances are s1,..., sn

Certain answers Let M be a schema mapping between a mediated schema G and source schemata S1,..., Sn that defines the relation M R over I(G) X I(S1) X... X I(Sn). Let Q be a query over G, and let s1,..., sn be instances of the source relations. We say that t is a certain answer of Q wrt M and s1,..., sn if t  Q(g) for every instance g of G s.t. (g, s1,..., sn)  M R

Properties of schema mapping languages Flexibility: the formalism should be able to express a wide variety of relationships between schemata. Efficient reformulation: reformulation algorithms should have well understood properties and be efficient  Trade-off: flexibility/expressivness vs efficiency Easy update: Must be easy to add and remove sources  Schema mapping languages:  Global-As-View (GAV)  Local-As-View (LAV)  Global-Local-As-View.(GLAV)

Two systems (GAV) TSIMMIS [Garcia-Molina+97] – Stanford  Focus: semistructured data (OEM), OQL-based language (Lorel)  Creates a mediated schema as a view over the sources  Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems  Other important systems of this vein: Penn (LAV) Information Manifold [Levy+96] – AT&T Research  Focus: local-as-view mappings, relational model  Sources defined as views over mediated schema  Led to peer-to-peer integration approaches (Piazza, etc.) Focus: Web-based queriable sources

Global-As-View (GAV) Defines the mediated schema as a set of views over the data sources  Mediated schema also referred as global schema Let G be a mediated schema, and let S = {S1,..., Sn} be schemata of n data sources, A Global-As-View schema mapping M is a set of expressions of the form: Gi(X)  Q(S) or Gi(X) = Q(S), where  Gi is a relation in G, and appears in at most one expression in M, and  Q(S) is a query over the relations in S

Working example Mediated schema: Movie(title, director, year, genre), Actors(title, name) Plays(movie, location, startTime) Reviews(title, rating, description) Data sources: S1: Movie(MID, title) Actor(AID, firstName, lastName, nationality, yearOfBirth) ActorPlays(AID, MID) MovieDetail(MID, director, genre, year) S2: S3: Cinemas(place,movie,start) NYCCinemas(name, title, startTime) S4: S5: Reviews(title,date,grade,review) MovieGenres(title, genre) S6: S7: MovieDirectors(title, dir) MovieYears(title, year)

Example of a GAV schema mapping Movie(title, director, year, genre)  S1.Movie(MID, title), S1.MovieDetail(MID, director, genre, year) Movie(title, director, year, genre)  S5.MovieGenres(title, genre), S6.MovieDirectors(title, director), S7.MovieYears(title, year) Plays(movie, location, startTime)  S2.Cinemas(location, movie, startTime) Plays(movie, location, startTime)  S3.NYCCinemas(location, movie, startTime)

GAV semantics Let M = M1,..., Ml be a GAV schema mapping between G and S = {S1,..., Sn}, where Mi is of the form Gi(X)  Qi(S), or Gi(X) = Qi(S). Let g be an instance of the mediated schema G, and let s = s1,..., sn be instances of S1,...Sn, respectively. The tuple of instances (g, s1,..., sn) is in M R if for every 1<=i<=l, the following holds:  If Mi is a = expression, then the extension of Gi in g is equal to the result of evaluating Qi on s,  If Mi is a  expression, then the extension of Gi in g is a superset of the result of evaluating Qi on s

Reformulation in GAV To reformulate a query posed over the mediated schema, simply unfold the query with the view definitions The reformulation resulting from the unfolding is guaranteed to find all the certain answers

Example The query Q, over the mediated schema, asks for comedies starting after 8pm: Q(title,location,startTime) :- Movie(title,director,year,“comedy”), Plays(title, location, st), st >= 8pm Reformulating Q with the source descriptions would yield the following four logical query plans: Q’(title, location, startTime) :- S1.Movie(MID, title), S1.MovieDetail(MID, director, “comedy”, year), S2.Cinemas(location, movie, st), st >= 8pm Q’(title, location, startTime) :- S1.Movie(MID, title), MovieDetail(MID, director, “comedy”, year), S3.NYCCinemas(location, title, st), st >= 8pm Q’(title, location, startTime) :- S5.MovieGenres(title, “comedy”), S6.MovieDirectors(title, director), S7.MovieYears(title, year), S2.Cinemas(location, title, st), st >= 8pm Q’(title, location, startTime) :- S5.MovieGenres(title, “comedy”), S6.MovieDirectors(title, director), S7.MovieYears(title, year), S3.NYCCinemas(location, title, st), st >= 8pm

Limitations The reformulation may not be the most efficient method to answer the query Some subgoals may be redundant  In the last two reformulations, the subgoals: S6.MovieDirectors and S7.MovieYears are not needed, since what is really needed for the Movies relations is the genre of the movie.  But there is no way of concluding this in GAV descriptions Adding and removing sources involves considerable work and knowledge of the sources -> potentially not scalable  Ex: if we discover another source that includes only movie directors  To update the source descriptions we need to specify exactly which sources it needs to be joined with in order to produce tuples of Movie

TSIMMIS [Garcia-Molina+97] One of the first systems to support semi-structured data according to the OEM data model, which predated XML by several years. Mediator Specification Language (MSL): logic-based OO language used as a view definition language targeted to the OEM data model and to the integration of heterogeneous data sources  Based on Datalog, among others Wrappers accept queries expressed in MSL and compare them with the patterns (MSL templates) given in the wrapper specification file An instance of a GAV mediation system  We define our global schema as views over the sources

XML vs. Object Exchange Model Bernstein Newcomer Principles of TP Chamberlin DB2 UDB O1: book { O2: author { Bernstein } O3: author { Newcomer } O4: title { Principles of TP } } O5: book { O6: author { Chamberlin } O7: title { DB2 UDB } }

User queries in TSIMMIS Specified in OQL-style language called Lorel  OQL was an object-oriented query language that looks like SQL  Lorel is, in many ways, a predecessor to XQuery Based on path expressions over OEM structures: select book where book.title = “DB2 UDB” and book.author = “Chamberlin” This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated: for $b in AllData()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

Query Answering in TSIMMIS Basically, it’s view unfolding, i.e., composing a query with a view  The query is the one being asked  The views are the MSL templates for the wrappers  Some of the views may actually require parameters, e.g., an author name, before they’ll return answers Common for web forms (see Amazon, Google, …) XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action

Recall SQL View Unfolding/Expansion A view consisting of branches and their customers Find all customers of the Perryridge branch create view all_customer as (select branch_name, customer_name from depositor, account where depositor.account_number = account.account_number ) union (select branch_name, customer_name from borrower, loan where borrower.loan_number = loan.loan_number ) select customer_name from all_customer where branch_name = 'Perryridge'

A Wrapper Definition in MSL Wrappers have templates and binding patterns ($X) in MSL: B :- B: }> // $$ = “select * from book where author=“ $X //  If the template is matched by the query issued to the mediator, an SQL query is issued over Book(author, year, title), which is the relation stored in the data source In XQuery, this might look like: define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +”’”) return {$b/title} $x } … The GetBook’s results is unioned with others to form the view Mediator()

How to Answer the Query Given our query: for $b in Mediator()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b Find all wrapper definitions that:  Contain enough “structure” to match the conditions of the query  Or have already tested the conditions for us!

Query Composition with Views We find all views that define book with author and titleas output, and we compose the query with each: define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return {$b/title} {$x} } for $b in Mediator()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b book title author … …

Matching View Output to Our Query’s Conditions Determine that $b/author/text()  $x by matching the pattern on the function’s output: define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return { $b/title } {$x} } let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b book title author … …

The Final Step: Unfolding let $x := “Chamberlin” for $b in ( for $b’ in sql(“Amazon.com”, “select * from book where author=‘” + $x + “’”) return { $b/title } {$x} )/book where $b/title/text() = “DB2 UDB” return $b This can be simplified into: for $b in sql(“Amazon.com”, “select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b

Virtues of TSIMMIS Early adopter of semistructured data, greatly predating XML  Can support data from many different kinds of sources  Obviously, doesn’t fully solve heterogeneity problem Presents a mediated schema that is the union of multiple views  Query answering based on view unfolding Easily composed in a hierarchy of mediators

Big limitation of TSIMMIS Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

Local-As-View Opposite approach to GAV Focus on describing each data source as precisely as possible and independently of any other sources  Instead of specifying how to compute tuples of the mediated system LAV expressions describe data sources as queries over the mediated schema

The Local-as-View Model The basic model is the following:  Local sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption” The system must use the sources (views) to answer queries over the mediated schema

LAV schema mappings Let G be a mediated schema and let S = {S1,..., Sn} be schemata of n data sources. A Local-As-View schema mapping M is a set of expressions of the form Si(X)  Qi(G) or Si(X) = Qi(G), where:  Qi is a query over the mediated schema G, and  Si is a source relation and it appears in at most one expression in M

Recap. example Mediated schema: Movie(title, director, year, genre), Actors(title, name) Plays(movie, location, startTime) Reviews(title, rating, description) Data sources: S1: Movie(MID, title) Actor(AID, firstName, lastName, nationality, yearOfBirth) ActorPlays(AID, MID) MovieDetail(MID, director, genre, year) S2: S3: Cinemas(place,movie,start) NYCCinemas(name, title, startTime) S4: S5: Reviews(title,date,grade,review) MovieGenres(title, genre) S6: S7: MovieDirectors(title, dir) MovieYears(title, year)

LAV example In LAV, sources S5-S7 would be described as projection queries over the Movie relation in the mediated schema S5.MovieGenres(title, genre)  Movie(title, director, year, genre) S6.MovieDirectors(title, dir)  Movie(title, director, year, genre) S7.MovieYears(title, year)  Movie(title, director, year, genre) In LAV, we can express constraints on the contents of data sources S9(title, year, “comedy”)  Movie(title, director, year, “comedy”), year >= 1970

LAV semantics Let M= M1,..., Ml be a LAV schema mapping between G and S ={S1,..., Sn}, where Mi is of the form Si(X)  Qi(G) or Si(X) = Qi(G). Let g be an instance of the mediated schema G, and let s = s1,..., sn be instances of S1,..., Sn, respectively. The tuple of instances (g, s1,..., sn) is in M R if for every 1<=i<=l, the following holds:  If Mi is an expression, then the result of evaluating Qi over g is equal to si  If Mi is a  expression, then the result of evaluating Qi over g is a subset of si

Reformulation in LAV Main advantages: flexibility + enables expressing incomplete information  data sources are described in isolation => the system, and not the designer, will find ways of combining data from multiple sources  Easier for a designer to add/remove sources Example: Q(title) :- Movie(title, director, year, “comedy”), year >= 1960 Using sources S5-S7, we obtain the reformulation: Q’(title) :- S5.MovieGenres(title, “comedy”), S7.MovieYears(title, year), year >= 1960 Using source S9, we obtain the reformulation: Q’(title) :- S9(title, year, “comedy”)

The Information Manifold [Levy+96] When you integrate something, you have some conceptual model of the integrated domain  Define that as a basic frame of reference, everything else as a view over it  Local as View May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

Advantages and Shortcomings of LAV Enables expressing incomplete information More robust way of defining mediated schemas and sources Mediated schema is clearly defined, less likely to change Sources can be more accurately described Computationally more expensive!

References Chapter 4, Draft of the book on “Principles of Data Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation). Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom.The TSIMMIS project: Integration of heterogeneous information sources. In proceedings of IPSJ, Tokyo, Japan, October Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In Proceedings of the International Conference on Very Large Databases (VLDB), Zach Ives, slides of the course: “Database and Information Systems”, Fall 2007, available at: