Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.

Slides:



Advertisements
Similar presentations
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Advertisements

CSE 636 Data Integration Data Integration Approaches.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
F22H1 Logic and Proof Week 7 Clausal Form and Resolution.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),
Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.
Local-as-View Mediators Priya Gangaraju(Class Id:203)
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 21, 2005.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004.
Peer Data Management, Concluded and Model Management Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 18, 2005.
Describing data sources. Outline Overview Schema mapping languages.
Recursive Views and Global Views Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 9, 2004 Some slide content.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 15, 2005.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Objectives of the Lecture :
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Web Services and Data Integration Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems September 20, 2015 Some slides by.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Relational Algebra  Souhad M. Daraghma. Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational.
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 9, 2008.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
The relational model A data model (in general) : Integrated collection of concepts for describing data (data requirements). Relational model was introduced.
CMPT 258 Database Systems Relational Algebra (Chapter 4)
Relational Algebra p BIT DBMS II.
Mariposa and Data Integration I Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 6, 2008.
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 16, 2008.
Data Integration Approaches
1 SQL: The Query Language. 2 Example Instances R1 S1 S2 v We will use these instances of the Sailors and Reserves relations in our examples. v If the.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Presentation transcript:

Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004

2 Administrivia Next reading assignment on scalable query reformulation algorithms:  Pottinger and Halevy – MiniCon Write-up: summarize the main ideas of this paper

3 Today’s Trivia Question

4 A Problem  We’ve seen that even with normalization and the same needs, different people will arrive at different schemas  In fact, most people also have different needs!  Often people build databases in isolation, then want to share their data  Different systems within an enterprise  Different information brokers on the Web  Scientific collaborators  Researchers who want to publish their data for others to use  This is the goal of data integration: tie together different sources, controlled by many people, under a common schema

5 Example  We want to build the UltimateMovieGuide TM  Given:  TV Guide schema: ShowingAt(time, title, year) Shows(title, year, genre, rating) Director(title, year, name) Starring(title, year, name)  GoodMovies.com: FourStarMovie(title, year, genre) DecentMovie(title, year)  OscarWinners: WinningDirector(director, title, year)  Documentaries.org: Documentary(title, year, director, producer)

6 Integrating Data  Several steps:  Getting data out from a data source – it may have its own query/retrieval interface  Sometimes it will need a query before it returns answers, e.g., a Web form  Getting all of the data into the same data model and format  Translating the data into the same schema  Answering queries  How might we handle this?

7 Data Warehouses – Offline Replication  Get experts together, define a schema they think best captures all the info  Define a database with this schema  Define procedural mappings in an “ETL tool” to import the data  Perhaps perform “data cleaning”  Periodically copy all of the data from the data sources  Note that the sources and the warehouse are basically independent at this point Remote, Autonomous Data Sources Data Warehouse Query Results

8 Pros and Cons of Data Warehouses  Need to spend time to design the physical database layout, as well as logical  This actually takes a lot of effort!  Data is generally not up-to-date (lazy or offline refresh) Queries over the warehouse don’t disrupt the data sources Can run very heavy-duty computations, including data mining and cleaning

9 An Alternative – Mediators or Virtual Integration Systems  Get experts together, define a schema they think best captures all the info  Define as a virtual mediated schema  Create declarative mappings specifying how to get data from each source into the warehouse  Evaluate queries over the mediated schema “on the fly” using the current data at the sources Data Integration System Mediated Schema Remote, Autonomous Data Sources Schema Mappings Source Catalog Query Results

10 Core Question: How Do We Define and Use Mappings?  Queries must be directly composed with mappings Leads to use of views as the means of specifying mappings  … So which direction do we specify views? 1.Mediated relations as views over source relations 2.Source relations as views over mediated relations  TSIMMIS chooses option 1  Information Mainfold chooses option 2  Neither is perfect or comprehensive, as we’ll see

11 The Job of Mappings Between different data sources:  May have different numbers of tables – different decompositions  Attributes may be broken down differently (“rating” vs. “EbertThumb” and “RoeperThumb”)  Metadata in one relation may be data in another  Values may not exactly correspond (“shows” vs. “movies”)  It may be unclear whether a value is the same (“COPPOLA” vs. “Francis Ford Coppola”)  May have different, but synonymous terms (ImdbID “123456”  SSN “ ”)  Might have sub/superclass relationships

12 General Techniques  Value-value correspondences accomplished using concordance tables  Join through a table mapping values to values  Imdb_Actor(ID, SAG_actor_name)  Table-multitable correspondences accomplished using joins (in one direction), projections (in other direction)  Key question: what happens if a needed attribute is missing? (e.g., DecentMovie has no genre)  Super/subclass relationships generally must be captured using selection (in one direction), union (in other direction)  … And sometimes we just can’t specify the correspondence!

13 Some Examples of Mappings  Show( ID, Title, Year, Lang, Genre )  Movie( ID, Title, Year, Genre, Director, Star1, Star2 )  EnglishMovie( Title, Year, Genre, Rating )  Docu( ID, Title, Year ) Participant( ID, Name, Role ) ImdbIDCastOf 1234Catwoman NameCastOf Berry, H.Monster’s Ball PieceOfArt(I, T, Y, “English”, “G”) :- EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y) Movie(I, T, Y, “doc”, D, S1, S2) :- Docu(I, T, Y), Participant(I, D, “Dir”), Participant(I, S1, “Cast1”), Participant(I, S2, “Cast2”) T1 T2 Need a concordance table from ImdbIDs to actress names

14 TSIMMIS and Information Manifold  Focus: Web-based queryable sources  CGI forms, online databases, maybe a few RDBMSs  Each needs to be mapped into the system – not as easy as web search – but the benefits are significant vs. query engines  A few parenthetical notes:  Part of a slew of works on wrappers, source profiling, etc.  The creation of mappings can be partly automated – systems such as LSD, Cupid, Clio, … do this  Today most people look at integrating large enterprises (that’s where the $$$ is!) – Nimble, BEA, IBM

15 TSIMMIS  “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew  An instance of a “global-as-view” mediation system  One of the first systems to support semi-structured data, which predated XML by several years

16 Semi-structured Data: OEM  Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity  Proposal: Object Exchange Model, OEM OID: 1: show { 2: id { 15 }, 3: title { Catwoman }, 4: year { 2004 }, 5: lang { English }, 6: genre { fantasy }, 7: criticsrating { 8: stars { 0.5 }, 9: source { Bob } } }

17 Queries in TSIMMIS  Specified in OQL-style language called Lorel  OQL was an object-oriented query language  Lorel is, in many ways, a predecessor to XQuery  Based on path expressions over OEM structures: select show where show.title = “Star Wars” and show.genre = “sci-fi”  This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated = for $s in AllData()/show where $s/title/text() = “Star Wars” and $s/genre/text() = “sci-fi” return $s

18 Query Answering in TSIMMIS  Basically, it’s view unfolding, i.e., composing a query with a view  The query is the one being asked  The views are the MSL templates for the wrappers  Some of the views may actually require parameters, e.g., an author name, before they’ll return answers  Common for web forms (see Amazon, Google, …)  XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action

19 A Wrapper Definition in MSL  Wrappers have templates and binding patterns ($T) in MSL: S :- S: }> // $$ = “select * from movie where title=“ $T //  This reformats a SQL query over Movie(title, year, genre)  In XQuery, this might look like: define function GetShow($t AS xsd:string) as show { for $s in sql(“Amazon.DB”, “select * from movie where title=‘” + $t +”’”) return {$t} {$s/year, $s/genre} } movie year genre … … … The union of GetShow’s results is unioned with others to form the view AllData() …

20 How to Answer the Query Given our query: for $s in AllData()/show where $s/title/text() = “Star Wars” and $s/genre/text() = “sci-fi” return $s Find all wrapper definitions that:  Contain output enough “structure” to match the conditions of the query  Or have already tested the conditions for us!

21 Query Composition with Views We find all views that define book with author and title, and we compose the query with each: define function GetBook($x AS xsd:string) as book { for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”) return {$b/title} {$x} } for $b in AllData()/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b book title author … …

22 Example on Board

23 Virtues of TSIMMIS  Early adopter of semistructured data, greatly predating XML  Can support data from many different kinds of sources  Obviously, doesn’t fully solve heterogeneity problem  Presents a mediated schema that is the union of multiple views  Query answering based on view unfolding  Easily composed in a hierarchy of mediators

24 Limitations of TSIMMIS’ Approach Some data sources may contain data with certain ranges or properties  “Books by Aho”, “Students at UPenn”, …  If we ask a query for students at Columbia, don’t want to bother querying students at Penn…  How do we express these? Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

25 An Alternate Approach: The Information Manifold (Levy et al.) When you integrate something, you have some conceptual model of the integrated domain  Define that as a basic frame of reference, everything else as a view over it  “Local as View” using mappings that are conjunctive queries May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema – the “open world assumption”  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

26 The Local-as-View Model The basic model is the following:  “Local” sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption” The system must use the sources (views) to answer queries over the mediated schema

27 Answering Queries Using Views Assumption: conjunctive queries, set semantics  Suppose we have a mediated schema: show(ID, title, year, genre), rating(ID, stars, source)  A conjunctive query might be: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997 Recall intuitions about this class of queries:  Adding a conjunct to a query removes answers from the result but never adds any  Any conjunctive query with at least the same constraints & conjuncts will give valid answers

28 Query Answering Suppose we have the query: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997 and sources: 5star(i)  show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g)  show(i, t, y, g) critics(i,r, s)  rating(i, r, s) goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997 We want to compose the query with the source mappings – but they’re in the wrong direction!

29 Inverse Rules We can take every mapping and “invert” it, though sometimes we may have insufficient information: If 5star(i)  show(i, t, y, g), rating(i, 5, s) then we can also infer that: show(i,???,???,???,???)  5star(i) But how to handle the absence of the missing attributes?  We know that there must be AT LEAST one instance of ??? for each attribute for each show ID  So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)…

30 But NULLs Lose Information Suppose we take these rules and ask for: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997 If we look at the rule: goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997 “By inspection,” q(t)  goodMovies(t,y) But if apply our inversion procedure, we get: show(i, t, y, g)  goodMovies(t,y), i = NULL, g = “drama” rating(i, r, s)  goodMovies(t,y), i = NULL, r = 5, s = NULL We need “a special NULL” so we can figure out which IDs and ratings match up

31 The Solution: “Skolem Functions” Skolem functions:  Conceptual “perfect” hash functions  Each function returns a unique, deterministic value for each combination of input values  Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values) Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values  They’re just a way of logically generating “special NULLs”

32 Query Answering Using Inverse Rules Invert all rules using the procedures described Take the query and the possible rule expansions and execute them in a Datalog interpreter  In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross- correlating info from different sources  Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent) More efficient, but equivalent, algorithms now exist:  Bucket algorithm [Levy et al.], which we discuss next  MiniCon [Pottinger & Halevy] (next time)  Also related: “chase and backchase” [Popa, Tannen, Deutsch]

33 The Bucket Algorithm  Given a query Q with relations and predicates  Create a bucket for each subgoal in Q  Iterate over each view (source mapping)  If source includes bucket’s subgoal:  Create mapping between q’s vars and the view’s var at the same position  If satisfiable with substitutions, add to bucket  Do cross-product of buckets, see if result is contained (exptime, but queries are probably relatively small)  For each result, do a containment check to make sure the rewriting is contained within the query

34 Let’s Try a Bucket Example  Query q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997 Sources 5star(i)  show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g)  show(i, t, y, g) critics(i,r, s)  rating(i, r, s) goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997 good98(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1998

35 Example of Containment Testing Suppose we have two queries: q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C), Course(C, “DB & Info Systems”) q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X) Intuitively, q1 must contain the same or fewer answers vs. q2:  It has all of the same conditions, except one extra conjunction (i.e., it’s more restricted)  There’s no union or any other way it can add more data We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}

36 Checking Containment via Canonical Databases  To test for q1 µ q2:  Create a “canonical DB” that contains a tuple for each subgoal in q1  Execute q2 over it  If q2 returns a tuple that matches the head of q1, then q1 µ q2 (This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!)  Let’s see this for our example…

37 Example Canonical DB  q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C), Course(C, “DB & Info Systems”)  q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X) StudentTakesCourseinCSE SN SC CX CDB & Info Systems S Need to get tuple in executing q2 over this database

38 Next Time  We’ll look at the state-of-the-art in query reformulation, the MiniCon algorithm  Eliminates the need for the containment check  Eliminates many cross-product comparisons  This – and the Chase&Backchase strategy of Tannen et al – are the two methods most used in virtual data integration today  Please read the MiniCon paper (Pottinger & Halevy)