Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.

Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan

2 Administrivia HW4 and midterms returned today  You all did great!  Median on midterm was 75 of 80 (mean was 73.4) Remember to turn in your project plan on Thursday!  Should have a plan for how to break down the project tasks among your group  Should have some milestones that get you towards a completed project  I’ll ask for a status report in a couple of weeks

3 A Problem  We’ve seen that even with normalization and the same needs, different people will arrive at different schemas  In fact, most people also have different needs!  Often people build databases in isolation, then want to share their data  Different systems within an enterprise  Different information brokers on the Web  Scientific collaborators  Researchers who want to publish their data for others to use  This is the goal of data integration: tie together different sources, controlled by many people, under a common schema

4 Building a Data Integration System Create a middleware “mediator” or “data integration system” over the sources  Can be warehoused (a data warehouse) or virtual  Presents a uniform query interface and schema  Abstracts away multitude of sources; consults them for relevant data  Unifies different source data formats (and possibly schemas)  Sources are generally autonomous, not designed to be integrated  Sources may be local DBs or remote web sources/services  Sources may require certain input to return output (e.g., web forms): “binding patterns” describe these

5 Data Integration System / Mediator Typical Data Integration Components Mediated Schema Wrapper Source Relations Mappings in Catalog Source Catalog QueryResults

6 Typical Data Integration Architecture Reformulator Query Processor Source Catalog Wrapper Query Query over sources Source Descrs. Queries + bindings Data in mediated format Results

7 Challenges of Mapping Schemas In a perfect world, it would be easy to match up items from one schema with another  Every table would have a similar table in the other schema  Every attribute would have an identical attribute in the other schema  Every value would clearly map to a value in the other schema Real world: as with human languages, things don’t map clearly!  May have different numbers of tables – different decompositions  Metadata in one relation may be data in another  Values may not exactly correspond  It may be unclear whether a value is the same

8 A Few Simple Examples  Movie(Title, Year, Director, Editor, Star1, Star2)  PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)  MotionPicture(ID, Title, Year) Participant(ID, Name, Role) CustIDCustName 1234Ives, Z. PennIDEmpName 46732Zachary Ives

9 How Do We Relate Schemas? General approach is to use a view to define relations in one schema, given data in the other schema  This allows us to “restructure” or “recompose + decompose” our data in a new way We can also define mappings between values in a view  We use an intermediate table defining correspondences – a “concordance table”  It can be filled in using some type of code, and corrected by hand

10 Mapping Our Examples  Movie(Title, Year, Director, Editor, Star1, Star2)  PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)  MotionPicture(ID, Title, Year) Participant(ID, Name, Role) CustIDCustName 1234Ives, Z. PennIDEmpName 46732Zachary Ives PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2), ID = T || Y, S = S1 || S2 Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”) T1 T2 ???

11 Two Important Approaches  TSIMMIS [Garcia-Molina+97] – Stanford  Focus: semistructured data (OEM), OQL-based language (Lorel)  Creates a mediated schema as a view over the sources  Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems  Information Manifold [Levy+96] – AT&T Research  Focus: local-as-view mappings, relational model  Sources defined as views over mediated schema  Spawned Tukwila at Washington, and eventually a company as well

12 TSIMMIS and Information Manifold  Focus: Web-based queryable sources  CGI forms, online databases, maybe a few RDBMSs  Each needs to be mapped into the system – not as easy as web search – but the benefits are significant vs. query engines  A few parenthetical notes:  Part of a slew of works on wrappers, source profiling, etc.  The creation of mappings can be partly automated – systems such as LSD, Cupid, Clio, … do this  Today most people look at integrating large enterprises (that’s where the $$$ is!) – Nimble, BEA Liquid Data, Enosys, IBM

13 TSIMMIS  “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew  An instance of a “global-as-view” mediation system  One of the first systems to support semi-structured data, which predated XML by several years

14 Semi-structured Data: OEM  Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity  Proposal: Object Exchange Model, OEM OID:  … How does it relate to XML?  … What problems does OEM solve, and not solve, in a heterogeneous system?

15 OEM Example Show this XML fragment in OEM: Bernstein Newcomer Principles of TP Chamberlin DB2 UDB

16 Queries in TSIMMIS  Specified in OQL-style language called Lorel  OQL was an object-oriented query language  Lorel is, in many ways, a predecessor to XQuery  Based on path expressions over OEM structures: select book where book.author = “DB2 UDB” and book.title = “Chamberlin”  This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated = for $b in document(“my-source”)/book where $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

17 Query Answering in TSIMMIS  Basically, it’s view unfolding, i.e., composing a query with a view  The query is the one being asked  The views are the MSL templates for the wrappers  Some of the views may actually require parameters, e.g., an author name, before they’ll return answers  Common for web forms (see Amazon, Google, …)  XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action

18 A Wrapper Definition in MSL  Wrappers have templates and binding patterns ($X) in MSL: B :- B: }> // $$ = “select * from book where author=“ $X //  This reformats a SQL query over Book(author, year, title)  In XQuery, this might look like: define function GetBook($X AS xsd:string) as book { for $x in sql(“select * from book where author=‘” + $x +”’”) return $x $x }

19 How to Answer the Query  Given our query: for $b in document(“my-source”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b  We want to find all wrapper definitions that:  Either contain output enough information that we can evaluate all of our conditions over the output  Or have already tested the conditions for us!

20 Query Composition with Views  We find all views that define book with author and title, and we compose the query with each: define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } for $b in document(“my-source”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

21 Matching View Output to Our Query’s Conditions  Determine that $b/book/author/text()  $x by matching the pattern on the function’s output: define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } where $x = “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b

22 The Final Step: Unfolding where $x = “Chamberlin” for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x }/book where $b/title/text() = “DB2 UDB” return $b

23 What Is the Answer? Given schema book(author, year, title) and datalog rules defining an instance: book(“Chamberlin”, “1992”, “DB2 UDB”) book(“Chamberlin”, “1995”, “DB2/CS”)

24 TSIMMIS  Early adopter of semistructured data  Can support irregular structure and missing attributes  Can support data from many different sources  Doesn’t fully solve heterogeneity problem, though!  Simple algorithms for view unfolding  Easily can be composed in a hierarchy of mediators

25 Limitations of TSIMMIS’ Approach  Some data sources may contain data with certain ranges or properties  “Books by Aho”, “Students at UPenn”, …  How do we express these? (Important for performance!)  Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

26 The Information Manifold Defines the mediated schema independently of the sources!  “Local-as-view” instead of “global-as-view”  Guarantees soundness and completeness of answers  Allows us to specify information about data sources  Focuses on relations (with OO extensions), datalog

27 Observations of Levy et al.  When you integrate something, you have some conceptual model of the integrated domain  Define that as a basic frame of reference  May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

28 The Local-as-View Model  If we look at the Information Manifold model:  “Local” sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption”  The system must use the sources (views) to answer queries over the mediated schema  Thursday we’ll see what “answering queries using views” is all about…

Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.

Similar presentations

Presentation on theme: "Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.

Similar presentations

Presentation on theme: "Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide."— Presentation transcript:

Similar presentations

About project

Feedback