Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.

Slides:



Advertisements
Similar presentations
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Advertisements

XML: Extensible Markup Language
CSE 636 Data Integration Data Integration Approaches.
From the Calculus to the Structured Query Language Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 22, 2005.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 21, 2005.
Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004.
Peer Data Management, Concluded and Model Management Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 18, 2005.
Describing data sources. Outline Overview Schema mapping languages.
Recursive Views and Global Views Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 9, 2004 Some slide content.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 15, 2005.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.
CSE 590DB: Database Seminar Autumn 2002: Meta Data Management Phil Bernstein Microsoft Research.
Chapter 7: The Object-Oriented Approach to Requirements
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
CSC2012 Database Technology & CSC2513 Database Systems.
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
Search Engines and Information Retrieval Chapter 1.
T Network Application Frameworks and XML Web Services and WSDL Sasu Tarkoma Based on slides by Pekka Nikander.
Web Services and Data Integration Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems September 20, 2015 Some slides by.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
IE 423 – Design of Decision Support Systems Data modeling and database development.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
An Introduction to Design Patterns. Introduction Promote reuse. Use the experiences of software developers. A shared library/lingo used by developers.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
The TSIMMIS Approach to Mediation: Data Models and Languages Hector Garcia-Molina Yannis Papakonstantinou Dallan Quass Anand Rajaraman Yehoshua Sagiv Jeffrey.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Fall 2013, Databases, Exam 2 Questions for the second exam. Your answers are due by Dec. 18 at 4PM. (This is the final exam slot.) And please type your.
Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
Data Integration Approaches
The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Normal Forms Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 18, 2016 Some slide content courtesy of Susan Davidson.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
ITEC 3220A Using and Designing Database Systems
Database Design Hacettepe University
Chapter 2 Database Environment Pearson Education © 2009.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan

2 Administrivia HW4 and midterms returned today  You all did great!  Median on midterm was 75 of 80 (mean was 73.4) Remember to turn in your project plan on Thursday!  Should have a plan for how to break down the project tasks among your group  Should have some milestones that get you towards a completed project  I’ll ask for a status report in a couple of weeks

3 A Problem  We’ve seen that even with normalization and the same needs, different people will arrive at different schemas  In fact, most people also have different needs!  Often people build databases in isolation, then want to share their data  Different systems within an enterprise  Different information brokers on the Web  Scientific collaborators  Researchers who want to publish their data for others to use  This is the goal of data integration: tie together different sources, controlled by many people, under a common schema

4 Building a Data Integration System Create a middleware “mediator” or “data integration system” over the sources  Can be warehoused (a data warehouse) or virtual  Presents a uniform query interface and schema  Abstracts away multitude of sources; consults them for relevant data  Unifies different source data formats (and possibly schemas)  Sources are generally autonomous, not designed to be integrated  Sources may be local DBs or remote web sources/services  Sources may require certain input to return output (e.g., web forms): “binding patterns” describe these

5 Data Integration System / Mediator Typical Data Integration Components Mediated Schema Wrapper Source Relations Mappings in Catalog Source Catalog QueryResults

6 Typical Data Integration Architecture Reformulator Query Processor Source Catalog Wrapper Query Query over sources Source Descrs. Queries + bindings Data in mediated format Results

7 Challenges of Mapping Schemas In a perfect world, it would be easy to match up items from one schema with another  Every table would have a similar table in the other schema  Every attribute would have an identical attribute in the other schema  Every value would clearly map to a value in the other schema Real world: as with human languages, things don’t map clearly!  May have different numbers of tables – different decompositions  Metadata in one relation may be data in another  Values may not exactly correspond  It may be unclear whether a value is the same

8 A Few Simple Examples  Movie(Title, Year, Director, Editor, Star1, Star2)  PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)  MotionPicture(ID, Title, Year) Participant(ID, Name, Role) CustIDCustName 1234Ives, Z. PennIDEmpName 46732Zachary Ives

9 How Do We Relate Schemas? General approach is to use a view to define relations in one schema, given data in the other schema  This allows us to “restructure” or “recompose + decompose” our data in a new way We can also define mappings between values in a view  We use an intermediate table defining correspondences – a “concordance table”  It can be filled in using some type of code, and corrected by hand

10 Mapping Our Examples  Movie(Title, Year, Director, Editor, Star1, Star2)  PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)  MotionPicture(ID, Title, Year) Participant(ID, Name, Role) CustIDCustName 1234Ives, Z. PennIDEmpName 46732Zachary Ives PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2), ID = T || Y, S = S1 || S2 Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”) T1 T2 ???

11 Two Important Approaches  TSIMMIS [Garcia-Molina+97] – Stanford  Focus: semistructured data (OEM), OQL-based language (Lorel)  Creates a mediated schema as a view over the sources  Spawned a UCSD project called MIX, which led to a company now owned by BEA Systems  Information Manifold [Levy+96] – AT&T Research  Focus: local-as-view mappings, relational model  Sources defined as views over mediated schema  Spawned Tukwila at Washington, and eventually a company as well

12 TSIMMIS and Information Manifold  Focus: Web-based queryable sources  CGI forms, online databases, maybe a few RDBMSs  Each needs to be mapped into the system – not as easy as web search – but the benefits are significant vs. query engines  A few parenthetical notes:  Part of a slew of works on wrappers, source profiling, etc.  The creation of mappings can be partly automated – systems such as LSD, Cupid, Clio, … do this  Today most people look at integrating large enterprises (that’s where the $$$ is!) – Nimble, BEA Liquid Data, Enosys, IBM

13 TSIMMIS  “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew  An instance of a “global-as-view” mediation system  One of the first systems to support semi-structured data, which predated XML by several years

14 Semi-structured Data: OEM  Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity  Proposal: Object Exchange Model, OEM OID:  … How does it relate to XML?  … What problems does OEM solve, and not solve, in a heterogeneous system?

15 OEM Example Show this XML fragment in OEM: Bernstein Newcomer Principles of TP Chamberlin DB2 UDB

16 Queries in TSIMMIS  Specified in OQL-style language called Lorel  OQL was an object-oriented query language  Lorel is, in many ways, a predecessor to XQuery  Based on path expressions over OEM structures: select book where book.author = “DB2 UDB” and book.title = “Chamberlin”  This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated = for $b in document(“my-source”)/book where $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

17 Query Answering in TSIMMIS  Basically, it’s view unfolding, i.e., composing a query with a view  The query is the one being asked  The views are the MSL templates for the wrappers  Some of the views may actually require parameters, e.g., an author name, before they’ll return answers  Common for web forms (see Amazon, Google, …)  XQuery functions (XQuery’s version of views) support parameters as well, so we’ll see these in action

18 A Wrapper Definition in MSL  Wrappers have templates and binding patterns ($X) in MSL: B :- B: }> // $$ = “select * from book where author=“ $X //  This reformats a SQL query over Book(author, year, title)  In XQuery, this might look like: define function GetBook($X AS xsd:string) as book { for $x in sql(“select * from book where author=‘” + $x +”’”) return $x $x }

19 How to Answer the Query  Given our query: for $b in document(“my-source”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b  We want to find all wrapper definitions that:  Either contain output enough information that we can evaluate all of our conditions over the output  Or have already tested the conditions for us!

20 Query Composition with Views  We find all views that define book with author and title, and we compose the query with each: define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } for $b in document(“my-source”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

21 Matching View Output to Our Query’s Conditions  Determine that $b/book/author/text()  $x by matching the pattern on the function’s output: define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } where $x = “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b

22 The Final Step: Unfolding where $x = “Chamberlin” for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x }/book where $b/title/text() = “DB2 UDB” return $b

23 What Is the Answer? Given schema book(author, year, title) and datalog rules defining an instance: book(“Chamberlin”, “1992”, “DB2 UDB”) book(“Chamberlin”, “1995”, “DB2/CS”)

24 TSIMMIS  Early adopter of semistructured data  Can support irregular structure and missing attributes  Can support data from many different sources  Doesn’t fully solve heterogeneity problem, though!  Simple algorithms for view unfolding  Easily can be composed in a hierarchy of mediators

25 Limitations of TSIMMIS’ Approach  Some data sources may contain data with certain ranges or properties  “Books by Aho”, “Students at UPenn”, …  How do we express these? (Important for performance!)  Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

26 The Information Manifold Defines the mediated schema independently of the sources!  “Local-as-view” instead of “global-as-view”  Guarantees soundness and completeness of answers  Allows us to specify information about data sources  Focuses on relations (with OO extensions), datalog

27 Observations of Levy et al.  When you integrate something, you have some conceptual model of the integrated domain  Define that as a basic frame of reference  May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

28 The Local-as-View Model  If we look at the Information Manifold model:  “Local” sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption”  The system must use the sources (views) to answer queries over the mediated schema  Thursday we’ll see what “answering queries using views” is all about…