1 Information Integration. 2 Information Resides on Heterogeneous Information Sources different interfaces different data representations redundant and.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Database System Concepts and Architecture
Wrappers in Mediator-Based Systems Chapter 21.3 Information Integration Presented By Annie Hii Toderici.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Chapter 21.2 Modes of Information Integration ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
15 Chapter 15 Web Database Development Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
CS 257 Database Systems Principles Assignment 1 Instructor: Student: Dr. T. Y. Lin Rajan Vyas (119)
Capability-Based Optimization in Mediators Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
Chapter 21 Information Integration 21.3 Wrappers in Mediator-Based Systems Presented by: Kai Zhu Professor: Dr. T.Y. Lin Class ID: 220.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
Automatic Data Ramon Lawrence University of Manitoba
INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Overview of Database Languages and Architectures.
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Database Design – Lecture 16
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
Chapter 21.2 Modes of Information Integration ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.
CODD’s 12 RULES OF RELATIONAL DATABASE
San Diego Supercomputer Center University of California, San Diego The MIX Project Native XML Database XML View(s) Wrappers export: 1. Schemas & Metadata.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.
© 2007 by Prentice Hall 1 Introduction to databases.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
INFORMATION INTEGRATION Shengyu Li CS-257 ID-211.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Wrappers in Mediator-Based Systems. Introduction Mediator Wrapper Source 1 Source 2 Query Result.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Object storage and object interoperability
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
1 Chapter 2 Database Environment Pearson Education © 2009.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Section 20.1 Modes of Information Integration Anilkumar Panicker CS257: Database Systems ID: 118.
Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Presented by: Kai Zhu Professor: Dr. T.Y. Lin Class ID: 220
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
February 7th – Exam Review
Information Integration Introduction (21.1)
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Metadata The metadata contains
Presentation transcript:

1 Information Integration

2 Information Resides on Heterogeneous Information Sources different interfaces different data representations redundant and conflicting information WWW Excel Personal database Flat File

3 Modes of Information Integration Federated Databases: the sources are independent, but one source can call on others to supply information Data warehouse : copies of data from several sources are stored in a single database, called a data warehouse. The data stored at the warehouse is first processed in some way before storage; e.g. data may be filtered, and relations may be joined or aggregated. As the data is copied from the sources, it may need to be transformed in certain ways to make all data conform to the schema at the data warehouse

4 Modes of Information Integration Mediation : a mediator is a software component that supports a virtual database, which the user may query as if it were materialized (physically constructed like a warehouse). The mediator store no data of its own. Rather, it translates the user’s query into one or more queries to its sources. The mediator then synthesizes the answer to the user’s query from the responses of those sources, and returns the answer to the user

5 Problems of Information Integration Example: The AAAI Automobile Co. has 1000 dealers each of which maintains a database of their cars in stock. AAAI wants to create an integrated database containing the information of all 1000 sources. The integrated database will help dealers locate a particular model if they don’t have one in stock. It also can be used by corporate analysts to predict the market and adjust production to provide the model most likely to sell

6 Problems of Information Integration The 1000 dealers do not all use the same database schema : Cars (serialNo, model, color, autoTrans, cdPlayer,...) or Autos (serialNo, model, color) Options (serialNo, option)

7 Problems of Information Integration Schema difference Different equivalent names Data type differences: numbers may be represented by character strings of varying length at one source and fixed length at another Value differences: the same concept may be represented by different constants at different sources (BLACK, BL, 100, etc) Semantic differences: Terms can be given different interpretations at different sources ( Cars includes trucks or not) Missing values: a source may not record information of a type that all of the other sources provide

8 Goal: System Providing Integrated View of Heterogeneous Data Integration System WWW Personal database collects and combines information provides integrated view, uniform user interface Excel Flat File

9 The Data Warehousing Approach to Integration Mediator Wrapper Client Excel Flat File Stored Integrated View

10 The Data Warehousing Approach to Integration Data from several sources is extracted and combined into a global schema The data is stored at the warehouse which looks like an ordinary database There are three approaches to maintaining the data in the data warehouse: –off-line reconstruction of the whole data warehouse –the data warehouse is updated periodically based on the changes made to the original data sources –the data warehouse is updated immediately

11 The Data Warehousing Approach to Integration Example Suppose that there are two dealers in the system and that they use the schemas: Cars (serialNo, model,color,autoTrans, cdPlayer,...) and Autos (serialNo,model,color) Options (serialNo,option) Assume a data warehouse with the schema: AutoWhse(serialNo,model,color,autoTrans, dealer)

12 The Data Warehousing Approach to Integration The software to extracts data from the dealer’s databases and populates the global schema can be written as SQL-queries. The query for the first dealer: insert into AutoWhse(serialNo,model,color,autoTrans, dealer) select serialNo, model, color, autoTrans, ‘dealer1’ from Cars The code for the second dealer is more complex since we have to decide whether or not a given car has an automatic transmission.

13 The Data Warehousing Approach to Integration insert into AutoWhse(serialNo,model,color,autoTrans, dealer) select serialNo, model, color, ‘yes’, ‘dealer2’ from Autos, Options where Autos.serialNo=Options.serialNo and option=‘autoTrans’; insert into AutoWhse(serialNo,model,color,autoTrans, dealer) select serialNo, model, color, ‘no’ ‘dealer2’ from Autos where not exists ( select * from Options where serialNo=Autos.serialNo and option=‘autoTrans’);

14 The Wrapper and Mediator Architecture Mediator Wrapper Client business reports portfolios for each company stock market prices Excel Flat File Common Data Model

15 The Wrapper and Mediator Architecture A mediator supports a virtual view, or collection of views, that integrates several sources in much the same way that the materialized relation(s) in a data warehouse integrate sources. The mediator doesn’t store any data!!! Example : Let us consider the same scenario. The mediator integrates the same two data sources into a view that is a single relation with the schema: AutoMed(serialNo,model,color,autoTrans, dealer)

16 The Wrapper and Mediator Architecture Assume the user asks the mediator about the red cars: select serialNo, model from AutosMed where color = ‘red’; The mediator forward the same query to each of the two wrappers (1) select serialNo, model from Cars where color=‘red’; (2) select serialNo, model from Autos where color=‘red’; The mediator can take the union of answers and return the result to the user.

17 The Lazy Integration Approach Mediator Wrapper Client IBM portfolio IBM price IBM related reports (in common model) IBM related reports Excel Flat File Query Decomposition, Translation and Result Fusion

18 Wrappers in Mediator-Based Systems In a data warehouse system, the source extractors consist of: –one or more queries built-in that are executed at the source to produce data for the data warehouse –communication mechanisms, so that wrapper can: pass ad-hoc queries to the source receive responses from the source pass information to the warehouse Mediator systems require more complex wrappers - the wrapper must be able to accept a variety of queries from the mediator and translate any of them to the terms of the source.

19 Wrappers in Mediator-Based Systems A systematic way to design a wrapper that connects a mediator to a source is to classify the possible queries that the mediator can ask into templates, which are queries with parameters that represent constants. The mediator can provide the constants, and the wrapper executes the query with the given constants. T  S the template T is turned into the source query S Example : The source of dealer1: Cars (serialNo, model,color,autoTrans, cdPlayer,...)

20 Wrappers in Mediator-Based Systems Assume we use the mediator with schema: AutoMed(serialNo,model,color,autoTrans, dealer) How the mediator could ask the wrapper for cars of a given color? The template: select * from AutoMed where color= ‘$c’;  select serialNo, model color, autoTrans, ‘dealer1’ from Cars where color=‘$c’;

21 Wrappers in Mediator-Based Systems The wrapper could have another template that specified the parameter $m representing a model there would be 2 N templates for N attributes the number of templates could grow unreasonably large.

22 Wrapper Generators The template defining a wrapper must be turned into code for the wrapper itself - the software that creates the wrapper is called a wrapper generator The wrapper generator creates a table that holds the various query patterns contained in the templates, and the source queries that are associated with each. A driver is used in each wrapper. The task of the driver is to : –accept a query from the mediator –search the table for a template that match the query –the source query is sent to the source using a communication mechanism –the response is processed by the wrapper, if necessary, and then returned to the mediator

23 Mediator Client Wrapper Wrappers & Mediators from High-Level Specifications Mediator Specification Interpreter Wrapper Generator Wrapper Specification Mediator Specification Source

24 Filters Complex template select * from AutoMed where color= ‘$c’ and model = ‘$m’;  select serialNo, model color, autoTrans, ‘dealer1’ from Cars where color=‘$c’ and model =‘$m’; Wrapper filter approach - if the wrapper has a template that returns a superset of what the query wants then it is possible to filter the result at the wrapper The decision whether a mediator asks for a subset of what the pattern of some wrapper template returns is a hard problem

25 Filters Example: Given the template select * from AutoMed where color= ‘$c’; The mediator needs to find blue Gobi model car: select * from AutoMed where color= ‘blue’ and model=‘Gobi’; use the template with $c=blue to find all blue cars store the result in the temporary relation Temp select from Temp the Gobi’s and return the result

26 Other Wrapper Operations It is possible to transform the data at the wrapper in different ways The mediator is asked to find dealers and models such that the dealer has two red cars, of the same model, one with and one without automatic transmission. Suppose we have only one template as before. Select A1.model A1.dealer from AutoMed A1 AutoMed A2 where A1.model=A2.model and A1.color=‘red’ and A2.color=‘red’ and A1.autoTrans=‘no’ and A2.autoTrans=‘yes’;

27 Other Wrapper Operations It is possible to answer the query by first obtaining from the Dealer’s 1 source a relation with all the red cars (use the original template) - RedAutos relation select distinct A1.model A1.dealer from RedAutos A1, RedAutos A2 where A1.model=A2.model and A1.autoTrans=‘no’ and A2.autoTrans=‘yes’;

28 Challenge: Sources Without a Well- Structured Schema semistructured –irregular –deeply nested –cross-referenced incomplete schema knowledge –autonomous –dynamic HTML pages SGML documents genome data chemical structures bibliographic information results of the integration process Examples

29 Challenge: Different and Limited Source Capabilities Client Wrapper (A) Wrapper (B) Mediator (U = A + B) retrieve IBM data

30 Mediator has to Adapt to Query Capabilities of Sources Client Wrapper (A) Wrapper (B) Mediator (U = A + B) retrieve everything retrieve IBM data (A) does not allow selection

31 Part B Semistructured Data Representation Mediator Generation Wrapper Generation Capabilities-Based Rewriting

32 Representation of Semistructured Information using OEM semantic object-id label Atomic Value Set Value structural object-id

33 Graph Representation of OEM Data faculty first_name “John” last_name “Doe” rank “professor”

34 OEM Structures Represent Arbitrary Labeled Graphs faculty first_name “John” last_name “Doe” rank “professor” faculty name “Mary Smith” project “Air DB” paper author name “John Doe” author name “Mary Smith” title “Thin Air DB”

35 Overview Semistructured Data Representation Mediator Generation Example of mediator specification Language expressiveness Implementation and performance Wrapper Generation Capabilities-Based Rewriting

36 Merge Information Relating to a Faculty person name “John Doe” birthday “April 1” s2 faculty name “John Doe” rank “professor” papers... s1 faculty name “John Doe” rank “professor” birthday “April 1” papers...

37 Mediator Specification Example person name “John Doe” birthday “April 1” s2 }> :- }> :- faculty name “John Doe” rank “professor” papers... s1 faculty name “John Doe” rank “professor” birthday “April 1” papers...

38 Mediator Specification Example: Semantics of Rule Bodies }> :- }> :- person name “John Doe” birthday “April 1” s2 faculty name “John Doe” rank “professor” birthday “April 1” papers... faculty name “John Doe” rank “professor” papers... s1

39 Mediator Specification Example: Semantics of Rule Heads }> :- }> :- person name “John Doe” birthday “April 1” s2 “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers... faculty name “John Doe” rank “professor” papers... s1

40 Incrementally Add to Semantically Identified Object }> :- }> :- faculty name “John Doe” rank “professor” papers... s1 person name “John Doe” birthday “April 1” s2 “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers...

41 Irregularities & Incomplete Schema Knowledge }> :- faculty name “John Doe” rank “professor” papers faculty name “Mary Smith” project “Air DB” s1 person name “John Doe” birthday “April 1” s2 faculty name “John Doe” rank “professor” birthday “April 1” papers faculty name “Mary Smith” project “Air DB” “John Doe” “Mary Smith”

42 Second Rule Attaches More Subobjects to View Objects }> :- }> :- faculty name “John Doe” rank “professor” papers... s1 “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers... person name “John Doe” birthday “April 1” s2

43 Language Expressiveness Information fusion problems solved by MSL –Irregularities –Incomplete knowledge of source structure –Transformation of cross-referenced structures –Inconsistent and redundant data –Use of arbitrary matching criteria Theoretical analysis of expressiveness –Consider the relational representation of OEM graphs. Then MSL is equivalent to “SQL + special form of transitive closure”

44 faculty name “John Doe” rank “associate” Inconsistent and Redundant Information }> :- }> :- AND NOT person name “John Doe” rank “assistant” s1s2 “John Doe” faculty name “John Doe” rank “associate” rank “assistant”

45 Overview Semistructured Data Representation Mediator Generation Example of mediator specification Language expressiveness Implementation and performance Wrapper Generation Capabilities-Based Rewriting

46 Mediator Specification Interpreter Architecture Query Rewriter Cost-Based Optimizer Datamerge Engine Mediator Specification Query logical datamerge program plan Result Queries to Wrappers Results

47 Query Rewriting When Known Origins of Information }> :- :- }> :- }> :- }> AND X>65000

48 Query Rewriter Pushes Conditions to Sources }> :- :- }> :- }> :- }> AND X>65000 logical datamerge program }> :- ( }> AND AND

49 :- <person { }> Passing Bindings & Local Join Plans Passing Bindings Local Join :- }> AND X>65000 :- <person { }> }>:- }> AND X>65000 N s1s2 s1s2

50 Query Decomposition When Unknown Origins of Information }> :- }> }> :- }> :-

51 Plan Considers All Possible Sources of birthday }> :- }> }> :- }> :- name s2s1 name birthday

52 Overview Semistructured-Data Representation Mediator Generation Wrapper Generation Capabilities-Based Rewriting

53 Query Translation in Wrappers Source SELECT * FROM person WHERE name=“Smith” find -all find -n Smith Query Translator Result Translator Wrapper

54 Rapid Query Translation Using Templates and Actions Source SELECT * FROM person WHERE name=“Smith” find -all find -n Smith Template Interpreter Result Translator SELECT * FROM person {emit “find -all” } SELECT * FROM person WHERE name=$N {emit “find -n $N”}

55 Description of Infinite Sets of Supported Queries uses recursive nonterminals Example: –job description contains word w1 and word w2 and... –SELECT subset(person) FROM person WHERE \CJob \CJob : job LIKE $W AND \CJob \CJob : TRUE

56 Overview Semistructured-Data Representation Mediator Generation Wrapper Generation Capabilities-Based Rewriting

57 Wrapper Supported Queries Description Capabilities-Based Rewriter in Mediator Architecture Capabilities- Based Rewriter Query Rewriter Cost-Based Optimizer Datamerge Engine logical datamerge program supported plans optimal plan Mediator Specification Wrapper Supported Queries Description Query

58 Capabilities-Based Rewriter Finds Supported Plans Supported Queries SELECT * FROM A WHERE salary>65000 SELECT * FROM A

59 Capabilities-Based Rewriter Finds Most-Selective Supported Plans Supported Queries SELECT * FROM B WHERE salary>65000 SELECT * FROM B WHERE salary >65000

60 Capabilities-Based Rewriter Architecture Component SubQuery Discovery Plan Construction Plan Refinement Query Capabilities Description Component SubQueries Plans (not fully optimized) Query Algebraically optimal plans

61 What TSIMMIS Achieved system for integration of heterogeneous sources challenges and solutions –semistructured data & incomplete schema knowledge appropriate specification language and query processing algorithms –limited and different query capabilities query translation algorithm capabilities-based query rewriting algorithm

62 Overview TSIMMIS’ goals, technical challenges, and solutions Insufficiencies of the TSIMMIS’ framework Going forward

63 Insufficiencies of the TSIMMIS framework OEM was really unstructured data –some loose and partial schematic info may pay off tremendously too “databasy” user/mediator/source interaction

64 Overview TSIMMIS’ goals, technical challenges, and solutions Insufficiencies of the TSIMMIS’ framework Going forward

65 Web emerges as a Distributed DB and XML as its Data Model Data Source Native XML Database XML View Document(s) XML View Document(s) XML View Document(s) Also export: 1. Schemas & Metadata (XML-Data, RDF,…) 2. Description of supported queries Wrapper Legacy Source XMAS Query Language

66 Definition of Integrated Views Data Source Data Source Data Source Mediator XML View Document(s) Integrated XML View Document(s) XML View Document(s) View Definition in XMAS

67 Non-Materialized Views in the MIX mediator system Blended Browsing & Querying (BBQ) GUI Application DOM for Virtual XML Doc’s MIX Mediator XMAS queryXML document DTD Inference Integrated View DTD XML Source Query Processor View Definition in XMAS Source DTD

68 RDB RDB2XML Wrapper DTD Inference Resolution Simplification Execution Unfolded Query Blended Browsing & Querying (BBQ) GUI MIX Mediator XMAS Mediator View Definition View DTD Translation to Algebra Optimization XML Document Fragments XMAS Query XML Source 1 XML Source 2 DTD XMAS Query XML Document Fragments DOM (VXD) Client API Application