Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.

Slides:



Advertisements
Similar presentations
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Advertisements

1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
Database System Concepts and Architecture
CSE 636 Data Integration Data Integration Approaches.
CHAPTER 3: DESCRIBING DATA SOURCES
Information Integration Using Logical Views Jeffrey D. Ullman.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Introduction to Database Management  Department of Computer Science Northern Illinois University January 2001.
Causality Interface  Declares the dependency that output events have on input events.  D is an ordered set associated with the min ( ) and plus ( ) operators.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
Infomaster: An information Integration Tool O. M. Duschka and M. R. Genesereth Presentation by Cui Tao.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
Automatic Data Ramon Lawrence University of Manitoba
1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Session-9 Data Management for Decision Support
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
Session-8 Data Management for Decision Support
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
Locating Mobile Agents in Distributed Computing Environment.
I. Khalil Ibrahim1 Data Integration in Digital Libraries: Approaches and Challenges Bringing Digital Libraries together Dr. Ismail Khalil Ibrahim
1 SIGMOD 2000 Christophides Vassilis On Wrapping Query Languages and Efficient XML Integration V. Christophides, S. Cluet, J Simeon Computer Science Department,
DDBMS Distributed Database Management Systems Fragmentation
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
Distributed database system
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
University of Maryland Scaling Heterogeneous Information Access for Wide area Environments Michael Franklin and Louiqa Raschid.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Data Integration Approaches
1 Chapter 2 Database Environment Pearson Education © 2009.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
1 Chapter 22 Distributed DBMS Concepts and Design CS 157B Edward Chen.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Answering Queries Using Views Presented by: Mahmoud ELIAS.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Databases and DBMSs Todd S. Bacastow January 2005.
Database Management.
Distributed Databases
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
Data, Databases, and DBMSs
Database Architecture
Chapter 2 Database Environment Pearson Education © 2009.
Adaptive Query Processing (Background)
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000

What is Data Integration? Providing uniform (sources transparent to user) access to (query, and eventually updates to) multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources Sounds like the devices in Portolano!

Outline Architecture of data integration system Source description & query reformulation Query optimization

Motivation Enterprise data integration; web-site construction. World-wide web: –comparison shopping (Netbot, Junglee) –portals integrating data from multiple sources –XML integration Science & culture –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all the cultural databases produced by countries in Europe.

(some) Research Prototypes DISCO (INRIA) Garlic (IBM) HERMES (U. of Maryland) InfoMaster (Stanford) Information Manifold (AT&T) IRO-DB (Versailles) SIMS, ARIADNE (USC/ISI) The Internet Softbot/Occam/Razor/Tukwila (us) TSIMMIS (Stanford) XMAS (UCSD) WHIRL (AT&T)

Principle Dimensions of Data Integration Virtual vs. materialized architecture Mediated schema? Pros: –Can ask questions over different schemas Cons: –Requires query reformulation

Mediated schema example Real database schemas: imdb(title, actor, director, genre,country) seattlesidewalk(title, theatre, time, price) showtimes(city, title, theatre, time) mrshowbiz(title, year, review) siskelebert(title, review) Mediated schemas: movieInfo(ID, title, genre, country, year, director) movieShowtime(ID, city, theatre, time) movieActor(ID, actor) movieReview(ID, review) Query: query(M,theatre, time):- movieActor(M, “tom hanks”), movieShowtime(M, “seattle”, theatre,time).

Materialization Architecture Data Source Data Source Data Source Wrapper Data Extraction Data Warehouse Application

Tukwila Architecture Data Source Wrapper Query Execution Engine Query Optimization Query Reformulation Global Data Model Data Source Local Data Model catalog

Translating between data models Where is the wrapper? How intelligent is the wrapper? Exported schema Query in exported schema Data in global data model Native schemaQuery in native schema Data in local data model Global DM Local DM

Describing Information Sources User queries refer to the mediated schema Sources store data in the local schemas Content descriptions provide the mappings between the mediated and local schemas Content Descriptions Mediated Schema Relations Information Source Relations

Data Source Catalogs Catalogs contain descriptions of: Logical source contents Source capabilities Source completeness Mirror sources Physical properties of the source and network Source reliability

Desiderata from source descriptions Distinguish between sources with closely related data: so we can prune access to irrelevant sources Enable easy addition of new information sources: because sources are dynamically being added and removed Be able to find sources relevant to a query: reformulate queries such that we obtain guarantees on which sources we access

Query Reformulation Problem Problem: reformulate user query referring to mediated schema onto local schemas Given a query Q in terms of the mediated-schema descriptions of the data sources Find a query Q’ that uses only the data source relation such that Q’  Q (i.e., answers are correct) and Q’ provides all possible answers to Q using the sources

Approaches to Specification of Source Descriptions Mediated schema relations defined as views over the source relations Source relations defined as views over mediated-schema relations Sources described as concepts in a description logic

The Global As View Approach Mediated-schema relations described in terms of source relations Movies and their years can be obtained from either DB 1 or DB 2 : MovieYear(title,year):-DB 1 (title,director,year) MovieYear(title,year):-DB 2 (title, director, year) Movie reviews can be obtained by joining DB 1 and DB 3 MovieRev(title,director,review):- DB 1 (title, director, year) & DB 3 (title,review)

Query Reformulation in GAV Query reformulation is done by rule unfolding Query: find reviews for 1997 movies: q(title,review):- MovieYear(title,1997)& MovieRev(title,director,review) Reformulated query on the sources: q(title, review):- DB 1 (title,director,year) & DB 3 (title,review) q(title,review):-DB 1 (title,director,year) & DB 2 (title,director,year)&DB 3 (title,review) Containment check shows second rule is redundant

The Local As View Approach Every data source is described as a query expression (view!) over mediated-schema relations S 1 : V 1 (title,year,director)  year  1960 & genre = ‘Comedy’ & Movie(title,year,director,genre) S 2 : V 2 (title,review)  Review(title,review)

Query Reformulation Find reviews for comedies produced after 1950: q(title,review):-Movie(title,year,director,‘Comedy’) & year  1950 & Review(title,review) V 1 (title,year,director)  year  1960 & genre = ‘Comedy’ & Movie(title,year,director,genre) V 2 (title,review)  Review(title,review) The reformulated query on the sources: q’(title,review):-V 1 (title,year,director) & V 2 (title,review)

Comparison of the approaches Local as view approach: Easier to add sources: specify the query expression Easier to specify constraints on source contents Global as view: Query reformulation is straightforward

The Query Optimization Problem (currently divorced from reformulation problem) The goal of a query optimizer: Translate a declarative query into an equivalent imperative program of minimal cost The imperative program is a query execution plan: an operator tree in some algebra Basic notions in optimization: search space, search strategy, cost model

Similarities of Data Integration with Optimization in DDBMS A distributed database: Query execution distributed over multiple sites Communication costs significant Consequences for Query Optimization Optimizer needs to decide operation locality Plans should exploit independent parallelism Plans should reduce communication overhead Caching can become a significant factor

Differences from DDBMS Capabilities of data sources: May provide only limited access patterns to data May have additional query processing capabilities Information about sources and network are missing: cost of answering queries unknown statistics harder to estimate transfer rates unpredictable In DDBMS data is distributed by precise rules

Modeling Source Capabilities Negative capabilities: A web-site may require certain inputs Need to consider only valid query execution plans Positive capabilities: A source may be an ODBC compliant database Need to decide the placement of operations according to capabilities Problem: how to describe and use source capabilities

Negative Capabilities We model access limitations by binding patterns: Sources: CitationDB bf (X,Y)  Cites(X,Y) CitingPapers f (X)  Cites(X,Y) Query: Q(X):- Cites(X,a) Need to consider only valid plans: q(X) :-CitingPapers(Y) &CitationDB(Y,a) Requires recursive rewritings to find all solutions

Optimization with positive capabilities Schema dependent vs. schema independent: Source able to perform joins, selections, or specifically R S Describing and using positive capabilities: Positive-capabilities testing module (PCTM): is a plan valid? Level of specification: declarative query vs. logical query execution plan. Interaction between optimizer and PCTM

Dealing with unexpected data transfer delays Problem: even the best plan can be bad with data transfer delays Query scrambling [Urhan et al, SIGMOD-98]: a set of runtime techniques to adapt to initial delays by: Rescheduling the query execution plan: leave plan unchanged, but evaluate different operators Operator synthesis: modify tree by removing or rearranging operators

Adaptive Query Processing (teaser for next week) Tukwila: a more general framework for query processing in data integration Due to lack of stats and network delays, interleaves query optimization and execution: Execute plan fragments; re-optimize Can decide to re-optimize even if next fragment is planned Adaptive operators for data integration Rule based mechanism for coordinating execution and optimization

Conclusions Data integration handles many problems needed for embedded systems applications Many data sources Easy addition and deletion of sources Different source capabilities Dealing with network delays Easy for user