Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.

Slides:

Advertisements

Similar presentations

Database System Concepts and Architecture

Advertisements

1http://img.cs.man.ac.uk/stevens Interaction Models of Humans and Computers CS2352: Lecture 7 Robert Stevens

Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.

Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.

0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.

Search Engines and Information Retrieval

Xyleme A Dynamic Warehouse for XML Data of the Web.

Distributed Systems Architectures

Requirements Specification

1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.

2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.

State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya.

©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.

Interpret Application Specifications

Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

TAMBIS Transparent Access to Multiple Biological Information Sources.

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

SiS Technical Training Development Track Technical Training(s) Day 1 – Day 2.

INTEGRATION INTEGRATION Ramon Lawrence University of Iowa

Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.

 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD

Overview of Search Engines

Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,

CHAPTER 9 DATABASE MANAGEMENT © Prepared By: Razif Razali.

Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.

Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.

Tae-Hyung Kim 1 Gil-Mi Ryu 1,2 InSong Koh 2 Jong Park 3 1.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

Web-Enabled Decision Support Systems

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

2. Database System Concepts and Architecture

MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.

Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Search Engine Architecture

Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.

Service Service metadata what Service is who responsible for service constraints service creation service maintenance service deployment rules rules processing.

Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Volgograd State Technical University Applied Computational Linguistic Society Undergraduate and post-graduate scientific researches under the direction.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

14 1 Chapter 14 Web Database Development Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.

Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

Integration of BioInformatics tools at NUS. GenBank Growth Chart Year Bases.

Developing GRID Applications GRACE Project

Application architectures Advisor : Dr. Moneer Al_Mekhlafi By : Ahmed AbdAllah Al_Homaidi.

Of 24 lecture 11: ontology – mediation, merging & aligning.

 System Requirement Specification and System Planning.

Databases and Database User ch1 Define Database? A database is a collection of related data.1 By data, we mean known facts that can be recorded and that.

Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS

The LIBI Federated database

Search Engine Architecture

CHAPTER 2 CREATING AN ARCHITECTURAL DESIGN.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

Database management concepts

Database Systems Instructor Name: Lecture-3.

Ontology-Based Approaches to Data Integration

Database management concepts

Database Architecture

Search Engine Architecture

Database System Concepts and Architecture

Course Instructor: Supriya Gupta Asstt. Prof

Presentation transcript:

Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati Dept. of Computer Science and Engineering Arizona State University

Introduction Traditionally, the integration of biological data was done manually by biologists. However, the availability of more data in different formats and the wide distribution over the internet makes the manual integration practically infeasible. There is a need for computer integration. Traditionally, the integration of biological data was done manually by biologists. However, the availability of more data in different formats and the wide distribution over the internet makes the manual integration practically infeasible. There is a need for computer integration. This need is also justified by the characteristics of the biological sources: This need is also justified by the characteristics of the biological sources:

Characteristics of Biological Sources Variety of data. Typical data stored cover several biological and genomic research fields (e.g. gene expression and sequences, disease characteristics, molecular structures, microarray data, etc). Not only can the quantity of data available in a source be quite large, but also the size of each record can itself be extremely large (e. g. DNA sequences, 3D protein structures, etc). Variety of data. Typical data stored cover several biological and genomic research fields (e.g. gene expression and sequences, disease characteristics, molecular structures, microarray data, etc). Not only can the quantity of data available in a source be quite large, but also the size of each record can itself be extremely large (e. g. DNA sequences, 3D protein structures, etc). Heterogeneous representations. Several sources containing the similar data can have very different representations. The representational heterogeneity includes structural (i. e. schema), naming, semantic (i.e. the same semantic concept with different terms and the opposite), content (different data for the same semantic object) differences. Heterogeneous representations. Several sources containing the similar data can have very different representations. The representational heterogeneity includes structural (i. e. schema), naming, semantic (i.e. the same semantic concept with different terms and the opposite), content (different data for the same semantic object) differences.

Characteristics of Biological Sources Autonomous operations. They are free to modify their design and/or schema, remove or modify data without any prior public notification. Nearly all sources are web-based and therefore dependent on network traffic and overall availability. The data is dynamic. Autonomous operations. They are free to modify their design and/or schema, remove or modify data without any prior public notification. Nearly all sources are web-based and therefore dependent on network traffic and overall availability. The data is dynamic. Different interfaces and querying capabilities. Different interfaces and querying capabilities.

Integration Approaches in Existing Systems They can be classified first in terms of data models. This refers to the design assumptions made by the integration system as to the syntactic nature of the data being exported by the sources. They can be classified first in terms of data models. This refers to the design assumptions made by the integration system as to the syntactic nature of the data being exported by the sources. 1. Text data model. They view sources as exporting mainly text, and their integration involves supporting keyword/text search across the sources. 2. Structures data model. When sources are viewed as exporting more structured data, there are two broad types of integration approaches: warehoused or accessed on demand from the sources. 3. Linked records model. They view sources as exporting linked sets of browsable records and the integration involves supporting effective navigation across sources.

Integration Approaches in Existing Systems The majority of systems use the (semi-) structured or linked record models. More details about those systems are going to be discussed. The majority of systems use the (semi-) structured or linked record models. More details about those systems are going to be discussed. They include three types of approach: They include three types of approach: 1. Warehouse integration. It materializes the data from multiple sources into a local warehouse and executes all queries on the data contained in the warehouse instead of the actual sources. It emphasizes the data translation instead of query translation in mediator-based integration. Pros: less dependency on network, improved efficiency of query optimization, enabling users to filter, validate, modify, and annotate the data obtained from the sources. Cons: outdated data and the need for frequent updates. Cons: outdated data and the need for frequent updates.

Integration Approaches in Existing Systems 2. Mediator-based integration. It concentrates on query translation. A mediator is responsible for reformulating a query at runtime on a single mediated schema into a query on the local schema of the underlying data sources. Mapping between the source description and the mediator is very crucial for such a translation. There are two main approaches for establishing mapping between each source schema and the global schema: global-as-view (GAV) and local-as-view (LAV). In GAV, the mediator relations are written directly in terms of the source relations. In LAV, every source relation is defined over the relations and the schema of the mediator. LAV is preferred for large scale integration and GAV is appropriate when the set of sources being integrated is known and stable.

Integration Approaches in Existing Systems 3. Navigation-based integration. It emerges from the fact that an increasing number of sources on the web require of users that they manually browse through several web pages and data sources in order to obtain the desired information. The specific paths essentially constitute workflows in which the output of a source is redirected to the input of the next source until the requested information is reached.

Integration Approaches in Existing Systems There are also other classifications besides the data model classification: There are also other classifications besides the data model classification: 1. Aim of integrations – portal or query oriented; 1. Aim of integrations – portal or query oriented; 2. Source model – complimentary (horizontal) or vertical (overlapping exists and requires aggregation); 2. Source model – complimentary (horizontal) or vertical (overlapping exists and requires aggregation); 3. User model – low expertise, high expertise in query languages, and interactive query formulations; 3. User model – low expertise, high expertise in query languages, and interactive query formulations; 4. Level of transparency: users choosing sources or hard- wiring choices of sources. 4. Level of transparency: users choosing sources or hard- wiring choices of sources.

Integration Approaches in Existing Systems

Sequence Retrieval System (SRS) SRS first parses flat files that contain structured text with field names. It then creates and stores an index for each field and used these local indexes at query-time to retrieve relevant entries. Although extensive indexed entries are kept locally to be used by the query processor at query time, SRS is not a warehouse system as the actual data is neither modified nor stored locally. The other main feature of SRS is that it keeps track of the cross-references between sources. It uses its own parsing component to identify links that exists between entries in different sources during parsing and indexing. These links are then used to suggest more results to a user after a query has been processed. SRS first parses flat files that contain structured text with field names. It then creates and stores an index for each field and used these local indexes at query-time to retrieve relevant entries. Although extensive indexed entries are kept locally to be used by the query processor at query time, SRS is not a warehouse system as the actual data is neither modified nor stored locally. The other main feature of SRS is that it keeps track of the cross-references between sources. It uses its own parsing component to identify links that exists between entries in different sources during parsing and indexing. These links are then used to suggest more results to a user after a query has been processed.

BioKleisli BioKleisli is a mediator-based integration system. The mediator on top of the underlying sources relies mainly on a high level query language (CPL, more expressive than SQL) to query across several sources. Queries are decomposed into sub-queries and source-specific wrappers map sub-queries to specific heterogeneous sources, which are accessed through predefined atomic query functions. BioKleisli is a mediator-based integration system. The mediator on top of the underlying sources relies mainly on a high level query language (CPL, more expressive than SQL) to query across several sources. Queries are decomposed into sub-queries and source-specific wrappers map sub-queries to specific heterogeneous sources, which are accessed through predefined atomic query functions. BioKleisli doesn’t use any global molecular biology schema or ontology. BioKleisli doesn’t use any global molecular biology schema or ontology. It is aimed at performing a horizontal integration. A query attribute is usually bound to an attribute in a single predetermined source and there is essentially no content overlap. It is aimed at performing a horizontal integration. A query attribute is usually bound to an attribute in a single predetermined source and there is essentially no content overlap.

TAMBIS TAMBIS is a mediator-based and ontology-driven integration system. TAMBIS is a mediator-based and ontology-driven integration system. GUI (Concepts Defined In a global Schema) Source- indepen dent GRAIL query Query internal form Source dependent CPL query execution plan Use BioKleisli existing function library to access sources

TAMBIS The TAMBIS domain ontology mainly serves the purpose of easing the user’s task of formulating the query instead of schema mapping between sources. The TAMBIS domain ontology mainly serves the purpose of easing the user’s task of formulating the query instead of schema mapping between sources.

DiscoveryLink DiscoveryLink is also a mediator-based integration system. Applications typically connect to DiscoveryLink and submit a query in SQL on the global schema, not necessarily aware of the underlying sources. Underneath, a federated database query processor communicates with source-specific wrappers to determine the optimal plan for a given query. DiscoveryLink is also a mediator-based integration system. Applications typically connect to DiscoveryLink and submit a query in SQL on the global schema, not necessarily aware of the underlying sources. Underneath, a federated database query processor communicates with source-specific wrappers to determine the optimal plan for a given query. The wrappers have two roles. They translate the source data models and provide source-specific information about query capabilities that will help the optimizer determine which parts of a query can be submitted to each source. The wrappers have two roles. They translate the source data models and provide source-specific information about query capabilities that will help the optimizer determine which parts of a query can be submitted to each source.

Other Existing Systems BASCIIS is an end-use product which was developed following a mediator-based approach combined with extensive use of a knowledge base (KB). The KB contains a domain ontology which serves as a global schema and maps the data base schema to the domain ontology. BASCIIS is an end-use product which was developed following a mediator-based approach combined with extensive use of a knowledge base (KB). The KB contains a domain ontology which serves as a global schema and maps the data base schema to the domain ontology. BioNavigator is a commercially available navigation integration system. Users can define their preferred execution path for a query and reuse it later. BioNavigator is a commercially available navigation integration system. Users can define their preferred execution path for a query and reuse it later. GUS is a warehouse-based integration system. GUS is a warehouse-based integration system.

Discussion As mentioned earlier, warehouse-based approaches have two clear advantages. First, it simplifies query optimization and processing by storing the data locally according to a single global schema. Second, it enables users to add their own annotations to some stored data and specify some filtering conditions to clean the data as it is stored locally. As mentioned earlier, warehouse-based approaches have two clear advantages. First, it simplifies query optimization and processing by storing the data locally according to a single global schema. Second, it enables users to add their own annotations to some stored data and specify some filtering conditions to clean the data as it is stored locally. However, it is still unclear how this user-friendly feature can be achieved efficiently and more specifically how the data could effectively be validated or modified without human interventions and extensive domain expertise. Furthermore, data warehousing faces the big problem of handling updates in the sources and even a bigger challenge as the data can be modified and annotated locally, and therefore different from the data in the sources. However, it is still unclear how this user-friendly feature can be achieved efficiently and more specifically how the data could effectively be validated or modified without human interventions and extensive domain expertise. Furthermore, data warehousing faces the big problem of handling updates in the sources and even a bigger challenge as the data can be modified and annotated locally, and therefore different from the data in the sources.

Discussion Although GAV and LAV are introduced earlier for mediator-based approach, there are no mediator-based integration systems implementing them so far. Wrapper-oriented approaches are still relatively new. Although GAV and LAV are introduced earlier for mediator-based approach, there are no mediator-based integration systems implementing them so far. Wrapper-oriented approaches are still relatively new. Much like TAMBIS and BioKleisli, most of the current systems only address the horizontal integration and don’t consider the potential overlapping aspect of sources. DiscoveryLink makes an attempt to solve the problem of selecting between several potential sources by using the information provided by wrappers to estimate querying costs. But the overlap and coverage point of view of optimization and source selection is not considered. Much like TAMBIS and BioKleisli, most of the current systems only address the horizontal integration and don’t consider the potential overlapping aspect of sources. DiscoveryLink makes an attempt to solve the problem of selecting between several potential sources by using the information provided by wrappers to estimate querying costs. But the overlap and coverage point of view of optimization and source selection is not considered.

Reference Thomas Hernandez & Subbarao Kambhampati. Integration of Biological Integration of BiologicalIntegration of Biological Sources: Current Systems and Challenges AheadSources: Current Systems and Challenges Ahead. Sigmod Record, Vol. 33, No. Sources: Current Systems and Challenges Ahead 3, September 2004.