October 2007 Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birkbeck, U. of London.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
DIMNet Workshop 7 & 8/10/2002 AutoMed: Automatic generation of Mediator tools for heterogeneous database integration Alex Poulovassilis (Birkbeck College)
Using AutoMed Metadata in Data Warehousing Environments Hao FanAlexandra Poulovassilis School of Computer Science & Information Systems Birkbeck college,
Intelligent Technologies Module: Ontologies and their use in Information Systems Revision lecture Alex Poulovassilis November/December 2009.
19 January 2007 Data Quality Meeting Alex Poulovassilis.
Intelligent Technologies Module: Ontologies and their use in Information Systems Part II Alex Poulovassilis November/December 2009.
SeLeNe Kick-off Meeting 15-16/11/2002 SeLeNe-related Research At Birkbeck Alex Poulovassilis and Peter T.Wood Database and Web Technologies Group School.
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis
Data Access & Integration in the ISPIDER Proteomics Grid L. Zamboulis, H. Fan, K. Bellhajjame, J. Siepen, A. Jones, N. Martin, A. Poulovassilis, S. Hubbard,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Automatic Data Ramon Lawrence University of Manitoba
INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Introduction to MDA (Model Driven Architecture) CYT.
Proteome data integration characteristics and challenges K. Belhajjame 1, R. Cote 4, S.M. Embury 1, H. Fan 2, C. Goble 1, H. Hermjakob, S.J. Hubbard 1,
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
10/18/20151 Business Process Management and Semantic Technologies B. Ramamurthy.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Dimitrios Skoutas Alkis Simitsis
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Data Integration by Bi-Directional Schema Transformation Rules Data Integration by Bi-Directional Schema Transformation Rules By Peter McBrien and Alexandria.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.
Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Mining the Biomedical Research Literature Ken Baclawski.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Data Warehouse.
Database Management System (DBMS)
Chapter 1 Database Systems
Database Systems Instructor Name: Lecture-3.
Chapter 1 Database Systems
Applying principles of computer science in a biological context
Business Process Management and Semantic Technologies
Presentation transcript:

October 2007 Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birkbeck, U. of London

October 2007 Outline of the talk The problem and challenges faced Historical background Main Data Integration approaches in the Life Sciences Our work Materialised and Virtual DI Future directions ISPIDER Project Bioinformatics service reconciliation

October The Problem Given a set of biological data sources, data integration (DI) is the process of creating an integrated resource which combines data from each of the data sources in order to support new queries and analyses Biological data sources are characterised by their high degree of heterogeneity, in terms of: data model, query interfaces, query processing capabilities, database schema or data exchange format, data types used, nomenclature adopted Coupled with the variety, complexity and large volumes of biological data, this poses several challenges, leading to several methodologies, architectures and systems being developed

October 2007 Challenges faced Increasingly large volumes of complex, highly varying, biological data are being made available Data sources are developed by different people in differing research environments for differing purposes Integrating them to meet the needs of new users and applications requires reconciliation of their heterogeneity w.r.t. content, data representation/exchange and querying Data sources may freely change their format and content without considering the impact on any integrated derived resources Integrated resources may themselves become data sources for high-level integrations, resulting in a network of dependencies

October 2007 Genome: DNA sequences of 4 bases (A,C,G,T) RNA: copy of DNA sequence Protein: sequence of 20 amino acids A gene Biological data: Genes Proteins Biological Function Permanent copyTemporary copyProduct (each triple of RNA bases encodes an amino acid) FUNCTION Job Biological Processes This slide is adapted from Nigel Martins Lecture Notes on Bioinformatics

October 2007 Varieties of Biological Data genomic data gene expression data (DNA proteins) and gene function data protein structure and function data regulatory pathway data: how gene expression is regulated by proteins cluster data: similarity-based clustering of genes or proteins proteomics data: from experiments on separating proteins produced by organisms into peptides, and protein identification phylogenetic data: evolution of genomic, protein, function data data on genomic variations in species semi-structured/unstructured data: medical abstracts

October 2007 Some Key Application Areas for DI Integrating, analysing and annotating genomic data Predicting the functional role of genes and integrating function- specific information Integrating organism-specific information Integrating protein structure and pathway data with gene expression data, to support functional genomics analysis Integrating, analysing and annotating proteomics data sources Integrating phylogenetic data sources for genealogy research Integrating data on genomic variations to analyse health impact Integrating genomic, proteomic and clinical data for personalised medicine

October Historical Background One possible approach would be to encode transformation/ integration functionality in the application programs However, this may be a complex and lengthy process, and may affect robustness, maintainability, extensibility This has motivated the development of generic architectures and methodologies for DI, which abstract out this functionality from application programs into generic DI software Much work has been done since the 1990s specifically in biological DI Many systems have been developed e.g. DiscoveryLink, Kleisli, Tambis, BioMart, SRS, Entrez, that aim to address some of the challenges faced

October Main DI Approaches in the Life Sciences Materialised import data into a DW transform & aggregate imported data query the DW via the DBMS Virtual specify the integrated schema wrap the data sources, using wrapper software construct mappings between data sources and IS using mediator software query the integrated schema mediator software coordinates query evaluation, using the mappings and wrappers

October 2007 Main DI Approaches in the Life Sciences Link-based no integrated schema users submit simple queries to the integration software e.g. via web-based user interface queries are formulated w.r.t to the data sources, as selected by the user the integration software provides additional capabilities for facilitating query formulation e.g. cross-references are maintained between different data sources and used to augment query results with links to other related data speeding up query evaluation e.g. indexes are maintained supporting efficient keyword based search

October Comparing the Main Approaches Link-based integration is fine if functionality meets users´ needs Otherwise materialised or virtual DI is indicated: both allow the integrated resource to be queried as though it were a single data source. Users/applications do not need to be aware of source schemas/formats/content Materialised DI is generally adopted for: better query performance greater ease of data cleaning and annotation Virtual DI is generally adopted for: lower cost of storing and maintaining the integrated resource greater currency of the integrated resource

October Our work: AutoMed The AutoMed Project at Birkbeck and Imperial: is developing tools for the semi-automatic integration of heterogeneous information sources can handle both structured and semi-structured data provides a unifying graph-based metamodel (HDM) for specifying higher-level modelling languages provides a single framework for expressing data cleansing, transformation and integration logic the AutoMed toolkit is currently being used for biological data integration and p2p data integration

October 2007 AutoMed Architecture Global Query Processor/Optimiser Schema Matching Tools Other Tools e.g.GUI, schema evolution,DLT Schema Transformation and Integration Tools Model Definition Tool Schema and Transformation Repository Model Definitions Repository Wrapper

October 2007 AutoMed Features Schema transformations are automatically reversible: addT/deleteT(c,q) by deleteT/addT(c,q) extendT(c,Range q1 q2) by contractT(c,Range q1 q2) renameT(c,n,n) by renameT(c,n,n) Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas The queries within transformations allow automatic data and query translation Schemas may be expressed in a variety of modelling languages Schemas may or may not have a data source associated with them

October 2007 AutoMed vs Common Data Model approach

October Materialised DI

October 2007 Some characteristics of Biological DI prevalence of automated and manual annotation of data prior, during and after its integration e.g. DAS distributed annotation service; GUS data warehouse annotation of data origin and data derivation importance of being able to trace the provenance of data wide variety of nomenclatures adopted greatly increases the difficulty of data aggregation has led to many standardised ontologies and taxonomies inconsistencies in identification of biological entities has led to standardisation efforts e.g. LSID but still a legacy of non-standard identifiers present

October 2007 The BioMap Data Warehouse A data warehouse integrating gene expression data protein structure data including data from the Macromolecular Structure Database (MSD) from the European Bioinformatics Institute (EBI) CATH structural classification data functional data including Gene Ontology; KEGG hierachical clustering data, derived from the above Aiming to support mining, analysis and visualisation of gene expression data

October 2007 BioMap integration approach

October 2007 BioMap architecture Structure Data MSD, CATH… Function Data GO, KEGG… Cluster Data Microarray Data (ArrayExpress) MEditor Data Marts Structure Tables Function Tables Cluster Tables Microarray Tables Metadata Tables Search Tables (Materialised views) Search Tools Analysis Tools Mining Tools VisualisationTools

October 2007 Using AutoMed in the BioMap Project Wrapping of data sources and the DW Automatic translation of source and global schemas into AutoMeds XML schema language (XMLDSS) Domain experts provide matchings between constructs in source and global schemas: rename transfs. Automatic schema restructuring and generation of transformation pathways Pathways could subsequently be used for maintaince and evolution of the DW; also for data lineage tracing See DILS05 paper for details of the architecture and clustering approach RDB XML File RDB AutoMed Relational Schema AutoMed Integrated Schema AutoMed XMLDSS Schema AutoMed Relational Schema XML Wrapper RDB Wrapper RDB Wrapper T r a n s f o r m a t i o n p a t h w a y T r a n s f o r m a t i o n p a t h w a y T r a n s f o r m a t i o n p a t h w a y Integrated Database Wrapper Integrated Database …..

October Virtual DI The integrated schema may be defined in a standard data modelling language Or, more broadly, it may be a source-independent ontology defined in an ontology language serving as a global schema for multiple potential data sources, beyond the ones being integrated e.g. as TAMBIS The integrated schema may/may not encompass all of the data in the data sources: it may be sufficient to capture just the data needed for answering key user queries/analyses this avoids the possibly complex and lengthy process of creating a complete integrated schema and set of mappings

October 2007 Virtual DI Architecture Global Query Processor Global Query Optimiser Schema Integration Tools Metadata Repository: Data source schemas Integrated schemas Mappings Wrappers

October 2007 Degree of Data Source Overlap different systems make different assumptions about this some systems assume that each DS contributes a different part of the integrated virtual resource e.g. K2/Kleisli some systems relax this but do not attempt any aggregation of duplicate or overlapping data from the DSs e.g. TAMBIS some systems support aggregation at both schema and data levels e.g. DiscoveryLink, AutoMed the degree of data source overlap impacts on complexity of the mappings and the design effort involved in specifying them the complexity of the mappings in turn impacts on the sophistication of the global query optimisation and evaluation mechanisms that will be needed

October 2007 Virtual DI methodologies Top-down integrated schema IS is first constructed or it may already exist from previous integration or standardisation efforts the set of mappings M is defined between IS and DS schemas

October 2007 Virtual DI methodologies Bottom-up initial version of IS and M constructed e.g. from one DS these are incrementally extended/refined by considering in turn each of the other DSs for each object O in each DS, M is modified to encompass the mapping between O and IS, if possible if not, IS is extended as necessary to encompass information represented by O, and M is then modified accordingly

October 2007 Virtual DI methodologies Mixed Top-down and Bottom-up initial IS may exist initial set of mappings M is specified IS and M may need to be extended/refined by considering additional data from the DSs that IS needs to capture for each object O in each DS that IS needs to capture, M is modified to encompass the mapping between O and IS, if possible if not, IS is extended as necessary to encompass information represented by O, and M is then modified accordingly

October 2007 Defining Mappings Global-as-view (GAV) each schema object in IS defined by a view over DSs simple global query reformulation by query unfolding view evolution problems if DSs change Local-as-view (LAV) each schema object in a DS defined by a view over IS harder global query reformulation using views evolution problems if IS changes Global-local-as-view (GLAV) views relate multiple schema objects in a DS with IS

October 2007 Both-As-View approach supported by AutoMed not based on views between integrated and source schemas instead, provides a set of primitive schema transformations each adding, deleting or renaming just one schema object relationships between source and integrated schema objects are thus represented by a pathway of primitive transformations add, extend, delete, contract transformations are accompanied by a query defining the new/deleted object in terms of the other schema objects from the pathways and queries, it is possible to derive GAV, LAV, GLAV mappings currently AutoMed supports GAV, LAV and combined GAV+LAV query processing

October 2007 Typical BAV Integration Network US1US2USiUSn DS1DS2DSi DSn GS id … … … …

October 2007 Typical BAV Integration Network (contd) On the previous slide: GS is a global schema DS1, …, DSn are data source schemas US1, …, USn are union-compatible schemas the transformation pathways between each pair LSi and USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository the transformation pathway between USi and GS is similar the transformation pathway between each pair of union- compatible schemas consists of id transformation steps

October Schema Evolution In biological DI, data sources may evolve their schemas to meet the needs of new experimental techniques or applications Global schemas may similarly need to evolve to encompass new requirements Supporting schema evolution in materialised DI is costly: requires modifying the ETL and view materialisation processes, plus the processes maintaining any derived data marts With view-based virtual DI approaches, the sets of views that may be affected need to be examined and redefined

October 2007 Schema Evolution in BAV BAV supports the evolution of both data source and global schemas The evolution of any schema is specified by a transformation pathway from the old to the new schema For example, the figure on the right shows transformation pathways, T, from an old to a new global or data source schema Global Schema S New Global Schema S T New Data Source Schema S Data Source Schema S T

October 2007 Global Schema Evolution Each transformation step t in T:S S is considered in turn if t is an add, delete, rename then schema equivalence is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway, using an AutoMed tool that does this); the extended pathway can be used to regenerate the necessary GAV or LAV views if t is a contract then there will be information present in S that is no longer available in S; again there is nothing further to do if t is an extend then domain knowledge is required to determine if, and how, the new construct in S could be derived from existing constructs; if not, nothing further to do; if yes, the extend step is replaced by an add step

October 2007 Local Schema Evolution This is a bit more complicated as it may require changes to be propagated also to the global schema(s) Again each transformation step t in T:S S is considered in turn In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically If it is an extend, then domain knowledge is required See our CAiSE02, ICDE03 and ER04 papers for more details The last of these discusses a materialised DI scenario where the old/new global/source schemas have an extent We are currently implementing this functionality within the AutoMed toolkit

October Some Future Directions in Biological DI Automatic or semi-automatic identification of correspondences between sources, or between sources and global schemas e.g. name-based and structural comparisons of schema elements instance-based matching at the data level annotation of data sources with terms from ontologies to facilitate automated reasoning Capturing incomplete and uncertain information about the data sources within the integrated resource e.g. using probabilistic or logic-based representations and reasoning Automating information extraction from textual sources using grammar and rule-based approaches; integrating this with other related structured or semi-structured data

October Harnessing Grid Technologies – ISPIDER ISPIDER Project Partners: Birkbeck, EBI, Manchester, UCL Aims: Large volumes of heterogeneous proteomics data Need for interoperability Need for efficient processing Development of Proteomics Grid Infrastructure, use existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.

October 2007 Project Aims

October 2007 Project Aims

October 2007 Project Aims

October 2007 Project Aims

October 2007 Project Aims

October 2007 my Grid / DQP / AutoMed my Grid: collection of services/components allowing high-level integration via workflows of data and applications DQP: uses OGSA-DAI (Open Grid Services Architecture Data Access and Integration) to access data sources provides distributed query processing over OGSA-DAI enabled resources Current research: AutoMed – DQP and AutoMed – my Grid workflows interoperation See DILS´06 and DILS´07 papers, respectively

October 2007 AutoMed – DQP Interoperability Data sources wrapped with OGSA-DAI AutoMed-DAI wrappers extract data sources metadata Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema IQL queries submitted to this integrated schema are: reformulated to IQL queries on the data sources, using the AutoMed transformation pathways Submitted to DQP for evaluation via the AutoMed- DQP Wrapper

October Bioinformatics Service Reconciliation Plethora of bioinformatics services are being made available Semantically compatible services are often not able to interoperate automatically in workflows due to different service technologies differences in data model, data modelling, data types need for service reconciliation

October 2007 Previous Approaches Shims. my Grid uses shims, i.e. services that act as intermediaries between specific pairs of services and reconcile their inputs and outputs Bowers & Lud ä scher (DILS 04) use 1-1 path correspondences to one or more ontologies for reconciling services. Sample implementation uses mappings to a single ontology and generates an XQuery query as the transformation program Thakkar et al. use a mediator system, like us, but for service integration i.e. for providing services that integrate other services – not for reconciling semantically compatible services that need to form a pipeline within a workflow

October 2007 Our approach XML as the common representation format Assume availability of format converters to convert to/from XML, if output/input of a service is not XML

October 2007 Our approach XMLDSS as the schema type We use our XMLDSS schema type as the common schema type for XML Can be automatically derived from DTD/XML Schema, if available Or can be automatically extracted from an XML document

October 2007 Our approach Correspondences to an ontology Set of GLAV corrrespondences between each XMLDSS schema and a typed ontology: An element maps to a concept/path in the ontology An attribute maps to a literal- valued property/path There may be multiple correspondences for elements/attributes in the ontology

October 2007 Our approach Schema and data transformation a pathway is generated to transform X1 to X2: correspondences are used to create X1 X1 and X2 X2 XMLDSS restructuring algorithm creates X1 X2 hence overall pathway X1 X1 X2 X2

October 2007 Architecture A workflow tool could use our approach either dynamically or statically: Mediation service Workflow tool invokes service S1 and receives its output Workflow tool submits output of S1, the schema of S2 and the two sets of correspondences to an AutoMed service The AutoMed service transforms the output of S1 to a suitable input for consumption by S2 Shim generation AutoMed is used to generate a shim for services S1 and S2 XMLDSS schema transformation algorithm currently tightly coupled with AutoMed functionality can be exported as single XQuery query able to materialise S2 from the data output by S1

October Conclusions Integrating biological data sources is hard! The overarching motivation is the potential to make scientific discoveries that can improve quality of life The technical challenges faced can lead to new, more generally applicable, DI techniques Thus, biological data integration continues to be a rich field for multi- and interdiscplinary research between clinicians, biologists, bioinformaticians and computer scientists