Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture.

Similar presentations


Presentation on theme: "Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture."— Presentation transcript:

1 Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

2 ICE Web-Based Software Development, Summer Where are we? Server-Side Info. Management Client-Side Info. ManagementBusiness related Issues Web Services Internationalization and Privacy XML & XML Processing HTML, JavaScript, Plug- in, Applet… WWW Concepts & Web- based Info. Management S.S. Info. Management Concept CGI, Java Servlets JDBC, MySQL App. Of Web- based tech. Semantic Web This Lecture ?

3 ICE Web-Based Software Development, Summer Can you remember? Problems in Integrating Heterogeneous Information -Heterogeneity of formats, data types, units, or semantics. Information Mediation Fig 1. Mediator in Lecture 7.

4 ICE Web-Based Software Development, Summer This Lecture Contains… Information Integration in Bioinformatics -Bioinformatics Overview Is there any relationship between Web and Bioinformatics? -Difficulties to handle Biological XML data -What is it? Why? Cultures : Schema-driven, Data-driven Models : Federation, Warehousing, Mediation Integration of XML format data -Problems -Issues Summary

5 ICE Web-Based Software Development, Summer Bioinformatics A narrow sense -The application of information technology to life science research Modeling (abstraction) Analysis and collection Data integration and information retrieval -Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions) A wide sense -The use of computers to collect, analyze, and interpret biological information at the molecular level

6 ICE Web-Based Software Development, Summer Web and Bioinformatics Experiment, Publish Use Make, Publish Use Biological DataBio Applications Biologist Computer Scientist

7 ICE Web-Based Software Development, Summer Difficulties to Handle Biological XML data Lack of standard -Different data model and schemas -Different handling methods are needed -Different formats Monstrous volume of data -It is growing exponentially -Data are updated very frequently Newly introduced data, error fixed data

8 ICE Web-Based Software Development, Summer Why Integration? In the post-human genome sequencing era, many analyses on the genome scale are possible Majority of human diseases are the product of multi-step pathophysiological processes The biggest challenge in interpreting the results of these analyses lies in the data integration problem

9 ICE Web-Based Software Development, Summer Two Cultures of Integration Database Integration -Schema level view -Focus on outside of data Data Integration -Data level view -Focus on inside of data Schema 1Schema 2 Schema 3 Schema 4 Data 1 Data 3 Data 2 Data 4

10 ICE Web-Based Software Development, Summer Two Cultures of Integration Schema-driven (computer scientists) -Much smaller than data, (hopefully) well-defined elements -Resolve redundancy and heterogeneity at the schema level -High degree of automation once system is set-up -Focus on methods - you rarely publish a “data paper” Data-driven (biologists) -Value is in the data, abstraction is a result of analysis -Don‘t bother with schemas Abstraction is volatile and depends on experimental technique -Manual integration at data level, constant high effort -You rarely publish a (database) “method paper”

11 ICE Web-Based Software Development, Summer Models of Integration Federation (Multi-database) Warehousing (Materialized in house) Mediation (Virtual integration)

12 ICE Web-Based Software Development, Summer Models of Integration Federation (Multi-database) -K2/BioKleisli, Entrez

13 ICE Web-Based Software Development, Summer Models of Integration Warehousing (Materialized in house) -GUS (Genome Unified Schema), SRS (Sequence Retrieval System) Local Operational Warehouse Decision Support & Mining NetworkInternet Integration & Storage R3R2

14 ICE Web-Based Software Development, Summer Models of Integration Mediation (Virtual integration) -TAMBIS (Transparent Access to Multiple Bioinformatics Information Source) Mediator NetworkInternet Query Translation

15 ICE Web-Based Software Development, Summer Models of Integration Federation represents a more “static” approach – using agreed couplings to allow view creation. Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.

16 ICE Web-Based Software Development, Summer Warehousing vs. Mediation Warehouse -Update-driven: i.e. in warehouse repository -Heterogeneous data is integrated in advance and stored in- house for direct query and analysis. Mediation -Wrapper and Mediator layer on top of source DBs. -Query-driven: Query to mediated schema then translated into queries appropriate to sources. -Results integrated into a global answer set.

17 ICE Web-Based Software Development, Summer Now let’s study the… Information Integration in Bioinformatics -Bioinformatics Overview Is there any relationship between Web and Bioinformatics? -Difficulties to handle Biological XML data -Why Integration? Cultures : Schema-driven, Data-driven Models : Federation, Warehousing, Mediation Integration of XML format data -Problems -Issues Discussion about Reading Question #6

18 ICE Web-Based Software Development, Summer Integration of XML format data Why XML? -Biology is a complex discipline -Wide variety of data resources and repositories No standard protocol exists to interrogate biological data stores No standard data format exists to exchange biological data. No standard data model exists. -Difficulties in using and exchanging data There exist various tools that can support XML handling

19 ICE Web-Based Software Development, Summer Integration of XML format data Problems -We focus on schema-driven integration -Warehousing model is efficient Have to analyze data Performance To implement perfect mediation model is extremely difficult -XML data should be converted into RDB -We want to make our own DB schema accommodating the data from XML files -We need to make the DB schema regarding efficiency and our own purpose -Heterogeneity and Large scale

20 ICE Web-Based Software Development, Summer Integration of XML format data PreSPI (Prediction System for Protein Interaction) General XML Wrapper (SAX) SequenceStructureFunctionDomain ٠ ٠ ٠٠ ٠ ٠ XML Integration Rule Local DB1Local DB2Local DB3 Warehouse Local Web

21 ICE Web-Based Software Development, Summer Issues of Using XML Biological data Structure -Semi-structured: Can be expressed as trees, graphs -Theoretically, it is ideal to map them into DB regarding structural feature Method for storing XML -File system Has overhead for query Text file, invert list, compression file -Specific storing method Use XML’s own structure -DB system Especially, mapping into RDB has been researched a lot Has overhead for converting into the appropriate model

22 ICE Web-Based Software Development, Summer Issues of Using XML Biological data Object view of the XML use DOM A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column XML Objects Tables ============= ========================== Table A object A { bbb B = "bbb" B C D ccc C = "ccc" ddd D = "ddd" } bbb ccc ddd XML-view CREATE XMLVIEW xview_1( id char(20), char (30) ) AS (‘select from “file:/home/user1/personal.xml”, p; ‘);  “A generic load/extract utility for data transfer between XML documents and relational databases” Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.

23 ICE Web-Based Software Development, Summer Issues of Using XML Biological data Direct method XML Document Insert Statement Mapping Rule XML Saver input Output & execute input  “A direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui; Information Technology Interfaces, th International Conference on 2004 Page(s): Vol.1

24 ICE Web-Based Software Development, Summer Issues of Using XML Biological data Direct Method (cont’d)

25 ICE Web-Based Software Development, Summer Issues of Using XML Biological data Current methods force DB to follow XML schema Complex structured XML -Share the same element name even thought they should be different columns in DB (DIP, InterPro…) Large size of file; we cannot use DOM XML updated frequently; the process should be easy...BC …ID_B … ……….. IDDBID ID_AB ID_B ID_AC ID_C ID_ADID_D ID_AEID_E IDNAMEBCDE ID_APROTEIN_AID_BID_CID_DID_E Rather than

26 ICE Web-Based Software Development, Summer Issues of Using XML Biological data  Direct Method cannot cover following XML type  Cannot integrate two more files ; Needs constraint SwissProt SWP:Q07812 PIR PIR:A47538 NCBI GI: bcl-2-associated protein x, alpha splice form Homo sapiens IDNAMEDIP_IDSWP_IDPIR_IDGI_ID G:1BAXA_HUMANDIP:232NQ07812A IDDBRef_ID G:1DIPDIP:232N G:1SWPQ07812 G:1PIRA47538 G:1gi We want But,

27 ICE Web-Based Software Development, Summer Issues of Using XML Biological data Make a data set for a tuple, which ignore sub document tree nodes Define SQL like syntax -Where condition of each column for constraints -Multiple files can be populated into one table by manipulation CREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20), C CHAR(20), D CHAR(20), E CHAR(20) ) AS ( SELECT ( = B, = C, = D, = E, ) FROM “file/protein.xml” AS FILE, “file/file.xml” AS FILE_2);

28 ICE Web-Based Software Development, Summer Summary Integration of biological data is a kind of Web based information management Integration in bioinformatics is a very important work because we can find out more valuable biological information via comprehensive view Biological XML data have some properties that disturb integration, so schema-driven and warehousing model are usually used for integration Thank you~~~


Download ppt "Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture."

Similar presentations


Ads by Google