Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrated Web Service for Database Queries and Computations

Similar presentations


Presentation on theme: "Integrated Web Service for Database Queries and Computations"— Presentation transcript:

1 Integrated Web Service for Database Queries and Computations
PRAGMA 11 Workshop, Osaka Univ., on October 16, 2006 Integrated Web Service for Database Queries and Computations Grid Web Services for Analog Queries to Protein 3D Structures Haruki Nakamura Institute for Protein Research, Lab. Protein Informatics, Osaka University

2 BioGrid at Osaka http://www.biogrid.jp
Started from 2002 Goals:Grid Technology Development for: Biotechnology (drug discovery) and Medical Sciences. Leader: Shinji Shimojo (CMC, Osaka Univ.) Government Support (MEXT): 5years H. Nakamura K. Yamaguchi T. Takada H. Matsuda S. Shimojo S. Date Integration of Data Grid and Computing Grid T. Akiyama to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost. First, I introduce our activity of BioGrid at Osaka University, at early stage of Grid. This has started about 4 years ago for the purpose of the Integration of Data Grid and Computing Grid, leaded by my colleague Shinji Shimojo and Hideo Matsuda.

3 Databases in Life Sciences nearly 1,000 DBs
Organism Medical Scale Large OMIM Organ Tissue Drug Data MEDLINE Cell MDDR Literature Function Sugar Chain Organelle Molecular Interaction Network Gene Ontology Gene Expression DIP Mass Analysis GEO KEGG 3D-Cordinates InterPro, Pfam Molecule Chemical Structure Swiss-Prot DDBJ/EMBL/GenBank H-Inv, FANTOM In the Life science field, there are so many databases nearly 1000, which have very heterogeneous properties:From atom level to Organism level, with very many different conceptual levels. PDB PubChem, Ligand, ChEBI Sequence Fine Atom Conceptual Level Less Conceputal (Formal) Highly Conceptual

4 BioDataGrid by Hideo Matsuda at Osaka Univ.
My colleague, Prof. Hideo Matsuda, has developed this system, BioDataGrid, to virtually integrate many heterogeneous DBs. Namely, when one query is submitted about the Protein-small compound Interaction, this system will try to find any descriptions in different DBs. It can be realized effectively by the result of PDBML. We use the grid technology as one of the most promising technologies that enable us to efficiently integrate heterogeneous resources for lead identification of drug. OGSA-DAI is a set of grid services that enables us to make various data resources accessible as grid services. By using OGSA-DAI, the database will be integrated virtually using web mechanisms such as SOAP to enable database services to operate within the XML scheme. In our system, the grid services are integrated by using Globus Toolkit 3 with OGSA-DAI and Web Service. Such integrated databases were grouped according to the domain of their contents among disease information, gene-protein information, compound information and drug metabolism information. Several types of metadata were designed to bridge information among the databases, such as protein, compound and interaction metadata. Interaction metadata provides a link between information found in databases of different domains such as protein-compound interaction. Protein and compound metadata provide links between information found in databases of the same domains. Bio-related databases are classified into categories including according to the Metadata. They are seamlessly federated by our Data Grid system.

5 BioGrid at Osaka http://www.biogrid.jp
Started from 2002 Goals:Grid Technology Development for: Biotechnology (drug discovery) and Medical Sciences. Leader: Shinji Shimojo (CMC, Osaka Univ.) Government Support (MEXT): 5years H. Nakamura K. Yamaguchi T. Takada H. Matsuda S. Shimojo S. Date Integration of Data Grid and Computing Grid T. Akiyama to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost. In our BioGrid activity, I have leaded the Computing Grid Technology Team.

6 BioPfuga: Biosimulation Platform United on Grid Architecture
Nakamura et al. (2004) New Generation Computing, 22, BioPfuga: Biosimulation Platform United on Grid Architecture A platform where individual application programs at the different levels are united to execute a hybrid computation. In particular, BioPfuga is a platform for biosimulation.

7 Construction of BioPfuga
・Propose and Design Communication between program components by XML description   (BMSML: BioMolecular Simulation-ML) , and Creation of the APIs to handle it. ・Divide programs into a set of many components Different program modules are then implemented as the service programs based on the OGSA (Open Grid Service Architecture) mechanism. For the platform BioPfuga, we first popose and design the new communication system between program components by XML description. We currently call it as BioMolecular Simulation-ML, BMS-ML. Then, we divide programs into a set of many components, and individual programs are made as the service programs based on the OGSA mechanism.

8

9 Construction of BioPfuga
・Propose and Design Communication between program components by XML description   (BMSML: BioMolecular Simulation-ML) , and Creation of the APIs to handle it. ・Divide programs into a set of many components Different program modules are then implemented as the service programs based on the OGSA (Open Grid Service Architecture) mechanism. For the platform BioPfuga, we first popose and design the new communication system between program components by XML description. We currently call it as BioMolecular Simulation-ML, BMS-ML. Then, we divide programs into a set of many components, and individual programs are made as the service programs based on the OGSA mechanism.

10 Coupled simulation on Grids
Quantum mechanics (QM) and Molecular mechanics (MM) simulation coupled on BioPfuga Coupled simulation on Grids MM area Active site of an enzyme Hamiltonian of total system Htotal = HQM(x,x) + HQM/MM(x,y) + HMM(y,y) Calculate in QM area Calculate in MM area Grid computing architecture is more useful for the multi scale simulations, combineing several different-level simulations. The Hamiltonian, or the system energy, is described by the summation of the two energy factors and the Interaction term between QM and MM. The Hqm calculation is tightly coupled one, and can be done with one PC-cluster or any computer system. Simultaneously, Hmm calculation is also tightly coupled, but the Interaction between QM and MM is not very tightly coupled, rather loosely coupled one. Ligand QM area (Yonezawa et al. at SC2004)

11 Simulation of Protein-solvent structure
PKCd-C1B Simulation of Electronic structure (1) AMOSS Ab initio Molecular Orbital Simulation for Supercomputer 856 atoms, 8672 AO’s developed by NEC Quantum Chemistry Group Rapid computation for huge molecular systems (20,000 basis for RHF, 1,000 basis for CASSCF and MP2), with high parallel performance Simulation of Electronic structure (2) GSO-X generalized spin density function theory (DFT) developed by Shusuke Yamanaka & Kizashi Yamaguchi, at Osaka Univ. Simulation of Protein-solvent structure prestoX-basic Protein Engineering Simulator eXtended –basic versin- For the simulation of the electronic structure, two big original programs are used, AMOSS mainly for HF calculations, and GSO-X for density function theory. AMOSS is developed by NEC quantum chemistry group, leaded by Toshikazu Takada, and GSO-X is developed by Shusuke Yamanaka and Kizashi Yamaguchi, who are quantum chemists in Osaka Universitity. For protein-solvent system, our group in Osaka Univ. and JBIRC at the Tokyo Bay develop our own molecular mechanics program, which includes the multi-canonical MD and Tsallis dynamics simulations to calculate free energy landscapes for peptide and proteins. developed by Yoshifumi Fukunishi at JBIRC-AIST and Haruki Nakamura at Institute for Protein Research, Osaka Univ.

12 A hybrid-QM/MM with BioPfuga on the Grid Service@SC2004
Client controller Pittsburgh @IPR Osaka Site-A Viewer portal @TokyoIT MM(MD) Site-B prestoX Service GT3 AMOSS Service GT3 AMOSS(HF) QM(HF) MPI 30CPU @CMC Osaka Site-C BMSML BMSML QM(DFT) 8CPU GSO-X Service GT3 prestoX (MPI) 8-boards Thus, our hybrid-QM/MM system is like this. At Site-A computer, QM computation is performed with MPI, and at site-B computer, MM computation is done. The characteristic features of our system is that we also use the special purpose boards, MD-Grape2, for molecular dynamics, which has been developed by Toshikazu Ebisuzaki in Riken. Each board is 50 GFlops, and we can only add many boards to perform a parallel computation, at the moment we have 8 boards attached. By the insertion of the Adapter program, the implementation of this MD-grape2 board is also easily realized. MPI 50CPU MD-Grape2 BMSML GSO-X(DFT) (Special purpose computing board: max 50GFLOPS)

13 Configuration of a computing system on a Grid
environment with several different program modules. Portal/ Client Program User Communicate using SOAP Personal user Distributed computing System Service QM(HF) Service MM Service QM (DFT) In that system, different comutaionts were made in the corresponding different sites, which had the best suited system for the individual computations. SOAP SOAP

14 BioGrid at Osaka http://www.biogrid.jp
H. Nakamura K. Yamaguchi T. Takada H. Matsuda T. Akiyama S. Shimojo S. Date Integration of Data Grid and Computing Grid to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost. Started from 2002 Goals:Grid Technology Development for: Biotechnology (drug discovery) and Medical Sciences. Leader: Shinji Shimojo (CMC, Osaka Univ.) Government Support (MEXT): 5years Our final goal is Integration of Data Grid and Computing Grid. In our BioGrid activity, I have leaded the Computing Grid Technology Team.

15 Integration of computing systems and Database query systems on a Grid
Portal/ Client Program User Communicate using SOAP Personal user site-A site-B site-C Distributed Computing System GT Service A GT GT Service B Service C SOAP GRID computing services H. Nakamura, BioGrid2005 (March 9, 2005)

16 Integration of computing systems and Database query systems on a Grid
Portal/ Client Program User Communicate using SOAP Personal user site-A site-B site-C Distributed Computing and DB System GT Service A GT GT Service B Service C SOAP GRID computing services Database query services H. Nakamura, BioGrid2005 (March 9, 2005)

17 Proposed Portal for Multiple Databases for Protein Structures
Berman, H. et al. (2006) Structure, 14, Theoretical Model DB 3 Theoretical Model DB 1 Theoretical Model DB 2 Figure 1. A Schematic for the Proposed Portal: The portal is envisioned to be a modular web services architecture achieved by using an implementation of the Simple Object Access Protocol (SOAP that allows for seamless data exchange between the portal and all registered contributors. SOAP is a light weight protocol for exchange of information in a decentralized, distributed environment. It is an XML-based protocol, which defines a framework for representing remote procedure calls and responses. The XML wrappers basically map the individual contributing metadata format into an XML format that the data portal understands (Drinkwater et al., 2004). These services will be platform and language independent allowing other services (other portals or clients) to communicate with the data portal (see

18 PDB (Protein Data Bank): Protein Tertiary Structure Database
Atom species and coordinates, amino acid residues, any other experimental results with raw data, experimental methods, and the conditions. Protein Tertiary Structure X-ray crystallography, NMR, and Electron microscope experiments.

19 Kim Henrick, Helen Berman, Haruki Nakamura

20 International collaboration
in wwPDB Rutgers Univ. UCSD NIST PDBj EBI RCSB Curation, data processing, and registration are made by all the members, collaborating with each other. 2) We have a single data archive, which is looked after by one “archive keeper (RCSB)”. 3) Data format and new descriptions will be discussed among the members. 4) Members are encouraged to develop their own browsers, viewers, and other APIs and services. (Berman, Henrick & Nakamura (2003) Nat. Struct. Biol. 10, 980)

21 Protein Data Bank Japan
At Institute for Protein Research, Osaka Univ. since 2001 assisted from the Institute for Bioinformatics Research and Development, Japan Science and Technology Agency (BIRD-JST).

22 Processed data numbers at PDBj
year Yearly wwPDB registration number Total wwPDB registration number Yearly PDBj registration number Total registration number Yearly registration number Now, our contribution is about 20% in the world. 5501 (2004), wwPDB , 1586 in PDBj, 700 in Japan We process more than 30 % deposited data of the entire world, mainly from Asian and Oceania regions

23 Maintain Format Standards
PDB (conventional and flat) PDB Exchange (mmCIF) Mechanism for extension based on new demands PDBML Derived from mmCIF All entries converted to XML Automatic translation from mmCIF data files and dictionaries 3-styles of translation released (Westbrook, Ito, Nakamura, Henrick, Berman (2005) Bioinformatics, 21, )

24 PDBML: canonical XML description of PDB data, developed by the wwPDB.
(Westbrook et al. (2005) Bioinformatics, 21, ) ftp://ftp.pdbj.org/ → No validation errors for more than 39,000 PDB file description.

25 Examples for the atom coordinates
<PDBx:atom_siteCategory> <PDBx:atom_site id="1"> <PDBx:group_PDB>ATOM</PDBx:group_PDB> <PDBx:type_symbol>N</PDBx:type_symbol> <PDBx:label_atom_id>N</PDBx:label_atom_id> <PDBx:label_comp_id>THR</PDBx:label_comp_id> <PDBx:label_asym_id>A</PDBx:label_asym_id> <PDBx:label_entity_id>1</PDBx:label_entity_id> <PDBx:label_seq_id>1</PDBx:label_seq_id> <PDBx:Cartn_x>17.047</PDBx:Cartn_x> <PDBx:Cartn_y>14.099</PDBx:Cartn_y> <PDBx:Cartn_z>3.625</PDBx:Cartn_z> <PDBx:occupancy>1.00</PDBx:occupancy> <PDBx:B_iso_or_equiv>13.79</PDBx:B_iso_or_equiv> <PDBx:auth_seq_id>1</PDBx:auth_seq_id> <PDBx:auth_comp_id>THR</PDBx:auth_comp_id> <PDBx:auth_asym_id>A</PDBx:auth_asym_id> <PDBx:auth_atom_id>N</PDBx:auth_atom_id> <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num> </PDBx:atom_site> Full-tag description For the atom coordinates, the full-tag description data becomes very huge, and so we have invented another separated coordinate file with this simplified description. Separated file for coordinates <atom_record id="1">ATOM 1 A A 1 1 ? . THR THR N N N </atom_record>

26 Applications of PDBML Browser at PDBj with the native XML DB
Database extension and annotation SOAP (Simple Object Access Protocol) services at PDBj Molecular graphics viewer (jV) , which directly parses PDBML, displaying the information written in XML (Kinoshita & Nakamura (2004) Bioinformatics, 20, , Westbrook et al (2005) Bioinformatics, 21, )

27

28 The functional information can be searched by this query, or by using SOAP service,

29 Wild-card "*" can be used: *2as,  12*,  *

30 "and" and wild-card are available.

31

32

33

34 The functional information can be searched by this query, or by using SOAP service,

35 Search proteins, which have one or more helices,
which are longer than 10. /datablock[struct_confCategory/struct_conf The functional information can be searched by this query, or by using SOAP service,

36

37 xPSSS new facility Both XQuery and XPath are available.

38 Search for all entries and helix of length with helix of length greater than or equal to 10 residues:

39

40

41 Queries by XQuery and XPath at our new xPSSS

42 PDBjViewer or jV http://www.pdbj.org/PDBjViewer/
Offers interactive molecular visualization with RasMol type commands (source code free). Can be used both as Stand- alone and as applet with Java. Any polygons defined by XML can be displayed and manipulated. Can parse PDBML files and display the molecules.

43 PIR/GenBank/KEGG/GDB/ for the primary citations
xPSSS Database System  Archive (RCSB-PDB /MSD-EBI /PDBj) Internet download (FTP) FTP server Web server downloader xPSSS PDBML AddInformation XSLT processor Native XML-DB PDBMLplus CATRES Data Function/ Source Information PDBMLplus PDBMLplusF Annotation Data PDBMLplus Filtering & Recostructing Loader PDBMLplusF Get/Input Tools DDBJ SwisProt/UniProt PIR/GenBank/KEGG/GDB/ ProTherm/EzCatDB EBI/CSA /CATRES PDF files for the primary citations Manual input from literatures

44 Development of Secondary Databases: data mining
Protein Molecular Surface Database, eF-site (Kinoshita & Nakamura) Protein Dynamics Database, ProMode (Wako & Endo) Alignment of Structural Homologues, ASH (Toh) Sequence Navigator & Structure Navigator (Standley) Encyclopedia of Protein Structures, eProtS (Ito & Nakamura)

45 Superfolds of Proteins

46 Identify Protein Function from ・Sequence similarity
・ Fold similarity In particular, the British group, Janet Thornton and Christine Orengo have so far pushed the method of using “Fold similarity” to identify or estimate the protein function. They have succeeded to some extent, but still there are many cases where there is no fold similarities.

47 Compare Protein folds/architectures

48 DNA-binding domain of Bovine Papilloma Virus-1 E2 (2BOP)
RNA-binding domain of the U1A spliceosomal protein (1URN) Distance Matrix

49 Map of the entire protein folds
498 SCOP domains Dali (distance matrix alignment) Hou et al. (2003) Proc. Natl. Acad. Sci. USA, 100,

50 Structure Navigator http://www.pdbj.org/strucnavi/
Rapidly generates structure neighbors for any PDB ID Both an HTTP and SOAP interface are available Alignments are displayed in a readable format 3D coordinates can be viewed or downloaded Structural similarity is defined by the Number of Equivalent Residues (NER1) Standley, D.M. et al. (2005) 1Standley D, Toh H, Nakamura H. Proteins 2004;57:

51 Structure Navigator Real-Time Server
Blue: Query Red: Temp1 Standley, D.M. et al. (2005)

52 Real-Time Query Strategy
Clustered Templates (5,097 representatives) Nearest Cluster Standley, D.M. (2006)

53 Structure Navigator-RT could reply the query quickly, but ...
When we open the Web service of Structure Navigator-RT and accept many requests, our computer resource may not be enough. We want to use large computing resource somewhere else. We need a simple and flexible tool for this purpose Grid Web Service

54 Grid Web Service using Opal
SDSC Technical Report TR , 2006. Opal is a toolkit for wrapping application programs as Web services, providing scheduling, Grid security, and data management.

55 Grid Web Service using Opal
CCGrid 2006

56 Opal Operation Provider by W. W. Li & K
Opal Operation Provider by W.W. Li & K. Ichikawa (under the PRIUS project) Opal is deployed as one of Operation Providers of Globus Toolkit 4 (GT4): Opal Operation Provider (Opal-OP) Opal OP Toolkit Application Service (WSRF) Opal OP Operation Providers for WSRF Operation Providers for Notification GetRPProvider SubscribeProvider SetRPProvider Opal NotificationConsumer Provider QueryRPProvider Globus Toolkit 4 (GT4)

57 System Architecture @IPR, Osaka Univ. (Suita) @CMC, Osaka Univ.
End user Request Query Portal Server Apache Tomcat (jsp -Input page/servlet -Opal-OP client program) Library of GT4 Library of Operation Provider Library of Operation Provider Service @IPR, Osaka Univ. (Suita) Computing Server (PC-cluster 16 nodes) Submit Job @CMC, Osaka Univ. (Toyonaka) Head Node (1 PC node): Tomcat Opal Operation Provider Service Pre-WS GRAM GT4 OpenPBS (launch pbs_server, pbs_sched, pbs_mon) Run the Application program Job scheduling 15 Worker nodes: openPBS (launch pbs_mon) Run the Application program Process is created by the local scheduler, openPBS. Launch Launch

58 Structure Navigator-RT with Opal-OP on GRID
Query: 3D fold (1crn)

59

60

61

62

63

64

65 Blue: Query Structure Red: Second nearest cluster to the Query Query structure

66 Conclusion We implemented the Opal Operation Provider to run Structure Navigator-RT as a Grid Web Service, using the local scheduler, openPBS, to run our application program on the PC cluster at the Cyber-Media Center, Osaka University, through the network inside Osaka University.

67 Identify Protein Function from ・Sequence similarity
・ Fold similarity ・ Surface similarity In particular, the British group, Janet Thornton and Christine Orengo have so far pushed the method of using “Fold similarity” to identify or estimate the protein function. They have succeeded to some extent, but still there are many cases where there is no fold similarities.

68 eF-site database for Ligand recognition sites
/eF-site/ Kinoshita, Furui, Nakamura (2002) J. Struct. Funct. Genomics 2, 9-22. Kinoshita, Nakamura (2004) Bioinformatics 20,

69 Similarity was found here.
Function identification of hypothetical proteins Kinoshita & Nakamura (2003) Protein Science 12, Kinoshita & Nakamura (2005) Protein Science, search result against eF-site/ActiveSite + eF-site/mono-nucleotide (1684 entries) query: MJ0226 free form Similarity was found here. As an example for the application to a hypothetial protein, MJ0226, whose 3D structure was determined by Sun-Hou Kim’s group, and the ligand, ATP, was found from the local fold similarity. So, we first made the molecular surface of the apo-conformation, and computed the electrostatic potential. Then without giving any information of what is the ligand or where is the active site, the query was just the entire molecular surface, against the eF-site database for the functional sites. This graph is the result of the query, and these two proteins were found having the high Z-score. Those two functional sites are the funstion of ATP hydrolysis. Folylpolyglutamate synthetase Pyruvate kinase

70 eF-seek: Query service for search of similar molecular surfaces (b-version)
Kinoshita, K. & Nakamura, H. (2006) ef-site.hgc.jp /eF-seek/

71 Application of Grid to Web Service of PDBj
Query for the analog structural data: Protein folds and molecular surfaces   eF-site Database Result Query for molecular surface Looking for the similar surfaces (eF-seek) Quick response by using Grid computing Result New Structure Those computations to search the similar molecular surfaces and search the similar folds require tough computing resources, because they are essentially analog data: Shape of the molecule or some physico-chemical properties of molecules. Therefore, GRID computation is a good solution to realize the quick response to many of such queries. The usage of the new Grid service system will be applied to our service at PDBj. The legacy scheduling system for the current Web service now uses Open JMS (Java Message Service). JMS(Java Message Service)はJavaプログラムにおいて非同期にメッセージを交換するためのインターフェース。 Blue: Query Red: Temp1 Query for fold Looking for the similar folds (Structure Navigator Real-Time server) Fold Database

72 Integration of computing systems and Database query systems at PDBj
Portal/ Client Program User Communicate using SOAP Personal user site-A site-B site-C Distributed Computing and DB System XMLDB GT Service A GT Service B Service C GRID computing services with Opal-OP Database query services with XQuery/XPath Analog Query Search Text Query Search

73 Acknowledgements ・Reiko Yamashita, Daron M. Standley (BIRD-JST / Inst. Protein Res., Osaka Univ.) ・Kohei Ichikawa, Susumu Date, Shinji Shimojo (Cyber Media Center, Osaka Univ.) ・Wilfred W. Li , Peter Arzberger (UCSD) ・Hiroyuki Toh (Medical Inst. Bioregulation, Kyushu Univ) ・Kengo Kinoshita (Inst. Med. Science, Univ. Tokyo) ・Other PDBj members (BIRD-JST / Inst. Protein Res., Osaka Univ.) ・PRIUS

74


Download ppt "Integrated Web Service for Database Queries and Computations"

Similar presentations


Ads by Google