Presentation on theme: "Integrated Web Service for Database Queries and Computations"— Presentation transcript:
1 Integrated Web Service for Database Queries and Computations PRAGMA 11 Workshop,Osaka Univ., on October 16, 2006Integrated Web Service forDatabase Queries and ComputationsGrid Web Services for Analog Queriesto Protein 3D StructuresHaruki NakamuraInstitute for Protein Research,Lab. Protein Informatics, Osaka University
2 BioGrid at Osaka http://www.biogrid.jp Started from 2002Goals:Grid Technology Development for:Biotechnology(drug discovery) and Medical Sciences.Leader: Shinji Shimojo (CMC, Osaka Univ.)Government Support (MEXT): 5yearsH. NakamuraK. YamaguchiT. TakadaH. MatsudaS. ShimojoS. DateIntegration of Data Grid and Computing GridT. Akiyamato access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.First, I introduce our activity of BioGrid at Osaka University, at early stage of Grid. This has started about 4 years ago for the purpose of the Integration of Data Grid and Computing Grid, leaded by my colleague Shinji Shimojo and Hideo Matsuda.
3 Databases in Life Sciences nearly 1,000 DBs OrganismMedicalScaleLargeOMIMOrganTissueDrugDataMEDLINECellMDDRLiteratureFunctionSugarChainOrganelleMolecular InteractionNetworkGene OntologyGeneExpressionDIPMassAnalysisGEOKEGG3D-CordinatesInterPro, PfamMoleculeChemicalStructureSwiss-ProtDDBJ/EMBL/GenBankH-Inv, FANTOMIn the Life science field, there are so many databases nearly 1000, which have very heterogeneous properties:From atom level to Organism level, with very many different conceptual levels.PDBPubChem, Ligand, ChEBISequenceFineAtomConceptual LevelLess Conceputal(Formal)Highly Conceptual
4 BioDataGrid by Hideo Matsuda at Osaka Univ. My colleague, Prof. Hideo Matsuda, has developed this system, BioDataGrid, to virtually integrate many heterogeneous DBs.Namely, when one query is submitted about the Protein-small compound Interaction, this system will try to find any descriptions in different DBs. It can be realized effectively by the result of PDBML.We use the grid technology as one of the most promising technologies that enable us to efficiently integrate heterogeneous resources for lead identification of drug. OGSA-DAI is a set of grid services that enables us to make various data resources accessible as grid services. By using OGSA-DAI, the database will be integrated virtually using web mechanisms such as SOAP to enable database services to operate within the XML scheme. In our system, the grid services are integrated by using Globus Toolkit 3 with OGSA-DAI and Web Service. Such integrated databases were grouped according to the domain of their contents among disease information, gene-protein information, compound information and drug metabolism information. Several types of metadata were designed to bridge information among the databases, such as protein, compound and interaction metadata. Interaction metadata provides a link between information found in databases of different domains such as protein-compound interaction. Protein and compound metadata provide links between information found in databases of the same domains. Bio-related databases are classified into categories including according to the Metadata. They are seamlessly federated by our Data Grid system.
5 BioGrid at Osaka http://www.biogrid.jp Started from 2002Goals:Grid Technology Development for:Biotechnology(drug discovery) and Medical Sciences.Leader: Shinji Shimojo (CMC, Osaka Univ.)Government Support (MEXT): 5yearsH. NakamuraK. YamaguchiT. TakadaH. MatsudaS. ShimojoS. DateIntegration of Data Grid and Computing GridT. Akiyamato access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.In our BioGrid activity, I have leaded the Computing Grid Technology Team.
6 BioPfuga: Biosimulation Platform United on Grid Architecture Nakamura et al. (2004) New Generation Computing, 22,BioPfuga: Biosimulation Platform United on Grid ArchitectureA platform where individual application programs at the different levels are united to execute a hybrid computation. In particular, BioPfuga is a platform for biosimulation.
7 Construction of BioPfuga ・Propose and Design Communication betweenprogram components by XML description (BMSML: BioMolecular Simulation-ML) ,and Creation of the APIs to handle it.・Divide programs into a set of manycomponentsDifferent program modules are thenimplemented as the service programsbased on the OGSA (Open Grid ServiceArchitecture) mechanism.For the platform BioPfuga, we first popose and design the new communication system between program components by XML description. We currently call it as BioMolecular Simulation-ML, BMS-ML.Then, we divide programs into a set of many components, and individual programs are made as the service programs based on the OGSA mechanism.
9 Construction of BioPfuga ・Propose and Design Communication betweenprogram components by XML description (BMSML: BioMolecular Simulation-ML) ,and Creation of the APIs to handle it.・Divide programs into a set of manycomponentsDifferent program modules are thenimplemented as the service programsbased on the OGSA (Open Grid ServiceArchitecture) mechanism.For the platform BioPfuga, we first popose and design the new communication system between program components by XML description. We currently call it as BioMolecular Simulation-ML, BMS-ML.Then, we divide programs into a set of many components, and individual programs are made as the service programs based on the OGSA mechanism.
10 Coupled simulation on Grids Quantum mechanics (QM) and Molecular mechanics (MM) simulation coupled on BioPfugaCoupled simulation on GridsMM areaActive site ofan enzymeHamiltonian of total systemHtotal = HQM(x,x)+ HQM/MM(x,y)+ HMM(y,y)Calculatein QM areaCalculatein MM areaGrid computing architecture is more useful for the multi scale simulations, combineing several different-level simulations.The Hamiltonian, or the system energy, is described by the summation of the two energy factors and the Interaction term between QM and MM.The Hqm calculation is tightly coupled one, and can be done with one PC-cluster or any computer system.Simultaneously, Hmm calculation is also tightly coupled, but the Interaction between QM and MM is not very tightly coupled, rather loosely coupled one.LigandQM area(Yonezawa et al. at SC2004)
11 Simulation of Protein-solvent structure PKCd-C1BSimulation of Electronic structure (1) AMOSS Ab initio Molecular Orbital Simulation for Supercomputer856 atoms, 8672 AO’sdeveloped by NEC Quantum Chemistry GroupRapid computation for huge molecular systems (20,000 basis for RHF, 1,000 basis for CASSCF and MP2), with high parallel performanceSimulation of Electronic structure (2) GSO-X generalized spin density function theory (DFT)developed by Shusuke Yamanaka & Kizashi Yamaguchi, at Osaka Univ.Simulation of Protein-solvent structureprestoX-basic Protein Engineering Simulator eXtended –basic versin-For the simulation of the electronic structure, two big original programs are used, AMOSS mainly for HF calculations, and GSO-X for density function theory. AMOSS is developed by NEC quantum chemistry group, leaded by Toshikazu Takada, and GSO-X is developed by Shusuke Yamanaka and Kizashi Yamaguchi, who are quantum chemists in Osaka Universitity.For protein-solvent system, our group in Osaka Univ. and JBIRC at the Tokyo Bay develop our own molecular mechanics program, which includes the multi-canonical MD and Tsallis dynamics simulations to calculate free energy landscapes for peptide and proteins.developed by Yoshifumi Fukunishi at JBIRC-AIST and Haruki Nakamura at Institute for Protein Research, Osaka Univ.
12 A hybrid-QM/MM with BioPfuga on the Grid Service@SC2004 Client controllerPittsburgh@IPR OsakaSite-AViewer portal@TokyoITMM(MD)Site-BprestoXServiceGT3AMOSSServiceGT3AMOSS(HF)QM(HF)MPI30CPU@CMC OsakaSite-CBMSMLBMSMLQM(DFT)8CPUGSO-XServiceGT3prestoX(MPI)8-boardsThus, our hybrid-QM/MM system is like this.At Site-A computer, QM computation is performed with MPI, andat site-B computer, MM computation is done.The characteristic features of our system is that we also use the special purpose boards, MD-Grape2,for molecular dynamics, which has been developed by Toshikazu Ebisuzaki in Riken. Each board is 50 GFlops, and we can only add many boards to perform a parallel computation, at the moment we have 8 boards attached.By the insertion of the Adapter program, the implementation of this MD-grape2 board is also easily realized.MPI50CPUMD-Grape2BMSMLGSO-X(DFT)(Special purpose computing board: max 50GFLOPS)
13 Configuration of a computing system on a Grid environment with several different program modules.Portal/Client ProgramUserCommunicate using SOAPPersonal userDistributedcomputingSystemServiceQM(HF)ServiceMMServiceQM(DFT)In that system, different comutaionts were made in the corresponding different sites, which had the best suited system for the individual computations.SOAPSOAP
14 BioGrid at Osaka http://www.biogrid.jp H. NakamuraK. YamaguchiT. TakadaH. MatsudaT. AkiyamaS. ShimojoS. DateIntegration of Data Grid and Computing Gridto access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.Started from 2002Goals:Grid Technology Development for:Biotechnology(drug discovery) and Medical Sciences.Leader: Shinji Shimojo (CMC, Osaka Univ.)Government Support (MEXT): 5yearsOur final goal is Integration ofData Grid and Computing Grid.In our BioGrid activity, I have leaded the Computing Grid Technology Team.
15 Integration of computing systems and Database query systems on a Grid Portal/Client ProgramUserCommunicate using SOAPPersonal usersite-Asite-Bsite-CDistributedComputingSystemGTServiceAGTGTServiceBServiceCSOAPGRID computing servicesH. Nakamura, BioGrid2005 (March 9, 2005)
16 Integration of computing systems and Database query systems on a Grid Portal/Client ProgramUserCommunicate using SOAPPersonal usersite-Asite-Bsite-CDistributedComputingand DBSystemGTServiceAGTGTServiceBServiceCSOAPGRID computing servicesDatabase query servicesH. Nakamura, BioGrid2005 (March 9, 2005)
17 Proposed Portal for Multiple Databases for Protein Structures Berman, H. et al. (2006) Structure, 14,Theoretical Model DB 3Theoretical Model DB 1Theoretical Model DB 2Figure 1. A Schematic for the Proposed Portal: The portal is envisioned to be a modular web services architecture achieved by using an implementation of the Simple Object Access Protocol (SOAP that allows for seamless data exchange between the portal and all registered contributors. SOAP is a light weight protocol for exchange of information in a decentralized, distributed environment. It is an XML-based protocol, which defines a framework for representing remote procedure calls and responses. The XML wrappers basically map the individual contributing metadata format into an XML format that the data portal understands (Drinkwater et al., 2004). These services will be platform and language independent allowing other services (other portals or clients) to communicate with the data portal (see
18 PDB (Protein Data Bank): Protein Tertiary Structure Database Atom species and coordinates, amino acid residues, any other experimental results with raw data, experimental methods, and the conditions.Protein Tertiary StructureX-ray crystallography, NMR, and Electron microscope experiments.
20 International collaboration in wwPDBRutgersUniv.ＵＣＳＤＮＩＳＴPDBjEBIRCSBCuration, data processing, andregistration are made by all themembers, collaborating with each other.2) We have a single data archive, which islooked after by one “archive keeper (RCSB)”.3) Data format and new descriptions will be discussedamong the members.4) Members are encouraged to develop their ownbrowsers, viewers, and other APIs and services.(Berman, Henrick & Nakamura (2003) Nat. Struct. Biol. 10, 980)
21 Protein Data Bank Japan At Institute for Protein Research,Osaka Univ. since 2001 assistedfrom the Institute for Bioinformatics Research and Development,Japan Science and Technology Agency (BIRD-JST).
22 Processed data numbers at PDBj yearYearly wwPDB registration numberTotal wwPDB registration numberYearly PDBj registration numberTotal registration numberYearly registration numberNow, our contribution is about 20% in the world.5501 (2004), wwPDB , 1586 in PDBj, 700 in JapanWe process more than 30 % deposited data of the entire world, mainly from Asian and Oceania regions
23 Maintain Format Standards PDB (conventional and flat)PDB Exchange (mmCIF)Mechanism for extension based on new demandsPDBMLDerived from mmCIFAll entries converted to XMLAutomatic translation from mmCIF data files and dictionaries3-styles of translation released(Westbrook, Ito, Nakamura, Henrick, Berman (2005)Bioinformatics, 21, )
24 PDBML: canonical XML description of PDB data, developed by the wwPDB. (Westbrook et al. (2005) Bioinformatics, 21, )ftp://ftp.pdbj.org/→ No validation errors for more than 39,000 PDB file description.
25 Examples for the atom coordinates <PDBx:atom_siteCategory><PDBx:atom_site id="1"><PDBx:group_PDB>ATOM</PDBx:group_PDB><PDBx:type_symbol>N</PDBx:type_symbol><PDBx:label_atom_id>N</PDBx:label_atom_id><PDBx:label_comp_id>THR</PDBx:label_comp_id><PDBx:label_asym_id>A</PDBx:label_asym_id><PDBx:label_entity_id>1</PDBx:label_entity_id><PDBx:label_seq_id>1</PDBx:label_seq_id><PDBx:Cartn_x>17.047</PDBx:Cartn_x><PDBx:Cartn_y>14.099</PDBx:Cartn_y><PDBx:Cartn_z>3.625</PDBx:Cartn_z><PDBx:occupancy>1.00</PDBx:occupancy><PDBx:B_iso_or_equiv>13.79</PDBx:B_iso_or_equiv><PDBx:auth_seq_id>1</PDBx:auth_seq_id><PDBx:auth_comp_id>THR</PDBx:auth_comp_id><PDBx:auth_asym_id>A</PDBx:auth_asym_id><PDBx:auth_atom_id>N</PDBx:auth_atom_id><PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num></PDBx:atom_site>Full-tag descriptionFor the atom coordinates, the full-tag description data becomes very huge, and so we have invented another separated coordinate file with this simplified description.Separated file for coordinates<atom_record id="1">ATOM 1 A A 1 1 ? . THR THR N N N </atom_record>
26 Applications of PDBML Browser at PDBj with the native XML DB Database extension and annotationSOAP (Simple Object Access Protocol) services at PDBjMolecular graphics viewer (jV) , which directly parses PDBML, displaying the information written in XML(Kinoshita & Nakamura (2004) Bioinformatics, 20, ,Westbrook et al (2005) Bioinformatics, 21, )
34 The functional information can be searched by this query, or by using SOAP service,
35 Search proteins, which have one or more helices, which are longer than 10./datablock[struct_confCategory/struct_confThe functional information can be searched by this query, or by using SOAP service,
42 PDBjViewer or jV http://www.pdbj.org/PDBjViewer/ Offers interactive molecularvisualization with RasMoltype commands (source code free).Can be used both as Stand-alone and as applet with Java.Any polygons defined by XMLcan be displayed and manipulated.Can parse PDBML files anddisplay the molecules.
43 PIR/GenBank/KEGG/GDB/ for the primary citations xPSSS Database System Archive(RCSB-PDB/MSD-EBI/PDBj)Internetdownload(FTP)FTP serverWeb serverdownloaderxPSSSPDBMLAddInformationXSLT processorNative XML-DBPDBMLplusCATRESDataFunction/Source InformationPDBMLplusPDBMLplusFAnnotationDataPDBMLplusFiltering &RecostructingLoaderPDBMLplusFGet/Input ToolsDDBJSwisProt/UniProtPIR/GenBank/KEGG/GDB/ProTherm/EzCatDBEBI/CSA/CATRESPDF filesfor the primary citationsManual inputfrom literatures
44 Development of Secondary Databases: data mining Protein Molecular Surface Database, eF-site(Kinoshita & Nakamura）Protein Dynamics Database, ProMode(Wako & Endo)Alignment of Structural Homologues, ASH (Toh)Sequence Navigator & Structure Navigator (Standley）Encyclopedia of Protein Structures, eProtS(Ito & Nakamura）
46 Identify Protein Function from ・Sequence similarity ・ Fold similarityIn particular, the British group, Janet Thornton and Christine Orengo have so far pushed the method of using “Fold similarity” to identify or estimate the protein function. They have succeeded to some extent, but still there are many cases where there is no fold similarities.
48 DNA-binding domain of Bovine Papilloma Virus-1 E2 （2BOP） RNA-binding domain of the U1A spliceosomal protein （1URN）Distance Matrix
49 Map of the entire protein folds 498 SCOP domainsDali (distance matrix alignment)Hou et al. (2003) Proc. Natl. Acad. Sci. USA, 100,
50 Structure Navigator http://www.pdbj.org/strucnavi/ Rapidly generates structure neighbors for any PDB IDBoth an HTTP and SOAP interface are availableAlignments are displayed in a readable format3D coordinates can be viewed or downloadedStructural similarity is defined by the Number of Equivalent Residues (NER1)Standley, D.M. et al. (2005)1Standley D, Toh H, Nakamura H. Proteins 2004;57:
51 Structure Navigator Real-Time Server Blue: QueryRed: Temp1Standley, D.M. et al. (2005)
53 Structure Navigator-RT could reply the query quickly, but ... When we open the Web service of Structure Navigator-RT and accept many requests, our computer resource may not be enough.We want to use large computing resource somewhere else.We need a simple and flexible tool for this purpose Grid Web Service
54 Grid Web Service using Opal SDSC Technical Report TR , 2006.Opal is a toolkit for wrapping application programs as Web services, providing scheduling, Grid security, and data management.
56 Opal Operation Provider by W. W. Li & K Opal Operation Provider by W.W. Li & K. Ichikawa (under the PRIUS project)Opal is deployed as one of Operation Providers of Globus Toolkit 4 (GT4): Opal Operation Provider (Opal-OP)Opal OP ToolkitApplication Service (WSRF)Opal OPOperation Providersfor WSRFOperation Providersfor NotificationGetRPProviderSubscribeProviderSetRPProviderOpalNotificationConsumerProviderQueryRPProviderGlobus Toolkit 4 (GT4)
57 System Architecture @IPR, Osaka Univ. (Suita) @CMC, Osaka Univ. End userRequest QueryPortal ServerApacheTomcat (jsp -Input page/servlet -Opal-OP client program)Library of GT4Library of Operation ProviderLibrary of Operation Provider Service@IPR, Osaka Univ.(Suita)Computing Server (PC-cluster 16 nodes)Submit Job@CMC, Osaka Univ.(Toyonaka)Head Node (1 PC node):TomcatOpal Operation Provider ServicePre-WS GRAMGT4OpenPBS (launch pbs_server, pbs_sched, pbs_mon)Run the Application programJob scheduling15 Worker nodes:openPBS (launch pbs_mon)Run the Application programProcess is created by the local scheduler, openPBS.LaunchLaunch
58 Structure Navigator-RT with Opal-OP on GRID Query: 3D fold (1crn)
65 Blue: Query StructureRed: Second nearestcluster to the QueryQuery structure
66 ConclusionWe implemented the Opal Operation Provider to run Structure Navigator-RT as a Grid Web Service, using the local scheduler, openPBS, to run our application program on the PC cluster at the Cyber-Media Center, Osaka University, through the network inside Osaka University.
67 Identify Protein Function from ・Sequence similarity ・ Fold similarity・ Surface similarityIn particular, the British group, Janet Thornton and Christine Orengo have so far pushed the method of using “Fold similarity” to identify or estimate the protein function. They have succeeded to some extent, but still there are many cases where there is no fold similarities.
69 Similarity was found here. Function identification of hypothetical proteins Kinoshita & Nakamura (2003) Protein Science 12, Kinoshita & Nakamura (2005) Protein Science,search result against eF-site/ActiveSite + eF-site/mono-nucleotide (1684 entries)query: MJ0226 free formSimilarity was found here.As an example for the application to a hypothetial protein, MJ0226, whose 3D structure was determined by Sun-Hou Kim’s group, and the ligand, ATP, was found from the local fold similarity. So, we first made the molecular surface of the apo-conformation, and computed the electrostatic potential. Then without giving any information of what is the ligand or where is the active site, the query was just the entire molecular surface, against the eF-site database for the functional sites. This graph is the result of the query, and these two proteins were found having the high Z-score. Those two functional sites are the funstion of ATP hydrolysis.Folylpolyglutamate synthetasePyruvate kinase
70 eF-seek: Query service for search of similar molecular surfaces (b-version) Kinoshita, K. & Nakamura, H. (2006)ef-site.hgc.jp/eF-seek/
71 Application of Grid to Web Service of PDBj Query for the analog structural data:Protein folds and molecular surfaces eF-site DatabaseResultQuery for molecular surfaceLooking for the similar surfaces(eF-seek)Quick response by using Grid computingResultNew StructureThose computations to search the similar molecular surfaces and search the similar folds require tough computing resources, because they are essentially analog data: Shape of the molecule or some physico-chemical properties of molecules.Therefore, GRID computation is a good solution to realize the quick response to many of such queries.The usage of the new Grid service system will be applied to our service at PDBj.The legacy scheduling system for the current Web service now uses Open JMS (Java Message Service). JMS(Java Message Service)はJavaプログラムにおいて非同期にメッセージを交換するためのインターフェース。Blue: QueryRed: Temp1Query for foldLooking for the similar folds（Structure Navigator Real-Time server）Fold Database
72 Integration of computing systems and Database query systems at PDBj Portal/Client ProgramUserCommunicate using SOAPPersonal usersite-Asite-Bsite-CDistributedComputingand DBSystemXMLDBGTServiceAGTServiceBServiceCGRID computing serviceswith Opal-OPDatabase query services with XQuery/XPathAnalog Query SearchText Query Search
73 Acknowledgements・Reiko Yamashita, Daron M. Standley (BIRD-JST / Inst. Protein Res., Osaka Univ.)・Kohei Ichikawa, Susumu Date, Shinji Shimojo (Cyber Media Center, Osaka Univ.)・Wilfred W. Li , Peter Arzberger (UCSD)・Hiroyuki Toh (Medical Inst. Bioregulation, Kyushu Univ)・Kengo Kinoshita (Inst. Med. Science, Univ. Tokyo)・Other PDBj members (BIRD-JST / Inst. Protein Res., Osaka Univ.)・PRIUS