Proteome data integration characteristics and challenges K. Belhajjame 1, R. Cote 4, S.M. Embury 1, H. Fan 2, C. Goble 1, H. Hermjakob, S.J. Hubbard 1, D. Jones 3, P. Jones 4, N. Martin 2, S. Oliver 1, C. Orengo 3, N.W. Paton 1, M. Pentony 3, A. Poulovassilis 2, J. Siepen, R.D. Stevens 1, C. Taylor 4, L. Zamboulis 2, and W. Zhu 4 1 University of Manchester 2 Birkbeck College 3 University College London 4 European Bioinformatics Institute
All Hands Meetings, Outline Experimental proteomics ISPIDER architecture Example use cases Conclusion
All Hands Meetings, Separation Protein digestion Mass Spectrometry Experimental proteomics An essential component for elucidation of the biological functions of proteins The study of the set of proteins produced by an organism with the aim of understanding their behaviour under varying conditions Protein DB 2D gel electrophoresis Maldi TOF Enzymatic digestion Identification Protein ID
All Hands Meetings, Experimental proteomics Development of new technologies for: –protein separation (2D-SDS-PAGE, HPLC, Capillary Electrophoresis) –mass spectrometry (Multi-Dimensional protein identification) Availability of publicly accessible protein sequence databases Proteomics databases (PedroDB, gpmDB, PepSeeker, Pride, …) Building experiments involving analysis services orchestration and data processing and integration
All Hands Meetings, Objectives of ISPIDER A Grid dedicated to the creation of bioinformatics experiments for proteomics Develop, or make, existing Proteome databases and Grid-enabled services Develop Middleware support for developing and executing new proteome analyses, based on distributed query processing and workflow technologies Undertake proteomic studies that demonstrate the effectiveness of the resulting infrastructure
All Hands Meetings, Outline Experimental proteomics ISPIDER architecture Example use cases Conclusion and future directions
All Hands Meetings, ISPIDER ExistingE-ScienceInfrastructure ISPIDER Proteomics Grid Infrastructure ISPIDER Proteomics Clients PublicProteomicsResources Proteome Request Handler Instance Ident/Mapping Services Proteomic Ontologies/ Vocabularies Source Selection Services Data Cleaning Services my Grid Ontology Services my Grid DQP AutoMed my Grid Workflows KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package Vanilla Query Client 2D Gel Visualisation Client + Aspergil. Extensions + Phosph. Extensions PPI Validation + Analysis Client Protein ID Client Web services Existing Resources PS WS PF WS TR WS GS WS FA WS PPI WS PID WS PRIDE WS PEDRo WS ISPIDER Resources Phos WS
All Hands Meetings, Outline Experimental proteomics ISPIDER architecture Example use cases Conclusion and future directions
All Hands Meetings, Motivation Protein identification experiments are usually used as input into further analysis processes. – Gathering evidence for a biological hypothesis – Suggesting new hypothesesObjective Augment the identification results with additional information on the identified proteinImplementation Taverna workflow system Value-added protein datasets
All Hands Meetings, Value-added protein datasets PepMapper Web Service GO Services Auxiliary Services
All Hands Meetings, Genome-focused protein identification Motivation Currently, protein identification searches performed over large data sets. This means fewer false negatives, but false positives are also more likely.Objective More focused and thus more efficient protein identificationImplementation Taverna workflow system DQP, a service-based query processor
All Hands Meetings, Genome-focused protein identification DQP Web Service IPI PepMapper web service GOA Web Service select p.Name, p.Seq from p in db_proteinSequences where p.OS='HomoSapiens';
All Hands Meetings, Integrated access to proteome databases Motivation Ability to analyse existing proteomics results en masse is limited, because of the heterogeneities between the schemas of the different databasesObjective Providing integrated access to proteome databases through a common schemaImplementation AutoMed, a framework for mapping heterogeneous schemata DQP, a service-based query processor
All Hands Meetings, Integrated access to proteome databases Automed Wrappers PRIDEPedroDBgpmDB Automed Repository OGSA-DAI Activity OGSA-DAI Activity OGSA-DAI Activity OGSA Distributed Query Processor Automed Query Processor Automed DQP Wrapper User query Result OQL query OQL result
All Hands Meetings, Conclusions + Available e-science technologies provide rapid prototyping facilities for bioinformatics analyses + Combining such technologies is possible and opens up more possibilities Taverna + DQP Automed + DQP - Writing custom code is usually required –Processing service output to extract inputs for following services –Transforming results between data formats –Dealing with mismatches between identifiers Developing a user-guided environment for the detection and resolution of mismatches Development of Proteomics client applications (PepMapper, PepSeeker and PRIDE)