Presentation on theme: "SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium Stuart Owen, University of Manchester."— Presentation transcript:
SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium Stuart Owen, University of Manchester
Pan European collaboration Eleven individual projects, 91 institutes Different research outcomes A cross-section of microorganisms, incl. bacteria, archaea and yeast Record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way Present these processes in the form of computerized mathematical models Pool research capacities and know-how Already running since April 2007 Runs for 3-5 years http://www.sysmo.net Systems Biology of Microorganisms
The Problem No one concept of experimentation or modelling No planned, shared infrastructure for pooling
Started July 2008, 3 years, 3 staff + 3 investigators, 3 teams over 3 sites Sensitively retrofit a data access, model handling and data integration platform. Support and manage the diversity of data, models and competencies. Web-based solution: exchange of data, models and processes (intra- and inter-consortia) search for data, models and processes across the initiative dissemination of results SysMO-DB
SysMO-DB Team University of Stellenbosch, South Africa University of Manchester, UK Jacky Snoep EML Research gGmbH, Germany Isabel Rojas University of Manchester, UK Olga Krebs Wolfgang Müller Sergejs Aleksejevs Carole Goble Stuart Owen Katy Wolstencroft
Own solutions Suspicion Data issues Resource Issues Own data solutions and collaboration environments. Wikis, e-Groupware, PHProjekt, BaseCamp, PLONE, Alfresco, bespoke commercial … files and spreadsheets. Suspicion and caution over sharing. Interesting interplay between modellers, experimentalists and bioinformaticians Many do not follow standards that exist or know who is doing what. No extra resources for the consortiums 91 institutes, 11 consortiums, some overlapping
Types of data Multiple omics genomics, transcriptomics proteomics, metabolomics Images Reaction Kinetics Models Relationships between data sets/experiments Procedures, experiments, data, results and models Analysis of data The same across many Systems Biology projects
Principles… A series of small victories Realistic Don‘t reinvent Sustainable and extensible Migrate to standards Provide instant gratification Address doubt and anxiety Incremental development
Social Approach PALS 21 Postdocs and PhD students Experimentalists, modellers and bioinformaticians Our design and technical collaboration team Very intense face to face and virtual collaboration UK and Continental PALS Chapters Audits and Sharing Methods, data, models, standards, software, schemas, spreadsheets, SOPs…..
Communication via PALs DB teamPALSProjects Show what is there Suggest what is possible Ask for requirements Give requirements Tell priorities Rate outcomes Suggest improvements Double check Transmit Disseminate Collect answers
The Lowest Hanging Fruit A Catalogue of SysMO assets SysMO Yellow Pages The people and their expertise The institutions and their facilities Data – experimental data sets Data – analysed results Data – external reference data sets Models Processes – laboratory protocols and bioinformatics analyses The catalogue references assets held elsewhere
Data Models Processes SysMO DB Technical Approach SysMO-SEEK web interface Assets and Yellow Pages Catalogues JERM
Discovery SysMO-SEEK Single, web based, access point Access control & Versioning management Yellow pages (“who is who”) People, Expertise, Equipment Assets catalogue (“who has what”) SOPs, Spreadsheets, pre-published models Metadata about Data held by projects Access to other repositories Models (JWS Online), Workflows (myExperiment), Public web services (BioCatalogue) Call out to external resources e.g. PubMed Does not hold data and results Holds metadata on results and links to results A component for SysMO groups to incorporate in their own environments and applications
Models Standardise their representation – SBML (what about non-SBML models?). Describing, annotating and curating the models so you can find them. (Semantic SBML) Safely storing the models, including versions and pre-publication (JWS Online & BioModels). Validating and running of the models through a simulation tool (JWSOnline & Copasi) Linking models & data – both experimental data and simulated data (SBRML & Key Results).
Models SBML is the recommended format Not all models are SBML JWS online allows storing and simulation of SBML models But - all models need to be shared JWS Online doesn’t have version and access control Models can be shared in SEEK instead of directly in JWS online Can still connect to JWS online and run simulations
Models JWS online – a database of curated models and a model simulator Web service enabled to run from workflows Used and accessed through SEEK…. Special instance of JWS Online for SysMO Store, validate and run models from SysMO-SEEK and publish later Access to other models resources Biomodels, Copasi and Semantic SBML
Experimental Processes Protocol Title Authors Keywords Abstract Materials Reagents Reagent Set Up Equipment Time Taken Procedure Troubleshooting Critical Steps Anticipated Results References Protocols and SOPs Nature Protocols format recommendation You can upload Protocols in any format, but if you use this one, we will index it and make searching easier Encouraging standardisation
Data Comparison and Exchange Public data sources model organism databases – (e.g. SGD) BRENDA …. Data produced by SysMO SABIO-RK, iChiP, MeMo …. Local databases & Files Excel Spreadsheets The most common form of experimental data format. Proteomics Metadata Metabolomics Microarray Proteomics Single Cell Data Variable descriptions of data Little adoption of community controlled vocabulary terms
SysMO LAB Spreadsheet Experiment measurem entn umb er Glucos e Ethano lAcetate Lactat e Formia te Succin at e Pyruva te Acetoi n 2,3 Butan ediol mM 113,57016,6111,570003,060 210032,857,035,7300,564,210 Our Extra Work!!
JERM JERM “Just Enough Results Model” Minimum information to exchange data What type of data is it Microarray, growth curve, enzyme activity… What was measured Gene expression, OD, metabolite concentration…. What do the values in the datasets mean Units, time series, repeats…. Which experiment does it relate to How was the data created SOPs and protocols Harvesting standards, current practice and consortium schemas and spreadsheets Inspired by MCISB Key Results initiative and SBRML [Paton]
The Idea For each data type….. Transcriptomics Proteomics Metabolomics Single Cell Data Generate and apply…. JERM template JERM extractor for data host Subset registered in SEEK Access / export through JERM interface / template Define a JERM….. Top down analysis of standards Bottom up analysis of practice 1 2 3 ISA-TAB
JERM Source Extractor Generator New spreadsheets adopt JERM templates Legacy spreadsheet JERM mapper Databases have JERM mapper Spreadsheet Ontology Annotator Restrict the values that a range of fields can have Just Enough Results Model Tools Metadata SABIO- RK BRENDA myDB mySpread Sheet JERM Web Service Access Interface Access Control JERM Extractor and Access Wrapper Layer JERM Template Source Access and Harvester Source Extractor
Incremental Annotation Metadata can be added to assets at any time Extracted from JERM templates Added by the data owner through SEEK Added by another SysMO consortium member with editing permission
Workflow Management System Bioinformatics Processes: Workflows Automated and repetitive data preparation, annotation and analysis pipelines SBML model construction and population Linking together Data sets, Web Services, R scripts, BioMART, Java libraries, Grid Services Free and Open Source
Data integration: workflows for model parameterisation and validation. Building models using workflows Manipulation of SBML models in workflows LibSBML: data integration & constructing and annotating SBML models [Li et al]
Ramp up when more data resources become workflow accessible Libraries of SysMO workflows Spreadsheet Smart.
Microarray Analysis SBML Model manipulation Pathway Analysis Chemical structure analysis Protein structure analysis Kinetic data Excel Spreadsheet handling Controlled vocabulary look- ups http://myexperiment.org
Spreadsheet Repository Models Repository SOP Repository Workflow Repository Consortium Data Models Processes Sops and Workflows What we have done.. SysMO-SEEK web interface JWS Online Assets Catalogue Yellow Pages Search SysMO DB JERM Public data SBML Nature Protocols Workflow Management System JERM
Experimental Data Metadata People Projects Assay Study Experimental conditions Factors studied Models SOPs Homogenised terminology and values in the datasets themselves Workflows Based on ISA-TAB Investigation SEEK + JERM
Reflections Keeping data at project sites has responsibilities Reliability - Sites available continuously and promptly Support - Must be proof against virus attacks, etc. Archiving - Beyond the lifetime of the project. What happens when a project is no longer part of the SysMO consortium Success, up and running in 12 months. Increase in confidence and trust Rapid agile development, PALS partnership Beyond SysMO- across all systems biology, sensitive to legacy Publishing – 1 click publishing pushing out to other systems (MolMeth, SABIO-RK, BII - ISATAB)
Lessons Find a solution that fits in with current practices Start simple, show benefits, add more Engage with the people actually doing the work PhD students, Post-docs Let the scientists retain control over their data and who can see it Don’t reinvent. Use available vocabularies, minimal model standards Help prevent people duplicating work by linking the people as well as the resources
Acknowledgements SysMO-DB Team SysMO-PALS myGrid, EML and JWS Online teams OMII-UK, Uni Southampton EMBL-EBI, MCISB