Data-driven research with e-Laboratories Stuart Owen University of Manchester
Social collaboration environments for sharing, curating and cataloguing personal, group and community contributed scientific assets. BSD registered users, 56 countries workflows, services Scientific workflow management system for accessing open, public data services, assembling data processing and analysis pipelines and recording provenance. LGPL 361 organisation, 48 countries 70,000+ binary downloads, ~4000 source Handy tools for data management tasks in bioinformatics. BSD
Scientific workflows, scripts and pipelines Now also neuroscience, music and numerical analysis Developed with Oxford and Southampton Web-based Software & Sharing Services “Mobilising the long tail of scientists for all our benefit” Common Ruby on RAILS platform Common and exchanged codebases Systems Biology models, data and protocols Adopted by 4 EU wide consortiums and 4 UK sites Developed with HITS and Stellenboch Crowd sourced curated Web services Adopted by EdUnify and ELDA education projects Developed with EBI and EMBRACE network Find experts, advice, scripts, variable sets Towards interface for UK Data Archives Developed with NIBHI
SysMO-DB Project A data access, model handling and data integration platform for Systems Biology: To support and manage the diversity of –Data, Models and experimental protocols (SOPs) from a consortium Web based Standards compliant DB
Pan European collaboration 13 individual projects, >100 institutes –Different research outcomes –A cross-section of microorganisms, incl. bacteria, archaea and yeast Record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way Present these processes in the form of computerized mathematical models Pool research capacities and know-how Already running since April 2007 Runs for 3-5 years This year, 2 new projects joined and 6 left Systems Biology of Microorganisms
Data Driven Multiple omics –genomics, transcriptomics –proteomics, metabolomics –fluxomics, reactomics Images Molecular biology Reaction Kinetics Models –Metabolic, gene network, kinetic Relationships between data sets/experiments –Procedures, experiments, data, results and models Analysis of data
SOP A Tree View of Assets InvestigationStudiesAssay Construction Validation SOP ISA infrastructure provides a directory structure for experiments
Access Permissions Just Enough Sharing...we don’t talk about security
Attribution. Trust. Credit Reward and Provenance Reusing myExperiment
COSMIC SysMOLab MOSES Alfresco Wiki ANOTHER A DATA STORE Just Enough sharing SOP Fetch on Request Direct Upload
RightField: Annotation by Stealth
SEEK, the e-Laboratory A dynamic resource for analysis as well as browsing Automatic comparison of data from inside files Understanding where and how data and models are linked Running simulations with new experimental data Running analyses and workflows over the data and models
Open Integration: JWS Simulator Web based easy to use interface: “runs in your browser”, integrated in SEEK Models can be accessed via browser, SEEK and web services. Data linked to models via file upload (e.g. Excel), or via database connection. Standard simulation functionality
Data Fuse
Available services Workflow diagram Workflow Explorer Taverna Workbench
The Taverna Open Suite of Tools Client User Interfaces GUI Workbench Workflow Repository Service Catalogue Third Party Tools Programming and APIs Web Portals Activity and Service Plug-in Manager Provenance Store Workflow Server Open Provenance Model Secure Service Access Workflow Engine Virtual Machine
Taverna and the ‘Cloud’ Analysing Next Generation Sequencing Data +
Analysing African Cattle with Taverna ,000 years separation African Livestock adaptations: Hardier Better disease resistance Potential outcomes: Food security Understanding resistance Understanding environmental Conditions Drought Parasites Understanding diversity
The Analysis Pipeline (in Perl) MAP FILTER ANALYSIS Input SNP data from sequencer Map between Genome Builds (Liftover) Filter for SNPs in Exons SNP consequences Identifying damaging SNPs (Polyphen) Harry Noyes – University of Liverpool
Workflow and phases Input SNP file Populate DB with start SNP’s and resource version numbers Lift-over: maps between UMD3 and BTA4 cow assemblies Exon positions from ENSMBL Find SNPs in Exon regions PolyPhen to mark “damaging” SNP’s
Accessing Taverna on the Cloud
Architecture overview
Jobs Status Input Provenance Experiment Metadata Input data summary Loading inputs
Summary of Workflow Output Non-synonymous coding SNPs Polyphen predictions: probably damaging 11 Million SNP for N’ Dama The result can be downloaded as a MySQL database or TSV / CSV download
Why use the Cloud? This is a highly repetitive task – And “embarrassingly parallel” But it also needs to be done on demand And within the financial reach of researchers – Who do not always have access to their own compute We have very fast network access – So we don’t need to do this in-house
Timings
SEEK as a data analysis and meta analysis service SBML model construction and population Calibration workflow Data requirements Parameterised SBML model Experimental data Metabolite concentrations from key results database Calibration by COPASI web service Peter Li
Search and Analysis across data sets, models and stuff Analysis pool Analysis As A Cloud Service Analysis using Cloud Computing Services Run analysis tools and knowledge bases Li et al, BMC Bioinformatics 2010, 11:582, doi: / , highly accessed Hucka and Le Novère, BMC Biology 2010, 8:140, doi: / Automated Model Generation MCISB Centre (Li) Annotation pipeline SUMO SysMO project (Maleki-Dizaji) Workflow Management System Next Gen Seq annotation pipelines using Amazon Cloud Services (Noyes, Li )
SysMO-DB Dev Team University of Stellenbosch, South Africa University of Manchester, UK Jacky Snoep Heidelberg Institute for Theoretical Studies Germany University of Manchester, UK Olga Krebs Wolfgang Müller Sergejs Aleksejevs Carole Goble Stuart Owen Katy Wolstencroft Finn Bacall Franco du Preez Quyen Ngyen
Further Information myGrid – Taverna – myExperiment – BioCatalogue – SEEK – RightField – MethodBox –