Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK

Slides:



Advertisements
Similar presentations
Taverna: From Biology to Astronomy Dr Katy Wolstencroft University of Manchester my Grid OMII-UK.
Advertisements

AHM, Nottingham, September eBank UK : linking research data, scholarly communication and learning. Dr Liz Lyon, UKOLN, University of Bath Dr Simon.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
Linking Data and Publications: the Chemistry Way Simon Coles School of Chemistry, University of Southampton, U.K. CLADDIER workshop.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
RCUK, Octiber Archiving research data and research publications. Dr Leslie Carr, Intelligence, Agents Multimedia, University of Southampton Dr Simon.
© S.J. Coles 2006 eCrystals: A Route for Open Access to Small Molecule Crystal Structure Data Simon Coles School of Chemistry, University of Southampton,
UKOLN is supported by: From research data to new knowledge: a lifecycle approach. Dr Liz Lyon, Director UKOLN, University of Bath, UK JISC/SURF/CNI Conference.
Digital | Curation | Centre Adding value to open access research data: reflections on the process of data curation Dr Liz Lyon, DCC Associate Director.
UKOLN is supported by: eBank UK : linking research data, scholarly communications and learning. Dr Liz Lyon, UKOLN, University of Bath, UK JISC CNI Conference.
JISC Joint Programmes Meeting eBank UK : linking research data, learning and scholarly communications. Dr Liz Lyon, UKOLN, University of Bath Dr.
A centre of expertise in digital information management UKOLN is supported by: Digital repositories as research infrastructure: a UK perspective.
UKOLN is supported by: Emergent technologies & digitisation: the institutional impact. Liz Lyon & Kevin Edge VCs Retreat, October a.
Federation eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton,
© S.J. Coles 2006 Institutional Data Repositories for Chemistry Simon Coles School of Chemistry, University of Southampton, U.K.
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
EBank UK CCLRC Workshop February eBank and CCLRC Workshop February 2005 University of Bath.
Digital Repositories: interoperability & common services Closing Remarks Dr Liz Lyon, UKOLN, University of Bath, UK
Sandra Gesing Division for Simulation of Biological Systems Eberhard-Karls-Universität Tübingen Portals for Life.
Sandra Gesing Eberhard-Karls-Universität Tübingen Requirements on a portal for MoSGrid (Molecular Simulation.
Center for Bioinformatics, University of Tübingen
Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.
Classical and myGrid approaches to data mining in bioinformatics
Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop, Brisbane,
ISWC 2005, Galway Seven Bottlenecks to Workflow Reuse and Repurposing Antoon Goderis Ulrike Sattler Phillip Lord Carole Goble University of Manchester.
Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.
Accelerating Time to Experiment – The myExperiment Approach to Open Science David De Roure Carole Goble Jiten Bhagat.
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
University of Southampton, U.K.
EPrints Workshop, January eBank UK: Dissemination of research data using EPrints Simon Coles, School of Chemistry, University of Southampton.
Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University.
Workflows within Taverna Stuart Owen University of Mancester, UK
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
The Representation of Scientific Data
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
An Introduction to Taverna Dr. Georgina Moulton and Stian Soiland The University of Manchester
Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,
USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.
The Taverna Workbench: Integrating and analysing biological and clinical data with computerised workflows Dr Katy Wolstencroft myGrid University of Manchester.
Taverna and my Grid Basic overview and Introduction Tom Oinn
Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.
An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester.
Programs and Research In the flow: from discovery to disclosure Lorcan Dempsey CIC March
OMII-UK Software Activities Steven Newhouse, Director.
(Bio)Web Services at the INB BioMOBY. Instituto Nacional de Bioinformática.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
EBank UK: linking scientific data, scholarly communication and learning Michael Day and Rachel Heery UKOLN, University of Bath
Taverna: A Workbench for the Design and Execution of Scientific Workflows Dr Katy Wolstencroft myGrid University of Manchester.
Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK
Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester.
VBI Web Services Workshop May 2005 Performing In silico Experiments in a Service Based Architecture: Solutions and Issues Chris Wroe, Phillip Lord,
Towards an understanding of Genotype-Phenotype correlations Paul Fisher et al.,
Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,
An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,
Taverna Workbench Stuart Owen University of Mancester, UK
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
First International Workshop on Portals for Life Sciences Sandra Gesing
EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester
UKOLN is supported by: Introduction to UKOLN Dr Liz Lyon, Director UKOLN, University of Bath, UK Grand Challenge Meeting, June a centre.
The Semantic Web, Service Oriented Architectures, the my Grid Experience Carole Goble
CombeDay Making Data Openly Available Simon Coles.
Selected Workflow and Semantic Experiences from my Grid Professor Carole Goble The University of Manchester, UK
UKOLN is supported by: Library futures in the new research landscape. Dr Liz Lyon, UKOLN, University of Bath, UK CURL Members Meeting October 2004, London.
An Introduction to Taverna caBIG monthly workspace call and Taverna, Franck Tanoh.
Joint Information Systems Committee Repositories Support Project Summer School 2008 Amber Thomas, JISC.
Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft and Aleksandra Pawlik.
Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft.
Distributed Computing for System Biology using Taverna Workflows
JISC Joint Programmes Meeting 2005
Developing Institutional Data Repositories
Presentation transcript:

Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK

© 2 Roadmap How bioinformaticians will work (and are now) The my Grid project - workflows Using publications in workflows Workflow implications for serials

© 3 Williams-Beuren Syndrome Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis Haploinsufficiency of the region results in the phenotype Chr 7 ~155 Mb ~1.5 Mb 7q ** WBS SVAS Patient deletions CTA-315H11 CTB-51J22 ‘Gap’ Physical Map Hannah Tipney

© 4 1.Identify new, overlapping sequence of interest 2.Characterise the new sequence at nucleotide and amino acid level Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

© 5 In Life Sciences: Data, Publication, its all the same Its just part of the experiment No separation between data and publications Publications are the context for data Break the silo between published papers and published data acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg

© 6 Aside: A heretic speaks Life Scientists read journals I’m a Computer Scientist. I don’t. Its on the Web Its in PodCast talks or Powerpoint Google is the Lord’s work What PhD students are for Journal publications too outdated

© 7 Bioinformatics pipelines on the web Copy and paste from one web based application to another Annotate by hand Disadvantages: time consuming, error prone, tacit procedure so difficult to share both protocol and results RepeatMaskerBLASTnTwinscan

© 8 Workflows for Science acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg

© 9 “Workflow at its simplest is the movement of documents and/or tasks through a work process. More specifically, workflow is the operational aspect of a work procedure: how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked”. Workflows for Science

© 10 Repeat Masker Web service BLASTn Web Service Twinscan Web Service Sequence in Predicted genes out Simple scripting language specifies how steps of a pipeline link together Hides all the fiddling about. Advantages : automation, quick to write, easier to explain, share, relocate, and record provenance of results in a standard way Workflows for Science

© 11 Workflows describe the scientists in silico experiment Link together and cross reference data in different repositories And that includes serials! Remote, third party, external applications and services Accessible to the workflow machinery And that includes serials! Results management Semantic metadata annotation of data Provenance tracking of results Sharing and replicating know-how Reuse of workflows Workflows for Science

© 12

© 13 WBS The first complete and accurate map of the region of chromosome 7 involved in Williams-Beuren Syndrome Perform one WBS pipeline from 2 weeks to 2 hours Faster, automated, systematic and shareable

© 14 Users downloads 150+ users in US, Singapore, UK, Europe, Australia Systems biology Proteomics Gene/protein annotation Microarray data analysis Now part of the UK’s Open Middleware Infrastructure Institute

© 15 Trypanosomiasis in cattle Chicken genome Reuse adapting and sharing best practice and know-how across a community by publishing workflows Mouse genome Grave Disease Williams-Beuren Syndrome

© 16 Trypanosomiasis in cattle Identify the genetic difference responsible for resistance to trypanosomiasis and breed into productive cattle. Mice as a model. Gene expression and microarray analysis The literature Associations between upregulated genes Links between changed genes and genes in the Tir1 region

© 17

© 18

© 20 PubMed Text Mining results

© 21

© 22 Chilibot text mining in Taverna

© 23 Taverna output Chilibot web page

© 24 Trypanosomes need cholesterol – and have a scavenger receptor – specific for HDL Resistant mice reduce available HDL – slowing trypanosome growth New hypothesis: Resistance and susceptibility in mice is a function of cholesterol recycling pathway. Mice love lard. lipoprotein and cholesterol

© 25 Biological pathway, highlighted with RNA molecules (orange) and DNA QTL molecules (pink), discovered with the aid of Chilibot text mining over PubMed.

© 26

© 27 my Grid/ Discovery Net Specialist Term recognition software Assigning Gene Ontology terms to papers in MedLine

© 28 Science: Knowledge-driven MEDLINE abstract; marked up by SciBorg HTML-CML version

© 29 “the development of online submission systems for scientific manuscripts provides a mechanism for including a mapping of the information in the manuscript to controlled terminologies as an integral part of the publishing process. It is not hard to envision that the indexing of a paper to controlled terms for anatomical, gene nomenclature, or functional terminologies would be a necessary requirement for acceptance of a paper for publication. This, then, would enable the rapid incorporation of the paper and its contents into bioinformatics systems. “ Judith Blake Judith Blake, Bio-ontologies—fast and furiousNature Biotechnology 22, (2004)

© 30 Learning & Teaching workflows Research & e-Science workflows Aggregator services: national, commercial Repositories : institutional, e-prints, subject, data, learning objects Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules Harvesting metadata Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Resource discovery, linking, embedding Deposit / self- archiving Peer-reviewed publications: journals, conference proceedings Publication Validation Data analysis, transformation, mining, modelling Resource discovery, linking, embedding Deposit / self- archiving Learning object creation, re-use Searching, harvesting, embedding Quality assurance bodies Validation Presentation services: subject, media-specific, data, commercial portals Resource discovery, linking, embedding The scholarly knowledge cycle. Liz Lyon, Ariadne, July This work is licensed under a Creative Commons License Attribution-ShareAlike 2.0Creative Commons License © Liz Lyon (UKOLN, University of Bath), 2005

© 31 eBank UK Project Aggregator service harvests metadata from institutional repository (e-crystals archive) eBank service embedded in PSIgate portal for 3 rd party search Service linking from data to derived research publication Embedding eBank service in learning workflows UKOLN (lead), University of Southampton, University of Manchester

© 32 Linking data to publications

© 33 Provenance Log what, where, when who For data and for publications

© 34 Workflows Web services Text mining Bioinformatics Semantic mark-up

© 35 Workflows Web services Text mining Bioinformatics Semantic mark-up Publications have to be computational services – web services They will be read and processed by machines Licensing that works! Authorisation, Authentication and digital rights management (e.g. Shibboleth) Integration of data and publications Workflows are linking results, whatever the source Common ids and persistent ids for citation (DOI, LSID, InCHI) No silos

© 36 Workflows Web services Text mining Bioinformatics Semantic mark-up Semantic publishing at source In order to automate we need better ways of interpreting the publication content They will be read and processed by machines Integration of data and publications Common vocabularies Accessible full texts for text mining, Not just abstracts.

© 37 Workflows Bioinformatics Data Publications Semantic markup Provenance

© 38 Workflows Bioinformatics Data Publications Semantic markup Provenance Publish workflows with data with publications Privacy? Intellectual property? Licensing models for services so can reuse and share results and workflows.

© 39 Take home Machines are reading your journals, not just people And if the Journals are not online then they unread Workflows are another form of outcome to publish alongside data, metadata and publications Google rocks – I don’t use anything else!

© 40 Acknowledgements The my Grid Team, esp. Tom Oinn Chris Wroe Antoon Goderis Andy Brass Paul Fisher Hannah Tipney May Tassabehji Rob Gaizauskas Ian Roberts Discovery Net / Inforsense Vasa Curcin Moustafa M Ghanem BioBank / CombeChem David De Roure Liz Lyon Scientists Peter Murray-Rust Judith Blake Mike Ashburner

© 41 Digital Library workflows Workflows for data capture, deposit, preservation, citation, discovery, mining &&…. Multiple workflows interacting together Workflows may call on each other, in a defined order Multiple workflows may use “common” services e.g. Assign (identifier) Require sequential or parallel execution, have dependencies, be time-limited, repetitive Have an owner (control) Include essential human interventions ? ? ?