Presentation is loading. Please wait.

Presentation is loading. Please wait.

Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK

Similar presentations


Presentation on theme: "Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK"— Presentation transcript:

1 Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk

2 © 2 Roadmap How bioinformaticians will work (and are now) The my Grid project - workflows Using publications in workflows Workflow implications for serials

3 © 3 Williams-Beuren Syndrome Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis Haploinsufficiency of the region results in the phenotype Chr 7 ~155 Mb ~1.5 Mb 7q11. 23 ** WBS SVAS Patient deletions CTA-315H11 CTB-51J22 ‘Gap’ Physical Map Hannah Tipney

4 © 4 1.Identify new, overlapping sequence of interest 2.Characterise the new sequence at nucleotide and amino acid level Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

5 © 5 In Life Sciences: Data, Publication, its all the same Its just part of the experiment No separation between data and publications Publications are the context for data Break the silo between published papers and published data 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

6 © 6 Aside: A heretic speaks Life Scientists read journals I’m a Computer Scientist. I don’t. Its on the Web Its in PodCast talks or Powerpoint Google is the Lord’s work What PhD students are for Journal publications too outdated

7 © 7 Bioinformatics pipelines on the web Copy and paste from one web based application to another Annotate by hand Disadvantages: time consuming, error prone, tacit procedure so difficult to share both protocol and results RepeatMaskerBLASTnTwinscan

8 © 8 Workflows for Science 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

9 © 9 “Workflow at its simplest is the movement of documents and/or tasks through a work process. More specifically, workflow is the operational aspect of a work procedure: how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked”. Workflows for Science

10 © 10 Repeat Masker Web service BLASTn Web Service Twinscan Web Service Sequence in Predicted genes out Simple scripting language specifies how steps of a pipeline link together Hides all the fiddling about. Advantages : automation, quick to write, easier to explain, share, relocate, and record provenance of results in a standard way Workflows for Science

11 © 11 Workflows describe the scientists in silico experiment Link together and cross reference data in different repositories And that includes serials! Remote, third party, external applications and services Accessible to the workflow machinery And that includes serials! Results management Semantic metadata annotation of data Provenance tracking of results Sharing and replicating know-how Reuse of workflows Workflows for Science

12 © 12

13 © 13 WBS The first complete and accurate map of the region of chromosome 7 involved in Williams-Beuren Syndrome Perform one WBS pipeline from 2 weeks to 2 hours Faster, automated, systematic and shareable

14 © 14 Users 14000+ downloads 150+ users in US, Singapore, UK, Europe, Australia Systems biology Proteomics Gene/protein annotation Microarray data analysis Now part of the UK’s Open Middleware Infrastructure Institute http://www.mygrid.org.uk

15 © 15 Trypanosomiasis in cattle Chicken genome Reuse adapting and sharing best practice and know-how across a community by publishing workflows Mouse genome Grave Disease Williams-Beuren Syndrome

16 © 16 Trypanosomiasis in cattle Identify the genetic difference responsible for resistance to trypanosomiasis and breed into productive cattle. Mice as a model. Gene expression and microarray analysis The literature Associations between upregulated genes Links between changed genes and genes in the Tir1 region

17 © 17

18 © 18

19

20 © 20 PubMed Text Mining results

21 © 21

22 © 22 Chilibot text mining in Taverna

23 © 23 Taverna output Chilibot web page

24 © 24 Trypanosomes need cholesterol – and have a scavenger receptor – specific for HDL Resistant mice reduce available HDL – slowing trypanosome growth New hypothesis: Resistance and susceptibility in mice is a function of cholesterol recycling pathway. Mice love lard. lipoprotein and cholesterol

25 © 25 Biological pathway, highlighted with RNA molecules (orange) and DNA QTL molecules (pink), discovered with the aid of Chilibot text mining over PubMed.

26 © 26

27 © 27 my Grid/ Discovery Net Specialist Term recognition software Assigning Gene Ontology terms to papers in MedLine

28 © 28 Science: Knowledge-driven MEDLINE abstract; marked up by SciBorg HTML-CML version

29 © 29 “the development of online submission systems for scientific manuscripts provides a mechanism for including a mapping of the information in the manuscript to controlled terminologies as an integral part of the publishing process. It is not hard to envision that the indexing of a paper to controlled terms for anatomical, gene nomenclature, or functional terminologies would be a necessary requirement for acceptance of a paper for publication. This, then, would enable the rapid incorporation of the paper and its contents into bioinformatics systems. “ Judith Blake Judith Blake, Bio-ontologies—fast and furiousNature Biotechnology 22, 773 - 774 (2004)

30 © 30 Learning & Teaching workflows Research & e-Science workflows Aggregator services: national, commercial Repositories : institutional, e-prints, subject, data, learning objects Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules Harvesting metadata Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Resource discovery, linking, embedding Deposit / self- archiving Peer-reviewed publications: journals, conference proceedings Publication Validation Data analysis, transformation, mining, modelling Resource discovery, linking, embedding Deposit / self- archiving Learning object creation, re-use Searching, harvesting, embedding Quality assurance bodies Validation Presentation services: subject, media-specific, data, commercial portals Resource discovery, linking, embedding The scholarly knowledge cycle. Liz Lyon, Ariadne, July 2003. This work is licensed under a Creative Commons License Attribution-ShareAlike 2.0Creative Commons License © Liz Lyon (UKOLN, University of Bath), 2005

31 © 31 eBank UK Project Aggregator service harvests metadata from institutional repository (e-crystals archive) eBank service embedded in PSIgate portal for 3 rd party search Service linking from data to derived research publication Embedding eBank service in learning workflows UKOLN (lead), University of Southampton, University of Manchester http://www.ukoln.ac.uk/projects/ebank-uk/

32 © 32 Linking data to publications

33 © 33 Provenance Log what, where, when who For data and for publications

34 © 34 Workflows Web services Text mining Bioinformatics Semantic mark-up

35 © 35 Workflows Web services Text mining Bioinformatics Semantic mark-up Publications have to be computational services – web services They will be read and processed by machines Licensing that works! Authorisation, Authentication and digital rights management (e.g. Shibboleth) Integration of data and publications Workflows are linking results, whatever the source Common ids and persistent ids for citation (DOI, LSID, InCHI) No silos

36 © 36 Workflows Web services Text mining Bioinformatics Semantic mark-up Semantic publishing at source In order to automate we need better ways of interpreting the publication content They will be read and processed by machines Integration of data and publications Common vocabularies Accessible full texts for text mining, Not just abstracts.

37 © 37 Workflows Bioinformatics Data Publications Semantic markup Provenance

38 © 38 Workflows Bioinformatics Data Publications Semantic markup Provenance Publish workflows with data with publications Privacy? Intellectual property? Licensing models for services so can reuse and share results and workflows.

39 © 39 Take home Machines are reading your journals, not just people And if the Journals are not online then they unread Workflows are another form of outcome to publish alongside data, metadata and publications Google rocks – I don’t use anything else! http://www.mygrid.org.uk http://www.ukoln.ac.uk/projects/ebank-uk/ http://www.combechem.org

40 © 40 Acknowledgements The my Grid Team, esp. Tom Oinn Chris Wroe Antoon Goderis Andy Brass Paul Fisher Hannah Tipney May Tassabehji Rob Gaizauskas Ian Roberts Discovery Net / Inforsense Vasa Curcin Moustafa M Ghanem BioBank / CombeChem David De Roure Liz Lyon Scientists Peter Murray-Rust Judith Blake Mike Ashburner

41 © 41 Digital Library workflows Workflows for data capture, deposit, preservation, citation, discovery, mining &&…. Multiple workflows interacting together Workflows may call on each other, in a defined order Multiple workflows may use “common” services e.g. Assign (identifier) Require sequential or parallel execution, have dependencies, be time-limited, repetitive Have an owner (control) Include essential human interventions ? ? ?


Download ppt "Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK"

Similar presentations


Ads by Google