Presentation is loading. Please wait.

Presentation is loading. Please wait.

Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.

Similar presentations


Presentation on theme: "Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005."— Presentation transcript:

1 Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005

2 European Bioinformatics Institute Part of the European Molecular Biology Laboratory International organisation 18 member states Headquarters (EMBL) in Heidelberg, Germany 4 Specialist "Outstations" Hamburg, Germany - DESY - Structural biology Grenoble, France - ESRF - Structural biology Monterotondo, Italy - Mouse genetics Hinxton, United Kingdom - Bioinformatics

3 European Bioinformatics Institute On the Hinxton Genome Campus near Cambridge Public biological database provider EMBLbank - DNA sequence (human genome, etc.) UniProt - Protein sequence MSD - Protein structure InterPro - Protein function ArrayExpress - Gene expression data GO - Gene ontology Ensembl, Integr8 - Integrated data resources Scientific literature Bioinformatics research Bioinformatics services Data retrieval (SRS, etc.) Sequence searches (BLAST etc.) Open source analysis tools

4 EBI RFCGR (HGMP) Sanger European Bioinformatics Institute

5 eScience at EBI Tool and data integration Creating services and service standards Building services into workflows Semantic web/grid technologies Grid computing Currently web service implementations Databases produced by EBI, mirrored across Europe Data additions and modifications 1/sec Search services and analysis tools Simple for laboratory biologists Complex for expert bioinformaticians Managing local and mirrored data and services The bioinformatics view of the Grid is hard to define

6 Bioinformatics meets particle physics...

7 40 million collisions/sec 10 Gbytes data/sec Interest in 1-10,000 per billion events 100 Kbytes per event 10 Kbytes per event Data stored at 1.25 Gbytes/sec Raw data 2Mbytes/event Expecting: Tape 20 Pbytes/year Disk 5 Pbytes/year Data flow in particle physics

8 A particle physics view of the Grid

9 Data flow in bioinformatics - data sources DNA sequence data Major sequencing centres and small laboratories Three major databases (EMBL, GenBank, DDBJ) Data volume doubles every year Associated databases: Gene expression Genetics... Protein data Sequence translated from DNA sequence Annotation automated and by experts Associated databases: Protein 3D structure Protein families and domains Protein function Protein expression...

10 Data flow in bioinformatics - user communities Bioinformatics research - software/database developers Bioinformatics research - expert users Academic biological research Molecular biology Biochemistry Microbiology Clinical medicine Crop research Physiology (Systems biology) Industry Pharmaceutical industry Small biotechnology companies Bioinformatics software/database providers Integrated solutions providers General public Interest in human genome and other data

11 Filling a genomic gap in Silico acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Services published on the web, many without programmatic interfaces

12 Filling a genomic gap in Silico acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Services published on the web, many without programmatic interfaces Public and local databases and data sets Sequence alignment algorithms Stochastic models for clustering gene expression data Gene prediction algorithms Protein-protein interaction algorithms Protein folding simulations Visualisation tools Literature searches Ontology services

13 Filling a genomic gap in Silico acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

14 Analysis tools - EMBOSS European Molecular Biology Open Software Suite Analysis of DNA and protein sequence data Open source project started in 1996 With Rosalind Franklin Centre for Genomic Research Over 20,000 unique downloads Reads and writes many data formats Command line driven Over 50 alternative user interfaces Web GUI Integrated applications Web services (e.g. SoapLab) Command definition language (ACD)

15 SoapLab Web Services Web service wrappers for EMBOSS and legacy applications... and CGI web pages Implements OMG standard: Biomolecular Sequence Analysis Application is defined in the EMBOSS ACD style Definition converted into input and output port types SoapLab server provides stateful and stateless job control Inherits multiple data formats from EMBOSS

16 Freefluo Workflow Enactor Core Scufl language parser Processor Plain Web Service Soap lab Processor Local App Processor Enactor Taverna Workbench Processor Bio MOBY Processor Seq Hound Processor Bio MART Taverna workflows and the myGrid Project

17 Practical example: Williams-Beuren Syndrome Genetic disease 1/20,000 children affected Multiple phenotypes: Characteristic facial features Muscle, nervous system and circulation Mental retardation Mapped to human chromosome 7 Deletion (missing DNA) in all cases Also missing in the draft human genome sequence (2000)

18 Williams-Beuren Syndrome Microdeletion Chr 7 ~155 Mb FKBP6 FZD9 BAZ1B BCL7B TBL2 WBSCR14 STX1A CLDN4 CLDN3 ELN LIMK1 LAB EIF4H RFC2 CYCLN2 GTF2IRD1 GTF2I NCF1 GTF2IRD2 SVAS WBS 7q11.23 * * ~1.4 Mb gap Physical contig Patient deletions CTA-315H11 CTB-51J22

19 GenBank Accession No GenBank Entry Seqret Nucleotide seq (Fasta) GenScanCoding sequence ORFs prettyseq restrict cpgreport RepeatMasker ncbiBlastWrapper sixpack transeq 6 ORFs Restriction enzyme map CpG Island locations and % Repetitive elements Translation/sequence file. Good for records and publications Blastn Vs nr, est databases. Amino Acid translation epestfind pepcoil pepstats pscan Identifies PEST seq Identifies FingerPRINTS MW, length, charge, pI, etc Predicts Coiled-coil regions SignalP TargetP PSORTII InterPro Hydrophobic regions Predicts cellular location Identifies functional and structural domains/motifs Pepwindow? Octanol? BlastWrapper URL inc GB identifier tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr RepeatMasker Query nucleotide sequence BLASTwrapper Sort for appropriate Sequences only Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services RepeatMasker TF binding Prediction Promotor Prediction Regulation Element Prediction Identify regulatory elements in genomic sequence

20 GenBank Accession No GenBank Entry Seqret Nucleotide seq (Fasta) GenScanCoding sequence ORFs prettyseq restrict cpgreport RepeatMasker ncbiBlastWrapper sixpack transeq 6 ORFs Restriction enzyme map CpG Island locations and % Repetitive elements Translation/sequence file. Good for records and publications Blastn Vs nr, est databases. Amino Acid translation epestfind pepcoil pepstats pscan Identifies PEST seq Identifies FingerPRINTS MW, length, charge, pI, etc Predicts Coiled-coil regions SignalP TargetP PSORTII InterPro Hydrophobic regions Predicts cellular location Identifies functional and structural domains/motifs Pepwindow? Octanol? BlastWrapper URL inc GB identifier tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr RepeatMasker Query nucleotide sequence BLASTwrapper Sort for appropriate Sequences only Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services RepeatMasker TF binding Prediction Promotor Prediction Regulation Element Prediction Identify regulatory elements in genomic sequence

21 Williams-Beuren Workflows Characterisation of nucleotide sequence Identification of overlapping sequence Characterisation of protein sequence

22 Third- party tools Utopia HaystackLSID Launchpad my Grid information model Applications Core Services External Services Service & workflow discovery Feta semantic discovery GRIMOIRES federated UDDI+ registry Web portals Taverna e-Science workbench Workflow enactment Freefluo workflow engine Metadata Management KAVE metadata store KAVE provenance capture my Grid ontology Soaplab Gowlab AMBIT text extraction service Legacy applications Web ServicesOGSA-DAI databases Web Sites OGSA-DAI DQP service e-Science coordination e-Science mediator e-Science process patterns e-Science events LSID support Data Management mIR my Grid information repository Web Service (Grid Service) communication fabric Notification service Pedro semantic publication Java applications Executable codes with an IDL

23 Intermediate Results

24 Results Management Taverna/Freefluo Workflow Enactment Engine is agnostic about the data flowing through it. As objects progress through, they are tagged with terms from ontologies, free text descriptions and MIME types, and which may contain arbitrary collection structures. Using the metadata hints we can locate and launch pluggable view components. One WBS workflow can produce ~130 files. (intermediate) results management and presentation can be a major problem.

25

26 Workflow environment Taverna API acts as an intermediate layer between user level applications and workflow enactors such as FreeFluo. Includes object models for both workflow definitions and data objects in a workflow Implicit iteration and data flow Data sets and nested flows Configurable failure handling Life Science ID resolution Plug-in framework Event notification Provenance and status reporting Permissive type management Graphical display Data entry wizard

27 Bioinformatics standards: Life Science Identifiers OMG standard proposal – IBM, EBI, I3C Standard identifier for biological entities Uniform Resource Name (URN) format Example: URN:LSID:ebi.ac.uk:SWISS-PROT.accession:P34355:3 Authority: ebi.ac.uk (can be any string, e.g. emboss.org ) Namespace: SWISS-PROT.accession Object: P34355 (Optional) revision: 3 Also used for internal objects in Taverna

28 Bioinformatics standards: Life Science Analysis Engine Reuses OMG Biomolecular Sequence Analysis components Describes SoapLab2... Already partly implemented Platform-independent model Platform-specific models for: Web services Java Defines metadata for analysis services Input and outputs: Syntactic type - data format Semantic type - data type

29 Bioinformatics standards: ACD command definitions Developed for the EMBOSS project Applied to general command-line controlled applications Write in a simple text format, convertible to XML etc. Validation tools provided by EMBOSS Easy to extend: Syntactic types (wrappers can choose one or more) Semantic type Relating output to inputs "... Is an alignment of input1 and input2" "... Is sequence feature positions in input1" Application metadata Hints for service wrappers and GUIs Potential split into multiple simpler services

30 EMBRACE: Putting this into practice European Union "Network of Excellence" 5-year project Coordinated by Graham Cameron at EBI Test cases requiring integration of data content and tools Application interface standards for data content: DNA and protein sequence data Structure and image data Gene and protein expression Literature and text mining Analysis tools using data content standards Sequence analysis tools (EMBOSS etc.) Structure analysis tools... and tools for all the other data types Taverna as an example user interface.

31 Scientific Content The User doesn’t care

32 Harvest and delivery

33 Application User interface Layers

34 Application User interface Databases

35 Application User interfaceApplication interface Interconnectivity

36 Application User interfaceApplication interface Communicate objects and their identities

37 Application User interfaceApplication interface Using standard protocols

38 ComparaGrid: What the biologist really needs UK BBSRC-funded project Integrating data across species Vertebrates Invertebrates Plants Fungi Micro-organisms Detailed knowledge is in the model organisms (genome projects etc.) Biologists need to use this knowledge to understand other species. This is difficult: Need to understand the data resources Need to understand the biology There is a strong overlap with the EMBRACE project EMBRACE: interface standards for data and tools ComparaGrid: how to explore the data using these standards

39 Web vs Grid services: strengths and weaknesses now Web services Grid services EMBRACEgrid Requires: Data management Data replication Service discovery Computing KO ?? KO OK ?? OK KO Lack of infrastructure providing low- level services Instability and lack of robustness Standards still evolving, and implementations lying behind Information world Infrastructure world

40 Acknowlegements myGrid: Carole Goble, Chris Wroe, Hannah Tipney (Manchester), Anil Wipat (Newcastle)... and the rest of the myGrid team Tom Oinn, Martin Senger (EBI) Taverna: Tom Oinn (EBI) and his many collaborators SoapLab, LSID 1, LSAE 2 : Martin Senger (EBI), Sean Martin 1 (IBM), Mike Niemi 1 (I3C), Richard Scott 2 (deNovo) EMBOSS: Alan Bleasby, Jon Ison, Gary Williams, Claude Beezley, Hugh Morgan (RFCGR) Tim Carver (Sanger) Lisa Mullan (EBI) EMBRACE: Graham Cameron, Kerstin Nyberg (EBI) Alan Bleasby (RFCGR), Vincent Breton (CNRS France), Erik Bongcam-Rudloff (LCB Sweden), Gert Vriend (CMBI, Netherlands) COMPARAGRID: Andy Law (Roslin), Anil Wipat (Newcastle)... and the rest of the team


Download ppt "Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005."

Similar presentations


Ads by Google