Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Mediation in myGrid Chris Wroe Manchester University.

Similar presentations


Presentation on theme: "Semantic Mediation in myGrid Chris Wroe Manchester University."— Presentation transcript:

1 Semantic Mediation in myGrid Chris Wroe Manchester University

2 UK e-Science Pilot Project. Oct 2001 – April 2005. £3.4 million. £0.4 million studentships. Newcastle Nottingham Manchester Southampton Hinxton Sheffield

3 Data-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

4 Web Service (Grid Service) communication fabric AMBIT Text Extraction Service Provenance Personalisation Event Notification Gateway Service and Workflow Discovery myGrid Information Repository Ontology Mgt Metadata Mgt Work bench TavernaTalisman Native Web Services SoapLab Web Portal Legacy apps Registries Ontologies FreeFluo Workflow Enactment Engine OGSA-DQP Distributed Query Processor Bioinformaticians Tool Providers Service Providers Applications Core services External services Service Stack Views Legacy apps GowLab

5 Workflow approach Grave’s Disease

6 Workflow approach II

7 Issues Connecting web services together –Shim services Connecting data to web services –Data provenance delivered by LSIDs Connecting data to data –Distributed Query Processing

8 Technology –Resource Description Framework Representing metadata about data and services –Ontology Web Language Representing concepts and classifications

9 myGrid & Bioinformatics world Automating mainstream, well known tasks Well known mature data formats Often no formal description of formats Lots of code to manipulate formats already exists (BioPerl, BioJava …) Semantic mediation work in progress..

10 Williams-Beuren Syndrome Workflow Main Bioinformatics Applications Explore gaps regions within the W-B Critical Region Main Bioinformatics Services Main Bioinformatics Application SHIM Services

11 Williams Example (simple) Genbank retrieval service Genscan Gene predication service Genbank record has_part genomic sequence genomic sequence in Genbank recordFASTA sequence Semantic level Syntactic level

12 Sample Genbank Record LOCUS AY214156 1065 bp mRNA linear VRT 07-MAY-2004 DEFINITION Oncorhynchus nerka RH1 opsin mRNA, complete cds. ACCESSION AY214156 VERSION AY214156.1 GI:37787241 KEYWORDS. SOURCE Oncorhynchus nerka (sockeye salmon) ORGANISM Oncorhynchus nerka Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Euteleostei; Protacanthopterygii; Salmoniformes; Salmonidae; Oncorhynchus. REFERENCE 1 (bases 1 to 1065) AUTHORS Dann,S.G., Allison,W.T., Levin,D.B., Taylor,J.S. and Hawryshyn,C.W. TITLE Salmonid opsin sequences undergo positive selection and indicate an alternate evolutionary relationship in oncorhynchus JOURNAL J. Mol. Evol. 58 (4), 400-412 (2004) PUBMED 15114419 REFERENCE 2 (bases 1 to 1065) AUTHORS Dann,S.G., William,A.E., David,L.B. and Craig,H.W. TITLE Direct Submission JOURNAL Submitted (08-JAN-2003) Biology, University of Victoria, PO Box 3020 Stn CSC, Victoria, British Columbia V8W 3N5, Canada FEATURES Location/Qualifiers source 1..1065 /organism="Oncorhynchus nerka" /mol_type="mRNA" /db_xref="taxon:8023" CDS 1..1065 /codon_start=1 /product="RH1 opsin" /protein_id="AAP58347.1" /db_xref="GI:37787242" /translation="MNGTEGPDFYVPMSNATGIVRNPYEYPQYYLVSPAAYSLMAAYM FFLILTGFPINFLTLYVTIEHKKLRTALNYILLNLAVADLFMVIGGFTTTMYTSMHGY FVFGRTGCNIEGFCATHGGEIALWSLVVLAIERWLVVCKPISNFRFSETHAIIGVAFT WVMAAACSVPPLLGWSRYIPEGMQCSCGIDYYTRAPDINNESFVIHMFVVHFMIPLFI ISFCYGNLLCAVKAAAAAQQESETTQRAEREVTRMVIMMVVSFLVCWVPYASVAWYIF CNQGTEFGPVFMTIPAFFAKSSSLYNPLIYVLMNKQFRNCMITTLCCGKNPFEEEEGA STTASKTEASSVSSSSVAPA" ORIGIN 1 atgaacggca cagagggacc agatttctac gtccctatgt ccaatgctac tggcattgtt 61 aggaacccct atgaataccc ccagtactac cttgtcagcc cagcggcgta ctcactcatg 121 gctgcctaca tgttcttcct catcctcacc ggcttcccca tcaacttcct cacactctat 181 gtcaccatcg agcacaaaaa gctgaggacc gccctgaact acatcctgct gaacctggct 241 gtggccgatc tcttcatggt aatcggaggc ttcaccacta cgatgtacac ctccatgcat 301 ggctatttcg tctttggaag aacgggctgc aacatcgagg gattctgtgc tacccatggt 361 ggtgagattg ccctatggtc cctggttgtc ctggctattg agaggtggtt ggtcgtctgc 421 aaacctatta gcaacttccg cttcagtgag acccatgcca tcataggcgt ggcctttacc 481 tgggtcatgg ctgctgcttg ctccgtcccc cctctgcttg ggtggtcccg ctatatcccc 541 gaaggcatgc agtgctcatg tggaattgac tactacacgc gcgcccctga catcaacaat 601 gagtcctttg tcatccacat gttcgttgtc cactttatga ttcccctgtt catcatctcc 661 ttctgctacg gcaacctgct ctgcgctgtc aaggcagctg ccgccgccca gcaggagtct 721 gagaccaccc agagggctga gagggaagtg acccgcatgg tcatcatgat ggtcgtctcc 781 ttcctagtgt gctgggtgcc ctacgccagc gtggcctggt atatcttctg caaccaggga 841 acagagttcg gccccgtctt catgacaatt ccggcattct ttgccaagag ttcgtccctg 901 tacaaccctc tcatctacgt gttgatgaac aagcagttcc gcaactgcat gatcaccacc 961 ctgtgctgtg ggaagaaccc cttcgaggag gaggagggag cctccaccac tgcctccaag 1021 accgaggcct cctccgtgtc ctccagctcc gtggctcctg cataa //

13 FASTA >gi|37787241|gb|AY214156.1| Oncorhynchus nerka RH1 opsin mRNA, complete cds ATGAACGGCACAGAGGGACCAGATTTCTACGTCCCTATGTCCAATGCTACTGGCATTGTTAGGAACCCCT ATGAATACCCCCAGTACTACCTTGTCAGCCCAGCGGCGTACTCACTCATGGCTGCCTACATGTTCTTCCT CATCCTCACCGGCTTCCCCATCAACTTCCTCACACTCTATGTCACCATCGAGCACAAAAAGCTGAGGACC GCCCTGAACTACATCCTGCTGAACCTGGCTGTGGCCGATCTCTTCATGGTAATCGGAGGCTTCACCACTA CGATGTACACCTCCATGCATGGCTATTTCGTCTTTGGAAGAACGGGCTGCAACATCGAGGGATTCTGTGC TACCCATGGTGGTGAGATTGCCCTATGGTCCCTGGTTGTCCTGGCTATTGAGAGGTGGTTGGTCGTCTGC AAACCTATTAGCAACTTCCGCTTCAGTGAGACCCATGCCATCATAGGCGTGGCCTTTACCTGGGTCATGG CTGCTGCTTGCTCCGTCCCCCCTCTGCTTGGGTGGTCCCGCTATATCCCCGAAGGCATGCAGTGCTCATG TGGAATTGACTACTACACGCGCGCCCCTGACATCAACAATGAGTCCTTTGTCATCCACATGTTCGTTGTC CACTTTATGATTCCCCTGTTCATCATCTCCTTCTGCTACGGCAACCTGCTCTGCGCTGTCAAGGCAGCTG CCGCCGCCCAGCAGGAGTCTGAGACCACCCAGAGGGCTGAGAGGGAAGTGACCCGCATGGTCATCATGAT GGTCGTCTCCTTCCTAGTGTGCTGGGTGCCCTACGCCAGCGTGGCCTGGTATATCTTCTGCAACCAGGGA ACAGAGTTCGGCCCCGTCTTCATGACAATTCCGGCATTCTTTGCCAAGAGTTCGTCCCTGTACAACCCTC TCATCTACGTGTTGATGAACAAGCAGTTCCGCAACTGCATGATCACCACCCTGTGCTGTGGGAAGAACCC CTTCGAGGAGGAGGAGGGAGCCTCCACCACTGCCTCCAAGACCGAGGCCTCCTCCGTGTCCTCCAGCTCC GTGGCTCCTGCATAA

14 Williams Example (simple) Genbank retrieval service Genscan Gene predication service Genbank record has_part genomic sequence genomic sequence in Genbank recordFASTA sequence Semantic level Syntactic level EMBOSS seqret service Genbank service

15 Graves disease Array ExpressGene clustering service Microarray expression data out Microarray expression data in Affymetrix CEL fileTreeview format Semantic level Syntactic level

16 Example data CellHeader=X Y MEAN STDV NPIXELS 0 0 112.0 24.4 25 1 0 10699.0 1340.6 20 2 0 147.0 42.4 25 3 0 10602.0 2126.2 25 4 0 100.8 29.9 20 5 0 96.0 11.9 25 6 0 9829.0 1983.4 25 7 0 133.3 21.6 20 8 0 9092.0 1470.7 25 CEL format Probe_Id Sample1236 1000_at 147 1001_at 96 1002_at -59 Treeview format Template Cell header Probe ID 2 0 1000_at 5 01001_at 2 31002_at

17 Graves disease Array ExpressGene clustering service Microarray expression data out Microarray expression data in Affymetrix CEL fileTreeview format Semantic level Syntactic level AffyR service Template file

18 Classification of shims Shim service FILTER MAPPER DEREFERENCER TRANSLATOR syntax (e.g. GenBank to EMBL) data (e.g. DNA to protein) TRANSFORMER SIFTER (sql SELECT type operation) PARSER (sql PROJECT type operation) - also known as SPLITTER or DECOMPOSER COMPARER SORTER Defn: experimentally neutral service used to connect domain services that don’t quite fit

19 Providing more assistance Taverna workbench 1. Register Taverna workbench 3. Query Pedro 2. Annotate

20 operation name, description input output task method resource application workflow bioMoby service WSDL operation Soaplab service service name, description author organisation WSDL service parameter name, description semantic type format transport type collection type collection format myGrid’s model of services

21 Service Description Flow Discovery Client Semantic Indexing Component Registry XML document describing service Extract service descriptions to reason over Pedro Jena RDF repository Instance Store FACT DL reasoner

22 http://genetics.man.ac.uk execute http://www.mygrid.org.uk/ontology#pairwise_local_aligning ….. Pedro XML

23 RDF Queries possible within RDF repository: Find me an operation called “exec*” Find me a service provided by groups working on Williams disease Find me an operation which performs aligning? RDF a1234 a2 “execute” a3 http://genetics.man.ac.uk #service #local_pairwise_aligning #operation published_by type subclass name task #aligning hasOperation

24 RDF a1234 a2 “execute” a3 http://genetics.man.ac.uk #service #local_pairwise_aligning #operation published_by type subclass name task #aligning Queries not possible: Find me an operation which performs aligning which is local? Where does this service fit into a classification hasOperation

25 OWL classes #service #local_pairwise_aligning #operation Owl property restriction: hasOperation Owl property restriction: performsTask Most specific class expression extracted Definition: Service which has an operation which performs the task local pairwise aligning

26 OWL classes service aligning service local aligning service pairwise local aligning service Each service class has its own property based OWL definition a1234 Instance store indexes our service instance in the appropriate place Classification calculated by the FACT reasoner using property based definitions

27 Query by navigation Service browser Service classified by task

28 Use of ontologies Property based classification requires property based modelling Advantages –Explicit, machine interpretable, easier to maintain large ontologies with polyhierarchies Disadvantages –Complex definitions take time/ skill to author, require expert domain knowledge –Difficult to present back to the user

29 Property based classification on steroids RNA sequence data DNA sequence data nucleic acid sequence data Data

30 Property based classification on steroids RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodes DataFeature

31 Property based classification on steroids RNA DNA nucleic acid RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodessequence_of DataFeatureBiological Concept

32 Property based classification on steroids ribonucleotide deoxyribonucleotide nucleotide RNA DNA nucleic acid RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodessequence_ofpolymer_of DataFeatureBiological Concept

33 Property based classification on steroids ribonucleotide deoxyribonucleotide nucleotide RNA DNA nucleic acid RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodessequence_ofpolymer_of DataFeatureBiological Concept

34 Human readable ontologies GROWL parser OWL API Reasoner OWL API GROWL renderer

35 Only data to hand Metadata associated with data items. Life science identifier (LSID) protocol used to retrieve metadata. Metadata model similar to service parameter Data item name, description semantic type format collection type collection format

36 Workflow run Workflow design Experiment design Project Person Organisation Process Service Event Data item data derivation e.g. output data derived from input data knowledge statements e.g. similar protein sequence to instanceOf partOf componentProcess e.g. web service invocation of BLAST @ NCBI componentEvent e.g. completion of a web service invocation at 12.04pm runBy e.g. BLAST @ NCBI run for Organisation level provenanceProcess level provenance Data/ knowledge level provenance Provenance (1) User can add templates to each workflow process to determine links between data items.

37 19747251AC005089.3 831 Homo sapiens BAC clone CTA-315H11 from 7, complete sequence 15145617AC073846.6 815 Homo sapiens BAC clone RP11-622P13 from 7, complete sequence 15384807AL365366.20 46.1 Human DNA sequence from clone RP11-553N16 on chromosome 1, complete sequence 7717376AL163282.2 44.1 Homo sapiens chromosome 21 segment HS21C082 16304790AL133523.5 44.1 Human chromosome 14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence 34367431BX648272.1 44.1 Homo sapiens mRNA; cDNA DKFZp686G08119 (from clone DKFZp686G08119) 5629923AC007298.17 44.1 Homo sapiens 12q22 BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence 34533695AK126986.1 44.1 Homo sapiens cDNA FLJ45040 fis, clone BRAWH3020486 20377057AC069363.10 44.1 Homo sapiens chromosome 17, clone RP11-104J23, complete sequence 4191263AL031674.1 44.1 Human DNA sequence from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence 17977487AC093690.5 44.1 Homo sapiens BAC clone RP11-731I19 from 2, complete sequence 17048246AC012568.7 44.1 Homo sapiens chromosome 15, clone RP11-342M21, complete sequence 14485328AL355339.7 44.1 Human DNA sequence from clone RP11-461K13 on chromosome 10, complete sequence 5757554AC007074.2 44.1 Homo sapiens PAC clone RP3-368G6 from X, complete sequence 4176355AC005509.1 44.1 Homo sapiens chromosome 4 clone B200N5 map 4q25, complete sequence 2829108AF042090.1 44.1 Homo sapiens chromosome 21q22.3 PAC 171F15, complete sequence >gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AAGCTTTTCTGGCACTGTTTCCTTCTT CCTGATAACCAGAGAAGGAAAAGATC TCCATTTTACAGATGAG GAAACAGGCTCAGAGAGGTCAAGGCT CTGGCTCAAGGTCACACAGCCTGGGA ACGGCAAAGCTGATATTC AAACCCAAGCATCTTGGCTCCAAAGC CCTGGTTTCTGTTCCCACTACTGTCAG TGACCTTGGCAAGCCCT GTCCTCCTCCGGGCTTCACTCTGCAC ACCTGTAACCTGGGGTTAAATGGGCT CACCTGGACTGTTGAGCG urn:lsid:taverna:datathing:15..BLAST_Report rdf:type urn:lsid:taverna:datathing:13..similar_sequences_to.. nucleotide_sequence rdf:type service invocation..created_by workflow invocation workflow definition experiment definition project person group service description organisation..described_by..run_during..invocation_of..part_of..works_for..part_of..author..run_for AB..masked_sequence_of..filtered_version_of Relationship BLAST report has with other items in the repository Other classes of information related to BLAST report Provenance tracking

38 Using IBM’s Haystack GenBank record Portion of the Web of provenance Managing collection of sequences for review

39 Storage LSID has no protocol for storage Taverna/ Freefluo implements its own data/ metadata storage protocol Taverna/ Freefluo Metadata Store Data store Publish interface data metadata

40 Retrieval LSID protocol used to retrieve data and metadata Query handled separately Metadata Store Data store LSID interface LSID aware client Query RDF aware client

41 Queries within Workflows Grid Data Service query query result Semantic content of result depends on query and data source schema Select GO_ID FROM GO WHERE GO.term LIKE “enzyme activity”; Select GO_Annotation_ID FROM GOA WHERE GO.term LIKE “enzyme activity”; Gene ontology term ID protein ID

42 Distributed Query Processing DQP linked with the OGSA-DAI activity Built within myGrid project Plans execution of a query over multiple Grid Data Services Each Grid Data Service provides schema metadata Currently no semantic mediation

43 Example query select p.proteinId, blast(p.sequence) from p in protein, t in proteinTerm where t.termId = 'GO:0008372' and p.proteinId = t.proteinId “Select proteins and homologous proteins from SWISS-PROT which have been annotated with GO:008372” Gene ontology databaseSWISS-PROT protein database t.proteinId p.proteinId Data encoding the identity of a protein in SWISS-PROT namespace = DQP Plan

44 Query 1: Select motifs for antigenic human proteins that participate in apoptosis and are homologous to the lymphocyte associated receptor of death (also known as lard). Translation: Select patterns in the proteins that invoke an immunological response and participate in programmed cell death that are similar in their sequence of amino acids to the protein that is associated with triggering cell death in the white cells of the immune system. (A) Ontology expression: Motif which <isComponentOf (Protein which <hasOrganismClassification Species functionsInProcess Apoptosis hasFunction Antigen isHomologousTo Protein which )>)> Species: Is instantiated by value “human” ProteinName: Is instantiated by value “lard” TAMBIS I

45 TAMBIS II Informal query plan: Select proteins with protein name “lard” from SWISS-PROT Execute a BLAST sequence alignment process against SWISS-PROT results Check the entries for apoptosis process and antigen function Pass the resultant sequences to PROSITE to scan for their motifs CPL expression: set-unique {(#motif1:motif1)I \protein3 <- get-sp-entries-by-de("lard"), \protein2 <- do-blastp-by-sq-in-entry(protein3), Check-sp-entries-by-kwd("apoptosis",protein2), check-sp-entries-by-de("antigen",protein2), Check-sp-entry-for-species("human",protein2), \motif1 <- do-ps-scan-by-sq-in-entry(protein2)}

46 select p.proteinId, blast(p.sequence) from p in protein, t in proteinTerm where t.termId = 'GO:0008372' and p.proteinId = t.proteinId

47 How we did it in the past –Service type directory How we currently plan to do it –Shims, genbank, microarray How we may want to do it in the future –DQP & TAMBIS

48 Overview We’re not attacking the same problem When would your problem become our problem Common descriptions of the core entities involved. –Data items, Datasets, Services.

49


Download ppt "Semantic Mediation in myGrid Chris Wroe Manchester University."

Similar presentations


Ads by Google