Presentation is loading. Please wait.

Presentation is loading. Please wait.

Migrating to the Semantic Web: Bioinformatics as a case study.

Similar presentations


Presentation on theme: "Migrating to the Semantic Web: Bioinformatics as a case study."— Presentation transcript:

1 Migrating to the Semantic Web: Bioinformatics as a case study.
Phillip Lord, Dept of Computer Science, University of Manchester

2 What is the Semantic Web
OWL RDF XML We are here! What is the semantic web? Had some introductory material on this already. In this talk will take a more pragmatic approach to the semantic web, and consider it to be a set of technologies. This is the current stack of the various technologies that will probably be needed to make a full semantic web. Not all of these layers are actually available yet

3 The talk Three (and a half) example case studies
Two different technologies. Why we choose the different technologies.

4 RDF in a nutshell; Tim Berners-Lee’s original vision…
1989 The original vision of the web looked more like this. Essentiallly its much the same thing but with some additional information on the links describing their types. As in this case we are trying to support the biologist this model seems fairly appropriate. Its nice a simple, and only a small change from the web, which they are already very used to.

5 OWL in a nutshell

6 A vision of the semantic web came from this paper in scientific american a couple of years ago. Essentially involves providing large quantity of machine processable information. This should enable agents to gather and extract information in an automated fashion.

7 The Motivation “At the doctor’s office, Lucy instructed her semantic web agent. It promptly retrieved information about her Mom’s prescribed treatment, looked up a list of several providers within 20 miles of home, with a good trust rating.” Also provides a motivating example. It is to do with booking appointments at the doctors following a diagnosis. Interesting that this is a medical example

8 Beware of the Hype! Scientific American, May 2001:
The important point to remember is that this is total hype at the moment. Particularly in the case of an important thing like an ill mother. There is no way that we can do this, as it requires knowledge of too many things, including; Medicines (which often have multiple names). Chemists. Geography:- location of peoples home and the shops. Trust: based on who’s pronouncement. Insurance plans. And so on.

9 The Motivating Example
Lucy Doctor In the motivating example, the basic idea is to improve communication the doctor (and chemist) and Lucy. Slide advance. However the example also explicitly puts most of the task of interacting with the doctor with a computational agent. So the primary task of consuming the semantically described information is not human. In addition, we also have some body who is publishing the semantically described data for the doctor, or providing them software to do this.

10 myGrid UK e-Science Pilot Project. Oct 2001 – April 2005.
£3.4 million. £0.4 million studentships. It’s quite a large project, its part of the UK eScience pilot projects, and involves a number of different organisations, from all over Britain, with a number of industrial collaborators. Newcastle Sheffield Manchester Nottingham Hinxton Southampton

11 Data(type)-intensive bioinformatics
It’s much more likely that the semantic web will take off in small constrained areas to start off with. Bioinformatics is a good area for this for a number of reasons. This slide shows a graphical picture of a workflow created by the mygrid project (more of which later) investigating the Williams Bueren Syndrome. Characteristic of bioinformatics. Involves accessing many distributed, decentralised resources. Accessing many different kinds of information, and integrating them all together. Many of the data sets are derived from each other. Opinion and trust are a corner stone of bioinformatics. Most of the services are web based, and always have been (the web grew up at the same time as bioinformatics; some key web technologies were actually written by bioinformaticians). Most of the data is free text, knowledge rich, structure poor. Bioinformatics analyses typically involve visiting many data resources and analytical tools The resources are often highly heterogenous, semi or un-structured, and distributed. Largely because bioiformatics has grown up as a “cottage industry” and mostly web delivered. Integrating these resources is often difficult both from a programmatic point of view, and also because of the heterogeneity. On the whole this has been done by screen scraping and explicit perl programming. Brittle, often done by non expert programmers (which makes it worse). Also the data is heterogenous. Even things like identifiers are non standard. ID MURA_BACSU STANDARD; PRT; AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE AA; MW; C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

12 Service Stack Bioinformaticians Tool Providers Service Providers
Work bench Taverna Web Portal Talisman Applications Gateway Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt myGrid Information Repository Core services FreeFluo Workflow Enactment Engine OGSA-DQP Distributed Query Processor This is the mygrid services stack. Shows the key components that we have developed to try and help this situation. Going to just mention three here. Firstly we have developed SOAPLAB which provides a quick and easier way to publish legacy applications as web services. This solves the problem of non standards programmatic interfaces, and removes the difficultly associated with screen scraping. Secondly we have a workflow enactment engine. This enables us to develop workflows which are structurally simpler than a full programmatic environment, but enables us to string together services. To take advantage of this we need a pretty development environment which should enable the biologists themselves to develop the workflows and pipelines they need. Web Service (Grid Service) communication fabric External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps

13 Williams-Beuren Syndrome Microdeletion
STAG3 PMS2L Block A FKBP6T POM121 NOLR1 Block C GTF2IP NCF1P GTF2IRD2P Block B C-cen A-cen B-cen C-mid B-mid A-mid B-tel A-tel C-tel WBSCR1/E1f4H WBSCR5/LAB POM121 NOLR1 WBSCR14 WBSCR18 WBSCR22 WBSCR21 GTF2IRD1 FKBP6 BAZ1B BCL7B GTF2I GTF2IRD2 FZD9 TBL2 STX1A CLDN3 CLDN4 LIMK1 CYLN2 NCF1 ELN RFC2 7q11.23 ~1.5 Mb Patient deletions * * Williams-Beuren Syndrome microdeletions reside on chromosome 7q Patients with deletions fall into two categories. Those with classic WBS (* indicates the common deletion) and those with SVAS but not WBS, caused by hemizygous deletion of the elastin gene. A physical map of the region composed of genomic clones is shown with a gap in the critical region. The myGrid software was used to continue the contig and identify more genes at this locus. WBS SVAS CTA-315H11 CTB-51J22 Gap Physical Map Chr 7 ~155 Mb

14 WBS Workflows: Query nucleotide sequence RepeatMasker ncbiBlastWrapper Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns GenBank Accession No URL inc GB identifier Translation/sequence file. Good for records and publications prettyseq GenBank Entry Amino Acid translation Identifies PEST seq Sort for appropriate Sequences only epestfind 6 ORFs Identifies FingerPRINTS Seqret pscan MW, length, charge, pI, etc pepstats Nucleotide seq (Fasta) sixpack Predicts Coiled-coil regions ORFs transeq pepcoil RepeatMasker tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan This is a pipeline with a very characteristic shape in bioinformatics. From a single source of data, we query lots of different databases, lots of different resources. Effectively we are trying to find out as much as possible about the resource (in this case some DNA) as possible. Then we need to present all of these results back to the bioinformatician. The point is that different analyses work under different circumstances. You want to try as many as possible. ncbiBlastWrapper Restriction enzyme map SignalP TargetP PSORTII restrict Predicts cellular location CpG Island locations and % cpgreport Identifies functional and structural domains/motifs InterPro PFAM Prosite Smart RepeatMasker Repetative elements Hydrophobic regions Pepwindow? Octanol? ncbiBlastWrapper Blastn Vs nr, est databases.

15 And this is what the development environment looks like.,
One of the key points here is that there are an awful lot of services down the one side. We currently have over three hundred and more are coming along.

16 Semantic discovery Query-ontology – discovering workflows and services described in the registry by building a query in Taverna. A common ontology is used to annotate and query. Look for all workflows that accept an input of semantic type nucleotide sequence. Aim to have semantic discovery over public view on the Web. This still leaves us with some problems. Firstly there are a large number of services out there, and we need to be able to select between these. In this case we are directly involved in the user environment. We only need to narrow down the range, not choose absolutely the service necessary. Close enough is good enough. To do this we have produced a service ontology. Developed in DAML+OIL and OWL. However, we currently query over this using a simple “semantically materialised” RDF Jena backend. We don’t have rich query interfaces to enable generation of OWL queries anyway, and on the whole its not necessary. This is currently being re-developed by Pinar Alper.

17 Service annotation A closer look at pedro Adding structured metadata to a workflow registration to enable others to discover and reuse it more effectively. E.g. what semantic type of input does it accept.

18 Semantic Discovery Pedro data capture tool We’ve also reused a tool from another project called Pedro. This enables us to annotate services based on our ontology. In most case the WSDL file (or the service interface provided by SOAPLAB) is not enough for good service selection, because the types are generally just strings. In this we are following the same path as biomoby. Again pedro allows annotation only with complete concepts. So neither our descriptions or queries are rich enough that we need full OWL support. Drag a workflow entry into the explorer pane and the workflow loads. Drag a service/ workflow to the scavenger window for inclusion into the workflow View annotations on workflow

19 Biologist Ontologist Service Providers
In summary, the expert ontologist uses OWL. The Service Providers uses a fairly explicit interface using Pedro, with fill in forms, using RDF to generate The descriptions. And the user gets either an explicit query interface, or better still a hidden interface. From an implementation point of view this also means that we get Ontologist Service Providers

20 Problems when doing In Silico Experiments
Experiments being performed repeatedly, at different site, different time, by different users or groups; A large repository of records about experiments!! verification of data; “recipes” for experiment designs; explanation for the impact of changes; ownership; performance of services; data quality; This leaves us with a variety of problems. Which data was used to derive other data. This is a huge problem in bioinformatics, and one which it is almost impossible to judge the importance of. It’s almost certain that much derived data is based on out of date information, and some of the knowledge is circular. One example is from Karp et al which reckons that for E.coli only 80% of swiss-prot sequences are directly discoverable in original Genbank DNA sequence What can we do, therefore, to support the biologist, or bioinformatician user in understanding Scientists In silico experiments:

21 The Current State of the Art
Currently bioinformatics deals with this basically by hyperlinking. Mostly this happens by use of identifiers and accession numbers (which adds a layer of complexity as these identifiers are not standardised although within mygrid we have found LSID’s very useful). Most bioinformatics data is presented on the web, so this identifiers are transformed by bespoke software into web links, and navigation is possible.

22 Tim Berners-Lee’s original vision…
1989 The original vision of the web looked more like this. Essentiallly its much the same thing but with some additional information on the links describing their types. As in this case we are trying to support the biologist this model seems fairly appropriate. Its nice a simple, and only a small change from the web, which they are already very used to.

23 A Semantic Web of Provenance
XML HTML PDF what Literature relevant to provenance study or data in this workflow Provenance record of a workflow run how/which/ when/where DAML+OiL Ontologies linking provenance documents Interlinking graph of the workflow that generates the provenance logs how who Web page of people who has related interests as the owner of the workflow This gives us a semantic web of provenance, enabling us to link between data of different types, from the workflows to the literature to the actual results. Experiment Notes why

24 Population Semantic Data
Web Services Data Repository FreeFluo Taverna Metadata Repository Because we control the workflow enactment environment, to some extent we can generate this information as we go along. Essentially by describing the services, we can then fit knowledge over the top of this, so that we know how the inputs and The products of the service relate to each other. At the current time these service descriptions are not actually the Same as the service descriptions which are used for the semantic service discovery, although clearly they should be. LaunchPad Haystack

25 Haystack from IBM We have a variety of tools for examing the end results of this provenance. In this case we are showing haystack which has been developed by IBM, and which provides a mechanism for navigating around the provenance. Many research issues remain. Can we present better view over the provenance? Can we delete “boring” provenance? Can we automatically regenerate out of date results?

26 Biologist Biologist Database Biologist
So, here, the key issue is communication from biologist to biologist, or to the same biologist at a later date. Hence we choose the simplest technology available. Database Biologist

27 Gene Ontology Next Generation Project (GONG)
Demonstrate the utility of finer grained concept descriptions in DAML+OIL (OWL-DL) Develop methodologies and tools to support the process I’m going to report on the progress of the Gene Ontology Next Generation project which is demonstrating the utility of providing finer grained formal concept descriptions in DAML+OIL. And also developing the metholodologies and tools to support the process on an ongoing basis.

28 Translating theory into practice
Gene Ontology provides a service to the model organism database community Description logic (DL) is a technology born out of computer science research OWL is a standard ontology interchange language underpinned by DL The project is a meeting of two communities. The GO provides a service to model organism community. As such they will only adopt an apparently compicated technology if it will confer will benfits to the service they provide and is practical to implement. On the other hand the description logic community is based firmly in the research arena and as such it has concentrated much more on what is theoretically possible than what is practically do able. However ontology languages such as DAML+OIL are becoming standard for interchange and as they are underpinned by description logic are dragging it into the mainstream. So much so that the language has been adopted by the W3C and will soon become OWL the ontology web language.

29 GONG - proof of concept Maintaining an exhaustive is-a structure
So what’s really needed to start with is a proof of concept to show that finer grained concept definitions in description logic based language are useful. The first area we have demonstrated this is in the maintainenance of a exhaustive is-a hierarchy for section of GO. Each GO concept can have one, two or many is-a parents. In fact each concept is positioned in a directed acyclic graph or arbitrary complexity. As it gets more complicated the task of ensuring all possible is-a links becomes not only monotonous but difficult by hand. Parent Is-a relationship GO concept

30 Example: heparin biosynthesis
[chemical] biosynthesis (GO: ) [i] carbohydrate biosynthesis (GO: ) Axis 1: Chemicals [i] aminoglycan biosynthesis (GO: ) [i] heparin biosynthesis (GO: ) If we take a concrete example. Metabolism concepts are actually phrases made of two aspects, the nature of the process and the class of the chemical target. These two aspects mean that each concept can be classified along two axes. In this first case we have classified it by the chemical class and as we ascend the hierarchy the class of chemical becomes more general.

31 Example: heparin biosynthesis
[chemical] biosynthesis (GO: ) [i] carbohydrate biosynthesis (GO: ) Axis 1: Chemicals [i] aminoglycan biosynthesis (GO: ) [i] heparin biosynthesis (GO: ) Axis 2: Process In the second classification the nature of the process becomes more general as we ascend. This structure existed as of jul But note there was no is-a link between heparin biosynthesis and aminoglycan biosynthesis. Heparin biosynthesis had only one parent. [i] heparin metabolism (GO: ) [i] heparin biosynthesis (GO: )

32 Example: heparin biosynthesis
[chemical] biosynthesis (GO: ) [i] carbohydrate biosynthesis (GO: ) Axis 1: Chemicals [i] aminoglycan biosynthesis (GO: ) [i] glycosaminoglycan biosynthesis (GO: ) [i] heparin biosynthesis (GO: ) Axis 2: Process In fact another concept could be placed in the upper hierarchy because heparin is a kind of glycosaminoglycan. This additional concept is needed to complete the classification structure [i] heparin metabolism (GO: ) [i] heparin biosynthesis (GO: )

33 Is this important? Missing is-a not noticed by users
BUT… improves fidelity of DB record retrieval. Asking for gene products involved in ‘glycosaminoglycan biosynthesis’ will lead to an additional result: O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment) Is this important? Well it had not been noticed by users. But adding this link improves database record retreival. If you ask for gene products involved in glycosaminoglycan biosynthesis you now get an additional result.

34 Paraphrased reasoning process
heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan Is-a The formal descriptions and chemical ontology are merged and submitted to the FaCT reasoner which in simple terms detects that these two concept differ only in the nature of the chemical, and because the chemical ontology has specified that heparin is a glycosaminoglycan it can infer that heparin biosynthesis is a kind of glycosaminoglycan biosynthesis.

35 Inferring a new is-a link
heparin biosynthesis class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan Is-a Is-a

36 Results Carbohydrate metabolism ~250 concepts
22 additional is-a links 17 of which now in GO Amino acid metabolism ~ 250 concepts Further 17 additional is-a links now in GO GO team will be reviewing results for metabolism as a whole once we have the tools to support the process Useful results come from even a partial coverage

37 Build a practical environment
Tools needed for: Creating OWL definitions Tracking changes Reporting reasoning results Viewing definitions

38 Reporting tools We are expanding the amount of structured information within the gene ontology. Taken together with audit/ provenance information detailing where this information came from and reports of inferences made by the reasoner, this amounts to a significant amount of information to manage. Tools to help navigate this information are essential. The screenshots above show prototypes of these tools which overlay provenance and reasoning reports on the original GO classification.

39 OWL for GONG Biologist Ontologist
In this case we have an expert ontologist, so we use the most OWL. This provides us with the advantages of automated reasoning. However deployment is still in a formalism with much the same expressivity as RDF.

40 Conclusions Three problems, three different solutions, all making use of semantic web technologies. A little semantics can go a long way. The expressivity of the language has to be chosen at least in part based on the tasks to be performed, and the user base. Tools, tools, tools. I’ve talked about three different problems within bioinformatics, and presented partial or initial solutions to all of these. We’ve found semantic web technologies to be useful in this. A little bit of semantics can go a long way. There is little magic here, But something as simple as addition of RDF can help a lot over the current web. In all cases, we have chosen a technology which is appropriate for the user base, as without this the users will not use the things. This means Controlling our urges toward expressivity for the sake of simplicity. However, with GONG we can see the advantage of automated approaches. Finally we need more tools. As these are generated, they should help to square the circle of providing for the user

41

42 Acknowledgments Chris Wroe, Robert Stevens, Carole Goble University of Manchester, UK Michael Ashburner EBI, Hinxton, UK Jane Lomax and Midori Harris of the GO editorial team for help and advice and responding to the suggested changes UMLS and MeSH which provided valuable resources for chemical information Sean Bechhofer for development on OilEd Project funded as a subcontract of the DARPA DAML programme

43 Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project,

44 myGrid People Core Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) Robin McEntire (GSK) Collaborators Keith Decker


Download ppt "Migrating to the Semantic Web: Bioinformatics as a case study."

Similar presentations


Ads by Google