Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.

Similar presentations


Presentation on theme: "The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam."— Presentation transcript:

1 The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

2 What’s the problem? (data-mess in bio-inf)

3 Source: PhRMA & FDA 2003 Pharmaceutical Productivity

4 The Industry’s Problem Too much unintegrated data: –from a variety of incompatible sources –no standard naming convention –each with a custom browsing and querying mechanism (no common interface) –and poor interaction with other data sources Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003

5 What are the Data Sources? Flat Files URLs Proprietary Databases Public Databases Data Marts Spreadsheets Emails …

6 Sample Problem: Hyperprolactinemia Over production of prolactin –prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: –inappropriate milk production –disruption of menstrual cycle –can lead to conception difficulty

7 Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” “Show me all genes that are homologous to known transcription factors” SEQUENCE “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” EXPRESSION “Show me all genes in the public literature that are putatively related to hyperprolactinemia” LITERATURE (Q1Q2Q3)(Q1Q2Q3)

8 The Medical tower of Babel Mesh l Medical Subject Headings, National Library of Medicine l 22.000 descriptions EMTREE l Commercial Elsevier, Drugs and diseases l 45.000 terms, 190.000 synonyms UMLS l Integrates 100 different vocabularies SNOMED l 200.000 concepts, College of American Pathologists Gene Ontology l 15.000 terms in molecular biology NCI Cancer Ontology: l 17,000 classes (about 1M definitions),

9 Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

10 Why would Semantic technology help?

11 machine accessible meaning (What it’s like to be a machine) symptoms drug administration disease IS-A alleviates META-DATA

12 What is meta-data? it's just data it's data describing other data its' meant for machine consumption disease name symptoms drug administration

13 Required are: 1. one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached mechanisms for attribution and trust is this page really about Pamela Anderson?

14 no shared understanding Conceptual and terminological confusion Actors: both humans and machines Agree on a conceptualization Make it explicit in some language. world concept language What are ontologies & what are they used for

15 standard vocabularies (“Ontologies”) Identify the key concepts in a domain Identify a vocabulary for these concepts Identify relations between these concepts Make these precise enough so that they can be shared between l humans and humans l humans and machines l machines and machines

16 Biomedical ontologies (a few..) Mesh l Medical Subject Headings, National Library of Medicine l 22.000 descriptions EMTREE l Commercial Elsevier, Drugs and diseases l 45.000 terms, 190.000 synonyms UMLS l Integrates 100 different vocabularies SNOMED l 200.000 concepts, College of American Pathologists Gene Ontology l 15.000 terms in molecular biology NCBI Cancer Ontology: l 17,000 classes (about 1M definitions),

17 Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached

18 Stack of languages

19 XML: l Surface syntax, no semantics XML Schema: l Describes structure of XML documents RDF: l Datamodel for “relations” between “things” RDF Schema: l RDF Vocabular Definition Language OWL: l A more expressive Vocabular Definition Language

20 Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached

21 Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. See previous slide on Biomedical ontologies l Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

22 Question: Who writes the meta-data ? -Automated learning -shallow natural language analysis -Concept extraction amsterdam trade antwerp europe amsterdam merchant city town center netherlands merchant city town Example: Encyclopedia Britannica on “Amsterdam”

23 exploit existing legacy-data l Databases l Lab equipment l (Amazon) side-effect from user interaction l email keyword extraction NOT from manual effort Question: Who writes the meta-data ?

24 Remember “required are” ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such lots of resources with meta-data attached

25 Some working examples? DOPE

26 DOPE: Background Vertical Information Provision l Buy a topic instead of a Journal ! l Web provides new opportunities Business driver: drug development l Rich, information-hungry market l Good thesaurus (EMTREE)

27 The Data Document repositories: l ScienceDirect: approx. 500.000 fulltext articles l MEDLINE: approx. 10.000.000 abstracts Extracted Metadata l The Collexis Metadata Server: concept- extraction ("semantic fingerprinting") Thesauri and Ontologies l EMTREE: 60.000 preferred terms 200.000 synonyms

28 RDF Schema EMTREE Query interface RDF Datasource 1 RDF Datasource n …. Architecture:

29

30 Ontology disambiguates query

31 Ontology groups results

32 Ontology clusters results

33 Ontology refines query

34 Some working examples? DOPE HCLS (http://www.w3.org/2001/sw/hcls/)http://www.w3.org/2001/sw/hcls/

35 RDF Schema EMTREE Query interface RDF Datasource 1 RDF Datasource n …. Architecture: RDF Schema Gene Ontology ….

36

37 Summarising… Data integration on the Web: l machine processable data besides human processable data Syntax for meta-data l Representation l Inference Vocabularies for meta-data l Lot’s of them in bio-inf. Actual meta-data: l Lot’s in bio-inf. Will enable: l Better search engines (recall, precision, concepts) l Combining information across pages (inference) l …

38 Things to do for you Practical: Use existing software to construct new use-scenario’s Conceptual: Create on ontology for some area of bio-medical expertise l from scratch l as a refinement of an existing ontology Technical: Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)


Download ppt "The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam."

Similar presentations


Ads by Google