Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scientific RDF Databases Michael Mertens K.U.Leuven.

Similar presentations


Presentation on theme: "Scientific RDF Databases Michael Mertens K.U.Leuven."— Presentation transcript:

1 Scientific RDF Databases Michael Mertens K.U.Leuven

2 Outline Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism 2

3 Outline Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism 3

4 RDF: Resource Description Framework Originally: metadata data model Now: General method for conceptual description for web resources (Semantic Web) Introduction 4

5 Traditional Web in 2009: Introduction Sharing documents URL as retrieval mechanism HTML standard format Hypertext links Image taken from “The Emerging Web of Linked Data”, Chris Bizer 5 > Semantic Web

6 Data on the web – HTML describes documents and links between them – Semantic web: Publish data in RDF, OWL, XML,.. Describe arbitrary things: people, books, events,.. Link between these concepts Machine-readable, web-accessible databases Introduction 6 > Semantic Web

7 Tim-Berners Lee: LINKED DATA Connected structured data 3 simple principles: – URLs for conceptual things – Returns useful data about that thing – Relationships link to other URLs Introduction 7 > Semantic Web > Linked Data

8 Introduction 8 Before: Scientific data usually not shared Pharmaceutical Drug Discovery – A lot of spread out data Drug Bank, ClinicalTrial.gov, Health Care and Life Science – Genomics data, Protein data,.. A question nobody examined before: “What Proteins are involved in signal transduction AND are related to pyramidal neurons?” Example taken from “Tim Berners-Lee on the next Web” > Semantic Web > Linked Data > Example

9 Introduction 9 The web: 223,000 hits, 0 results Example taken from “Tim Berners-Lee on the next Web” > Semantic Web > Linked Data > Example

10 Introduction 10 Linked Data: 32 hits, 32 results Example taken from “Tim Berners-Lee on the next Web” DRD1, 1812 adenylate cyclase activation ADRB2, 154 adenylate cyclase activation ADRB2, 154 arrestin mediated desensitization of G-protein coupled … DRD1IP, dopamine receptor signaling pathway DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway GRM7, 2917 G-protein coupled receptor protein signaling pathway GNG3, 2785 G-protein coupled receptor protein signaling pathway GNG12, G-protein coupled receptor protein signaling pathway DRD2, 1813 G-protein coupled receptor protein signaling pathway ADRB2, 154 G-protein coupled receptor protein signaling pathway CALM3, 808 G-protein coupled receptor protein signaling pathway HTR2A, 3356 G-protein coupled receptor protein signaling pathway DRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second… SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second… MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide … HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second … GRIK2, 2898 glutamate signaling pathway GRIN1, 2902 glutamate signaling pathway GRIN2A, 2903 glutamate signaling pathway GRIN2B, 2904 glutamate signaling pathway ADAM10, 102 integrin-mediated signaling pathway GRM7, 2917 negative regulation of adenylate cyclase activity LRP1, 4035 negative regulation of Wnt receptor signaling pathway ADAM10, 102 Notch receptor processing ASCL1, 429 Notch signaling pathway HTR2A, 3356 serotonin receptor signaling pathway ADRB2, 154 transmembrane receptor protein tyrosine kinase … PTPRG, 5793 transmembrane receptor protein tyrosine kinase … EPHA4, 2043 transmembrane receptor protein tyrosine kinase … NRTN, 4902 transmembrane receptor protein tyrosine kinase … CTNND1, 1500 Wnt receptor signaling pathway > Semantic Web > Linked Data > Example

11 Introduction 11 Example taken from “Tim Berners-Lee on the next Web” PREFIX g o: PREFIX rdfs: PREFIX owl: PREFIX mesh: SELECT ?genename ?processname WHERE { graph { ?paper ?p mesh:D ?article sc:identified_by_pmid ?paper. ?gene sc:describes_gene_or_gene_product_mentioned_by ?article.} graph { ?protein rdfs:subClassOf ?res. ?res owl:onProperty ro:has_function. ?res owl:someValuesFrom ?res2. ?res2 owl:onProperty ro:realized_as. ?res2 owl:someValuesFrom ?process. graph {{?process go:GO_ } union { ?process rdfs:subClassOf go:GO_ }} ?protein rdfs:subClassOf ?parent. ?parent owl:equivalentClass ?res3. ?res3 owl:hasValue ?gene.} graph { ?gene rdfs:label ?genename } graph { ?process rdfs:label ?processname}} > Semantic Web > Linked Data > Example Related to Pyramidal Neurons Part of Signal Transduction Used 4 sources

12 Introduction 12 > Semantic Web > Linked Data

13 Introduction 13 > Semantic Web > Linked Data

14 What do we need? – Identifiers: URIs – Linking mechanism: HTTP – Vocabulary: Web Ontology Language (OWL) – Serialization: RDF/XML Introduction 14 > Semantic Web > Linked Data

15 Identifiers: URIs – Use of HTTP URL – Link to “Resources” – Possibly many documents per resource – Shift to non-information resources: Introduction 15 > Semantic Web > Linked Data HTML: RDF: N3:

16 Linking mechanism: HTTP – Accessible through generic data browsers – Allowing to be crawled by search engines – Connecting different sources – In contrast, Web APIs use different interfaces Introduction 16 > Semantic Web > Linked Data

17 Vocabulary: Web Ontology Language (OWL) – Knowledge representation language – Designed to be interpreted by computers – Describes data, based on individuals (classes) and property assertions (relationships) Introduction 17 > Semantic Web > Linked Data

18 Vocabulary: Web Ontology Language (OWL) – Knowledge representation language – Designed to be interpreted by computers – Describes data, based on individuals (classes) and property assertions (relationships) – URIs about the same thing: ‘owl:sameAs’ Introduction 18 > Semantic Web > Linked Data

19 Based on triples – Subject, predicate, object Resources identified by URI URIs allow to look up RDF information RDF information links to other URIs RDF: Resource Description Framework 19 < >

20 20 RDF: Resource Description Framework

21 21 RDF: Resource Description Framework

22 22 RDF: Resource Description Framework This looks a lot like XML.. Why don’t we just use XML??

23 RDF: XML: Name Page Name Page Name... RDF vs XML 23

24 RDF/XML: proposed by W3C N3 or Turtle: human-readability Tony Benn dc:. dc:title "Tony Benn"; dc:publisher "Wikipedia". RDF: Serialization 24

25 Outline Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism 25

26 Also called “Triple Store” Data in the form of triples: Subject – predicate – object Dominant query language: SPARQL RDF Databases 26 PREFIX abc:. SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y. ?y abc:countryname ?country ; abc:isInContinent abc:Africa. }

27 Built on W3C’s “Linked Data” Subset of “Graph databases” Nodes (entities), edges (relationships), properties Directed, labeled graph structure (Predicate URI as label) RDF Databases 27

28 Graph View 28 Image taken from w3.org

29 Only standarised NoSQL database In contrast to normal RDBMS: – Very flexible data model Do not require fixed table schema – Information as most basic building blocks Enabling improvement on data-intensive operations Examples: Ebay, Facebook, digg,.. RDF Databases 29

30 Scalable: Distributed design Self-Documenting Data – Vocabulary identified in OWL or RDFS definitions – Allows multiple schemata Open – Discover new data sources at run-time Often weak consistency guarantees – Solved with additional middleware RDF Databases 30

31 Limitations of Relational Databases: Not directly visible to web-agents Primary-foreign key relationships – Meaning is implicit, unspecified semantics No relationships across seperate databases Parent-child relationship are not natural – “Self-joins” for each level in hierarchy 31 RDF Databases

32 Outline Introduction to RDF RDF Databases Advantages for scientific R&D Criticism In practice 32

33 Advantages for Scientific R&D Studies continue to show that research in all fields is increasingly collaborative Example: genomic research – Complex data distributed over many datasets Entrez Gene (EG), Gene Ontology (GO), Swiss_Prot, GenBank,.. 33

34 Problem = Lack of well defined standards – Integration Nightmare: data scattered, different formats, lacking information synonyms, ambiguity – Changing models: maintenance not feasible – Understanding and reasoning need for connecting ontologies Challenge: Syntatic and Semantic heterogeneity 34 Advantages for Scientific R&D

35 Localization of resources – Identify relevant webresources Data formats – Resources are represented in HTML, TXT, images,.. Synonyms – Researchers can name their own data differently 35 Integration of Databases > Challenges

36 Ambiguity – E.g. “insulin” can represent a drug, protein, gene,.. Relations – One-to-one / One-to-many between identifiers Granularity – Can cause missing data,.. 36 Integration of Databases > Challenges

37 Data Warehouse Approach – Translate data in one local database – Eliminate unavailability & slow response – Allow data processing and optimalization – Maintenance problem evolution of content and structure – Examples: BioWarehouse, Biozon, DataFoundry 37 Integration of Databases > Approaches

38 Federated Database Approach – Translate queries for individual sources – Easier to maintain (e.g. Adding new source) – Poor performance – Examples: BioKleisli, DiscoveryLink, QIS 38 Integration of Databases > Approaches

39 Semantic Web Approach – No need to map data models – Rely on standarized ontologies – Less work, better performance – But only if sources comply 39 Integration of Databases > Approaches

40 Outline Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism 40

41 In Practice Scientists need: – Access to data – Ability to utilize data – Handle uncertainty 41

42 In Practice Linked Open Data: – “We all need the same databases, for different decisions or applications” – Complements data in internal/licensed sources – Stimulates cross scientific sharing 42

43 Biological data: Human Genome Project – Increase in web-accessible databases GenBank, Gene Ontology, UniProt, PhenoDB,.. – Integration is key problem – Increase in RDF availability 43 Examples

44 YeastHub – Registration of web-accessible database Metadata according to Dublin Core standards using RSS1.0 to describe an ontology – Data Conversion XML or RDB to RDF conversion – (eg Unique ID = RDF ID, rest of columns are properties) – Data Integration Ad hoc RDF queries Form-based queries (supervised) 44 Examples

45 Outline Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism 45

46 Feasability – Human behavior and personal preferences ‘Database hugging’ – Organizations tend to keep data for themselves Censorship and Privacy 46 Criticism

47 Published data reusable in research? – Requires: Provenance information Quality Attribution Consistency... – Out-of context data fails to respect scientific research methodology 47 Criticism

48 Bringing Web 2.0 to bioinformatics 2008, Zhang Zhang, Kei-Hoi Cheung and Jeffrey P. Townsend Semantic web approach to database integration in life sciences 2006, Kei-Hoi Cheung, Andrew K. Smith, Kevin Y.L. Yip, Christopher J.O. Baker and Mark B. Gerstein Integrating large biomedical knowledge resources with RDF 2007, Satya S. Sahoo, Olivier Bodenreider, Kelly Zeng, Amit Sheth RDF/RDFS-based Relational Database Integration 2006, Huajun Chen, Zhaohui Wu, Heng Wang, Yuxin Mao 48 References

49 Has anyone ever worked with linked (RDF) data before? What are your experiences? Will the semantic web grow to become the Giant Global Graph? Why haven’t RDF databases taken off like Relational Databases? 49 Discussion


Download ppt "Scientific RDF Databases Michael Mertens K.U.Leuven."

Similar presentations


Ads by Google