Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2007 OpenLink Software, All rights reserved Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources.

Similar presentations


Presentation on theme: "© 2007 OpenLink Software, All rights reserved Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources."— Presentation transcript:

1 © 2007 OpenLink Software, All rights reserved Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources

2 © 2007 OpenLink Software, All rights reserved Growing the Semantic Web Classic chicken n egg problem has impeded the growth of the Semantic Web Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data. A critical mass of RDF data wont be achieved without adequate Semantic Web applications and tools. A new class of tools is emerging in response to this need…RDFizers Transform non-RDF data into RDF Virtuoso Sponger is one such RDFizer

3 © 2007 OpenLink Software, All rights reserved Virtuoso Sponger An RDFizer introduced in Virtuoso 5.0 Provides built-in RDF middleware for transforming non-RDF data into RDF "on the fly. You can use non-RDF data sources as Semantic Web data sources. Inputs: Wide variety of non-RDF Web data sources, e.g: (X)HTML Web Pages (including hosted microformats) Web services (Google, Del.icio.us, Flickr etc.) Binary files (MS Office, PDF, OpenDocument etc.) Output: RDF structured data

4 © 2007 OpenLink Software, All rights reserved Inputs: Supported Data Sources RDF (inc. N3, Turtle) SIOC, SKOS, FOAF, AtomOWL, Annotea … (X)HTML pages HTML header metadata: Dublin Core Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk … Syndication formats RSS 2.0, Atom, OPML, OCS, XBEL GRDDL Web service APIs: Google Base, Flickr, Del.icio.us, Ning … Files: Binary files: MS Office, OpenOffice, images, audio, video … Data exchange formats: iCalendar, vCard 3 rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers or add your own!

5 © 2007 OpenLink Software, All rights reserved Output: Structured Data In the context of the Semantic Data Web: Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes Michael Bergman (http://www.mkbergman.com) Article: More Structure, More Terminology and (hopefully) More Clarity

6 © 2007 OpenLink Software, All rights reserved Sponger Benefits Majority of the world's data resides in non-RDF form at the current time Sponger provides a Swiss army knife for RDF structured data generation from non-RDF sources Extracting data from non-RDF Web sources and converting it to RDF helps bootstrap the Semantic Web helps drive the transition of the traditional Document- Web into the emerging Semantic Data-Web exposes the data in a canonical form for querying and inference

7 © 2007 OpenLink Software, All rights reserved Sponger Inputs & Outputs

8 © 2007 OpenLink Software, All rights reserved Sponger Architecture Sponger is comprised of Sponger Cartridges Default cartridge collection is bundled as a Virtuoso VAD Cartridge = Metadata Extractor + Ontology Mapper Metadata extracted from non-RDF resources is mapped to a suitable ontology by Ontology Mapper to produce Structured Data Sponger is highly customizable Custom cartridges can be developed Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API

9 © 2007 OpenLink Software, All rights reserved Using The Sponger Can be invoked in several ways, via: Virtuoso SPARQL query processor Virtuoso RDF Proxy Service (/proxy/rdf/%U) E.g. ners-Lee/card OpenLink RDF client applications ODS-Briefcase (Virtuoso WebDAV) Directly through Virtuoso PL

10 © 2007 OpenLink Software, All rights reserved Using the Sponger: SPARQL Query Processor Virtuoso extends SPARQL with IRI/URI dereferencing Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store During query execution: From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store IRI dereferencing of FROM clauses Downloads & stores triples from named graphs IRI dereferencing of SPARQL variables Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates

11 © 2007 OpenLink Software, All rights reserved SPARQL Extensions: IRI Dereferencing of FROM Clauses Enabled through define get:… pragmas DEFINE get:method GET DEFINE get:soft soft SELECT ?id FROM NAMED FROM NAMED WHERE { GRAPH ?g { ?id a ?o } }; get:soft – retrieval mode: soft / replace get:uri – IRI to retrieve if not equal to IRI of FROM clause get:method – HTTP GET or URIQA MGET get:refresh – max allowed age (seconds) of cached resource can reduce expiry time specified in HTTP headers get:proxy – proxy server address if direct download not possible

12 © 2007 OpenLink Software, All rights reserved SPARQL Extensions: IRI Dereferencing of Variables Enabled through define input:grab-… pragmas DEFINE input:grab-var ?more DEFINE input:grab-depth 10 DEFINE input:grab-limit 100 DEFINE input:grab-base SELECT ?id ?fullname ? WHERE { GRAPH ?g { ?id a ; ?fullname ; ? . OPTIONAL { ?id ?more } } } ; input:grab-var - SPARQL variable identifying IRIs to be downloaded input:grab-depth – max # of links (predicates) between nodes in graph input:grab-limit – max # of resources (subject/object nodes) to retrieve input:grab-base – base IRI for converting relative IRIs to absolute plus others (grab-seealso, grab-destination …) - see Reference Manual.

13 © 2007 OpenLink Software, All rights reserved Using the Sponger: RDF Proxy Service Sponger functionality is also exposed by Virtuoso /proxy/rdf endpoint An in-built REST style Web service Takes a target URL & returns its content as is or tries to transform it (by sponging) to RDF Provides a pipe for RDF browsers to browse non-RDF sources Caches to temporary Virtuoso storage Cache invalidation similar to traditional Web Browser, based on HTTP expires header

14 © 2007 OpenLink Software, All rights reserved RDF Proxy Service Parameters: url: the URL of the target force: if rdf is specified, will try to extract RDF data from the target and return it header: HTTP headers to be sent to the target output-format: output MIME type of the RDF data rdf+xml (default) / n3 / turtle / ttl if not specified, proxy service uses content negotiation

15 © 2007 OpenLink Software, All rights reserved Using the Sponger: OpenLink RDF Client Applications Bundled as part of OpenLink AJAX Toolkit (OAT) RDF Browser Uses /proxy service by default iSPARQL – Interactive SPARQL query builder Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server) Get Local Data Only Get Remote Data When Missing Locally Get All Remote Data Get All Remote Data & Related Data Get Everything

16 © 2007 OpenLink Software, All rights reserved Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) Briefcase = A component of OpenLink Data Spaces Includes high level interface to Virtuoso WebDAV repository Web browser based interaction Web services support (direct use of WebDAV protocol) SPARQL queryable (WebDAV location acts as RDF graph URI) Metadata automically extracted at file upload time Wide variety of file formats supported All WebDAV resources are exposed as SIOC instance data Extracted metadata available in two forms Pure WebDAV RDF (RDF/XML, N3, Turtle) optionally synchronized with Quad Store Virtuoso Content Crawler / RDF_Sink folder help automate uploading

17 © 2007 OpenLink Software, All rights reserved SIOC as a Data Space Glue Ontology ODS has its own built-in cartridges for mapping to SIOC All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data SIOC provides a generic data model of containers, items and associations between items Classes include: User, UserGroup, Role, Site, Forum, Post SIOC Types Module (sioc-t) defines further types. Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc permits the use of other ontologies (e.g. FOAF) when describing attributes of SIOC entities provides a generic wrapper (glue ontology) for describing RDF structured data derived from OpenLink Data Spaces All ODS-related SIOC data can be queried through SPARQL

18 © 2007 OpenLink Software, All rights reserved Using the Sponger: Directly via Virtuoso PL Sponger cartridges are invoked through a cartridge hook Provides a Virtuoso PL entry point to the packaged functionality Can be called directly from your own Virtuoso PL procedures

19 © 2007 OpenLink Software, All rights reserved Sponger Cartridges

20 © 2007 OpenLink Software, All rights reserved Sponger Architecture Sponger is comprised of cartridges Cartridge = metadata extractor + ontology mapper Cartridge is invoked through cartridge hook (Virtuoso PL entry point) Metadata extractor Performs initial data extraction Ontology mapper Generates RDF instance data from extracted (non-RDF) metadata Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL

21 © 2007 OpenLink Software, All rights reserved Sponger Cartridge Invocation

22 © 2007 OpenLink Software, All rights reserved Sponger Configuration using Conductor UI Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks including managing Sponger Cartridges and VADs VAD = Virtuoso Application Distribution Packaging & distribution system for Virtuoso extensions RDF Cartridges VAD Bundles a variety of pre-built cartridges for popular Web resources and file types Installed as part of default Virtuoso installation

23 © 2007 OpenLink Software, All rights reserved Sponger Configuration using Conductor UI: RDF Cartridges Pane

24 © 2007 OpenLink Software, All rights reserved Sponger Configuration using Conductor UI: GRDDL Filters

25 © 2007 OpenLink Software, All rights reserved Sponger Configuration using Conductor UI: XSLT Templates

26 © 2007 OpenLink Software, All rights reserved Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies

27 © 2007 OpenLink Software, All rights reserved Custom Cartridges Sponger is extensible via pluggable cartridge architecture Sponge new data formats by creating your own cartridges Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java) Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML

28 © 2007 OpenLink Software, All rights reserved Custom Cartridges Cartridge Hook - Virtuoso PL Prototype in graph_iri varchar: IRI of graph being retrieved in new_origin_uri varchar: URI of the document being retrieved in destination varchar: destination graph IRI inout content any: the document content inout async_queue any: preallocated asynchronous queue used to call the configured ping service inout ping_service any: URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service inout api_key any: unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table

29 © 2007 OpenLink Software, All rights reserved Flickr Cartridge Extracts procedure DB.DBA.RDF_LOAD_FLICKR_IMG ( in graph_iri varchar, in new_origin_uri varchar, in dest varchar, inout _ret_body any, inout aq any, inout ps any, inout _key any) { declare xd, xt, url, tmp, api_key, img_id, hdr, exif any;... url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_ id=%s&api_key=%s', img_id, api_key); tmp := http_get (url, hdr);... xd := xtree_doc (tmp);... xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif)); xd := serialize_to_UTF8_xml (xt); DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri)); return 1; }

30 © 2007 OpenLink Software, All rights reserved Custom Resolvers Sponger supports pluggable Custom Resolver cartridges Support dereferencing of other forms of URIs besides HTTP URLs, e.g: URN schemes (LSIDs) and handle schemes (DOIs) Greatly extends range of data sources which can be linked into the Semantic Web urn:lsid:ubio.org:namebank:11815 &should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html Proxy service also recognizes URNs


Download ppt "© 2007 OpenLink Software, All rights reserved Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources."

Similar presentations


Ads by Google