Presentation is loading. Please wait.

Presentation is loading. Please wait.

SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION

Similar presentations


Presentation on theme: "SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION"— Presentation transcript:

1 SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
WORKING GROUP 3rdMEETING MAY 2013 ITEM 2.4 Integrating Statistical Data with Semantic Web

2 From SDMX to RDF Data Cube Vocabulary: Integrating Statistical Data
with Semantic Web Monica Scannapieco Italian National Institute of Statistics (Istat) Joint work with: Raffaella M. Aracri, Andrea Pagano, Laura Tosco, Luca Valentino Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg

3 Open Data & Linked Open Data
Format that allows «usage, reuse and redistribution» Linked Open Standard models and formats intended for data integration on the Web Current situation: Overcoming pattern «formerly open - then linked» Data straightly expressed as Linked Open Data (LOD) Example 1 - US data.gov: PA open data portal in the PA. In progress migration to LOD Example 2: DBpedia & Wikidata DBpedia: data extraction from Wikipedia infoboxes Wikidata: structured database, aiming to feed Wikipedia infoboxes 1 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 3

4 Linked Open Data and Semantic Interoperability
Semantic Web Stack Knowledge Semantic Format and syntax Linked Open Data Data represented by means of RDF (Resource Description Framework) languages Interconnected => Semantic Interoperability 2 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 4

5 Background: RDF, RDFS, OWL
Resource Description Framework (RDF) Language for representing information about resources in the World Wide Web W3C Recommendation 10 February 2004 RDF Schema (RDFS) Language for RDF vocabulary sharing 10 February 2004 (as part of a wider revision of RDF) RDFS became a W3C Recommendation OWL Language for publishing and sharing ontologies on the World Wide Web. OWL is developed as a vocabulary extension of RDF and is derived from the DAML+OIL Web Ontology Language OWL W3C Recommendation 10 February 2004 OWL 2 W3C Recommendation 11 December 2012 OWL 2 adds new functionality with respect to OWL 1. Some of the new features are syntactic sugar (e.g., disjoint union of classes) while others offer new expressivity, including: keys; property chains; richer datatypes, data ranges; qualified cardinality restrictions; asymmetric, reflexive, and disjoint properties; and enhanced annotation capabilities 3 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 5

6 Background: SPARQL and RDF Data Cube Vocabulary
SPARQL (Sparql Protocol And RDF Query Language) is a language with a syntax similar to SQL for querying RDF data and a communication protocol based on HTTP A SPARQL client can query a SPARQL endpoint with queries on a RDF graph SPARQL allows “graph pattern matching” on RDF data W3C Recommendation 15 January 2008 RDF Data Cube (RDF QB) is a W3C Working Draft of the 12 March 2013 RDF QB is based on SDMX Focused only on the publication on the web of multi-dimensional data Built on the SDMX information model 4 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 6

7 RDF Basics Resource Description Framework (RDF)
Allows to represent data/metadata trough assertions, called triples A triple: <subject> <property> <object> A resource is uniquely identified by a URI A puts in relationship other two A property can also put in relationship a resource and a «literal», i.e., a pure symbolic expression, e.g., a number, a string In such a way we have the RDF graph Example: < < < 5 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 7

8 How can we "integrate" the Istat data with data
Problem ISTAT Dati sul WEB Reference Metadata Enhanced SDMX BB1 BB2 BBn SDMX Web Service Provider SEP Tablets/ smartphones Structural Metadata WEB GUI Istat Information System Building Blocks Excel Plug -in Metadata Management System How can we "integrate" the Istat data with data on the Web already interconnected with each other? 6 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg

9 Translator Solution 7 SDMX <structure:Concept id="REF_AREA">
<structure:Name xml:lang="en">Geographical reference area</structure:Name> </structure:Concept> 7 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg

10 Transformation from SDMX to RDF-QB
Code list Concept dimension, attribute, measure ? DSD SDMX RDF QB Data file SDMX Data set RDF transformation 8 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 10

11 Analysis of the Technological Environments
R Environment: Package RSDMX: not complete and not actively maintained Java Environment: Apache JENA: framework for reading, processing and writing data in RDF, SPARQL queries are processable and more Input SDMX not covered MIMAS Project ( ) Transformation of data using XSLT 9 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 11

12 Technological Choice and Design
Usage of XSLT transformations Execution engine of transformation: Saxon (Home Edition) Supports XSLT 2.0, XQuery 1.0, XPath 2.0 Available in both Java and .NET (Principal) Differences with MIMAS Level of generalization: our translator is generalized, while MIMAS provides transformations ad-hoc to datasets Generation of separate files for SDMX Data, SDMX DSDs, and Codelists 10 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 12

13 Example of Mapping SDMX DSD RDF QB 11
<structure:Concept id="REF_AREA"> <structure:Name xml:lang="en">Geographical reference area</structure:Name> </structure:Concept> RDF QB <rdf:Description rdf:nodeID="REF_AREA">       <qb:dimension rdf:resource="       <dc:language>en</dc:language>       <rdf:type rdf:resource="       <rdf:type rdf:resource="       <sdmx:codeList rdf:resource="       <rdfs:range rdf:resource="       <rdfs:label xml:lang="en">Geographical reference area</rdfs:label>    </rdf:Description> 11 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 13

14 Example of XSL File: Definition of Transformation Rules
Reading from SDMX: Concept and Codelist <xsl:for-each <xsl:variable name="conceptRef" <xsl:variable name="codeList" <xsl:variable name="codeListName" <xsl:element name="rdf:Description" > <xsl:attribute name="rdf:nodeID" select="$conceptRef"/> <xsl:element name="qb:dimension"> <xsl:attribute name="rdf:resource" select="concat($IstatRoot,'/code/',$codeListName)"/> </xsl:element> Writing in RDF QB: Concept and Codelist 12 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 14

15 Syntactic and Semantic Validation
Syntactic Validation: syntactic validation of RDF files creating RDF triples and their graph representation Used the free software validator ( Verified also compliance with turtle (Eurostat format for DSD) from RDF XML to turtle format (.ttl) Software any23 ( ) Semantic Validation: data model is a «valid» RDF QB model Used Openlink Virtuoso ( ) 13 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 15

16 Semantic Validation SELECT obs_id, ref_area, obs_value, time_period, territoryLabel WHERE typeofWaste=9 AND Time= 14 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 16

17 Test Case and Performance
Data Input Size (KB) Execution Time (min,sec) Base 1.338 6m 7,9s 2.674 12m 32,86s 4.009 18m 42,248s 5.345 23m 49,127s Optimized_1 2m 17,71s 4m 36,933s 7m 28,945s 9m 12,627s Optimized_2 5,6s 7,87s 11,003s 13,968s Optimized 1: mimimize DSD access Optimized 2: in memory access of DSD representation 15 17

18 Test Case e Performance
Milliseconds (Log Scale) KB 16 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 18

19 Prototype development SDMX data format: Compact
Conclusions and Future Developments Prototype development SDMX data format: Compact Possible extensions to other SDMX formats (i.e., Generic, Cross-Sectional) Extension of the transformation rules of the constructs Integration with the Istat Single Exit Point (SEP) 17 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 19

20 Complexity Query Data Model Discussion: RDF Publication Architectures
(+) Linking with other sources (-) Redundancy wrt SDMX (-) Deployment dedicated SPARQL EndPoint with triple (+) Easy to be published (-) Not queryable High Complexity Query RDF in flat file RDF via SDMX Low (+) Recovery of SDMX investment (SEP) (-) Cross-format translation during the query phase (-) Linking with other sources Low High Data Model 18 Monica Scannapieco , 3rd SISAI Meeting, May 2013, Luxembourg 20


Download ppt "SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION"

Similar presentations


Ads by Google