Presentation is loading. Please wait.

Presentation is loading. Please wait.

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray.

Similar presentations


Presentation on theme: "Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray."— Presentation transcript:

1 Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray (Universities of Leicester and Glasgow)  Iadh Ounis (University of Glasgow) ESWC 2009 – Crete 3 June 2009

2 Outline Motivation: The Virtual Observatory Can SPARQL be used to express scientific queries? Can existing archives be exposed with semantic tools? – Can RDB2RDF tools extract large volumes of data? 3 June 2009A.J.G. Gray - ESWC 20091

3 International Virtual Observatory Alliance “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” 3 June 2009A.J.G. Gray - ESWC 20092

4 Searching for Brown Dwarfs Data sets: – Near Infrared, 2MASS/UK Infrared Deep Sky Survey – Optical, APMCAT/Sloan Digital Sky Survey Complex colour/motion selection criteria Similar problems – Halo White Dwarfs 3 June 20093A.J.G. Gray - ESWC 2009

5 Deep Field Surveys Observations in multiple wavelengths – Radio to X-Ray Searching for new objects – Galaxies, stars, etc Requires correlations across many catalogues – ISO – Hubble – SCUBA – etc 3 June 20094A.J.G. Gray - ESWC 2009

6 The Problem Locate and combine relevant data Heterogeneous publishers – Archive centres – Research labs Heterogeneous data – Relational – XML – Files Virtual Observatory 3 June 20095A.J.G. Gray - ESWC 2009

7 A Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Homogeneous view – Mediated global schema Mapping – LAV: local-as-view – GAV: global-as-view 3 June 2009A.J.G. Gray - ESWC 20096 Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema

8 P2P Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Heterogeneous views – Multiple schemas Mappings – From sources to common schema – Between pairs of schema Require common integration data model Can RDF do this? 3 June 2009A.J.G. Gray - ESWC 20097 Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schema j Query 1 Query n Mappings

9 Resource Description Framework W3C standard Designed as a metadata data model Contains semantic details Ideal for linking distributed data Queried through SPARQL 3 June 2009A.J.G. Gray - ESWC 20098 #foundIn #Sun The Sun #name #MilkyWay Milky Way #name The Galaxy #name IAU:Star rdf:type IAU:BarredSpiral rdf:type

10 SPARQL Declarative query language – Select returned data Graph or tuples Attributes to return – Describe structure of desired results – Filter data W3C standard Syntactically similar to SQL – Should be easy for scientists to learn! 3 June 20099A.J.G. Gray - ESWC 2009

11 Integrating Using RDF Data resources – Expose schema and data as RDF – Need a SPARQL endpoint Allows multiple – Access models – Storage models Easy to relate data from multiple sources Relational DB RDF / Relational Conversion XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query 3 June 200910A.J.G. Gray - ESWC 2009 We will focus on exposing relational data sources

12 RDB2RDF Extract-Transform-Load Data replicated as RDF – Data can become stale Native SPARQL query support – Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support – Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 3 June 200911A.J.G. Gray - ESWC 2009

13 System Hypothesis Is it viable to perform query-driven conversions to facilitate data access from a data model that a scientist is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? Relational DB RDB2RDF XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query 3 June 200912A.J.G. Gray - ESWC 2009 SPARQL query

14 Astronomical Test Data Set SuperCOSMOS Science Archive (SSA) – Data extracted from scans of Schmidt plates – Stored in a relational database – About 4TB of data, detailing 6.4 billion objects – Fairly typical of astronomical data archives Schema designed using 20 real queries Personal version contains – Data for a specific region of the sky – About 0.1% of the data – About 500MB 3 June 200913A.J.G. Gray - ESWC 2009

15 Analysis of Test Data Using personal version – About 500MB in size (similar size to related work) Organised in 14 Relations – Number of attributes: 2 – 152 4 relations with more than 20 attributes – Number of rows: 3 – 585,560 – Two views Complex selection criteria in views 3 June 200914A.J.G. Gray - ESWC 2009 Makes this different from business cases and previous work!

16 Is SPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? 3 June 2009A.J.G. Gray - ESWC 200915

17 Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SQL: SELECT ra, dec, sCorMagB, sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) SPARQL: SELECT ?ra ?decl ?sCorMagB ?sCorMagR2 ?sCorMagI WHERE { FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} 3 June 2009A.J.G. Gray - ESWC 200916

18 Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, 15-20 Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, 17-20 Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. Radians to degrees)7, 8, 12 Server defined functions10, 11 3 June 200917A.J.G. Gray - ESWC 2009

19 Expressivity of SPARQL Features Select-project-join Arithmetic in body Conjunction and disjunction Ordering String matching External function calls (extension mechanism) Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting 3 June 200918A.J.G. Gray - ESWC 2009

20 Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, 15-20 Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, 17-20 Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. radians to degrees)7, 8, 12 Server defined functions10, 11 Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 3 June 200919A.J.G. Gray - ESWC 2009

21 Can RDB2RDF tools feasibly expose large science archives for data integration? 3 June 2009A.J.G. Gray - ESWC 200920

22 Experiment Time query evaluation – 5 out of 20 queries used – No joins Systems compared: – Relational DB (Base line) MySQL v5.1.25 – RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 – RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 3 June 200921A.J.G. Gray - ESWC 2009 Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query

23 Experimental Configuration 8 identical machines – 64 bit Intel Quad Core Xeon 2.4GHz – 4GB RAM – 100 GB Hard drive – Java 1.6 – Linux 10 repetitions 3 June 2009A.J.G. Gray - ESWC 200922

24 Performance Results 3 June 200923A.J.G. Gray - ESWC 2009 ms 3,4505,339 21,492 485,932 2,7337,2294,0901,307 17,793 7,468 19,984 372,561 1

25 Conclusions SPARQL not expressive enough for real astronomical queries RDBMS benefits from 30+ years research – Query optimisation – Indexes RDF stores are improving – Require existing data to be replicated RDB2RDF tools show promise – Need to exploit relational database 3 June 2009A.J.G. Gray - ESWC 200924

26 Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! We need more work on query translation… 3 June 2009A.J.G. Gray - ESWC 200925


Download ppt "Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray."

Similar presentations


Ads by Google