Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico.

Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico di Milano 2 TU Delft

Agenda Rankings, Rankings everywhere What are top-k SPARQL queries Jim Gray's Benchmarking Principles The problem Some Definitions Research Hypothesis Background work: DBpedia SPARQL Benchmark Our proposal: Top-k DBPSB Preliminary Evaluation Conclusions

Rankings, rankings everywhere 3

A very intuitive and simplified example: Top 3 largest countries (by both area and population) Why do we need to optimize them? 6

The standard way: materialize-then-sort scheme 7 Countries Compute the scoring function that accounts for area and population Sort all the 242 countries Fetch 3 best results … … 242 …

Innovative optimization: Split-and-Interleave scheme 8 Fetch 3 best results Incrementally order partial results by area Sorted access to countries ordered by population Countries 242 9 3

State-of-the art Database method –Split the evaluation of the scoring function into single criteria –Interleave them with other operators –Use partial orders to construct incrementally the final order Standard assumptions: –Monotone scoring function –Each criterion is evaluated as a [0,1] number (normalization) Optimized for the case of fast sorted access for each criterion 9

Top-k SPARQL queries E.g., the 10 most recent books written by the youngest authors SELECT ?book ?author (0.5*norm(?releaseDate) + 0.5*norm(?dateOfBirth) AS ?s ) WHERE { ?book dbp:isbn ?v. ?book dbp:author ?author. ?book dbp:releaseDate ?releaseDate. ?v3 dbp:dateOfBirth ?dateOfBirth. } ORDER BY DESC(?s) LIMIT 10 Scoring Function as a SELECT expression Normalization cast the value in [0..1] norm(x) = x - min x max x - min x Order and slice 10

The Problem Set up a benchmark for top-k SPARQL Queries that Resembles reality Stresses the features of top-k queries –Syntax: SELECT expression + ORDER BY + LIMIT –Performance: hit SPARQL engine where it hurts 11

Jim Gray on Benchmarking Principles Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain Portable: Easy to implement on many different systems Scalable: Applies to small and large computer systems Simple: understandable Results 12

Definitions E.g., the 10 most recent books written by the youngest authors releaseDate Rankable Variables Scoring Variables Rankable Data Properties Rankable Triple Patterns Scoring Function 0.5* norm(?releaseDate) + 0.5*norm(?birthDate) ?book ?author ?releaseDate dateOfBirth ?birthDate author Triple Patterns 13

Research Hypothesis H.0 top-k SPARQL queries that resemble reality can be obtained extending DBpedia SPARQL Benchmark –H.1 ++ Rankable variable  ++ execution time –H.2 ++ Scoring variable  ++ execution time –H.3 +/- LIMIT  = execution time 14

DBpedia SPARQL Benchmark A method to generate a SPARQL benchmark from DBpedia an its query longs It can be applied to other datasets and other query logs Characteristics –Resemble reality –Stress SPARQL features Query Logs Query Analysis and Clustering Dataset generation Auxiliary Queries Queries Templates Query Instances 15

Proposed Solution Top-k DBPSB An extension of DBPSB Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables It is also a method –Can be applied to other benchmark obtained using DBSBM method Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 16

A DBPSB Auxiliary Query SELECT DISTINCT ?v WHERE { ?v6 rdf:type ?v. ?v6 dbp:name ?v0. ?v6 dbp:pages ?v1. ?v6 dbp:isbn ?v2. ?v6 dbp:author ?v3. } Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 17

Top-k DBPSB step 1a To generate queries with 1 rankable variable SELECT ?p (COUNT(?p) AS ?n) WHERE { ?v6 rdf:type ?v. ?v6 dbp:name ?v0. ?v6 dbp:pages ?v1. ?v6 dbp:isbn ?v2. ?v6 dbp:author ?v3. ?v6 ?p ?o. FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime) } ORDER BY ORDER BY DESC(?n) Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 18

Top-k DBPSB step 1b Results – not all sortable properties resemble reality Pages ISBN NumberOfPages Year Volume wikiPageID releaseDate … NOTE: it requires manual selection Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 19

Top-k DBPSB step 1c To generate queries with 2 rankable variables SELECT ?p ?p1 (COUNT(?p1) AS ?n) WHERE { ?v6 rdf:type ?v. ?v6 dbp:name ?v0. ?v6 dbp:pages ?v1. ?v6 dbp:isbn ?v2. ?v6 dbp:author ?v3. ?v6 ?p ?o. ?o ?p1 ?o1. FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime) } GROUP BY ?p ?p1 ORDER BY DESC(?n) NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 20

Top-k DBPSB step 1d Results author, wikiPageID author, wikiPageRevisionID … author, dateOfBirth … publisher, wikiPageID publisher, wikiPageRevisionID … publisher, founded … NOTE: it requires manual selection Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 21

Top-k DBPSB step 2 SELECT (max(?o) as ?max) (min(?o) as ?min) WHERE { ?v6 rdf:type ?v. ?v6 dbp:name ?v0. ?v6 dbp:pages ?v1. ?v6 dbp:isbn ?v2. ?v6 dbp:author ?v3. ?v6 dbp:pages ?o. FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime) } NOTE: the filter clause should not be necessary, but DBpedia is very dirty … Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 22

Top-k DBPSB step 3 Choose the number of ranking variables – Max three – E.g., books and authors Choose the number of scoring variables per ranking variables – Max three – E.g., releaseDate for books and dateOfBirth for authors Look up the min and the max of each ranking variable to normalise it Choose the weights – The sum of the weight should be 1 Assemble the scoring function – E.g., 0.5*norm(?releaseDate ) + 0.5*norm(?dateOfBirth) Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 23

Top-k DBPSB step 4 SELECT ?v6 ?v3 (0.5*norm(?o1) + 0.5*norm(?o2) AS ?s ) WHERE { ?v6 rdf:type ?v. ?v6 dbp:name ?v0. ?v6 dbp:pages ?v1. ?v6 dbp:isbn ?v2. ?v6 dbp:author ?v3. ?v6 dbp:releaseDate ?o1. ?v3 dbp:dateOfBirth ?o2. FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime) FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime) } ORDER BY ?s LIMIT 10 Find Rankable Variables Auxiliary Queries Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries 24

Preliminary Results 1/2 We tested our hypothesis using –Virtuoso Open-Source Edition version 6.1.6 –Jena-TDB Version 2.10.1 –DBpedia 10% In this setting, Top-k DBPSB generates queries – adequate to test H.2 ++ Scoring variable  ++ execution time H.3 +/- LIMIT  = execution time –only partially adequate to test H.1 ++ Rankable variable  ++ execution time 25

Preliminary Results 2/2 H.1 ++ Rankable variable  ++ execution time –confirmed in some cases –not confirmed aggregating by query across engine –confirmed aggregating by engine across queries H.2 ++ Scoring variable  ++ execution time –confirmed for Jena TDB –confirmed in most of the cases for Virtuoso H.3 +/- LIMIT  = execution time –confirmed for Jena TDB –confirmed in most of the cases for Virtuoso 26

Conclusions Top-k DBPSB is a successful first attempt to automatically generate Top-k SPARQL queries that –Resemble reality –Hit SPARQL engines where it hurts More investigation is required –Better understand the relationships between the number of rankable variable and the execution time E.g., cardinalities, selectivity and jooins –Include over known features of top-k query that impact execution time E.g., correlation of order induced on the result set by the different scoring variable in the scoring function E.g., Distribution of values matched by the scoring variables 27

Thank you! Any Question? Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico di Milano 2 TU Delft

Preliminary Results - details

Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico.

Similar presentations

Presentation on theme: "Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico.

Similar presentations

Presentation on theme: "Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1, Emanuele Della Valle 1, Daniele Dell’Aglio 1, and Alessandro Bozzon 2 1 Politecnico."— Presentation transcript:

Similar presentations

About project

Feedback