ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

SemHE Workshop at ECTEL09, October 2009 Flexible Querying of Lifelong Learner Metadata Alex Poulovassilis, Peter T. Wood.
Chapter 5: Tree Constructions
A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Great Theoretical Ideas in Computer Science for Some.
CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking.
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
Fall 2006Costas Busch - RPI1 Non-Deterministic Finite Automata.
Costas Busch - LSU1 Non-Deterministic Finite Automata.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
1 Minimum Spanning Trees Longin Jan Latecki Temple University based on slides by David Matuszek, UPenn, Rose Hoberman, CMU, Bing Liu, U. of Illinois, Boting.
1 Minimum Spanning Trees Longin Jan Latecki Temple University based on slides by David Matuszek, UPenn, Rose Hoberman, CMU, Bing Liu, U. of Illinois, Boting.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
Detection and Resolution of Anomalies in Firewall Policy Rules
Regular Model Checking Ahmed Bouajjani,Benget Jonsson, Marcus Nillson and Tayssir Touili Moran Ben Tulila
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
May 5, 2015Applied Discrete Mathematics Week 13: Boolean Algebra 1 Dijkstra’s Algorithm procedure Dijkstra(G: weighted connected simple graph with vertices.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
I/O-Efficient Graph Algorithms Norbert Zeh Duke University EEF Summer School on Massive Data Sets Århus, Denmark June 26 – July 1, 2002.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
May 1, 2002Applied Discrete Mathematics Week 13: Graphs and Trees 1News CSEMS Scholarships for CS and Math students (US citizens only) $3,125 per year.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Union-find Algorithm Presented by Michael Cassarino.
Review 1 Queue Operations on Queues A Dequeue Operation An Enqueue Operation Array Implementation Link list Implementation Examples.
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
ISWC’10, November 2010 Combining Approximation and Relaxation in Semantic Web Path Queries Alex Poulovassilis, Peter Wood Birkbeck, University of London.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
1 Chapter Constructing Efficient Finite Automata.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Section 11.3 Constructing Efficient Finite Automata First we’ll see how to transform an NFA into a DFA. Then we’ll see how to transform a DFA into a.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11. Chapter Summary Introduction to Trees Applications of Trees (not currently included in overheads) Tree Traversal Spanning Trees Minimum Spanning.
Redraw these graphs so that none of the line intersect except at the vertices B C D E F G H.
Copyright © Cengage Learning. All rights reserved.
Computing Full Disjunctions
Program based on pointers in C.
Two issues in lexical analysis
Analysis and design of algorithm
Alternating tree Automata and Parity games
CSE 373 Data Structures and Algorithms
Non-Deterministic Finite Automata
Intro to Data Structures
Outline This topic covers Prim’s algorithm:
Presentation transcript:

ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez, Chile 2 Birkbeck, University of London

Outline of the talk 1.Motivation 2.Overview of our approach 3.Single-conjunct queries – exact semantics 4.Approximate semantics 5.Multi-conjunct queries 6.Conclusions and future work

1. Motivation Volumes of semi-structured data available on the web In particular, increase in the amount of RDF data e.g. in the form of linked data Volumes and heterogeneity of such data necessitates support for users querying by approximate answering techniques: o users queries do not have to match exactly the data structures being queried o answers to queries are returned in ranked order, in increasing distance from the original query

2. Overview of our approach We consider general semi-structured data, modelled as a graph structure e.g. RDF linked data is one kind of data that can be represented this way Our model is a directed graph G = (V,E) where each node in V is labelled with a constant (so blank nodes cannot be represented) each edge e in E is labelled with a label l (e) from a finite alphabet Our query language is that of conjunctive regular path queries: Z 1,..., Z m (X 1, R 1, Y 1 ),..., (X n, R n, Y n ) where the X i, Y i are variables or constants, the R i are regular expressions over and the Z i are drawn from the X i and Y i

Example 1 – RDF graph of a transport network

Find cities from which we can travel to city u5 using only airplanes as well as to city u6 using only trains or busses : ?X (?X, (airplane)+, u5), (?X, (train|bus)+, u6)

Answer: First conjunct generates bindings u1, u4 for ?X Second conjunct generates bindings u1, u2, u4 for ?X Hence answer is u1, u4

Approximate answers We are interested in using weighted regular transducers to capture query approximations since, from results by Grahne and Thomo 2001, we know that single-conjunct queries with a weighted regular transducer applied can be evaluated incrementally in polynomial time Incremental evaluation allows answers to be returned to the user in ranked order In this paper, we extend these this approach to include also symbol inversion; and we show that multiple conjunct queries can also be evaluated in polynomial time, using an algorithm from Ilyas, Aref, Elmagarmid 2004 for computing top-k join queries

Weighted regular transducers A weighted regular transducer is a Finite State Automaton in which the transitions are labelled with triples rather than single symbols: a transition from state s to state t labelled (a,i,b) means that if the transducer is in state s then it can move to state t on input a with cost i while outputting b in our context, such a transition is interpreted as stating that symbol a in a query can match label b of an edge in the graph with cost i

Approximate regular expression matching In the paper, for simplicity we mainly focus on approximate regular expression matching, which can be specified using weighted regular transducers (Grahne, Thomo 2001) The edit operations we allow are: insertions, deletions and substitutions of symbols inversion of symbols (i.e. edge reversal) transposition of adjacent symbols We envisage the user being able to specify which edit operations should be undertaken by the system when answering a particular query, or in a particular application The user could also specify the cost associated with applying each edit operation (in the paper we assume a cost of 1 for all of them)

Example 2 – transport network data

Find cities reachable from Santiago by non-stop flights, posed by user who has little knowledge of the structure of the data: ?X (Santiago, airplane, ?X)

The query as posed returns no answers: ?X (Santiago, airplane, ?X) However, the query can be relaxed, by an insertion of name, to: ?X (Santiago, airplane. name, ?X) And further relaxed, by an insertion of name - to ?X (Santiago, name -. airplane. name, ?X) This generates bindings of Temuco, Chillan for ?X These answers can be regarding as having distance 2 from the original query: two insertions to the original query each at an assumed cost of 1

3. Single-conjunct queries A single-conjunct query, Q, is of the form Z 1, Z 2 (X, R, Y) A semipath p in graph G is a sequence of the form v 1, l 1, v 2, l 2, …, v n, l n v n+1 where for each v i, v i+1 there is an edge v i v i+1 labelled l i or an edge v i+1 v i labelled l i - in G Semipath p conforms to regular expression R if l 1 … l n is in the language denoted by R

Exact Semantics Given a single-conjunct query Q, Z 1, Z 2 (X, R, Y) Let θ be a matching from {X, Y} to the nodes of graph G, that maps each constant to itself The exact answer of Q on G is the set of tuples θ(Z 1, Z 2 ) such that there is a semipath from θ(X) to θ(Y) which conforms to R

4. Approximate Semantics The edit distance from a semipath p to a semipath q is the minimum cost of any sequence of edit operations which transforms the sequence of edge labels of p to the sequence of edge labels of q We recall that the edit operations we allow are insertions, deletions, substitutions and inversions of symbols, and transposition of adjacent symbols We envisage the user being able to specify which edit operations should be applied by the system when answering a particular query, or in a particular application The user could also specify the cost associated with applying each edit operation (in the paper we assume a cost of 1 for all of them)

Approximate Semantics The distance of a semipath p to a regular expression R, dist(p,R), is the minimum edit distance from p to any semipath that conforms to R Given graph G, query Q and matching θ, the tuple θ(Z 1, Z 2 ) has distance dist(p,R) to Q, where p is a semipath from θ(X) to θ(Y) which has the minimum distance to R of any semipath from θ(X) to θ(Y) in G note, if p conforms to R, then θ(Z 1, Z 2 ) has distance 0 to Q The approximate top-k answer of Q on G is a list containing the k tuples θ(Z 1, Z 2 ) with minimum distance to Q, ranked in order of increasing distance to Q The approximate answer of Q on G is a list containing all the tuples at any distance to Q, ranked in order of increasing distance to Q (a maximum of O(|E|) 2 tuples).

Evaluation – naive 1.Construct approximate automaton M at distance d = |R|+|E| using a standard construction from approximate string matching note, |R|+|E| is the maximum distance required to obtain all tuples in the approximate answer (Lemma 1) M consists of d copies of M R, the NFA that recognises L(R) Each copy M R j, where 0 j d, represents states at distance j from M R The only initial state in M is the initial state of M R 0 The final state of each M R j becomes a final state in M Each sub-automaton M R j is connected to M R j+1 by transitions representing the selected edit operations, and their costs (assumed 1 for simplicity in the paper)

Evaluation – naive 2.Form the product automation H = M x G viewing each node in the input graph G=(V,E) as both an initial and a final state 3a.If Q is of the form (n,Y) (n,R,Y) for some node n of G, then perform a uniform cost traversal of graph H, starting from node (s 0 0,n) where s 0 0 is the initial state of M R 0 We keep a list of visited nodes of H, so no node is visited twice. Whenever a node (s f j,m) is encountered (where s f j is the final state of some M R j ), we output m. The distance of m to Q is given by the total cost of the path from (s 0 0,n) to (s f j,m) in the traversal tree.

Evaluation – naive 3b.If Q is of the form (X,Y) (X,R,Y) it can be evaluated by answering the query (n,Y) (n,R,Y) for each node n of G Lemma 2 of the paper states that the time to compute the approximate answer is polynomial in |V|, |E| and |R|

Evaluation – incremental The edges of graph H = M x G can be computed incrementally, avoiding pre-computation and materialisation of the entire H: For any state s i and node n of G, succ(s i,n) outputs the set of transitions which would be the successors of (s i, n) in H succ calls nextStates(M R,s,c) to return the set of states in M R reachable from state s i on reading input c – this input is obtained from the edges in G adjacent to n – for normal traversal, edge reversal and symbol insertion, from symbols in – for symbol deletion, and from edges in G adjacent to n, plus a further hop of edge traversals in G – for transpositions

Evaluation – incremental Incremental evaluation proceeds by: Constructing the NFA M R for R Initialising to empty the set visited R of triples (v,n,s) stating that node n in G was visited in state s starting from node v Initialising a priority queue Q R with quadruples of the form (v,v, s 0,0) for each node v in G (unless X=n in the query, in which case only (n,n, s 0,0) is enqueued) the fourth argument is the current distance, d initially, d = 0 subsequently, quadruples are added to Q R in order of increasing d Repeatedly calling the function getNext (X,R,Y) to return the next answer tuple for the conjunct (X,R,Y), in ranked order

Evaluation – incremental getNext (X,R,Y): while Q R is non-empty, this: de-queues a tuple (v,n,s,d) from Q R where d is the distance associated with visiting node n in state s of M R having started from node v adds (v,n,s) to visited R if s is a final state then getNext returns triple (v,n,d) otherwise, succ(s,n) is called, returning the set of transitions (c,w) and states (s,m) which are the successors of (s,n) in H those states (s,m) such that (v,m,s) is already in visited R are ignored for all other states, (v,m,s,d+w) is added to Q R

Example 4 – transport network data Suppose that the only query edits allowable are insertion of name or name -, and inversion of airplane. Find cities reachable from Santiago by plane: ?Y (Santiago, (airplane)+, ?Y)

Enqueue (Santiago,Santiago, s 0,0) This is de-queued, and succ(s 0,Santiago) is called; which returns transition (name -,1) and state (s 0 1,u1) (Santiago,u1, s 0 1,1) is enqueued (Santiago,u1, s 0 1,1) is de-queued, and succ(s 0 1,u1) is called; this returns transition (airplane,0) and state (s f 1,u4), and transition (airplane,0) and state (s f 1,u7) (Santiago,u4, s f 1,1) and (Santiago,u7, s f 1,1) are enqueued These are successively de-queued, resulting in (Santiago,u4, 1) and (Santiago,u7, 1) being successively returned by getNext Computation continues in this way, until all answer tuples have been returned ?Y (Santiago, (airplane)+, ?Y)

5. Multi-conjunct queries For a general conjunctive regular path query Z 1,..., Z m (X 1, R 1, Y 1 ),..., (X n, R n, Y n ) Given a matching θ from variables to the nodes of graph G, the tuple θ(Z 1,...,Z m ) has distance dist(p 1,R 1,) dist(p n,R n ) to Q, where each p i is a semipath from θ(X i ) to θ(Y i ) which has the minimum distance to R i of any semipath from θ(X i ) to θ(Y i ) The approximate top-k answer of Q on G is a list containing the k tuples θ(Z 1,...,Z m ) with minimum distance to Q, ranked in order of increasing distance to Q The approximate answer of Q on G is a list containing all the tuples at any distance to Q, ranked in order of increasing distance to Q

Multi-conjunct queries To ensure polynomial time evaluation, we require that the conjuncts of Q are acyclic This implies the existence of a join tree induced by the conjuncts of Q We use the hash ripple join algorithm of Ilyas, Aref, Elmagarmid 2004 to incrementally evaluate Q For each conjunct (X i,R i,Y i ) of Q, we use our incremental evaluation algorithm for single-conjunct queries to compute a relation r i containing triples (n,m,d) where d is the minimum distance to R i of any semipath from node n to node m in G

Multi-conjunct query evaluation Construct the evaluation tree E of Q Initialise data structures calling recursively the procedure open starting at root of E: for each node of E that is a join operator, hash tables are built for its left and right subtree (LN and RN), its threshold value is set to 0, and an (initially empty) priority queue is allocated for the node for each node of E that is a conjunct (X,R,Y), the same initialisations as earlier are performed : construct the NFA M R for R set visited R to empty and d to 0 initialise the priority queue Q R

Multi-conjunct query evaluation Incremental evaluation proceeds by calling a function getNext with the root of E If its argument is a conjunct, getNext is as discussed earlier for single-conjunct queries If its argument is a join operator, getNext chooses (by some heuristic) one of the two join operands, I, from which to retrieve a tuple, by recursively invoking getNext I top is set to the distance value of the first retrieved tuple from I, and I bottom is updated with the distance value of the most recently retrieved tuple from I The threshold value of the current node is min(LN top + RN bottom, RN top + LN bottom ) which is the lowest possible distance for join tuple yet to be computed

Multi-conjunct query evaluation – join operator The current tuple, t, retrieved from I is inserted into Is hash table, and the other hash table is probed with t to find possible join combinations with t For each such tuple s and join tuple u, the distance of s from Q is set to the sum of the distances of t and s from Q, and u is added to the nodes priority queue This process of generating and enqueueing join tuples repeats while the priority queue remains empty, or the distance value of the first item on the priority queue is greater than the current threshold value of the node Finally, getNext returns the first item on the priority queue

6. Conclusions and future work The paper has explored the use of weighted regular transducers and conjunctive regular path queries in a framework for approximate querying of graph-structured data For single-conjunct queries we have shown how approximate answers can be computed in polynomial time in the size of the query and the graph We have also shown how answers can be computed incrementally and returned in ranked order We have generalised the treatment to multi-conjunct queries, showing that incremental computation can still be achieved in polynomial time provided the queries are acyclic

Conclusions and future work There are several directions of future work: Implementation of our algorithms (ongoing), determination of their practical utility and efficiency, development and empirical evaluation of optimisations Application in case studies e.g. RDF linked data arising in a variety of domains Design of end-user tools for approximate querying of semi- structured data – so that users can specify their query approximation requirements Extending the expressiveness of our query language, to allow path variables and predicates on paths

Many thanks go to Petra Selmer for her implementation of the incremental evaluation algorithm, and the screenshots. Acknowledgements

Corrections Section 2.3 should state that there are O(|R|) transitions between successive sub-automata for transpositions (because only adjacent symbols can be transposed) Lemma 1(i) should therefore state that M has size O(d (|R| + | ||R| + |R|)) Examples 3 and 4 return one more answer at distance 2 than shown, namely (s f 2,u1) which is reachable from (s f 1,u4) by a transition (airplane -,1) (and also from (s f 1,u7) by a similar transition)

Corrections (contd) There is also a mistake in our calculations in Lemma 2 of the paper and the correct expression is O(|V| |E| 3 |R|) : If we assume that contains only labels appearing on edges in G, then the size of the approximation automaton M or R at distance |R|+|E| is O(|E| 2 |R|), from Lemma 1. The size of H = M x G is O(|E| 3 |R|), since we can discard disconnected nodes from H. Computing the approximate answer in the worst case requires |V| traversals of H, each at cost equal to the size of H i.e. a cost of O(|V| |E| 3 |R|).