XML indexing – A(k) indices

XML indexing – A(k) indices
- Ragini Rahalkar - Roshith Rajagopal 2/22/2019 CSE 636

Outline Introduction Motivation Labeled graph and index graph
Bisimilarity and A(k) index Construction of A(k) index Query Evaluation Approximate index handling Implementation and testing Summary 2/22/2019 CSE 636

Introduction Structural summaries Evaluating Path Expressions
A(K) index Indexing scheme for large graph data like XML Not all structure is interesting Paths longer than k Smaller and faster Schemaless data Competitive for arbitrary path expressions 2/22/2019 CSE 636

Prior Schemes 1-index [Milo, Suciu 1999] NFA rather than DFA (smaller)
split graph nodes into equivalence classes based on incoming paths from the root Go for refinements (approximations) similarity bisimilarity An alternative is the work done by dan and tova milo, that proposes to use a NFA to avoid replication. As a result the index graph is atmost as big as the data graph. The idea here is to partition the data graph into equivalence classes, where two nodes are equivalent if the set of paths into them from the root is the same. A refinement is basically if you take an equivalence partitioning, and further split the partitions so obtained into smaller pieces. 2/22/2019 CSE 636

Limitations of Prior Work
Size Each and every path is indexed which is not necessary (does not exploit local similarity) 1-index size can be big too! Designed to answer queries involving arbitrarily complex paths, but... such paths may never show up in queries We actually saw this happen in one of the datasets we considered (the data was highly cyclic). Similar results are reported by stanford too. We’d like to capture two important notions: One is that a certain fuzzification of the index is useful – it might buy us compactness and speed traded for accuracy Another is that there is similarity at several levels in the data graph, and current notions might ignore most of them: e.g. label similarity, small-piece-similarity, etc. 2/22/2019 CSE 636

Labeled Graph G=(Vg, Eg, root, ΣG, label, oid, value)
Node path and label path Path expression Regular language 2/22/2019 CSE 636

2/22/2019 CSE 636

Index Graph I(G) Extent of a node
Regular expression execution with I(G) Safe extent mapping Containment of results of path expressions Precise index graph 1-index graph – never bigger than data graph Can be computed in O( m log n ) 2/22/2019 CSE 636

Notion of Bisimilarity
Symmetric and binary relation For two nodes u and v , u ≈b v if u and v have same labels If u’ is a parent of u, then there is a parent v’ of v such that u’ ≈ v’ and vice versa Objects 8 and 9 are bisimilar Objects 21 and 23 are not bisimilar 2/22/2019 CSE 636

The A(k) index Local similarity Using Equivalence class partition
Grouping according to labels Notion of false paths Classification by label—business and cultural Absolute precision and grouping similar data to allow index size affected by updates in the values of k 2/22/2019 CSE 636

2/22/2019 CSE 636

K-bisimilarity Defined inductively as:
for any two nodes, u and v, u ≈0 v if u and v have same labels u ≈k v iff u ≈k-1 v and For every parent u’ of u and v’ of v u’ ≈k-1 v’ 2/22/2019 CSE 636

1 A 1 A 1 A 1 A B C B 2 3 2 3 C C 2 B 3 2 B 3 C 4 D 5 D 4,5 D 4 D 5 D 4 D 5 D E E 6,7 E E E E 6 7 6,7 6 7 G A (0) A (1) A (2) = 1-INDEX 2/22/2019 CSE 636

A(k) index properties If nodes u and v are k-bisimilar, then the set of labelpaths of length k into them is the same. The set of label-paths of length k into an A(k)-index node is the set of label-paths of length k into any node in its extent. The A(k)-index is precise for any simple path expression of length less than or equal to k. The A(k)-index is safe, i.e., its result on a path expression always contains the graph result for that query. The (k + 1)-bisimulation is either equal to or is a refinement of the k-bisimulation. Let v; x; y be three nodes such that the shortest path to x from v or to y from v contains more than k edges. If an edge is added or deleted going from a node u to v, this update does not affect the k-bisimilarity relationship between x and y 2/22/2019 CSE 636

A(k) index construction
Partitioning – compute_k_bisim Notion of successor of a node Notion of stability Two sets of nodes A and B- Partition as A ∩ SUCC(B) A – SUCC(B) Computation of k+1 bisimulation from k bisimulation Copy of k bisimulation divided into equivalence classes until they are stable with equivalence classes of k bisimulation Time – O(km) Space- O(m) where m is no of edges 2/22/2019 CSE 636

Compute_k_bisim(G,k) Begin Q and X are each a list of node-sets
Q = partition VG by label X = (a copy of) Q for i=1 to k do foreach X1 in X do compute Succ(X1) for each Q1 in Q do replace Q1 by Q1 ∩ Succ(X1) and Q1- Succ(X1) if there was no split then break End 2/22/2019 CSE 636

Compute_A(k)_index(G,k)
Begin Compute_k_bisim(G,k) foreach equiv. class in k-bisimulation do create an index node I ext[I] = data nodes in the equivalence Class foreach edge from u to v in G do I[u] = index node containing u I[v] = index node containing v if there is no edge from I[u] to I[v] then add an edge from I[u] to I[v] End 2/22/2019 CSE 636

Query Evaluation Schemes
Index is queried using regular path expressions. Path expressions are of the form P = Root.R Query Evaluation Techniques: Forward Evaluation Backward Evaluation 2/22/2019 CSE 636

Query Evaluation Techniques
Forward Evaluation Strategy Simulation of NFA on the graph Index graph traversed breadth first , making corresponding transitions Backward Evaluation Strategy Find nodes bearing final labels in R R evaluated in reverse manner from these nodes Intuition: end of the expression more selective than the earlier paths, thus processing cheaper 2/22/2019 CSE 636

Approximate Index Graphs:
While evaluating R on Index graph, we add nodes in the Ext[B] rather than B to the result set. A(k) index is safe Result set for R is superset of the target set in the data graph. 2/22/2019 CSE 636

Approximate Index Graphs
When node B is accepted along a path of length <=K in the A(k) Index Graph , a node in Ext[B] must be in the target set of R When index node accepted by a longer path, the data node initially added to a maybe set M instead of result set. Nodes in M are validated by reverse execution of the automation on the data graph beginning with each node in M 2/22/2019 CSE 636

Implementation – Data Structures
Data Graph Representation Element_HT Hashtable – (NodeID, Element) Pairs Attribute_HT Hashtable (NodeID -1, IDREF Attribute) Pairs The Key of this hashtable is the NodeID of the element of this attribute. 2/22/2019 CSE 636

Implementation Index Tree Link Table EqClass_HT - Hashtable
(EqClassID, Vector of NodeIDs in that EqClass) Generated from Compute_K_Bisim Link Table Linktable_HT - Hashtable (EqClassID, Vector of Child EqClassIDs) Generated from Compute_A(K)_index 2/22/2019 CSE 636

Sample Results Size of Index graph v/s K
2/22/2019 CSE 636

Summary Generalization of 1-index
Value of k and tradeoff between the size of the index graph and accuracy Small values of k perform better than 1-Index Future scope Use in schema extraction and query optimization 2/22/2019 CSE 636

References Exploiting Local Similarity for Indexing Paths in Graph-Structured Data [Raghav Kaushik, Ehud Gudes et all] Index Structures for Path Expressions [Milo, Suciu 1999] 2/22/2019 CSE 636

THANK YOU 2/22/2019 CSE 636

XML indexing – A(k) indices

Similar presentations

Presentation on theme: "XML indexing – A(k) indices"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML indexing – A(k) indices

Similar presentations

Presentation on theme: "XML indexing – A(k) indices"— Presentation transcript:

Similar presentations

About project

Feedback