XML indexing – A(k) indices

Slides:



Advertisements
Similar presentations
Covering Indexes for XML Queries by Prakash Ramanan
Advertisements

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
CSE 311 Foundations of Computing I
Lecture 24 MAS 714 Hartmut Klauck
Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Fast Algorithms For Hierarchical Range Histogram Constructions
SASH Spatial Approximation Sample Hierarchy
A balanced life is a prefect life.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Covering Indexes for Branching Path Queries Raghav Kaushik, Philip Bohannon, Jeffrey F Naughton and Henry F Korth 1Abdullah Mueen.
Exploiting Local Similarity for Indexing Paths in Graph-Structured Data by Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes 1Abdullah Mueen.
Rewiring – Review, Quantitative Analysis and Applications Matthew Tang Wai Chung CUHK CSE MPhil 10/11/2003.
Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
September 12, 2006IEEE PIMRC 2006, Helsinki, Finland1 On the Packet Header Size and Network State Tradeoff for Trajectory-Based Routing in Wireless Networks.
P2P Course, Structured systems 1 Introduction (26/10/05)
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
CSE 373 Data Structures Lecture 7
CSE373: Data Structures & Algorithms Priority Queues
Global Register Allocation Based on
Learning from Observations
Learning from Observations
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Week 7 - Friday CS221.
Updating SF-Tree Speaker: Ho Wai Shing.
Introduce to machine learning
Multiway Search Trees Data may not fit into main memory
Balancing Binary Search Trees
Presented By S.Yamuna AP/CSE
RE-Tree: An Efficient Index Structure for Regular Expressions
Priority Queues © 2010 Goodrich, Tamassia Priority Queues 1
B+-Trees.
B+-Trees.
COSC160: Data Structures Linked Lists
Double-Ended Priority Queues
Heaps © 2010 Goodrich, Tamassia Heaps Heaps
PC trees and Circular One Arrangements
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Chapter 9: Huffman Codes
Priority Queues and Heaps
Advanced Associative Structures
Greedy Algorithms / Dijkstra’s Algorithm Yin Tat Lee
Steven Lindell Scott Weinstein
On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.
Ch. 8 Priority Queues And Heaps
Lectures on Graph Algorithms: searching, testing and sorting
Introduction to Finite Automata
Copyright © Aiman Hanna All rights reserved
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Haskell Tips You can turn any function that takes two inputs into an infix operator: mod 7 3 is the same as 7 `mod` 3 takeWhile returns all initial.
Incremental Maintenance of XML Structural Indexes
Learning from Observations
Theory of Computability
Instructor: Aaron Roth
Combinatorial Optimization of Multicast Key Management
CENG 351 Data Management and File Structures
Learning from Observations
Lecture 5 Scanning.
Heaps & Multi-way Search Trees
General Trees A general tree T is a finite set of one or more nodes such that there is one designated node r, called the root of T, and the remaining nodes.
Heaps.
Lecture-Hashing.
Presentation transcript:

XML indexing – A(k) indices - Ragini Rahalkar - Roshith Rajagopal 2/22/2019 CSE 636

Outline Introduction Motivation Labeled graph and index graph Bisimilarity and A(k) index Construction of A(k) index Query Evaluation Approximate index handling Implementation and testing Summary 2/22/2019 CSE 636

Introduction Structural summaries Evaluating Path Expressions A(K) index Indexing scheme for large graph data like XML Not all structure is interesting Paths longer than k Smaller and faster Schemaless data Competitive for arbitrary path expressions 2/22/2019 CSE 636

Prior Schemes 1-index [Milo, Suciu 1999] NFA rather than DFA (smaller) split graph nodes into equivalence classes based on incoming paths from the root Go for refinements (approximations) similarity bisimilarity An alternative is the work done by dan and tova milo, that proposes to use a NFA to avoid replication. As a result the index graph is atmost as big as the data graph. The idea here is to partition the data graph into equivalence classes, where two nodes are equivalent if the set of paths into them from the root is the same. A refinement is basically if you take an equivalence partitioning, and further split the partitions so obtained into smaller pieces. 2/22/2019 CSE 636

Limitations of Prior Work Size Each and every path is indexed which is not necessary (does not exploit local similarity) 1-index size can be big too! Designed to answer queries involving arbitrarily complex paths, but... such paths may never show up in queries We actually saw this happen in one of the datasets we considered (the data was highly cyclic). Similar results are reported by stanford too. We’d like to capture two important notions: One is that a certain fuzzification of the index is useful – it might buy us compactness and speed traded for accuracy Another is that there is similarity at several levels in the data graph, and current notions might ignore most of them: e.g. label similarity, small-piece-similarity, etc. 2/22/2019 CSE 636

Labeled Graph G=(Vg, Eg, root, ΣG, label, oid, value) Node path and label path Path expression Regular language 2/22/2019 CSE 636

2/22/2019 CSE 636

Index Graph I(G) Extent of a node Regular expression execution with I(G) Safe extent mapping Containment of results of path expressions Precise index graph 1-index graph – never bigger than data graph Can be computed in O( m log n ) 2/22/2019 CSE 636

Notion of Bisimilarity Symmetric and binary relation For two nodes u and v , u ≈b v if u and v have same labels If u’ is a parent of u, then there is a parent v’ of v such that u’ ≈ v’ and vice versa Objects 8 and 9 are bisimilar Objects 21 and 23 are not bisimilar 2/22/2019 CSE 636

The A(k) index Local similarity Using Equivalence class partition Grouping according to labels Notion of false paths Classification by label—business and cultural Absolute precision and grouping similar data to allow index size affected by updates in the values of k 2/22/2019 CSE 636

2/22/2019 CSE 636

2/22/2019 CSE 636

K-bisimilarity Defined inductively as: for any two nodes, u and v, u ≈0 v if u and v have same labels u ≈k v iff u ≈k-1 v and For every parent u’ of u and v’ of v u’ ≈k-1 v’ 2/22/2019 CSE 636

1 A 1 A 1 A 1 A B C B 2 3 2 3 C C 2 B 3 2 B 3 C 4 D 5 D 4,5 D 4 D 5 D 4 D 5 D E E 6,7 E E E E 6 7 6,7 6 7 G A (0) A (1) A (2) = 1-INDEX 2/22/2019 CSE 636

A(k) index properties If nodes u and v are k-bisimilar, then the set of labelpaths of length k into them is the same. The set of label-paths of length k into an A(k)-index node is the set of label-paths of length k into any node in its extent. The A(k)-index is precise for any simple path expression of length less than or equal to k. The A(k)-index is safe, i.e., its result on a path expression always contains the graph result for that query. The (k + 1)-bisimulation is either equal to or is a refinement of the k-bisimulation. Let v; x; y be three nodes such that the shortest path to x from v or to y from v contains more than k edges. If an edge is added or deleted going from a node u to v, this update does not affect the k-bisimilarity relationship between x and y 2/22/2019 CSE 636

A(k) index construction Partitioning – compute_k_bisim Notion of successor of a node Notion of stability Two sets of nodes A and B- Partition as A ∩ SUCC(B) A – SUCC(B) Computation of k+1 bisimulation from k bisimulation Copy of k bisimulation divided into equivalence classes until they are stable with equivalence classes of k bisimulation Time – O(km) Space- O(m) where m is no of edges 2/22/2019 CSE 636

Compute_k_bisim(G,k) Begin Q and X are each a list of node-sets Q = partition VG by label X = (a copy of) Q for i=1 to k do foreach X1 in X do compute Succ(X1) for each Q1 in Q do replace Q1 by Q1 ∩ Succ(X1) and Q1- Succ(X1) if there was no split then break End 2/22/2019 CSE 636

Compute_A(k)_index(G,k) Begin Compute_k_bisim(G,k) foreach equiv. class in k-bisimulation do create an index node I ext[I] = data nodes in the equivalence Class foreach edge from u to v in G do I[u] = index node containing u I[v] = index node containing v if there is no edge from I[u] to I[v] then add an edge from I[u] to I[v] End 2/22/2019 CSE 636

Query Evaluation Schemes Index is queried using regular path expressions. Path expressions are of the form P = Root.R Query Evaluation Techniques: Forward Evaluation Backward Evaluation 2/22/2019 CSE 636

Query Evaluation Techniques Forward Evaluation Strategy Simulation of NFA on the graph Index graph traversed breadth first , making corresponding transitions Backward Evaluation Strategy Find nodes bearing final labels in R R evaluated in reverse manner from these nodes Intuition: end of the expression more selective than the earlier paths, thus processing cheaper 2/22/2019 CSE 636

Approximate Index Graphs: While evaluating R on Index graph, we add nodes in the Ext[B] rather than B to the result set. A(k) index is safe Result set for R is superset of the target set in the data graph. 2/22/2019 CSE 636

Approximate Index Graphs When node B is accepted along a path of length <=K in the A(k) Index Graph , a node in Ext[B] must be in the target set of R When index node accepted by a longer path, the data node initially added to a maybe set M instead of result set. Nodes in M are validated by reverse execution of the automation on the data graph beginning with each node in M 2/22/2019 CSE 636

Implementation – Data Structures Data Graph Representation Element_HT Hashtable – (NodeID, Element) Pairs Attribute_HT Hashtable (NodeID -1, IDREF Attribute) Pairs The Key of this hashtable is the NodeID of the element of this attribute. 2/22/2019 CSE 636

Implementation Index Tree Link Table EqClass_HT - Hashtable (EqClassID, Vector of NodeIDs in that EqClass) Generated from Compute_K_Bisim Link Table Linktable_HT - Hashtable (EqClassID, Vector of Child EqClassIDs) Generated from Compute_A(K)_index 2/22/2019 CSE 636

Sample Results Size of Index graph v/s K 2/22/2019 CSE 636

Summary Generalization of 1-index Value of k and tradeoff between the size of the index graph and accuracy Small values of k perform better than 1-Index Future scope Use in schema extraction and query optimization 2/22/2019 CSE 636

References Exploiting Local Similarity for Indexing Paths in Graph-Structured Data [Raghav Kaushik, Ehud Gudes et all] Index Structures for Path Expressions [Milo, Suciu 1999] 2/22/2019 CSE 636

THANK YOU 2/22/2019 CSE 636