A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun.

Slides:



Advertisements
Similar presentations
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.
Advertisements

1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.
2015/5/5 A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Ning Zhang(University of Waterloo) Varun Kacholia(Indian Institute.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
Chapter 9 of DBMS First we look at a simple (strawman) approach (ISAM). We will see why it is unsatisfactory. This will motivate the B+Tree Read 9.1 to.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 7. XQuery.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Marc Smith and Jim Ten Eyck
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
Data Structures and Algorithms Session 13 Ver. 1.0 Objectives In this session, you will learn to: Store data in a tree Implement a binary tree Implement.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Index Structures for Files Indexes speed up the retrieval of records under certain search conditions Indexes called secondary access paths do not affect.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Trees CS212 & CS-240 D.J. Foreman. What is a Tree A tree is a finite set of one or more nodes such that: –There is a specially designated node called.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 10: Trees Data Abstraction & Problem Solving with C++
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
COMP20010: Algorithms and Imperative Programming Lecture 1 Trees.
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Trees Chapter 8. 2 Tree Terminology A tree consists of a collection of elements or nodes, organized hierarchically. The node at the top of a tree is called.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
XML and Database.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Introduction to Trees IT12112 Lecture 05 Introduction Tree is one of the most important non-linear data structures in computing. It allows us to implement.
CH 7. TREES ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH, TAMASSIA AND MOUNT (WILEY.
IS432 Semi-Structured Data Lecture 6: XQuery Dr. Gamal Al-Shorbagy.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
CH 7 : TREE ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH, TAMASSIA AND MOUNT (WILEY.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Navigation Piles with Applications to Sorting, Priority Queues, and Priority Deques Jyrki Katajainen and Fabio Vitale Department of Computing, University.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
What is a Tree? Formally, we define a tree T as a set of nodes storing elements such that the nodes have a parent-child relationship, that satisfies the.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
CS522 Advanced database Systems
MCS680: Foundations Of Computer Science
B+ Tree.
Binary Trees, Binary Search Trees
Week nine-ten: Trees Trees.
Trees Lecture 9 CS2110 – Fall 2009.
CS210- Lecture 9 June 20, 2005 Announcements
Presentation transcript:

A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun Kacholia, M.Tamer Ozsu.

Outline Introduction Preliminaries NoK pattern matching at the logical level Physical storage XML path queries at the physical level Experimental evaluation Conclusion

Introduction The increasingly wider use of XML leads to the need to store large volumes of data encoded in XML the need to query XML data more efficiently Path expressions are the most natural way to query tree- structured data such as XML tree evaluate path expressions against XML tree – tree pattern matching a path expression: a pattern tree that specifies a set of constraints TPM problem: to find the nodes in the XML tree that satisfy all the constraints

Existing evaluation approach Navigational Approach traverse the tree structure test whether a tree node satisfies the constraints by the path expression Join-based Approach Select a list of XML tree nodes that satisfy the node-associated constraints for each pattern tree node Join the lists based on their structural relationship However, these two are not adaptive to the streaming XML data.

A Novel Approach Define a special pattern tree and pattern matching Next-of-Kin pattern tree in which nodes are connected by parent- child and following/preceding-sibling relationship only Next-of-Kin pattern matching speed up the node selection step reduce the join size in the second step Design a novel, succinct physical storage scheme support efficient NoK query evaluation

Preliminaries Consider the bibliography XML TCP/IP Illustrated Stevens W. Addison-Wesley Advanced Programming in the Unix Stevens W. Addison-Wesley Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publisher The Economics of Technology Gerbarg Darcy CiTI Kluwer Academic Publisher

Subject tree Subject tree or XML tree a b b b b z e c i j z e c i j z e c c c i j z e d i j f g f g f g f g f g f g Note: bib-> a book-> author->c title->e publisher-> i price-> j first->f last-> g editor->d

Pattern tree Query: “find all books written by Stevens whose price is less than 100”. Path expression: //book[author/last=“Stevens”] [price<100]. Pattern tree A graphical representation of constraints specified in a path expression root book authorprice<100 Last=“Stevens” // // /

Nok pattern matching at the logical level Next-of-Kin pattern tree: Consists of edges whose labels are in {parent-child relationship, following-sibling relationship}. Two steps in the process of matching Nok pattern tree to the subject tree: Locate the nodes in the subject tree to start pattern matching; Nok pattern matching from that starting node.

Locate the starting node Many options to locate the starting point: Naïve approach: traverse the whole subject tree in document order and try to match each node with the root of the Nok pattern tree; Index on tag names: If have a B+ tree on tag names, an index lookup for the root of the NoK pattern tree will generate all possible starting points; Index on data values: If there are value constraints in the NoK pattern tree (such as last=“Stevens”) and we have a B+ tree for all values in XML document, we can use that value-based index to locate all nodes having the particular value and use them as the starting points.

Example Consider the subject tree and NoK pattern tree with tag names: b[c/g=“Stevens”][j<100] Suppose : the starting point snode -- the first node b of subject tree, which matches proot and is appended to the result set R iterates over b’s children to check whether they match any node in the set {c,j}; third node of snode matches with c, a recursive call will be invoked to match the NoK pattern tree c/g=“Stevens” with the subject tree rooted at snode/c; The recursive call returns True, check the other children and eventually j is matched, causing the set = 0; The result R contains the starting point b. b ij zec fg

Physical storage Desideration for designing the physical storage scheme are: Structural information should be stored separately from the value information. The subject tree should be “materialized” to fit into the paged I/O model. The storage scheme should have enough auxiliary information (e.g.indexed on values and tag names) to speed up Nok pattern matching. The storage scheme should be adaptable to support updates.

Value information storage Based on two observations, value information and structural information should be stored separately: An XML document is a mixture of structural information and value information; Any path query can be divided into two subqueries: pattern matching on the tree structural and selection based on values. Example: Path expression: //book[author/last=“Stevens”][price<100]. structural constraints: //book[author/last][price] value constraints: last=“Stevens” and price<100 Separating structural and value information --- separate the different concerns and address each appropriately B+ tree on the value information; path index or tag name index on the structural information.

Value information storage(cont) Maintain connection between structural and value information Use Dewey ID as key of tree nodes to reconnect, e.g. Dewey ID of root a =0, Dewey ID of its second child b =0.2 ; Given a Dewey ID, another B+ tree to locate value of node in the data file. B+ tree HashedValue-> Dewey ID B+ tree Dewey ID-> Pointer to value in data file Data File

Value information storage(cont) In the data file, each element content could be represented by a binary tuple (len,value) e.g. (4,”1994”),(7,”Stevens”),(5,”69.95”) Dewey ID B+ tree: position of these records in the data file. More than one node with same value, just keep one copy and let these nodes point to the same position.

Structural information storage Store the nodes in pre-order and keep the tree structure by inserting pairs of parentheses. E.g. (a(b)(c)) – represent the tree that has a root a and its two children b and c “(”: indicate the beginning of a subtree; “)”: indicate the end of the subtree. Each node implies an open parentheses, so a b) c) (a (b) (c))

String representation The string representation of an XML tree The depth of node from the root.

Structural information storage(cont) For each page, an extra tuple (st, lo, hi) is stored, where st: the level of the last node in the previous page, lo and hi: the minimum and maximum levels of all nodes in that page. Page layout for structural info. a b z) e) c f ) g ) ) i) j ) ) b z ) e ) c f (st,lo,hi) nextpage Header String Representation Reserved for update

Advantages for page layout Using extra tuple (st,lo,hi) can guess the page where the following sibling or parent is located. Easy to insert nodes into the string representation of the tree E.g. to insert a b) c)) as a subtree of the first f node in page 1: Allocate a new page with the content a b) c)); Cut-and-paste the content after f in page 1 to the end of content of the new page; Insert the new page between page 1 and 2; Update the tuple (st,lo,hi) information for page 1. in page 1: a b z) e) c f ) g)) new page: a b) c)) construct new page: a b) c)) ) g)) a b z) e) c f i) j)) b z)e) d f a b)c )) ) g))

XML path queries at the physical level In the Nok pattern matching, the only operation on the subject tree is the iteration over children of a specific node. Using the physical storage technique, this operation is divided into: find the first child of a specific node find the following sibling of a node According to the pre-order property of the string representation, these two operations can be performed by looking at the node level information of each page from left to right without reconstructing the tree structure.

Example Find the first child of character b in the first page. The first child of b must be the next character if it is not “)”. If b is at level L,the first child of b should at level L+1. Answer: right neighbor z

Example Find b’s following sibling. The following sibling must be located to the right of b in the string and its level must be the same as b’s. Answer: b in page 2.

Experimental Setting Selected queries are based on the following three properties of path expression: Selectivity: a path expression returning a small number of results should be evaluated faster than those returning a large number; Topology: the shape of the pattern tree could be a single path or bushy. Value constraints: the existence of value constrains and index on values may be used for fast locating the starting point for Nok pattern matching.

Performance

Conclusion Have defined a special type of pattern tree – NoK pattern tree; Proposed a novel approach for efficient evaluating path expression by NoK pattern matching; NoK pattern matching can be evaluated efficiently using the physical storage scheme; Performance evaluation has shown that this system is better or comparable performance than the existing systems.

Limitation More optimization on the locating step of NoK pattern tree matching process. Use path index instead of tag-name index. Consider how to employ concurrency control and how it affect the update process.

Reference Ning Zhang, Varun Kacholia, M.Tamer Ozsu. A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. D. Chamberlin, P. Fankhauser, M. Marchiori, and J. Robie. XML Query Use Case. Available at E.Cohen, H. Kaplan, S. Padmanabhan, and R. Bordawekar. Labeling Your XML. Preliminary version presented at CASCON’02, October N. Zhang and M. T. Ozsu. Optimizing Correlated Path Expressions in XML Languages. Technical Report CS , University of Waterloo, November Available at

Thank You ! Question? Question?