1 Efficient Processing of XPath Queries Using Indexes Yan Chen 1, Sanjay Madria 1, Kalpdrum Passi 2, Sourav Bhowmick 3 1 Department of Computer Science,

Slides:



Advertisements
Similar presentations
What is a Database By: Cristian Dubon.
Advertisements

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Spring Part III: Introduction to XPath XML Path Language.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.
TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
XPath Eugenia Fernandez IUPUI. XML Path Language (XPath) a data model for representing an XML document as an abstract node tree a mechanism for addressing.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
Indexing Semistructured Data J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajaraman Stanford University January 1998
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
G. Gottlob, C. Koch & R. Pichler TU Wien, Vienna, Austria Elias Politarhos Advanced Databases M.Sc. in Information Systems Athens University of Economics.
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
Managing XML and Semistructured Data Lecture 6: XPath Prof. Dan Suciu Spring 2001.
1 Introduction to Database Systems CSE 444 Lecture 11 Xpath/XQuery April 23, 2008.
1 Lecture 11: Xpath/XQuery Friday, October 20, 2006.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant.
1 Lecture 16: Querying XML Data: XPath, XQuery Friday, February 11, 2005.
Querying XML February 12 th, Querying XML Data XPath = simple navigation through the tree XQuery = the SQL of XML XSLT = recursive traversal –will.
SD2520 Databases using XML and JQuery
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
1 XPath XPath became a W3C Recommendation 16. November 1999 XPath is a language for finding information in an XML document XPath is used to navigate through.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
XQL, OQL and SQL Xia Tang Sixin Qian Shijun Shen Feb 18, 2000.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Management of XML and Semistructured Data Lecture 5: Query Languages Wednesday, 4/1/2001.
COSC 2007 Data Structures II Chapter 15 External Methods.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
Database Systems Part VII: XML Querying Software School of Hunan University
1 Le Thi Thu Thuy*, Doan Dai Duong*, Virendrakumar C. Bhavsar* and Harold Boley** * Faculty of Computer Science, University of New Brunswick, Fredericton,
CSE 636 Data Integration Fall 2006 XML Query Languages XPath.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
More XML: semantics, DTDs, XPATH February 18, 2004.
____________________________ XML Access Control for Semantically Related XML Documents & A Role-Based Approach to Access Control For XML Databases BY Asheesh.
XML and Database.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Lecture 15 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
IS432 Semi-Structured Data Lecture 4: XPath Dr. Gamal Al-Shorbagy.
Deriving Relation Keys from XML Keys by Qing Wang, Hongwei Wu, Jianchang Xiao, Aoying Zhou, Junmei Zhou Reviewed by Chris Ying Zhu, Cong Wang, Max Wang,
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Scheduling of Transactions on XML Documents Author: Stijin Dekeyser Jan Hidders Reviewed by Jason Chen, Glenn, Steven, Christian.
Lecture 17: XPath and XQuery Wednesday, Nov. 7, 2001.
1 Lecture 12: XML, XPath, XQuery Friday, October 24, 2003.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
XML path expressions CSE 350 Fall 2003.
Semi-Structured Data and Agile Application Development
Lecture 15: Querying XML Friday, October 27, 2000.
Lecture 11: XML and Semistructured Data
Introduction to XML IR XML Group.
Presentation transcript:

1 Efficient Processing of XPath Queries Using Indexes Yan Chen 1, Sanjay Madria 1, Kalpdrum Passi 2, Sourav Bhowmick 3 1 Department of Computer Science, University of Missouri-Rolla, Rolla, MO 65409, USA 2 Dept. of Math. & Computer Science, Laurentian University, Sudbury ON P3E 2C6 Canada 3 School of Computer Engineering, Nanyang Technological University, Singapore

2 Querying Semistructured Data Query languages to query semistructured data –XQuery, XML-QL, XML-GL, Lorel, and Quilt Semistructured data is represented as a graph Queries on such data are expressed in the form of regular path expressions XPath is a language that describes the syntax for addressing path expressions over XML data Indexes on XML data - improves the performance of the query on large XML files Indexing techniques used in relational and object-oriented databases do not suffice for semistructured data due to the nature of the data

3 Indexing Semistructured Data Dataguides –record information on the existing paths in a database –do not provide any information of parent-child relationships between nodes in the database –as a result they cannot be used for navigation from any arbitrary node. T-indexes –specialized path indexes, which only summarize a limited class of paths. –1-index and 2-index are special cases of T-indexes

4 Indexing Semistructured Data LORE –Uses four different types of index structures - value, text, link, and path indexes –Value index and text index are used to search objects that have specific values –link index and path index provide fast access to parents of an object and all objects reachable via a given labeled path –Lore uses OEM (Object Exchange Model) to store data and OQL (Object Query Language) as its query language

5 Indexing Semistructured Data ToXin –has two different types of index structure: the value index and the path index. –The path index has two parts: index tree and instance functions, and these functions can be used to trace the parent-child relationship. –Their path index contains only parent and children information but in our model, we store the complete path from root to each node. –ToXin uses index for single level while we use multiple index for different levels

6 A Sample XML File David Chris Chris Michael Jason Tomas

7 XML as DOM Tree

8 Indexing XML Data - Motivation Retrieve all the books with author’s name as “Chris” from the Benny-bookstore –We need to find all the nodes in the DOM tree with child nodes of BOOKSTORE as BOOK. –Then for each BOOK, we need to test the author’s name. –After about 100,000 comparisons we get a couple of books with author “Chris” as the output –By using index on AUTHOR, we do not need to test author of each BOOK node. –With the index of the key as “Chris”, we can find all author nodes faster –The nodes obtained can be checked if they satisfy the query condition. –This is a “bottom-up” query plan. –Such a plan is useful in the case when we have a relatively “small” result set at the bottom, which can be pre-selected

9 Indexing XML Data - Motivation Find all the books with the name beginning with “glory” and the author as “Chris” –The query plan could be to get all the books with the name “glory” disregarding their authors. –If there are small number of books satisfying the constraint, (e.g., four “glory” books), it might be useful to introduce another type of index, which is built on the values of some nodes. –Here, we need index upon strings. –On the basis of the nodes obtained in the first step, we can further test another condition on the query. –Hence, we can build a set of nodes as the “entry set”, which will depend on the specific query and on the type of XML data

10 Types of Indexes Name-index (Nindex) –A name index locates nodes with the tag names –The Nindex for the incoming tag over the XML fragment in figure 2 will then be {&2, &3, &4, &13, &16, &19} Value-index (Vindex) –A value-index locates nodes with given value –The Value-index for the word “Chris” is {&10, &12}, for the word “the” is {&2, &4} Path-index (Pindex) –A path-index, locates nodes with the path from root node –Path index is the information we attach to each node to record its ancestors’ paths –In Dom tree the path information of &11 is {&1, &4}; node &7 is {&1, &2} Descent Number (DN) –Descent Number is the information we attach to every node to record the number of its descents. –In the DOM tree, the DN of node &11 is 0; the DN of node &3 is 2

11 Example for XPath Queries Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998

12 Data Model for XPath bib book publisherauthor.. Addison-WesleySerge Abiteboul The root The root element Much like the Xquery data model Processing instruction Comment

13 XPath: Simple Expressions /bib/book/year Result: /bib/paper/year Result: empty (there were no papers)

14 Entry-point Technique We find an entry-point node among a set of middle level nodes in the XPath expression. Then we split the XPath expression at the entry-point and test for the path condition for the first part and eliminate nodes from DOM tree that do not satisfy the path condition. Then we test the remaining part of the XPath expression recursively eliminating nodes that do not satisfy the path condition. The algorithm can be implemented either using top-down approach or bottom-up approach

15 Entry-point Technique – An Example Select BOOKSTORE/BOOK where BOOK.name = “Glory days” and /AUTHOR.title = “Chris” and BOOKSTORE.name = “Benny-bookstore” The above query is transformed to the following XPath expression /BOOKSTORE [name = “Benny-bookstore”]/child:: BOOK[title = “Glory Days”] /Child :: AUTHOR/child :: FIRSTNAME[name = “Chris”] Use Nindex to get all BOOK nodes or AUTHOR nodes

16 Entry-point Technique – An Example Get all books named “Glory Days” and then test the condition on each one of them if the author is “Chris” /BOOKSTORE [name = “Benny-bookstore”]/child:: BOOK[title = “Glory Days”] Then, we test each author child node, which is the latter part of X-path expression /Child :: AUTHOR/child :: FIRSTNAME[name = “Chris”] In second strategy, first get all authors named “Chris”, and then test the parent nodes if book name is “Glory Days”

17 Entry-point Root-first Algorithm INPUT: XPath expression root/X 1 /X 2 /…/X i /…/X m STEP 1: FOR each X i BEGIN IF X i is indexed THEN BEGIN get every node x i of type X i get the DN n i of each x i Sum i =  n i END STEP 2: Get entry point X n with minimum Sum, add all x n to a node set S; Consider the tree obtained after deleting all branches that do not have the node x n in its path. split the XPath into root/X 1 /X 2 /…/X n-1 and /X n+1 /…/X m by the entry point X n ; STEP 3: FOR each node x n in S BEGIN IF the path starting from root to node x n is not included in the path root/X 1 /X 2 /…/X n-1 /X n THEN delete the sub tree that does not satisfy the path condition END STEP 4: FOR each node x n in S, consider all sub trees starting with x n BEGIN IF X n+1 /…/X m is same as /X m THEN return nodes X m ELSEINPUT = X n /X n+1 /…/X m GO TO STEP 1 END

18 Example – Entry-point Root-first Algorithm X-Path: A/B/C/E//H

19 Example – Entry-point Root-first Algorithm Step 1: calculate descent numbers (DN) of the nodes that have indexes DN of node B = 31 DN of node E = 18 Entry-point = node E (minimum DN)

20 Example – Entry-point Root-first Algorithm Step 2: Delete the branches that do not have E XPath – A/B/C/E and E//H

21 Example – Entry-point Root-first Algorithm Step 3: test A/B/C/E on each E node and discard the right most sub tree with node E Step 4: evaluate E//H on each E and finally we get the three H nodes Cost – O(N) where N is the number of nodes

22 Rest-tree Conception Performance deterioration in Entry-point algorithm –Find books written by “David” where the title of the book contains the word “book” –The XML file might have hundreds of books having the word “book” in the title and –further there might be a large number of books by author “David”, but only one of them has the word “book” in its title –The Entry-point algorithm first eliminates all the nodes that do not have the word “book” in its title. –Then it eliminates the nodes that do not have “David” as the author –Due to relatively large number of instances at the two levels, large number of eliminations is required

23 Rest-tree Conception The tree formed by the nodes that meet certain condition at its level, along with its descendant and ancestor nodes In the example, the Rest-tree of the node that satisfies the condition that the node has the word “glory” in its title, is as shown

24 Rest-tree Conception First employ Entry-point algorithm to find all nodes that meet the condition statements at each level The final result will then be the intersection of the Rest-trees of these nodes In practice, we do not need to find the Rest-tree of every node satisfying the condition. Small set of nodes are left after applying the Entry-point algorithm So we need to find the Rest-trees of a relatively small set of nodes within a small sub tree To get the intersection of rest-trees, note that the nodes that satisfy the query condition and that have the minimum number of descendants is available from the Entry-point algorithm

25 Rest-tree Conception The minimum level is the anchor level of the rest- tree algorithm. We just need to intersect the Rest-trees at this minimum level. For example, after the first step of Entry-point algorithm, we know there are 2000 nodes at Level A that meet say condition A, 1000 nodes at Level B that meet condition B, 200 nodes at Level C, 3000 at Level D, 400 at Level E. The minimum level is C and the order of the levels is C  E  B  A  D

26 Rest-tree Conception Ancestor node information is available as path- index Filter some nodes at Level C by checking the grandparent node information of the 400 nodes at Level E Similarly, we can filter some other nodes at Level C by checking the parent node information of the nodes at Level B. The intersection at Level C will be complete by checking ancestor information at Level D nodes. The final step is to get all the nodes that satisfy the query requirement

27 Rest-tree Algorithm INPUT: X-path expression root/X 1 /X 2 /…/X i /…/X m STEP 1: FOR each X i BEGIN IF X i is indexed THEN BEGIN get every node x i of type X i get the DN number n i of each x i Sum i =  n i ; END STEP 2: get entry point X j with minimum Sum, add all x j to a node set S j ; get comparison point X k with second minimum Sum, add all x k to a node set S k ; STEP 3: IF level j > k FOR each node x k in S k IF its ancestor is not in S j THEN delete x k from S k ELSE FOR each node x j in S j IF its ancestor is not in S k THEN delete x j from S j STEP 4: FOR each node x j in S j BEGIN IF the path starting from root to node x j is not included in the path root/X 1 /X 2 /…/X j THEN delete the sub tree that does not satisfy the path condition END STEP 5: FOR each node x j in S j, consider all sub trees starting with x j BEGIN IF X j+1 /…/X m is same as /X m THEN return nodes X m ELSEINPUT = X j / X j+1 /…/X m GO TO STEP 1; END

28 Rest-tree Algorithm - Example XPath - A/B/C/E//H Step 1: Calculate DNs DOM Tree

29 Rest-tree Algorithm - Example Step 2: Minimum DN DN of node B = 32 DN of node C = 20 DN of node E = 18

30 Rest-tree Algorithm - Example Step 3: Delete “E” nodes whose ancestor does not have “C”

31 Rest-tree Algorithm - Example Step 4: Delete the subtree that does not satisfy the path A/B/C/E Step 5: Get all the nodes from E//H

32 Test Cases and Comparisons Size of DOM Tree –Entry-point algorithm performs much better than the traditional algorithm, taking less than one third of the processing time of the traditional algorithm Increasing Number of Nodes for XPath: / /A20//C30//A80

33 Test Cases and Comparisons Result Nodes Set –The processing time for the Entry-point algorithm has increased slightly with increasing number of result nodes. –Partially, the reason is due to the recursive function call in the Entry-point Algorithm code Increasing Number of Result Nodes

34 Test Cases and Comparisons Tree Height –The variation tendency of processing time of the three methods is the same with the height of the tree Tree Height Increasing

35 Test Cases and Comparisons Without Index on result nodes –The traditional method turns out to be a disaster, falling into no index method category. –However, the Entry-point Algorithm is still in good shape Tree Height Increasing

36 Conclusions Proposed three types of indexes on XML data to execute efficiently XPath queries. We proposed two algorithms to process XPath queries using these indexes to optimize the queries. We have also simulated both bottom-up and top-down approaches Processing XPath query using the Entry-point indexing technique performs much better than traditional algorithms with or without indexes