An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing LiTok Wang Ling Department of Computer Science School of.

Slides:

Advertisements

Similar presentations

Ting Chen, Jiaheng Lu, Tok Wang Ling

Advertisements

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:

Fast Algorithms For Hierarchical Range Histogram Constructions

S. Sudarshan Based partly on material from Fawzi Emad & Chau-Wen Tseng

Tree Data Structures &Binary Search Tree 1. Trees Data Structures Tree  Nodes  Each node can have 0 or more children  A node can have at most one parent.

Greedy Algorithms Amihood Amir Bar-Ilan University.

Computer Science C++ High School Level By Guillermo Moreno.

CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.

Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.

Accelerating Inferencing. Assertion Efficient inferencing using taxonomies require fast computation of subsumption, disjointness, least common ancestors,

B+-tree and Hashing.

Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.

1 Trees. 2 Outline –Tree Structures –Tree Node Level and Path Length –Binary Tree Definition –Binary Tree Nodes –Binary Search Trees.

Heaps & Priority Queues Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

Unit 11a 1 Unit 11: Data Structures & Complexity H We discuss in this unit Graphs and trees Binary search trees Hashing functions Recursive sorting: quicksort,

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

Heapsort CIS 606 Spring Overview Heapsort – O(n lg n) worst case—like merge sort. – Sorts in place—like insertion sort. – Combines the best of both.

CS4432: Database Systems II

Important Problem Types and Fundamental Data Structures

Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.

Algorithms and data structures Protected by

Computers Data Representation Chapter 3, SA. Data Representation and Processing Data and information processors must be able to: Recognize external data.

Ceng-112 Data Structures I 1 Chapter 7 Introduction to Trees.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.

Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.

Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.

12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,

5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.

Lecture1 introductions and Tree Data Structures 11/12/20151.

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.

Copyright © Curt Hill Hashing A quick lookup strategy.

M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.

Heaps & Priority Queues

CIS 068 Welcome to CIS 068 ! Lesson 12: Data Structures 3 Trees.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.

Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.

1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"

1 Trees What is a Tree? Tree terminology Why trees? What is a general tree? Implementing trees Binary trees Binary tree implementation Application of Binary.

Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes Changqing Li, Tok Wang Ling, Min Hu.

Enumerating XML Data for Dynamic Updating L.Kit and V.Ng, Hong Kong Polytechnique University Sang-Ho Nah Lily Daniel Yun Hee Lee.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.

Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees

Efficient processing of path query with not-predicates on XML data

CISC220 Fall 2009 James Atlas Lecture 13: Binary Trees.

Database System Implementation CSE 507

Lecture 22 Binary Search Trees Chapter 10 of textbook

Hashing Exercises.

Data Structures: Segment Trees, Fenwick Trees

Week nine-ten: Trees Trees.

A Robust Data Structure

Database Design and Programming

CS210- Lecture 9 June 20, 2005 Announcements

Important Problem Types and Fundamental Data Structures

Analysis of Algorithms CS 477/677

General Trees A general tree T is a finite set of one or more nodes such that there is one designated node r, called the root of T, and the remaining nodes.

Presentation transcript:

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing LiTok Wang Ling Department of Computer Science School of Computing National University of Singapore

2 Overview Related work and motivation Labeling with our ImprovedBinary scheme Order-sensitive Update and Query Experimental results Conclustion

3 Related work – Containment scheme Containment scheme [1] Node u is an ancestor of v iff label(u.start) < label(v.start) < label(v.end) < label(u.end) Or interval of v is contained in interval of u [1]Rakesh Agrawal, Alexander Borgida, H. V. Jagadish: Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. SIGMOD Conference 1989: ,20 2,3 4,56,1112,1314,19 7,89,1015,1617,18

4 2,3 1,20 2,3 4,56,1112,1314,19 7,89,1015,1617,18 Related work – Containment scheme Update is bad Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order 1,20 2,3 4,56,1112,1314,19 7,89,1015,1617,18 Insert a node 1,22 4,5 6,78,1314,1516,21 9,1011,1217,1819,20 Original labels

5 Related work – Containment scheme How to decrease the update cost Increase the interval size  Small interval size is easy to lead to re-labeling  large interval size wastes a lot of values which causes the increase of storage Use real (float-point) values [2]  float-point is represented in computer with a fixed number of bits which is similar to the representation of integer  As a result, there are a finite number of values between any two real values [2]Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura: QRS: A Robust Numbering Scheme for XML Documents. ICDE 2003:

6 Related work – Prefix scheme Prefix scheme Node u is an ancestor of v iff label(u) is a prefix of label(v) DeweyID [6] Integer based Binary [4] Binary string based [4]Edith Cohen, Haim Kaplan, Tova Milo: Labeling Dynamic XML Trees. PODS 2002: [6]Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang: Storing and querying ordered XML using a relational database system. SIGMOD Conference 2002:

7 Related work – Prefix scheme Update is better DeweyID Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings Insert a node Original labels

8 Related work – Prefix scheme Update is better Binary Same number of nodes to re-label as DeweyID Insert a node Original labels

9 Related work – Prime scheme Prime scheme [7] Node u is an ancestor of v iff label(v) mod label(u) = 0 The number above each node is the document order The right number is the label The numbers below each label is its parent_label and self_label (prime number) (11*23)(11*19)(5*17)(5*13) 9854 (1*7) (1*2) (1*3)(1*5)(1*11) [7]Xiaodong Wu, Mong-Li Lee, Wynne Hsu: A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. ICDE 2004: 66-78

10 Related work – Prime scheme Update Based on Chinese Remainder Theorem (CRT) [3] And Simultaneous Congruence (SC) value The SC value is such that mod 2 = 1 (2 is the self_label, 1 is the document order) mod 3 = 2, mod 5 = 3, mod 13 = 4, ···, and mod 23 = 9 Need not re-label the existing nodes, need to re-calculate the SC value (1*29) 1 1 Use more SC values to prevent the single SC value become a very large number. [3]James A. Anderson and James M. Bell, Number Theory with Application, Prentice-Hall, New Jersey, (11*23)(11*19)(5*17)(5*13) 9854 (1*7) (1*2) (1*3)(1*5)(1*11) Original labels Insert a node (11*23)(11*19)(5*17)(5*13) 9854 (1*7) (1*2) (1*3)(1*5)(1*11) (11*23) 253 (11*19)(5*17)(5*13) (1*7) (1*2) (1*3)(1*5)(1*11) After insertion, the new SC value is , such that mod 29 = 1, mod 2 = 2, mod 3 = 3, mod 5 = 4, mod 13 = 5, ···, mod 23 = 10.

11 Motivation Containment scheme is very bad in updating Prefix and prime schemes are better in updating which are called dynamic labeling schemes, and we only compare the performance of dynamic labeling schemes in this paper. Prefix schemes DeweyID and Binary both need to re-label the existing nodes Prime need not re-label existing nodes, but need to re-calculate the SC values SC value are very large numbers Re-calculation is costly Our objective Need not re-label existing nodes in updating Need not re-calculate any values in updating

12 Our ImprovedBinary scheme How to label the XML tree based on our ImprovedBinary scheme Root is empty First sibling self_label is “01” Last sibling self_label is “011” Case (a) left self_label size is less than or equal to right self_label size The middle self_label is that we change the last character of the right self_label to “ 0 ” and concatenate one more “ 1 ”. Case (b) left self_label size is larger than right self_label size The middle self_label is that we directly concatenate one more “ 1 ” after the left self_label

13 Properties of our ImprovedBinary labels Self_labels of siblings are lexically ordered 01 < < 0101 < < 011 Labels are lexically ordered when comparing the labels component by component 01 < < 0101 < < < < 011 < < The component is the labels between two consecutive delimiters.

14 The formal labeling algorithm for our ImprovedBinary Left self_label is smaller than or equal to the size of the right self_label, the self_label of the middle node is that we change the last character of the right self_label to “ 0 ” and concatenate one more “ 1 ”. Otherwise, the self_label of the middle node is the left self_label concatenates “ 1 ”. Algorithm 1: AssignMiddleSelfLabel Input: left self label self_label_L, and right self label self_label_R Output: middle self label self_label_M, such that self_label_Lself_label_Mself_label_R lexically. begin 1: calculate the size of self_label_L and the size of self_label_R 2: if size(self_label_L) size(self_label_R) then self_label_M = change the last character of self_label_R to “0”, and concatenate () one more “1” 3: else if size(self_label_L) > size(self_label_R) then self_label_M = self_label_L “1” end AssignMiddleSelfLabel algorithm.

15 The formal labeling algorithm for our ImprovedBinary SubLabeling is a recursive function Self_label is an array Labeling algorithm. Algorithm 2: Labeling Input: XML document Output: Label of each node begin 1: for all the sibling child nodes of each node of the XML document 2: for the first sibling node, self_label[1]=“01” //self_label is an array 3: if the Number of Sibling nodes SN > 1 then self_label[SN]=“011” self_label=SubLabeling(self_label, 1, SN) 4: label = prefix_label concatenates delimiter and each element of self_label array end SubLabeling Input: self_label array, left element index of self_label array L, and right element index of self_label array R Output: self_label array begin 1: M = floor((L+R)/2) //M refers to the M th element of self_label array 2: if L+1<R then self_label[M]=AssignMiddleSelfLabel(self_label[L], self_label[R]) SubLabeling(self_label, L, M) SubLabeling(self_label, M, R) end

16 Order-sensitive update for our ImprovedBinary Case (1): Insert a node before the first sibling node. The self_label of the inserted node is that the last character of the first self_label is changed to “ 0 ” and is concatenated with one more “ 1 ”. Label of node of a will be 001 After insertion, the order is still kept. Label(a) < 01 lexically a c bd bd c

17 Order-sensitive update for our ImprovedBinary Case (2): Insert a node at any position between the first and last sibling node. We use the AssignMiddleSelfLabel algorithm to assign the self_label of the new inserted node. Label of node of b will be Label of node of c will be After insertion, the order is still kept < label(b) < 0101 lexically < label(c) < lexically a c bd bd c

18 Order-sensitive update for our ImprovedBinary Case (3): Insert a node after the last sibling node. The self_label of the inserted node is that the last self_label concatenates one more “ 1 ”. Label of node of d will be 0111 After insertion, the order is still kept. 011 < label(d) lexically a c bd bd c

19 Order-sensitive update for our ImprovedBinary For all the three cases We need not re-label the exiting nodes We need not re-calculate any values to keep the document order

20 Order-sensitive query 1) position = i: Selects the i th node within a context node set. 2) preceding-sibling or following-sibling: Selects all the preceding (following) sibling nodes of the context node. 3) preceding or following: Selects all the nodes before (after) (in document order) the context node excluding any ancestors (descendants). When inserting a node DeweyID and Binary need to re-label the existing nodes Prime needs to re-calculate the SC values before they can process the order-sensitive queries.

21 Experimental result -- datasets Datesets available at [8] The depths of real XMLs are usually not too high [5] [5]Laurent Mignet, Denilson Barbosa, Pierangelo Veltri: The XML web: a first study. WWW 2003: [8]The Niagara Project Experimental Data. Available at: DatasetsTopics# of files Max fan-out for a file Max depth for a file Total # of nodes for each dataset D1Bib D2Club D3Movie D4Sigmod Record D5Department D6Actor D7Company D8Shakespeare’s play D9NASA

22 Experimental results -- storage Storage requirement ImprovedBinary has the smallest label size in each dataset

23 Experimental results -- query Query performance ImprovedBinary works the fastest for most of the queries, both ordered and un-ordered Queries# of nodes returned Q1/play/act[4]370 Q2/play/act[5]//preceding::scene6110 Q3/play/act/scene/speech[2]7300 Q4/play/*/*19380 Q5/play/act//speech[3]/preceding-sibling::*30930 Q6/play//act[2]/following::speaker Q7/play//scene/speech[6]/following-sibling::speech Q8/play/act/scene/speech Q9/play/*//line

24 Experimental results -- update Update performance Here we study the update performance of the Hamlet XML file in D8. The update performance of other XML files in D8 is similar. Hamlet has 5 acts, and we test the following six cases: inserting an act before act[1] inserting an act between act[1] and act[2] inserting an act between act[2] and act[3] inserting an act between act[3] and act[4] inserting an act between act[4] and act[5] and inserting an act after act[5] Acts are the child nodes of the root play in the Hamlet XML file

25 Experimental results -- update Update performance Number of nodes to re-label The number of SC values to re-calculate is counted for Prime, one SC value is for every three nodes. One SC value for 4 or more nodes can not be calculated using 64 bits, the long type value in java Our ImprovedBinary need not re-label any existing nodes and need not re-calculate any values

26 Experimental results -- update Update performance Time to re-label or re-calculate The re-calculation time for prime is more than 337 times larger than the re-labeling time for DeweyID and Binary Our ImprovedBinary needs 0 milliseconds for the update The query time and re-labeling (re-calculation) time in this paper refer to the processing time only without including the I/O time.

27 Conclustion Our contribution We proposed a node labeling scheme called ImprovedBinary to decrease the label update cost This scheme need not re-label any existing nodes Need not re-calculate any values

28 Reference [1]Rakesh Agrawal, Alexander Borgida, H. V. Jagadish: Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. SIGMOD Conference 1989: [2]Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura: QRS: A Robust Numbering Scheme for XML Documents. ICDE 2003: [3]James A. Anderson and James M. Bell, Number Theory with Application, Prentice- Hall, New Jersey, [4]Edith Cohen, Haim Kaplan, Tova Milo: Labeling Dynamic XML Trees. PODS 2002: [5]Laurent Mignet, Denilson Barbosa, Pierangelo Veltri: The XML web: a first study. WWW 2003: [6]Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang: Storing and querying ordered XML using a relational database system. SIGMOD Conference 2002: [7]Xiaodong Wu, Mong-Li Lee, Wynne Hsu: A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. ICDE 2004: [8]The Niagara Project Experimental Data. Available at: