Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer.

Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) konsolak@csd.uoc.gr University of Crete Department of Computer Science Fafalios Pavlos (623) fafalios@csd.uoc.gr May 2010

2 Outline Introduction Interval Labeling Schemes Prefix Labeling Schemes Comparison

3 Node Indexing Schemes Hold values that reflect the nodes’ position within the structure of an XML tree. Can solve both simple path and twig path queries. Use two types of labeling schemes: Interval labeling Schemes Prefix labeling Schemes

4 Labeling Schemes The purpose of a labeling scheme is to provide unique labels for each node in the XML tree A good labeling scheme should have the following characteristics: The relationships between two nodes should be uniquely and quickly determined simply by examining their labels Updating XML files should not require the re-labeling of nodes in the XML trees The size of the label should be minimal in order to fit in the main memory The scheme should be used to support all kinds of XPath functions Should follow the order of the XML document

5 Node Indexes vs. Graph Indexes Graph indexes consider paths, during query evaluation, as a whole path. Node indexes deal with each node in the path separately. In graph indexes, the numbers of joins is reduced during query processing and therefore, query performance is improved. In node indexes, at each step of a query processing, a structural join is performed between two nodes starting from one end of the path and finishing at the other end.

6 Node Indexes vs. Sequence Indexes Sequence indexes transform XML documents and queries into an encoded sequences. Node indexes label each node of the XML document In Sequence indexes, answering a query requires a sequence matching between the encoded sequences of the data and the query Efficient evaluation of simple path and twig queries without any extra join operations In Node indexes, answering a query requires structural joins among the labeled nodes Not efficient evaluation of queries due to the multiple structural joins

7 XML Document for our examples Tim Sarah Bib bookpaper author Tim Sarah author XML Document XML Tree

9 Interval Labeling Schemes

10 Outline Interval Labeling Scheme Beg-End Labeling Scheme Order-Size Labeling Scheme Prime Number Labeling Scheme Nested Tree Structure Label Size Experimental results Conclusion

11 Interval Labeling Scheme Interval based labeling schemes (otherwise known as Containment based labeling schemes or Region encoded labeling schemes) exploit the properties of tree traversal to maintain document order and to determine various structural relationships between nodes Tree traversal is the process of visiting each node in a tree data structure. Such traversals are characterized by the order in which the nodes are visited.

13 Beg-End Labeling Scheme A pair of numbers is assigned to each node in an XML document according to its sequential traversal order. Starting from the root element, each node is given a “Beg” number. If the end of an attribute, an attribute value, or an ending tag element is reached, the “End” number is assigned. The “End” number is equal to the next sequential number. If the value of the element is a leaf the “Beg” number =“End” number

14 Example Bib bookpaper author Tim Sarah author (1,14) (2,6) (3,5) (4,4) (7,8) (9,13) (10,12) (11,11)

15 Properties [1] A “Level” is added to the (Beg,End) label to form a node- triplet identification label (Beg,End,Level) for each node in the tree, where “Level” represents the depth of an element in the tree. Ancestor-descendant relationship : In a given data-tree, node “x” is an ancestor of node “y” iff x.Beg < y.Beg < x.End (preorder property). Bib bookpaper author Tim Sarah author (1,14) (2,6) (3,5) (4,4) (7,8) (9,13) (10,12) (11,11)

16 Properties [2] Parent-child relationship: In a given data-tree, node “x” is a parent of node “y” iff (x.Beg < y.Beg < x.End and y.Level = x.Level + 1. There is no way to locate the siblings of a given node, using only the knowledge of its index numbers. Bib bookpaper author Tim Sarah author (1,14) (2,6) (3,5) (4,4) (7,8) (9,13) (10,12) (11,11)

17 Are updates possible ? Updating the labeling (numbering) scheme of Beg-End is costly. When a new node is inserted into the tree, then all the nodes in the tree, except the left sibling subtrees of the inserted node, have to be updated. On the other hand when a node is deleted no re- labeling is needed.

18 Update example Bib bookpaper author Tim Sarah author (1,14) (2,6) (3,5) (4,4) (7,8) (9,13) (10,12) (11,11) paper (9,10) (11,15) (12,14) (13,13) (1,16)

20 Labeling Scheme This labeling scheme uses an extended preorder. Each node is associated with a pair of numbers as follows: For a tree node y and its parent x: order(x)< order(y), order(y) + size(y) <= order(x) +size(x). For two sibling nodes x and y, if x is the predecessor of y in preorder traversal: order(x)+size(x) < order(y).

21 Example Bib bookpaper author Tim Sarah author (1,100) (10,30) (11,20) (17,10) (41,10) (60, 30) (62,20) (65,10)

22 Properties Ancestor-descendant relationship: For two given nodes x and y of a tree T, x is an ancestor of y if and only if: order(x) < order(y) <= order(x) + size(x). There is no way to locate the siblings of a given node, using only the knowledge of its index numbers. Bib bookpaper author Tim Sarah author (1,100) (10,30) (11,20) (17,10) (41,10) (60, 30) (62,20) (65,10)

23 Are updates possible ? For a tree node x, size(x) <= Σ y size(y) for all y’s that are a direct child of x. Size(x) can be an arbitrary integer larger than the total number of the current descendants of x. Thus labeling scheme is more flexible and can deal with dynamic updates of XML data more efficiently, in contrast with the one presented before. Additional space is reserved for future data insertions. Disadvantage : It is hard to predict the actual space requirements, thus after several data insertions the space required to hold inserted data has exceeded the reserved space and in the worst case the relabeling of the whole data tree is needed.

24 Insertion without Re-labeling (1,100) (10,30) (11,20) (17,10) (41,10) (60, 30) (62,20) (65,10) Bib bookpaper author Tim Sarah author paper (53,5) No re-labeling since: order(x)+size(x) < order(y) and size(x) <= Σy size(y)

25 Insertions with Re-labeling (1,100) (10,35) (11,20) (17,10) (46,10) (54, 35) (62,20) (65,10) Bib bookpaper author Tim Sarah author paper (58,30) Re-labeling needed since: order(x)+size(x) < order(y) size(x) <= Σy size(y) (90, 35) (95,20) (100,10) (1,200)

27 Prime Number Labeling Scheme Divisibility Property: If an integer X has a prime* factor Z which is not a prime factor of another integer Y, then Y is not divisible by X. In XML trees, if a node A has a descendant C which is not a descendant of another node B, then A cannot be a descendant of node B. Therefore, if the leaf nodes in XML are labeled by prime numbers and the non-leaf nodes as a product of the labels of its child nodes, then we can easily determine the ancestor-descendent relationship by using the “divisible” property of prime numbers. *Prime factor: prime numbers that divide that integer exactly A B C EXAMPLE X=6 Z=3 (prime number) Y=10

28 Bottom-Up Starting from the leaf nodes prime numbers are assigned to each leaf node. For each subsequent level, the parents labels are assigned as the product of their children’s labels. Bib bookpaper author (1155) (15*77) (15) (3*5) (3) (77) (7*11) author (5) (7) (11)

29 Properties of Bottom-Up Ancestor-descendant relationship : For any nodes x and y in an XML tree, x is an ancestor of y if and only if: label(x) mod label(y) = 0. Bib bookpaper author (1155) (15) (3) (77) author (5) (7) (11) There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

30 Disadvantages of Bottom-Up Can quickly result in relatively large numbers being assigned to nodes at the top of the tree. Special handling is required for those nodes that have only one child.

31 Top-Down Each non-leaf node is given a unique prime number and the label of each node is the product of its parent nodes label and its own label. Thus each label is a product of two factors: first factor is the number that is inherited from the label of its parent, is called “parent-label”. The second part is the value that is assigned to the node by the labeling scheme, is called “self- label”. 1 (1*1) Bib bookpaper author Tim Sarah author 2 (1*2) 14 (2*7) 182 (14*13) 3 (1*3) 5 (1*5) 55 (5*11) 935 (55*17) parent-labelself-label

32 Properties of Top-Down Ancestor-descendant relationship : For any nodes x and y in an XML tree, x is an ancestor of y if and only if: label(y) mod label(x) = 0. 1 (1*1) Bib bookpaper author Tim Sarah author 2 (1*2) 14 (2*7) 182 (14*13) 3 (1*3) 5 (1*5) 55 (5*11) 935 (55*17) There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

33 Are updates possible ? The top-down prime number labeling scheme is good for dynamic updates. When a new node is inserted, it is easy to simply assign a prime number that has not been assigned before as the self-label for the newly inserted node. No re-labeling is required. 1 (1*1) Bib bookpaper author Tim Sarah author 2 (1*2) 14 (2*7) 182 (14*13) 3 (1*3) 5 (1*5) 55 (5*11) 935 (55*17) paper 19 (1*19)

34 Top-Down Disadvantage In the prime number labeling scheme each prime number can only be used once. Hence, the self-label of a node that is subsequently inserted is always larger than self-labels of existing nodes. This implies that the size of the labels will increase when the smaller prime numbers are used up. Thus after a few insertions the space size for the node label will be huge.

36 Nested Tree Structure Definition: A Nested Tree is a subtree which has an interval- based number as a node of the containing tree and its own interval based numbering as a tree. Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) Nested Tree

37 K-Nested Tree Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) 2-Nested Tree 1-Nested Tree 1-Nested Tree is a Nested Tree of XML data tree which is not included by any other Nested Trees. K-Nested Tree is a Nested Tree that is included by (k- 1)-Nested Tree and there is not any other Nested Tree that includes T k and is included by T k-1.

38 StartList-EndList of a Node Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) 2-Nested Tree 1-Nested Tree StartList=([(1,50),29;1] EndList=[(1,50),29;12] The startList of any tree node N is the list, s 1,..., s n ;s n+1, where si is the label of the i-Nested Tree of the node N (i = 1, 2,...,n) and s n+1 is the start position of N in the n-Nested Tree T. The endList of node N is defined in the same way of the previous definition of startList of N except that the start position is substituted by the end position of N.

39 Nested Tree’s Label The label of each node can be represented as the 4-tuple (DocID, sList, eList, Level), where : DocID is the identifier of the document sList and eList is the startList and endList of the node, respectively Level is the depth of the node in the data tree. Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) 2-Nested Tree 1-Nested Tree For example the red’s node label is: (1, [(1,50),29;1], [(1,50),29;12],2) Assuming that DocId =1

40 Ancestor-Descendant Relationship Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) 2-Nested Tree 1-Nested Tree The red’s node label is: (1, 1, 50, 1) The blue’s node label is: (1, ( (1,50);(29;5)), (1,50);(29;9)),3) Assuming that DocId =1 The red node is the ancestor of the blue because : They have same DocId 1<29<50 Node X is ancestor of node Y: Beg(X)<NestedTreeLabel(Y)< End(X)

41 Parent-Child Relationship Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) 2-Nested Tree 1-Nested Tree The red’s node label is: (1, 1, 50, 1) The blue’s node label is: (1, ( (1,50);(29;1)), (1,50);(29;12)),2) Assuming that DocId =1 The red node is the ancestor of the blue because : They have same DocId 1<29<50 Level b = Level r +1 Node X is parent of node Y: Beg(X)<NestedTreeLabel(Y)< End(X) Level(Y) = Level(X)+1

42 Insertion of a Node The space is the range of integers that are possible to be used as new labels for the inserted data and the size of the space is the number of integers in the range. The size of the space is called SpaceSize and the size of the inserted data InsertSize. Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper For example the SpaceSize between the red ant blue node is 2.

43 Insertion of a Node The insertion of a node can be divided in three cases : 1 st case SpaceSize > InsertSize: Use the integers in the range of the space as labels for the inserted subtree 2 nd case 0 < SpaceSize <=InsertSize: Treat the inserted subtree as a new Nested Tree and label the Nested Tree with an integer in the range of the space. 3 rd case SpaceSize = 0: Combine the inserted subtree with the subtree rooted by the parent of the inserted subtree, treat the combined subtree as one Nested Tree and label the Nested Tree with an integer in the space.

44 Insertion of a Node : Case:1 st The first case does not need a new method to process data insertion because the SpaceSize is enough to label the nodes of the new inserted tree. Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (35,40) (23,27) paper Sarah author (28,32) (29,31) (30,30) Inserted Tree SpaceSize=7 InsertedData=5

45 Insertion of a Node : Case: 2 nd In the second case the size of the inserted subtree is larger than the size of the space. But if the new inserted subtree is treated as one Nested Tree, only one integer is needed for the label of the new Nested Tree. Accordingly if the size of the space is one or more, the relabeling for the nodes in the original data tree is not necessary for the new data insertion. Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;5) (29;2,29;4) (29;329;3) Inserted Tree SpaceSize=2 InsertedData=5

46 Insertion of a Node : Case: 3 rd In the third case, the scope of the new Nested Tree is extended such that the Nested Tree includes the subtree rooted by the parent of the inserted subtree. In this case, it is required to relabel some nodes in the original data tree. infoBooks book paper author Tim (5,50) (7,20) (11,15) (13,13) (28,35) (23,27) paper Sarah author (5;9;5;13) (5;10,5;12) (5;11,5;11) Inserted Tree SpaceSize=0 InsertedData=5 (5;1,5;16) (5;2,5;6) (5;3,5;5) (5;4,5;4) (5;7,5;8) (5;14,5;15)

47 Deletion of a Node In the interval labeling scheme in case of deletion no processing is required. However, the more subtree insertions occur, the more Nested Trees are created. The more Nested Trees are created, the longer the lengths of the startList and endList of nodes are. The deletion is classified by two cases: Release the last Nested Tree in which the deleted subtree is included Release following-sibling or preceding-sibling Nested Trees of the deleted subtree

48 Deletion of a Node: 1 st Case Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (31,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) Nested Tree (28,29) PositionSize=3 RemainSize=2 PositionSize is the size of the space in which the Nested Tree is included. RemainSize is the size of the Nested tree, after delete processing.

49 Deletion of a Node: 2 nd Case Bib book paper author Tim (1,50) (7,20) (11,15) (13,13) (30,35) (23,27) paper Sarah author (29;1,29;12) (29;5,29;9) (29;7,29;7) Nested Tree PositionSize=26 RemainSize=5 (14,18) (15,17) (16,16)

51 Label Size Labeling SchemeLabel Size Beg-EndO(logN) Order-SizeO(logN) Prime NumbersDlog(θ Ν ) Nested TreeO(logN) where : N is the number of nodes of an XML tree D is the maximal depth θ N is the maximal prime number that has been used to label the nodes

53 Experimental Enviroment The experiments were carried out on an Intel Pentium, 1.7Ghz with 1GB memory, running Windows XP. All procedures are implemented in Java. All experiments were repeated 10 times independently.

54 Experimental Data Three data sets are used: The XMark data set contains information about auctions. The Shakespeare data set represents Shakespeare’s plays in XML format. The Nasa data set contains astronomical data. ShakespeareXMarkNasa Size7.7 MB115.7 MB25.2 MB Nodes179,6191,666,315476,646 Depth7128

55 Insertion Processing Measure the processing time of inserting 382 nodes, as the size of the original data is increased. 100000 10000 1000 100 10 1 10 20 30 40 50 Size of Original Data (MB) Insertion Time (ms) Beg-End Prime Nested Tree

56 Conclusions from the diagram In the Beg-End labeling scheme, the relabeling of nodes in the original data tree is inevitable when new data is inserted, and the number of nodes to be relabeled increases as the size of the original data increases. In the Prime approach the label of a node is determined by the product of the self-label and the label of the parent node, so the time of data insertion exceeds these of Nested approach. In the Nested approach, the data insertion is processed by a simple integer assignment to each node, so the performance is the best.

58 Conclusion The Beg-End labeling scheme can’t be used for updates because the re-labeling of nodes is inevitable, when a new node is inserted. In the Order-Size labeling scheme, it is hard to predict the space requirements and thus in most cases the re-labeling is needed. In the Prime Number labeling scheme after a few insertions the space size for the node label will be huge. The Nested Tree Structure can handle efficiently the updates.

60 Prefix Labeling Schemes

61 Outline Introduction Prefix Labeling Schemes Dewey ORDPATH LSDX Persistent Evaluation

62 Prefix Labeling Schemes In a prefix labeling scheme, the label of a node in the XML tree often consists of: A prefix, which often represents the label of all the ancestors of the node. A delimiter, which in most cases is the fullstop “.” A positional identifier, which indicates the position of the node relative to its siblings.

63 Prefix vs. Interval Labeling Schemes Prefix Labeling Schemes: Can handle updates easier and more efficient than Interval Labeling Schemes Support sibling relationship However: Extra space required to store paths Its storage size increases quickly as the depth and the breath of the tree increases Infer a bit more costly ancestor/descendant relationship

65 Dewey - Structure Each node is assigned a label that represents the path from the document’s root to the node. Each component of the label represents the local order of an ancestor node. Nodes with the same number of delimiters (“.”) in their label are in the same level. Bib bookpaper author Tim Sarah author (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0) Tatarinov et al. - 2002

66 Dewey – Supported Queries (1/3) Ancestors / Descendants Node “X” is an ancestor of node “Y” if the label of node “X” is a substring of the label of node “Y”. Bib bookpaper author Tim Sarah author (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

67 Dewey – Supported Queries (2/3) Parent / Child Node “X” is parent of node “Y” if: - The label of node “X” is a substring of the label of node “Y” and - frags(X) = frags(Y) – 1, where frags(X) is the number of delimiters of the label of node X and frags(Y) is the number of delimiters of label of node Y. Bib bookpaper author Tim Sarah author (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

68 Dewey – Supported Queries (3/3) Siblings Nodes “X” and “Y” are siblings if: - They have the same number of delimiters in their labels and - X.prefix = Y.prefix, where prefix is the label of the node without its positional identifier Bib bookpaper author Tim Sarah author (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

69 Dewey – Updates Insertion of new node The label of the nodes in the subtree rooted at the following sibling need to be updated O(n) nodes need relabeling, where n is the number of nodes of the XML file Bib bookpaper author Tim Sarah author (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0) paper (0.2) (0.3) (0.3.0) (0.3.0.0)

70 Not efficient for dynamic XML files with many updates Need to re-label many nodes As the depth of the tree increases: Label size of a node increases rapidly Storage size increases rapidly It becomes more costly to infer the supported queries between any two nodes (the string prefix matching becomes longer) Overflow problem The original fixed length of bits assigned to store the size of the label is not enough. Dewey - Conclusion

72 ORDPATHs - Structure O’Neil et al. - 2004 Allow updates without re-labeling other nodes Assigns only positive, odd integers during the initial labeling Even and negative number are reserved for later insertions Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1)

73 ORDPATHs – Supported Queries Compute ancestors / descendants, parent / child and siblings relations in the same way as Dewey Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1)

74 ORDPATHs – Updates (1/5) Case 1: New node to the right of all existing child nodes Take the label of the immediate previous sibling and add 2 to the positional identifier book (1.7) Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1)

75 ORDPATHs – Updates (2/5) Case 2: New node to the left of all existing child nodes Take the label of the immediate next sibling and add -2 to the positional identifier book (1.1.-1) Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1)

76 ORDPATHs – Updates (3/5) Case 3: New node between two consecutive nodes Assign to the new node the even-number that sits between the two odd positional identifiers of its neighbor siblings, and then concatenate a new component consisting of an odd number book (1.2.1) Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1) paper (1.2.3)

77 ORDPATHs – Updates (4/5) How to find now the parent? Node “X” is parent of node “Y” if: - The label of node “X” is a substring of the label of node “Y” and - frags(X) = frags(Y) – evenNum(Y) – 1, where frags(X) is the number of delimiters of the label of node “X”, frags(Y) is the number of delimiters of label of node “Y” and evenNum(Y) is the number of even components of node “Y” book (1.2.1) Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1) paper (1.2.3)

78 ORDPATHs – Updates (5/5) How to find now the siblings? Nodes “X” and “Y” are siblings if: In case nodes X and Y have the same length then the sibling conditions are the same as previous. In case nodes X and Y have not the same length then they are siblings if: the node with the bigger length contains even number in the same position as the positional identifier of the other node and the prefix of the node with the bigger length until the first even number is the same with the prefix of the other node. frags(X) = frags(Y) – evenNum(Y) book (1.2.1) Bib bookpaper author Tim Sarah author (1) (1.1) (1.1.1) (1.1.1.1) (1.3) (1.5) (1.5.1) (1.5.1.1) paper (1.2.3)

79 ORDPATHs – Conclusion Unlike Dewey, it’s efficient for dynamic XML files. Not need to re-label nodes Like Dewey, it’s not suitable for very deep trees Node’s label size increases quickly Not suitable also for very wide trees Big label size for nodes with many siblings Expensive comparative label evaluations between siblings nodes of varying length Waste of half of the total numbers due to odd numbers Overflow problem

81 LSDX - Structure Duong et al. - 2005 L abeling S cheme for D ynamic X ml data Allow updates without re-labeling other nodes Combine numbers and letters to label each tree For a node X, its label is: level(X)parent(X).positionalIdentifier(X) Bib bookpaper author Tim Sarah author (0a) (1a.b) (2ab.b) (3abb.b) (1a.c) (1a.d) (2ad.b) (3adb.b) where parent(X) is the label of the parent of node X without its level and without its delimiter character First positional identifier is “b” in order to save codes for any insert before operation

82 LSDX – Supported Queries (1/3) Ancestors / Descendants Node “X” is an ancestor of node “Y” if the label of node “X” without the level number and without the delimiter character is a substring of the label of node “Y”. Bib bookpaper author Tim Sarah author (0a) (1a.b) (2ab.b) (3abb.b) (1a.c) (1a.d) (2ad.b) (3adb.b) ad 3adb.b substring

83 LSDX – Supported Queries (2/3) Parent / Child Node “X” is parent of node “Y” if node “X” is an ancestor of node “Y” and level(X)=level(Y)-1 Bib bookpaper author Tim Sarah author (0a) (1a.b) (2ab.b) (3abb.b) (1a.c) (1a.d) (2ad.b) (3adb.b)

84 LSDX – Supported Queries (3/3) Siblings Node “X” and “Y” are siblings if X.prefix = Y.prefix, where prefix is the substring before the delimiter of a node’s label. Bib bookpaper author Tim Sarah author (0a) (1a.b) (2ab.b) (3abb.b) (1a.c) (1a.d) (2ad.b) (3adb.b)

85 LSDX – Updates (1/3) Insertion of a new Node 1.If there is no node standing before the position we want to place the new node, get the label of the node standing after the new node and insert “a” after the delimiter 2.Otherwise, keep counting from the node standing before so that the label for the new node will be greater than the label of its previous sibling and less than the label of its next sibling (if have), in alphabetical order. If previous label ends with “z”, attach “b” at the end. Bib book paper author Tim Sarah author (0a) (1a.b) (2ab.b) (3abb.b) (1a.c) (1a.d) (2ad.b) (3adb.b) book (1a.ab) paper (1a.e) paper (1a.z) ….. paper (1a.zb)

86 LSDX – Updates (2/3) Bib book paper author Tim Sarah author (0a) (1a.b) (2ab.b) (3abb.b) (1a.c) (1a.d) (2ad.b) (3adb.b) paper (1a.cb) paper (1a.cc) paper (1a.cab) Insertion of a new Node 1.If there is no node standing before the position we want to place the new node, get the label of the node standing after the new node and insert “a” after the delimiter 2.Otherwise, keep counting from the node standing before so that the label for the new node will be greater than the label of its previous sibling and less than the label of its next sibling (if have), in alphabetical order. If previous label ends with “z”, attach “b” at the end.

87 LSDX – Conclusion Like ORDPATH, it’s efficient for dynamic XML files Not need to re-label nodes Like ORDPATH, it’s not suitable for deep and wide trees with nodes of hundreds siblings Node’s label size increases quickly Quick computation of supported queries Capable of showing fast the level of each node Unlike ORDPATH, finds siblings much easier Overflow problem Although, it’s more resistant than ORDPATH and DEWEY XML Doc (MB) No of Nodes Total Size of labels (MB) 1,2170,17 5,6841,63 11,41675,29

89 Persistent Labeling Scheme - Structure Gabillon et al. - 2005 Allow updates without re-labeling other nodes Label of each node has the form: (l, [n p,d p ], [n,d]), where: l is the level of the node in the tree, [n p,d p ] is the positional identifier of the parent node [n,d] is the positional identifier of the node (unique for each level) Given a level “l”, the positional identifier of a node is “(i,1)”, where “i” is the position of the node at level “l”. Bib bookpaper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1])

90 Persistent Labeling Scheme – Supported Queries (1/4) Ancestors / Descendants We build an ancestor structural summary “s” of a source tree “t” Each node of “t” is represented in the summary tree “s” by the code of its parent Nodes having the same parent are represented in “s” by only one node which has their parent’s code The root of “s” represents the nodes of “t” having the root of “t” as parent The ancestor structural summary “s” is held in memory. Bib bookpaper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1]) source tree “t” ancestor structural summary “s” [1,1] [3,1] [1,1][2,1]

91 Persistent Labeling Scheme – Supported Queries (2/4) Ancestors / Descendants For a node “X” of code “(l1, [n1 p,d1 p ], [n1,d1])” and a node “Y” of code “(l2, [n2 p,d2 p ], [n2,d2])”: Node “ X ” is represented in “s” as the node “u” of level l1-1 and of local code “n1 p,d1 p ” Node “ Y ” is represented in “s” as the node “v” of level l2-1 and of local code “n2 p,d2 p ” If node “ X ” is an ancestor of node “ Y ” then l1<l2 and we can reach node “u” starting from node “v” in “s” with l2-l1 parent steps, that is node “ X ” is the (l2-l1)-ancestor of node “ Y ” Bib bookpaper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1]) source tree “t” ancestor structural summary “s” [1,1] [3,1] [1,1][2,1]

92 Persistent Labeling Scheme – Supported Queries (3/4) Bib bookpaper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1]) Parent / Child Let node “X” with label (l1, [n1 p,d1 p ], [n1,d1]) and node “Y” with label (l2, [n2 p,d2 p ], [n2,d2]) Node “X” is parent of node “Y” if l2=l1+1 and [n2 p,d2 p ]=[n1,d1]

93 Persistent Labeling Scheme – Supported Queries (4/4) Bib bookpaper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1]) Siblings Let node “X” with label (l1, [n1 p,d1 p ], [n1,d1]) and node “Y” with label (l2, [n2 p,d2 p ], [n2,d2]) Node “X” and “Y” are siblings if l1=l2 and [n1 p,d1 p ]=[n2 p,d2 p ]

94 Persistent Labeling Scheme – Updates (1/2) Insertion of a new Node “X” at level “l” If “X” is the first node to be inserted at level “l” then its positional identifier is (1,1) If “X” is inserted immediately before the node of positional identifier (i, j) and if there is no other node before (i, j), then the positional identifier of “X” is (i-j, j) If “X” is inserted immediately after the node of positional identifier (i, j) and if there is no other node after (i, j), then the positional identifier of “X” is (i+j, j) Bib book paper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1]) book (1,[1,1],[0,1]) book (1,[1,1],[4,1])

95 Persistent Labeling Scheme – Updates (2/2) Insertion of a new Node “X” at level “l” If “X” is inserted immediately before the node of positional identifier (i, j) and immediately after the node of positional identifier (k, h), then the positional identifier of “X” is (a\d, b\d) with: a=i*h+k*j b=2*h*j d=“the highest common factor of a and b” Bib bookpaper author Tim Sarah author (0,[1,1]) (1,[1,1],[1,1]) (2,[1,1],[1,1]) (3,[1,1],[1,1]) (1,[1,1],[2,1]) (1,[1,1],[3,1]) (2,[3,1],[2,1]) (3,[2,1],[2,1]) paper (1,[1,1],[5,2]) (i, j) = (3,1) (k,h) = (2,1) a= 3*1+2*1 =5 b= 2*1*1=2 d=1

96 Persistent labeling Scheme - Conclusion Like ORDPATH and LSDX, its efficient for dynamic XML files Not need to re-label nodes Quick computation of supported queries A bit more complex for the ancestor/descendent relationship No overflow problem! However, needs large memory storage

98 Evaluation of Prefix Labeling Schemes Experiment’s Data: Java 1.4.2 Sun Microsystems parser SAX Pentium IV 1.3G, 1024MB RAM Windows XP OS Impact of depth and breadth of the XML document on: Time for generating labels Space taken by these labels Sans et al. - 2008

Evaluation – Time Analysis (1/2) Breadth Influence Required time to generate labels – Constant Depth 4000 3000 2000 1000 50000100000150000200000250000300000350000 Number of nodes Time (s) 0 Persistent LSDX ORDPATH Dewey

Evaluation – Time Analysis (2/2) Depth Influence Required time to generate labels – Constant Breadth (50 nodes) 0.30 0.25 0.10 0.05 5 15 30 50 Depth Time (s) 0 Persistent LSDX ORDPATH Dewey 0.15 0.20

Evaluation – Storage Analysis (1/2) Breadth Influence Required storage space for labels – Constant Depth 400 200 50 nodes Space (octets*) 0 Persistent LSDX ORDPATH Dewey 600 800 10000 5000 500 nodes Space (octets) 0 15000 20000 0.50 0.25 5000 nodes Space (millions octets) 0 0.75 1 * An octet is a grouping of eight bits

Evaluation – Storage Analysis (2/2) Depth Influence Required storage space for labels – Constant Breadth (50 nodes) 20000 10000 5 15 30 50 Depth Space (octets) 0 Persistent LSDX ORDPATH Dewey 30000 40000

103 Evaluation – Conclusion For generating labels, DEWEY and ORDPATH are the quickest techniques for both deep and wide trees LSDX and PERSISTENT follow Since ORDPATH supports updates, it's preferable than DEWEY For wide trees, DEWEY, ORDPATH and PERSISTENT require the least space As the breadth of the tree grows, PERSISTENT technique outperforms the other techniques and LSDX worsens For not very wide trees, LSDX needs the least space For deep trees, LSDX require the least space, DEWEY and ORDPATH follow For not deep trees, DEWEY and ORDPATH outperform

104 Comparing with Interval Measure the processing time of inserting 382 nodes, as the size of the original data is increased. 100000 10000 1000 100 10 1 10 20 30 40 50 Size of Original Data (MB) Insertion Time (ms) Beg-End Prime Nested Tree ORDPATH

106 Comparing Structural Indexes (1/2) NodeGraphSequence Wrong initial answers No Yes (non-deterministic with forward and backward bisimilarity) No Missing initial correct answers No Yes Structural path joins required twig Yes No Yes No How to evaluate a twig query Break the query into nodes Join the nodes Break the query to several paths Solve each path Join results Process the twig query as whole

107 NodeGraphSequence Hold values No (values have to index separately) No (there are some attempts to integrate values into the index) Yes (efficient integrate values into the index) Main role in answering a XML query Path joiningPath selection Complete query evaluation Update cost for inserting a node/subtree O(N)*O(N+M)*O(b. logN)* *number of nodes that are needed to be touched during an update N=number of nodes, M=number of edges, b=fan-out of B + tree Comparing Structural Indexes (2/2)

108 Questions?

Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer.

Similar presentations

Presentation on theme: "Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer.

Similar presentations

Presentation on theme: "Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624) University of Crete Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback