Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Similar presentations


Presentation on theme: "A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms."— Presentation transcript:

1 A Summary of XISS and Index Fabric Ho Wai Shing

2 Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms Index Fabric (Cooper et al, VLDB2001) Patricia Balanced Trie Raw Path Index

3 Definition of Terms Absolute Path Expression (APE): the path which start from root, each step is a traversal of child axis or attribute axis, no wildcards e.g., /, /A/B, /A/@C

4 Definition of Terms Regular Path Expression (RPE): may start from root or not, may traverse different axes (restricted to child, descendant-or-self, attribute for discussions since they are the most commonly used ones) may contain wildcards e.g., //, /A//C, /A/_/B, //A/B//C/D/@E

5 XISS XISS = XML Indexing and Storage System by Li and Moon, published in VLDB 2001, with title “Indexing and Querying XML Data for Regular Path Expressions” decomposes and stores XML documents in the indices can answer regular path expressions

6 XISS - General Idea solve RPE by decomposing RPE into these 5 basic subexpressions element retrieval attribute retrieval steps involve an element and an attribute steps involve two elements a Kleene Closure of another subexpression

7 XISS - General Idea each subexpression is solved by its own method: element index lookup attribute index lookup EA-join EE-join KC-join

8 XISS - General Idea result lists from the subexpressions are joined to produce the final result to make this decomposition and join efficient, an efficient method to determine ancestor-descendant relationship is needed XISS uses an extended preorder based numbering scheme

9 XISS - Numbering Scheme number all the nodes with a tuple order is assigned based on an extended preorder traversal size can be imagined as the size of the subtree rooted at that node

10 XISS - Numbering Scheme The rules for number assignment if x precedes y in the preorder traversal, x.order < y.order (preorder) if x and y are siblings, either x.order + x.size < y.order or y.order + y.size < x.order(siblings won’t overlap) if x is an ancestor of y, x.order < y.order <= x.order + x.size (ancestor contains descendant)

11 XISS - Numbering Scheme Actual Assignment uses heuristics to reserve some “space” between orders reserve more space to the sizes for future node insertions attributes are place before sibling elements

12 XISS - Index Organization There are 5 indices Name Index Element Index Attribute Index Structure Index Value Table

13 XISS - Name Index maps element or attribute name to a name identifier (or nid) nid is used for further query evaluation representing that element or attribute reduce the time for string comparison in further index lookup stored in a B + -tree

14 XISS - Name Index Name B + -tree nid

15 XISS - Value Table stores all the string values of the XML document vidvalue

16 XISS - Element Index input: nid, output: list of element records implemented by a B + -tree leaves are pointers to list of document ID (did), each list element points to a list of all elements with the same name in the same document

17 XISS - Element Index nid B + -tree did list element list, Depth, ParentID element record

18 XISS - Attribute Index Very similar to element index always has a value identifier, vid

19 XISS - Structure Index Input: did, Output: array containing all the element and attributes in the document implemented by a B + -tree

20 XISS - Structure Index did B + -tree nid, Parent order, Child order, Sibling order, Attribute order record array

21 XISS - Indices When to use which index? first use Name Index to find nid of the element/attribute to be queried search Element/Attribute index for the records if we need values, lookup Value Table use Structure Index to rebuild or traverse the XML document tree

22 XISS - Join Algorithms After getting the record lists from each subexpression, we need to find out which are answers to the original query e.g., to find /A/B, we found a record list of all element A, another list of all element B, and we have to find out which B’s are A/B

23 XISS - Join Algorithms Three join algorithms proposed: EA-join - merges an element record list and an attribute record list (solves A/@B) EE-join - merges two element record lists (solves A/B or A//B) KC-join - self-merge an element record list (solves (E)*)

24 XISS - EA-Join to solve E/@A input: an element record list and an attribute record list find out the attribute records which have parents in the element record list two lists are sorted by did and then order

25 XISS - EA-join 2-stage sort-merge group by did first merge using order then output criterion: E is a parent of A single scan on both list is enough

26 XISS - EE-join to solve E/_*/E, e.g., E/E, E//E, E/_/E input: two Element record lists, E, F output: (e,f) where e is an ancestor of f also use 2-stage sort-merge however, may need scanning of lists multiple times (for special cases, e.g., the document has /A/A/B/B)

27 XISS - KC-join to solve Kleene Closure of a subexpression input: a list of element records fits the base case recursively use EE join on the list, and stop until no more grow in the result list

28 Index Fabric by Cooper at el, published in VLDB 2001, with title “A fast index for semistructured data” has 2 subtypes, raw path index and refined path index use Patricia technique to compress the index

29 Index Fabric - General Idea it is a disk balanced indexing structure based on Patricia each data node is associated with a key string and this string is stored in the trie index for retrieval the layered approach in building the index ensure the number of disk pages accessed per query

30 Index Fabric - General Idea raw path index answers absolute path queries refined path index answers any predefined queries the difference is how to generate the key

31 Patricia Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric by Morrison, in JACM 1968 a method to store and retrieve strings in a space efficient way binary, use bit comparisons, has a “skip” in each internal node

32 Patricia an example Patricia trie 254 101110101111110000110011 01 0011

33 Patricia it’s basically a trie with internal nodes having single child removed search is done by branch according to the value of bit at skip retrieve the string at leaf compare it with the query string

34 Index Fabric - Balanced Trie The number of disk pages accessed per query is bounded by the number of layers in the layered index The idea is similar to that of B-tree, The Patricia trie is decomposed into blocks, and there is an upper layer trie which traverse the blocks

35 Index Fabric - Balanced Trie e.g. 254 101110101111110000 110011 01 0011 2 1 Layer 0Layer 1

36 Index Fabric - Balanced Trie There are 3 types of links in the balanced trie: far link: across layer, a result of branching near link: within the same block, a result of branching direct link: across layer, the root nodes are the same Each query will access 1 block in 1 layer

37 Index Fabric - Balanced Trie increase the speed by skipping nodes of original trie using traversals in upper layers number of page accessed is bounded

38 Index Fabric - Raw Path each data node is associated with a key key = path (encoded in designators) + value designators are special characters, each represents a name APE queries are translated to prefix to keys and submitted to the index trie

39 Index Fabric - Raw Path Example: HKU is translated to IBNHKU (bolded & underlined are designators query of /invoice/buyer/name[“HKU”] is translated to query string IBNHKU

40 Index Fabric - Refined Path Special designators can be assigned to special queries (can be regular) e.g., we define P as the path //buyer/name, and PHKU means there is a buyer/name has value HKU in the document can answer any predefined RPE very quickly

41 Comparison XISS can solve general RPE solve APE by dividing it into steps Index Fabric RPE solved by compile time expansion of RPE or using predefined Refined Path Index solve APE by single index lookup


Download ppt "A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms."

Similar presentations


Ads by Google