Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Storage. Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently.

Similar presentations


Presentation on theme: "XML Storage. Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently."— Presentation transcript:

1 XML Storage

2 Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently made of the XML –Usage requirements determine which type of storage is needed

3 3 Basic Strategies Files Relational Database Native XML Database What advantages do you think that each approach has? What disadvantages do you think that each approach has?

4 XML Files

5 Idea Store XML “as is”, in a file system –When querying, parse the document and traverse it to find the query answer Obvious Advantage: Simple storage system Obvious Disadvantage: –Must parse the XML document every time it is queried –Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)

6 Sample Document 89-344 WEBM GE What must we read to be able to get information about the ticker element?

7 How is an XML document Parsed? Two basic types of parsers: –DOM parser: Creates a tree out of the document –SAX parser: Does not create any data structures. Notifies program for every element seen Both types of parsers have been standardized and have implementations in virtually every query language

8 DOM Parser DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree The API allows for constructing, accessing and manipulating the structure and content of XML documents

9 Document as Tree transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE GE exch NASDAQ Methods like: getRoot getChildren getAttributes etc.

10 Advantages and Disadvantages How would you answer a query like: –/transaction/buy –//ticker Advantages: –Natural and relatively easy to use –Can repeatedly query tree without reparsing Disadvantages: –High memory requirements – the whole document is kept in memory –Must parse the whole document and construct many objects before use

11 SAX Parser SAX = Simple API for XML Parser creates “events” (i.e., notifications) while traversing tree Goes through the document one time only

12 Document as Events 89-344 WEBM GE Start tag: transaction Start tag: account Text: 89-344 End tag: account Start tag: buy Attribute: shares Value: 100

13 Advantages and Disadvantages How would you answer a query like: –/transaction/buy –find accounts in which something is bought or sold from the NASDAQ Advantages: –Requires less memory –Fast Disadvantages: –Cannot read backwards

14 Storing XML in a Relational Database

15 Why? Relational databases have been developed for about 30 years There is extensive knowledge on how to use them efficiently Why not take advantage of this knowledge? Main Challenges: –get XML into database (inserting): translating XML into tables –get XML out of database (querying): translating XPath into SQL

16 Reminder Relational Database simply contains some tables Each table can have any number of columns (also called attributes) Data items in each column are atomic, i.e., single values A schema is a description of a set of tables, i.e., the table name and each table’s column names

17 Difficulties DTDs can be complex Modeling Mismatch –Conceptually, relational databases, i.e., tables, have 2 levels: tables and attributes –XML documents have arbitrary nesting XML documents can have set-valued attributes and recursion

18 Relational Databases: Option 1 The Schema-less Case

19 Option 1: Store Tree Structure Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il

20 Option 1: Store Tree Structure (cont.) 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il 1 2 3 4 5 6 7 8 9

21 Option 1: Store Tree Structure (cont.) person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il 1 2 3 4 5 6 7 8 9 NodeTypeValueParentID 1elementpersonnull 6textBart Simpson2 ……

22 How Good Is This? Simple schema, can work with any document Translation from XML to tables is easy What about the translation back? –is this transformation lossless?

23 Answering XPath Queries Can you answer an XPath query that: –Just uses the Child axis, e.g., /a/b/c/d/e –Uses the Descendent axis at the beginning of the query, e.g., //a/b –Uses the Descendent axis in the middle of the query, e.g., /a/b//e –Uses the Following, Preceding, Following- Sibling axis?

24 Solving the Problem With the current modeling, it is not possible to evaluate many different types of steps of XPath queries To solve this problem, we: –number the nodes by DFS ordering –store, for each node, the id of its last descendent

25 phones person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il 1 2 3 4 5 6 7 8 9 1010 NodeTypeValueParentIDLastDesc 1elementpersonnull10 4elementphones18 …… Can you answer these queries, now? these queries

26 Summary: Main Problems No convenient method to creating XML as output Each element in the path expression requires an additional join –Can become very expensive

27 Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton

28 Framework Relational Database System XML Translation Layer DTD Relational Schema Translation Information XML Documents Tuples XML Query SQL Query Relational Result XML Result

29 Example XML The Selfish Gene Richard Dawkins Timbuktu 99999 Wouldn’t it be nice to store this as a table with the columns: booktitle author_id firstname lastname city zip

30 Example XML The Selfish Gene Richard Dawkins Timbuktu 99999 We can do this only if all XML documents that we will be considering follow this format. Otherwise, for example, what happens if there are 2 authors?

31 Considering the DTD If a DTD is given, then it defines what types of XML documents will be of interest Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations –

32 Reducing the Complexity DTDs can be very complex Before translating a DTD to a relational schema, simplify the DTD Property of the Simplification: If D 2 is a simplification of D 1, then every document that conforms to D 1 also almost conforms to D 2 –almost means that it conforms, if the ordering of sub- elements is ignored

33 Simplification Rules (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … …,...a?, …, a, …  a*, … …,...a, …, a?, …  a*, … …,...a*, …, a, …  a*, … …,...a, …, a*, …  a*, …

34 (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+)

35 (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+?

36 (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+?

37 (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*?

38 (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f*

39 (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f* b?,c?,e*,f*

40 You try it Can you simplify the expression –(b|c|e)?,(e?|(f?,(b,b)*))* (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, …

41 DTD Graphs In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs Its nodes are elements, attributes and operators in the DTD Each element appears exactly once in the graph Attributes and operators appear as many times as they are in the DTD Cycles indicate recursion

42 DTD Example

43 Corresponding DTD Graph attribute

44 Creating the Schema: Shared Inline Technique When creating the schema for a DTD, we create a relation for: –each element with in-degree greater than 1 –each element with in-degree 0 –each element below a * –one element from each set of mutually recursive elements, having in-degree 1 All other elements are “inlined” into their parent’s relation (i.e., added into their parents relations) –Note that parent may also be inlined

45 In the Relations, Store: Id of node Text content of all leaf nodes that are inlined For all nodes with an incoming edge: –parentID –parentCODE

46 Relations for which elements? attribute

47 book (bookID: integer, book.booktitle : string) article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string, title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) What are these for?

48 Advantages/Disadvantages Advantages: –Reduces number of joins for queries like “get the first and last names of an author” –Efficient for queries such as “list all authors with name Jack” Disadvantages: –Extra join needed for “Article with a given title name”

49 Notes Can/Should we use foreign keys to connect child tuples with their parents, e.g., titles with what they belong to? How can we answer queries, such as: –//title –//article/title –//article//name

50 Another Option: Hybrid Inlining Technique Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node

51 What, in addition, will be inline? attribute

52 book (bookID: integer, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.authorid: string, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string, monograph.editor.name: string, ) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) Why do we still have an author relation?

53 Advantages/Disadvantages Advantages: –Reduces joins through shared elements (that are not set or recursive elements) –Reduces joins for queries like “get first and last names of a book author” (like Shared) Disadvantages: –Requires more SQL sub-queries to retrieve all authors with first name Jack (i.e., unions) Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively

54 XML in Major Databases All major databases now have some level of support for XML Example: Oracle –XML data type (can have a column which contains XML documents) –XPath processing of XML values –Some indexing capabilities –XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)

55 Homework (Part 1) Consider the DTD: <!DOCTYPE a [ ]>

56 Homework (Part 1) Simplify the DTD and draw the DTD graph that corresponds to the simplified DTD. Show the schema that would be created using the Shared- Inline Technique. Show the schema that would be created using the Hybrid- Inlining Technique. NOTE: This example is a bit tricky. Make sure that you follow the rules given in class and that documents can be reconstructed from (1) the data stored in the relations and (2) the knowledge of the DTD structure

57 57 Native Databases for XML

58 58 Store XML as a tree Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) –appropriate indexing –efficient query processing Several native XML database systems have been developed: –TIMBER (University of Michigan) –ToX (University of Toronto) –etc. Basic Idea

59 59 Natix... bib book titleauthor Subtrees are stored in blocks. When a block is full another block is used. Pointer to block containing child

60 60 Indexing In order to do efficient query processing, indexes are used Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff)

61 61 Indexing Strategy We will discuss different indexing strategies and query processing with these indices –Element and value inverted lists –Rotated paths –Graph-based indexes

62 62 Element and Value Inverted Lists

63 63 Basic Indexes At minimum, the following indexes are usually stored: –Value indexes: for each value appearing in the tree there is a list of nodes containing the value –Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression

64 64 Example: Value Indexes transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 WEBM10NYSE169

65 65 Example: Element Indexes transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 buy4exch158

66 66 Example: Structure Indexes transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 //buy//exch8

67 67 Query Processing Suppose that we only have value indexes and element indexes How should we process the query: //buy//exch ? –Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements –Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements Which is a better strategy?

68 68 //buy//exch: Strategy 1 transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 buy4exch158

69 69 //buy//exch: Strategy 2 transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 buy4exch158

70 Both Strategies Are BAD! Both strategies require traversal of the tree Many disk reads Will be inefficient, if tree is large! GOAL: Answer queries using indices only, without traversing the XML tree

71 71 Improving the Execution Instead of storing a running id for each element, store triple: (start, end, level) Find buy elements Find exch elements Merge these two lists by finding exch elements that are nested within buy elements Level is used in case we are interested in finding children, not descendents

72 72 //buy//exch: Improved buy(4,10,2) exch(15,17,4)(8,9,4) Start EndLevel Merge the 2 lists by finding descendent elements What does this remind you of?

73 73 Merging Lists What is the complexity of merging the lists? Is it enough to go through each list once? –Assuming the lists are sorted by start? Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a a b b b

74 74 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3) Where should we go on the b list?

75 75 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)

76 76 Merging Lists: Example We did extra work Need a method to find the correct place to start in the b list a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)

77 77 Minimizing the Work Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart See: –Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc.of VLDB 2002 –Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 –Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002

78 Goal Efficiently find all pairs of nodes n,m such that m is a descendent (child) of n, and n and m have the user specified labels –E.g., a//b, c//d, e/f Recall: –For any label, we have a sorted list (i.e., an index) of nodes with that label –The sorted list of ids contains both the starting position of a node and its ending position

79 79 Stack-Tree Algorithms: Intuition A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. An ancestor-descendant structural relationship is manifested as the ancestor appearing higher on the stack than the descendant. Unfortunately, a depth-first traversal requires going over all the tree. –DON’T GO OVER THE TREE!! ONLY THE INDEX

80 80 Stack-Tree Algorithms We will study the algorithm –Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) Paper also discusses the algorithm –Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) Why is the ordering of the result of interest?

81 81 Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; OutputList = NULL; while (lists are not finished or stack is not empty) { if (a.startPos < d.startPos) then e = a; else e = d; while (stack not empty and e.startPos > stack.Top().endPos) stack.Pop(); if (e == a) { stack.Push(a); a = a->nextNode; } else for each a’ in stack do append (a’, d) to OutputList; d = d->nextNode; } a d

82 82 Stack-Tree-Desc: section//paragraph paragraph section Bla,..Bla,.. paragraph article

83 83 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Alist

84 84 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Dlist

85 85 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7

86 86 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7 a1a2a3 d1d4d2d5d3d6 section paragraph Note: These lists are not created at the beginning of the algorithm. They are already available!

87 87 Stack-Tree-Desc a1 d1 a2 d2 a3 d3 d4 d5 d6 d7 d1d6 d2d5 d3d4 a1 a2 a3 a1 (a1,d1) a2 (a1,d2),(a2,d2) d7 a3 (a1,d3),(a2,d3),(a3,d3) (a1,d4),(a2,d4),(a3,d4)(a1,d5),(a2,d5)(a1,d6) Output: Stack:

88 88 Analysis of Stack-Tree-Dec O(|Alist| + |Dlist| + |OutputList|) for ancestor- descendant structural relationships. –Each Alist element is pushed once and popped once, so stack operations take O(|Alist|). –The inner “for loop” outputs a new pair each time, so its total time is O(|OutputList|).

89 89 Questions and Disadvantages Can a similar algorithm be used to compute other axes? –e.g., child, following How can we use an algorithm for computing a single “step” to compute an entire XPath Query? –E.g., //a//b[//c/d]//e

90 90 Tree Pattern Can Computed From Structural Relationships Descendent edge Child edge book title XML author jane book title author XML jane Algorithm presented only computed a single edge query. Results can be combined to answer entire query.

91 Homework (Part 2) The underlying assumption behind the algorithm Stack-Tree-Desc is that the XML forms a tree. Can the algorithm easily be adapted to return ancestor-descendent pairs (as it does now), if the XML forms a graph? –If so, how? If not, explain intuitively why this is difficult.

92 92 Graph-Based Indexes: DataGuides

93 93 Exploiting Regularity XML documents tend to have a very repetitive structure Structure can be summarized in a (relatively) small graph, called a dataguide Nodes in a dataguide point to their corresponding node in the XML document Strategy: Evaluate query over graph. Then find corresponding nodes in document –Very efficient if dataguide fits into main memory

94 94 Notes In this work, we will model documents as graphs with the labels on the edges We will only consider path queries (no branching) Our XML documents can be arbitrary graphs There are many different types of indexes that exploit the same idea –this was the first (1997)

95 95 An Example DataGuide: Intuition How would you evaluate the queries: //Name /Restaurant/Owner

96 96 DataGuides: Formally Given a data source (i.e., XML document) X, a graph D is a dataguide for X if: –every path of labels appearing in X appears exactly once in D (conciseness) –every path of labels appearing in D appears at least once in X (accuracy)

97 97 Example Revisited Observe that every path in X also appears in D Observe that no path (from the root) appears twice in D Document: XDataGuide: D

98 98 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 1 1 A B CC D D ?

99 99 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 11 1 1 1 A B B C CC D D D ?

100 100 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 11 1 1 1 A B C C CC D D D ?

101 101 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 C D ? AB

102 102 Choosing a DataGuide 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 1 1 A B CC D D Option 1Option 2 1 1 1 1 C D AB What does D point to?

103 103 Strong DataGuide: Formally Consider source X and dataguide D Let p, p’ be two label paths Let p(X) be the set of nodes reached in X by traversing path p We define p ≡ X p’ if p(X) = p’(X) –That is, p and p’ are indistinguishable on X –D is a strong DataGuide for a database X if the equivalence relations ≡ D and ≡ X are the same

104 104 Strong DataGuides Is (b) a strong dataguide for (a)? Is (c) a strong dataguide for (a)?

105 105 Creating a Strong Dataguide Strong dataguides can be used as indexes since they are unambiguous How big might a strong dataguide be? Can it be created efficiently? –In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one –If XML is a tree, can be created in linear time

106 106 MakeDataGuide(n) { dg = NewObject() targetHash.Insert({n}, dg) RecursiveMake({n}, dg) } RecursiveMake(t1, d1) { p = set of children pairs of each object in t1 foreach (unique label l in p) { t2 = set of node-ids paired with l in p d2 = targetHash.Lookup(t2) if (d2 != nil) { add an edge from d1 to d2 with label l } else { d2 = NewObject() targetHash.Insert(t2, d2) add an edge from d1 to d2 with label l RecursiveMake(t2, d2) }

107 107 Can you create a Strong DataGuide? Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. Compute on blackboard 1 A A C B CC A C B C 2 3 4 5 6 1 2,4 3,5 6 5 1 A A C B CC 2 3 4 5 6 C Source Strong DataGuide A B C 1 2,4 3,5 6 C 1 A A C B CC A C B C 2 3 4 5 6 1 2,4 3,5 6 5 1 A A C B CC 2 3 4 5 6 C Source Strong DataGuide A B C 1 2,4 3,5 6 C

108 108 Summary Advantages: –if dataguide can fit in memory, evaluation can be performed efficiently for path queries Disadvantages: –May be large (why is this worse here than for the rotated lexicon?) –Only good for simple queries. Which axes?

109 Construct a strong dataguide for this document, using the algorithm shown Show an example of a database, strong dataguide and XPath query such that evaluating the XPath query on the dataguide (and then finding the corresponding database nodes) yields a different answer than evaluating the query directly on the database. Homework (Part 3)


Download ppt "XML Storage. Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently."

Similar presentations


Ads by Google