Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.

Similar presentations


Presentation on theme: "Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database."— Presentation transcript:

1 Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database schema is loose-defined. Semi-structured data need specialized management methods. The most characteristic example is XML data where the elements (tags) define the semantics of the information. xxx yyy zzz 7

2 Dr. N. MamoulisAdvanced Database Technologies2 XML and HTML XML (like HTML) is a subset of SGML In HTML the tags serve the primary purpose of describing how to display a data item. On the other hand, XML tags describe the data itself. Tags are called elements in XML. This means that a program receiving an XML document can interpret it in multiple ways, can filter the document upon its content, can restructure it for a different application, etc.

3 Dr. N. MamoulisAdvanced Database Technologies3 Example of HTML document Course Details CSIS7101 Detailed (tentative) schedule 1 Introduction Introduction by the instructor [ slides in pdf ]* attribute=“value” text

4 Dr. N. MamoulisAdvanced Database Technologies4 Example of XML document The Selfish Gene Richard Dawkins Timbuktu 99999 attribute=“value” text

5 Dr. N. MamoulisAdvanced Database Technologies5 XML and Databases XML is becoming a standard for information exchange over the internet. Since actual data are stored in XML documents, it should be possible to query the data. Here comes the role of databases: How should we organize and query XML data?

6 Dr. N. MamoulisAdvanced Database Technologies6 XML and Databases (cont’d) Solution 1: Use specialized storage methods, query languages and query evaluation techniques for semi-structured data. Solution 2: Represent XML data in relational tables, transform queries to SQL, and use the mature relational DB technology.

7 Dr. N. MamoulisAdvanced Database Technologies7 More on XML – Document Type Descriptors DTDs (Document Type Descriptors) define (and control) the schema of XML documents for a specific application. Thus, now the structure of these documents is not free, but should conform to the DTD. The DTD can help define a relational schema for the class of documents that conform to it.

8 Dr. N. MamoulisAdvanced Database Technologies8 Example of a DTD 0 or more times 0 or 1 time reference to existing ID anything can be nested under address

9 Dr. N. MamoulisAdvanced Database Technologies9 More on XML – Queries Queries on XML data are described by the structural relationships of elements, attributes and values. Several XML Query Languages have been proposed. XML-QL, Lorel, UnQL, XQL, XPath, XQuery,...

10 Dr. N. MamoulisAdvanced Database Technologies10 More on XML – Query Example Find the last name of the author of book “the selfish gene” WHERE The Selfish Gene $l IN db.xml CONSTRUCT $l

11 Dr. N. MamoulisAdvanced Database Technologies11 XML data represented as graphs An XML document can be represented as a node-labeled graph. The labels of the graph are element tags, attribute names and values. Most documents can be represented by trees. The edges that transform a tree to a graph come from ID references.

12 Dr. N. MamoulisAdvanced Database Technologies12 Example of a tree representation book booktitle The Selfish Gene author id dawkins name firstname lastname RichardDawkins address cityzip Timbuktu99999

13 Dr. N. MamoulisAdvanced Database Technologies13 Example of a graph representation The Importance of Evergreen John Smith Smithsville Jones Jonesville

14 Dr. N. MamoulisAdvanced Database Technologies14 Example of a graph representation (cont’d) article title The Importance of Evergreen author id smith name firstname John Smith address lastname Smithsville author id jones name lastname Jones address Jonessville contactauthor authorid

15 Dr. N. MamoulisAdvanced Database Technologies15 XML Query types Queries with absolute path expressions. These queries retrieve paths where the first element is the ROOT of the document. Example: find all books written by author with lastname=“Smith” book/author/name/lastname/Smith

16 Dr. N. MamoulisAdvanced Database Technologies16 XML Query types Queries with simple path expressions. These queries retrieve paths where the first element can be any element of the document. Example: find all items written by author with lastname=“Smith” //author/name/lastname/Smith anything can be before author tag

17 Dr. N. MamoulisAdvanced Database Technologies17 XML Query types Queries with regular path expressions. These queries retrieve paths where the not all elements on the path are specified. Example: find all documents with an “author” element with a descendant “Smith” in the graph //author//Smith any path can be between author and Smith

18 Dr. N. MamoulisAdvanced Database Technologies18 XML Query types In general, other symbols may be used to denote the distance between the path elements. Example: find all documents with an “author” followed by one element, then one or none elements, and then by “Smith”. //author/_ /?/Smith exactly one element should be here one or no element could be here

19 Dr. N. MamoulisAdvanced Database Technologies19 XML Query types Queries that match multiple regular path expressions. The paths are joined in a root element and the whole query is represented by a twig (small tree). Example: find the book with title “XML” written by an author with a descendant “Smith” in the graph book[/title/XML][//author//Smith] book title XML author Smith

20 Dr. N. MamoulisAdvanced Database Technologies20 Indexing and XML Query Processing Several storage schemes and indexes have been proposed for the queries discussed above. Some of them index the paths or subgraphs of the XML structures. Some decompose the information and flatten it into relational DB tables.

21 Dr. N. MamoulisAdvanced Database Technologies21 Path indexes for XML data If many documents exist, they are connected into a large graph by adding a common root. Then a structural summary of the XML graph is created. All the paths in the data graph are preserved into the summary graph. If we keep pointers to the original graph into the summary graph, then this becomes an index.

22 Dr. N. MamoulisAdvanced Database Technologies22 Path index example alldocuments (root) title book author name title book author nameaddress title article author nameaddress author name 1 2 34 5 6 7 8 910 1112 13 1415 1617 18 19 A. a graph of documents

23 Dr. N. MamoulisAdvanced Database Technologies23 Path index example alldocuments (root) title book authortitle article author nameaddress 1 2,8 3,9 4,6,10 13 1415,18 16,1917 B. the 1-index name 5,7,1112 address The 1-index maintains information about all paths in the original graph

24 Dr. N. MamoulisAdvanced Database Technologies24 Path indexes for XML data The 1-index maintains information about all paths in the original graph. This makes the index very large (with size comparable to the data size). Therefore it is quite expensive to evaluate queries using this index. To address this problem an A(k)-index is proposed which indexes exactly only paths up to length k.

25 Dr. N. MamoulisAdvanced Database Technologies25 Bisimilarity Two nodes u, v are called bisimilar if: They have the same label. If u’ is the parent of u, then there is a parent v’ of v, such that u’,v’ are also bisimilar, and vice versa.

26 Dr. N. MamoulisAdvanced Database Technologies26 Bisimilarity Example Nodes 4 and 10 are bisimilar Nodes 10 and 15 are not bisimilar alldocuments (root) title book author name title book author nameaddress title article author nameaddress author name 1 2 34 5 6 7 8 910 1112 13 1415 1617 18 19

27 Dr. N. MamoulisAdvanced Database Technologies27 Bisimilarity defines the 1-index Bisimilar nodes are stored in the same node in the summary index. alldocuments (root) title book authortitle article author nameaddress 1 2,8 3,9 4,6,10 13 1415,18 16,1917 name 5,7,1112 address

28 Dr. N. MamoulisAdvanced Database Technologies28 Bisimilarity and the A(k) index In the A(k) index, the notion of k- bisimilarity is used: Two nodes u,v are 0-bisimilar, if they have the same label. Two nodes u,v are k-bisimilar, if they have the same label and for every parent u’ of u, there is a parent v’ of v, such that u’ and v’ are (k-1)-bisimilar, and vice versa.

29 Dr. N. MamoulisAdvanced Database Technologies29 k-bisimilarity Example Nodes 5 and 16 are 1-bisimilar Nodes 6 and 15 are not 1-bisimilar alldocuments (root) title book author name title book author nameaddress title article author nameaddress author name 1 2 34 5 6 7 8 910 1112 13 1415 1617 18 19

30 Dr. N. MamoulisAdvanced Database Technologies30 Bisimilarity and the A(k) index The A(k)-index stores exactly all paths of length k, or else: all k-bisimilar nodes in the data graph are stored in the same node in the index graph. This means that all incoming paths up to length k are encoded in the index.

31 Dr. N. MamoulisAdvanced Database Technologies31 A(k)-index example alldocuments (root) title book authortitle article author nameaddress 1 2,8 3,9 4,6,10 13 1415,18 16,1917 A(3) and A(2)-index name 5,7,1112 address

32 Dr. N. MamoulisAdvanced Database Technologies32 A(k)-index example alldocuments (root) title book authortitle article author 1 2,8 3,9 4,6,10 13 1415,18 A(1)-index name 5,7,11,16,19 12,17 address alldocuments (root) title book author article 1 2,8 3,9,14 4,6,10,15,18 A(0)-index name 5,7,11,16,1912,17 address 13

33 Dr. N. MamoulisAdvanced Database Technologies33 Using the A(k) index to search A Label Map is constructed together with the index, where each label points to its positions in the index. alldocuments (root) title book authortitle article author 1 2,8 3,9 4,6,10 13 1415,18 A(1)-index name 5,7,11,16,19 12,17 address name author title article book

34 Dr. N. MamoulisAdvanced Database Technologies34 Evaluation of path queries Assume that a path query q of length ≤k is applied. The A(k) index can answer the query as follows. First the last label of q is found and the label map is used to find its positions in the index. Then the index is traversed backwards to complete the answer.

35 Dr. N. MamoulisAdvanced Database Technologies35 Evaluation of path queries (example) Query book/title alldocuments (root) title book authortitle article author 1 2,8 3,9 4,6,10 13 1415,18 A(1)-index name 5,7,11,16,19 12,17 address name author title article book OK

36 Dr. N. MamoulisAdvanced Database Technologies36 Evaluation of path queries If the path of the query is longer than k, we may need to access the actual data. Thus, A(k)-index alone cannot be used to answer the query in this case. This is because if we traverse the index backwards we may find false positive paths that actually do not exist in the graph. Paths that share information are grouped to decrease the potential cost.

37 Dr. N. MamoulisAdvanced Database Technologies37 Evaluation of path queries (2 nd example) Query book/author/name path has length 2>1 alldocuments (root) title book authortitle article author 1 2,8 3,9 4,6,10 13 1415,18 A(1)-index name 5,7,11,16,19 12,17 address name author title article book

38 Dr. N. MamoulisAdvanced Database Technologies38 Problems of path indexes They are appropriate only for simple path queries up to a certain length. Therefore if a query has branches or regular path expressions the index cannot provide exact answers, but the actual data have to be accessed. Also these indexes have high storage and update cost.

39 Dr. N. MamoulisAdvanced Database Technologies39 Storing and indexing XML data in relational databases We can decompose the structural information into tables and use them to answer queries. This reduces the volume of data that need to be accessed for a single query and we can use off-the-shelf query processing and optimization techniques. On the other hand, we may need expensive joins during query processing

40 Dr. N. MamoulisAdvanced Database Technologies40 A decomposition model for XML data The storage model indexes the elements and text of the documents by their position in the graph. If the structures are trees, this representation can help to answer queries fast. On the other hand, for graphs the positions of the elements many times cannot help fast query evaluation because of recursion and other problems the incur.

41 Dr. N. MamoulisAdvanced Database Technologies41 Encoding elements, attributes and values based on their positions. The position of each element/attribute occurrence is represented as a 3-tuple (Document-id, StartPos:EndPos, LevelNum) Values (text) is encoded using (Document-id, StartPos, LevelNum): Document-id is the id of the document that contains the element StartPos is the number of words from the beginning of the document until the start of the element EndPos is the number of words from the beginning of the document until the end of the element LevelNum is the nesting depth of the element

42 Dr. N. MamoulisAdvanced Database Technologies42 Encoding example The Selfish Gene Richard Dawkins Timbuktu 99999

43 Dr. N. MamoulisAdvanced Database Technologies43 Encoding Example book booktitle The Selfish Gene author id dawkins name firstname lastname RichardDawkins address cityzip Timbuktu99999 (1,1:27,1) (1,2:6,2) (1,3,3) (1,7:26,2) (1,8:9,3) (1,9,4) (1,10:17,3) (1,11:13,4) (1,12,5) (1,15,5) (1,14:16,4)(1,19:21,4) (1,22:24,4) (1,18:25,3) (1,20,5)(1,23,5)

44 Dr. N. MamoulisAdvanced Database Technologies44 Using the encoding to determine a structural relationship We can use the encoding to find fast the relationship between two elements (or between an element and a value). Element e 1 is an ancestor of element e 2 in the same document iff: e 1.DocumentId = e 2.DocumentId e 1.StartPos> e 2.StartPos && e 1.EndPos< e 2.EndPos (interval coverage) If the above hold and, in addition, e 1.LevelNum+1 = e 2.LevelNum, then e 1 is the parent of e 2.

45 Dr. N. MamoulisAdvanced Database Technologies45 Answering queries using the encoding Assume that all documents have been flattened to tables and the encoding is used to index the positions of each element and value in the documents. We store all information in a table: (ElementId, Document-id, StartPos:EndPos, LevelNum) The table is clustered by ElementId and sorted by (Document-id, StartPos).

46 Dr. N. MamoulisAdvanced Database Technologies46 Answering queries using the encoding (cont’d) Example ElementIdDocument-idStartPosEndPosLevelNum book11271 booktitle1262 author17262...

47 Dr. N. MamoulisAdvanced Database Technologies47 Answering queries using the encoding (cont’d) The query is broken into binary parent-child or ancestor descendant relationships. Example: book[/title/XML][//author//Smith] Broken to: book/title title/XML book//author author//Smith book title XML author Smith

48 Dr. N. MamoulisAdvanced Database Technologies48 Answering queries using the encoding (cont’d) Each binary query is executed as a join, and their results are “stitched” together to formulate the results of the whole query. Example: book/author/address book/author: (2,4),(2,6),(8,10) author/address: (10,12),(15,17) book/author/title: (8,10,12) title book author name title book author nameaddress title article author nameaddress author name 2 34 5 6 7 8 910 1112 13 1415 1617 18 19

49 Dr. N. MamoulisAdvanced Database Technologies49 How to process the binary joins Thus the “heart” of XML query processing is the algorithm that joins the elements table to retrieve the results for each individual query component. One method to process the binary join is to apply a merge-join algorithm, since the table is already sorted by Element,DocId,StartPos. Assume that the query is an A//D, where A is the ancestor element and D is the descendant element

50 Dr. N. MamoulisAdvanced Database Technologies50 How to process the binary joins. The tree-merge join algorithm Example: 123342 125332 145562 215283 245573 AList 130324 168725 217184 220224 243602 DList

51 Dr. N. MamoulisAdvanced Database Technologies51 Worst case for the tree-merge join algorithm Example:... a1a1 a2a2 anan d1d1 d 2n d 2n-1 d2d2 d n+1 dndn a1a1 a2a2... d1d1 d2d2 dndn d n+1... d 2n-1 d 2n... anan

52 Dr. N. MamoulisAdvanced Database Technologies52 How to process the binary joins. The tree-merge join algorithm The tree merge join may perform many passes to the “inner” DList table, one for each element in AList that mathes the elements there. In order to avoid this a stack-tree join algorithm is proposed. OBSERVATION: We can get all the join results by a depth-first traversal of the XML tree.

53 Dr. N. MamoulisAdvanced Database Technologies53 The stack-tree join algorithm The lists are merged together as before, but a stack is maintained to keep nested AList elements which are in the same path as the current element from DList. When a qualifying element in DList is found, all elements of AList in the stack are output.

54 Dr. N. MamoulisAdvanced Database Technologies54 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 Output

55 Dr. N. MamoulisAdvanced Database Technologies55 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 a2a2 Output a 1,d 1 a 2,d 1

56 Dr. N. MamoulisAdvanced Database Technologies56 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 a3a3 Output a 1,d 1 a 2,d 1

57 Dr. N. MamoulisAdvanced Database Technologies57 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 a3a3 Output a 1,d 1 a 2,d 1 a 1,d 2 a 3,d 2

58 Dr. N. MamoulisAdvanced Database Technologies58 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 a3a3 Output a 1,d 1 a 2,d 1 a 1,d 2 a 3,d 2 a 1,d 3 a 3,d 3

59 Dr. N. MamoulisAdvanced Database Technologies59 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 Output a 1,d 1 a 2,d 1 a 1,d 2 a 3,d 2 a 1,d 3 a 3,d 3 a 1,d 4

60 Dr. N. MamoulisAdvanced Database Technologies60 Stack-tree join example a1a1 a2a2 d1d1 a3a3 d2d2 d3d3 d4d4 a4a4 d5d5 d6d6 a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 AListDListStack a1a1 Output a 1,d 1 a 2,d 1 a 1,d 2 a 3,d 2 a 1,d 3 a 3,d 3 a 1,d 4 a4a4 a 1,d 5 a 4,d 5 a 1,d 6 a 4,d 6

61 Dr. N. MamoulisAdvanced Database Technologies61 Comments on the stack-tree join algorithm The algorithm has better worst-case complexity than the tree-merge join algorithm. Both of them have two versions; one that outputs results sorted on AList elements and one that outputs results sorted on DList elements.

62 Dr. N. MamoulisAdvanced Database Technologies62 Limitation of the binary join algorithms They are used only for binary joins. If a query is complex and contains many binary relationships, many intermediate results have to be merged. Example: book[/title/XML][//author//Smith] Broken to: book/title title/XML book//author author//Smith book title XML author Smith

63 Dr. N. MamoulisAdvanced Database Technologies63 An extension of the stack-tree join algorithm The path-join and twig-join algorithms extend the basic stack-join algorithm for complex queries. The idea is the same, but multiple stacks are used to avoid merging the intermediate results. Path-join is appropriate for path queries only (e.g., book/author/name) Twig-join is appropriate for branching (tree) expressions. book title XML author Smith

64 Dr. N. MamoulisAdvanced Database Technologies64 Example of Path-Join Query a//b//c..at point c 3 a1a1 a2a2 b1b1 b2b2 c1c1 c2c2 b3b3 a3a3 b4b4 c5c5 c3c3 c4c4 StackA a1a1 a3a3 StackB b3b3 b4b4 Output c 3,b 4,a 3 c 3,b 4,a 1 c 3,b 3,a 1 c 1,b 2,a 1 c 2,b 2,a 1 c 3,b 3,a 3 is not an answer because b 3 points to a 1 in the next stack!

65 Dr. N. MamoulisAdvanced Database Technologies65 Path-join has optimal asymptotic cost for single-path queries, but...... if a query is a twig of multiple paths may produce many partial results which have then to be joined. The twig-join, joins these results at production time. Example: query a[//b/c][//d/e] two paths a/b/c two paths a/d/e only one twig a[/b/c][/d/e] a b c c e a d c d e a b c d e b

66 Dr. N. MamoulisAdvanced Database Technologies66 Twig Join The twig-join applies path-join at multiple paths at the same time. When at some node there are potential solutions for each path of the query, the algorithm waits for these results and waits for them to be computed. Then the results from each path are joined. a b c c e a d c d e a b c d e b partial result: a/b/c partial result: a/d/e merge result: a[/b/c][/d/e]

67 Dr. N. MamoulisAdvanced Database Technologies67 Limitations of the twig-join and stack-based methods It is useful for simple twigs only, but it is not trivial to extend it for arbitrary trees The encoding can be used for tree- structured XML data only. However, in many cases XML data are graphs. In this case the encoding (and also the stack-based algorithms) are not applicable. book title XML author SmithJones

68 Dr. N. MamoulisAdvanced Database Technologies68 Summary XML data are everywhere today, and efficient management and querying systems are needed. This is why today XML data management is one of the hottest research topics in DB. There are two streams for XML data management: Store XML data into native systems and use special indexing and querying methods. Trasform XML data into relational tables and use/adapt relational query algorithms.

69 Dr. N. MamoulisAdvanced Database Technologies69 References R. Kaushik et al., “Exploiting Local Similarity for Indexing Paths in Graph-Structured Data”, ICDE 2002. S. Al-Khalifa et al., “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002. N. Bruno et al., “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002. J. Shanmugasundaram et al. “Relational Databases for Querying XML Documents: Limitations and Opportunities”, VLDB 1999.


Download ppt "Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database."

Similar presentations


Ads by Google