KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST

Slides:

Advertisements

Similar presentations

Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA.

Advertisements

XML: Extensible Markup Language

XML DOCUMENTS AND DATABASES

CLASSICAL PLANNING What is planning ?  Planning is an AI approach to control  It is deliberation about actions  Key ideas  We have a model of the.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.

Fast Algorithms For Hierarchical Range Histogram Constructions

DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.

TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.

Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.

Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.

From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University

Managing XML and Semistructured Data Lecture : Indexes.

A Framework for Using Materialized XPath Views in XML Query Processing Dapeng He Wei Jin.

1 COS 425: Database and Information Management Systems XML and information exchange.

TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,

Covering Indexes for Branching Path Queries Raghav Kaushik, Philip Bohannon, Jeffrey F Naughton and Henry F Korth 1Abdullah Mueen.

Exploiting Local Similarity for Indexing Paths in Graph-Structured Data by Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes 1Abdullah Mueen.

1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.

Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

XML-QL A Query Language for XML Charuta Nakhe

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Master Informatique 1 Qiuyue WangXML Data Management Structure Indexes for XML.

Querying Structured Text in an XML Database By Xuemei Luo.

Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,

Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.

Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

XML and Database.

CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.

Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.

Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

XML Native Query Processing Chun-shek Chan Mahesh Marathe Wednesday, February 12, 2003.

RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.

Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.

XML: Extensible Markup Language

Trie Indexes for Efficient XML Query Processing

Efficient Filtering of XML Documents with XPath Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions

Probabilistic Data Management

Managing XML and Semistructured Data

(b) Tree representation

Semi-Structured data (XML Data MODEL)

XML Query Processing Yaw-Huei Chen

XML indexing – A(k) indices

Incremental Maintenance of XML Structural Indexes

Indexing Methods for Efficient XML Query Processing

A Framework for Testing Query Transformation Rules

Wei Wang University of New South Wales, Australia

Presentation transcript:

KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST

KAIST2002 SIGDB Tutorial2 XML eXtensible Markup Language The de facto standard data representation and exchange on the Web XML Data An instance of semistructured data self-describing irregularly structured

KAIST2002 SIGDB Tutorial3 XML Data Comprise hierarchically nested collections of elements Element can contains Atomic data value A sequences of subelements attributes composed of name-value pairs ID-IDREF relationship Tree or Graph representation

KAIST2002 SIGDB Tutorial4 XML Example title1 author1 … title2 author2 author3 … … book title paper libraryDB 7 editor title author section 8 author 5 chapter author ToXinIndex FabricAPEX

KAIST2002 SIGDB Tutorial5 XML Query XML Query Language XSLT, XML-QL, XPath, XQuery use path expression to traverse the irregularly structured data ex) /libraryDB/book/title or //title search the whole XML data => inefficiency Structural Summary & Path Index by restricting the search to only relevant portions of XML Data

KAIST2002 SIGDB Tutorial6 Schemas for XML DTD, XML Schema Specifies the constraints of XML Data are not mandatory => lack of external schema Structural Summary Summary of label paths Path Index Structural Summary + Extents

KAIST2002 SIGDB Tutorial7 Schemas for XML Applications User Interface XML Data Design, Editing Query Formulation Query Validation Query Optimization Path Index

KAIST2002 SIGDB Tutorial8 Structural Summary DTD Extraction XTRACT based on element information Structural Summary Representative Objects based on path information

KAIST2002 SIGDB Tutorial9 XTRACT [Garofalakis, Gionis, Rastogi, Seshadri, Shim: SIGMOD 00] Infer concise and accurate DTD Choose a DTD from candidate DTDs (a b),(b a) => (a|b)* or (a b)|(b a) Based on Minimum Description Length (MDL) Principle ranks each candidate DTDs depending on the number of bits required to describe the subelement sequences in terms of the candidate DTD 6(for DTD)+3+3 = 12 9(for DTD)+1+1 = 11

KAIST2002 SIGDB Tutorial10 Representative Objects(RO) [Nestorov, Ullman, Wiener, Chawathe : ICDE 97] Provide a concise representation of the inherent schema of a semistructured hierarchical data Full-RO Describe all simple paths K-RO K-RO guarantees that its paths whose length are k+1 exist in data. 1-RO Simplest & very compacted representation

KAIST2002 SIGDB Tutorial11 Representative Objects(RO) book title paper libraryDB editor title author section author chapter author XML Data name book title paper libraryDB editor title authorsection authorchapter Graph Representation of 1-RO name book title paper libraryDB editor title author section author chapter Graph Representation of 2-RO(= Full-RO) name

KAIST2002 SIGDB Tutorial12 Path Index Access Support Relations Deterministic Strong DataGuide Index Fabric ToXin APEX Non-Deterministic 1-Index A(k) Index F&B Index

KAIST2002 SIGDB Tutorial13 Access Support Relations [Kemper, Moerkotte: IS 92] Originated from OODBMS select Name from Mercedes.Manufactures.Composition.Division To support join along arbitrary reference chains Generalization of Join Index[Valduriez 87] Based on the paths in the schema Materialize access paths of arbitrary length Support only predefined subsets of paths.

KAIST2002 SIGDB Tutorial14 DataGuides [Goldman, Widom : VLDB 97] An implementation version of Full-RO Summary of label paths from the root (= simple paths) Concise: describe every unique simple path exactly once, regardless of the number of times it appears Accuracy: do not contains label paths that do not appear in the data Convenience: can store and access it using similar techniques available for processing semistructured data

KAIST2002 SIGDB Tutorial15 DataGuides Construction Algorithm emulates the conversion algorithm from non-deterministic finite automata (NFA) to deterministic finite automata (DFA) Intuitively, a simple path is represented as a node in DataGuide One XML Data may have multiple DataGuides A B C C A B CC AB C Various DataGuides A B C CCCC B An XML Data

KAIST2002 SIGDB Tutorial16 Strong DataGuide If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. Linear time and linear space for tree structured data Exponential time and exponential space for graph structured data 1 A A C B CC A C B C ,4 3, A A C B CC C Source Strong DataGuide A B C 1 2,4 3,5 6 C

KAIST2002 SIGDB Tutorial17 1/2/T-Index [Milo and Suciu: ICDT 99] 1-Index Summary all label paths starting from the root Support queries of q= Px where P = /l 1 /l 2 / … /l n Non-deterministic Based on backward bisimulation which is originated from graph verification Extents are disjoint More compact size than Strong DataGuides

KAIST2002 SIGDB Tutorial18 1-Index Equivalence relation ( ≡) v ≡ u iff Lv =Lu where Lx = {w| w is a simple path from the root to x} the collection of all equivalence class Exponential construction cost Backward Bisimulation (≈ b ) 1. If x≈ b y and x is the root then y is the root 2. Conversely, If x≈ b y and y is the root, then x is the root. 3. If x≈ b y and is an edge, then there is exists an edge (y ’ l y), such that x ’ ≈ b y ’ 4. Conversely, if x≈ b y and (y ’ l y) is an edge, then there exists an edge (x ’ l x) such that x ’ ≈ b y ’

KAIST2002 SIGDB Tutorial19 ≡ vs ≈ b X ≡ Y since L X = L Y = {a.b.d, a.c.d} X Y v ≈ b u  v ≡ u O(mlogm) construction cost [Paige and Tarjan 87] d d aa aa bbc c d XY ≈b≈b

KAIST2002 SIGDB Tutorial20 1-Index vs Strong DataGuide In tree structured Data, strong Dataguide and 1- Index coincide book title paper libraryDB 7 editor title author 4109 section 8 author 5 chapter author XML Data book title paper libraryDB 7 editor title author 10 8,9 section 8 5 chapterauthor Strong DataGuide 4 1-Index book title paper libraryDB 7 editor title author 10 9 section 8 5 chapterauthor 4

KAIST2002 SIGDB Tutorial21 2/T-Index 2-Index To support queries of x 1 Px 2 ex) //title Equivalence relation (≡) (v, u) ≡ (v ’, u ’ ) iff L(v,u) =L(v ’,u ’ ) where L(x,y) = {w| w is a label path from x to y} Summary of path information bwt. two arbitrary nodes T-Index Generalization of 1/2-Index (v 1, …,v n )≡ (u 1, …,u n ) iff L(v 1, …,v n ) =L(u 1, …,u n ) Conceptually similar to Access Support Relations Support only predefined paths

KAIST2002 SIGDB Tutorial22 Index Fabric [Cooper, Sample, Franklin, Hjaltason, Shadmon, VLDB 01] Tree Structured Data Conceptual similar to strong DataGuide Layered structure Use Patricia trie to index a large number of search keys The simple path of an element which has a data value is encoded as a special character sequence Keeps the key which is the combination of encoded sequence and data value.

KAIST2002 SIGDB Tutorial23 Index Fabric Keeps only the information of elements which have data values Patricia trie : lossy Compression XML Data Patricia Trie L B P TA … LBAauthor1LBTtitle1 C … “L” “LBC” C B P C

KAIST2002 SIGDB Tutorial24 ToXin [Rizzolo, Mendelzon: WebDB 01] Tree Structured Data Conceptually Similar to strong DataGuide (not minimal DataGuide) Support navigation of forward and backward traversal Path Tree ( = strong DataGuide) A node of Path Tree has an Index Table or Value Tables Index Table (IT): parent-child relationships Value Table (VT): owner-value relationships

KAIST2002 SIGDB Tutorial25 ToXin Since ToXin keeps parent-child relationships, ToXin supports path expression with value predicates ex) /libraryDB/book[author = author1] LibraryDB:IT book:ITpaper:IT title:VT author:VT chapter title:VT author:VT section LibararyDB parent child null 1 LibraryDB.book parent child 1 2 LibraryDB.paper parent child 1 6 LibraryDB.book.author parent value 2 author1 … Index Tables Value Tables XML Data

KAIST2002 SIGDB Tutorial26 A(k)-Index [Kaushik, Shenoy, Bohannon, Gudes: ICDE 02] Strong DataGuide and 1-Index record the all simple paths Increase index size => Increase search space Approximation of 1-Index Non-deterministic Utilize local similarity(= degree k) reduce the size of index graph

KAIST2002 SIGDB Tutorial27 A(k)-Index k-bisimulation (≈ k ) For any two nodes, v and u, v ≈ 0 u iff u and v have the same label Node v≈ k u iff v≈ k-1 u and for every parent v ’ of v, there is a parent u ’ of u such that v ’ ≈ k-1 u ’ A CB D E D E A C B D E XML DataA(0)-Index A CB D E D A(1)-Index A CB D E D E A(2)-Index (= 1-Index) D E

KAIST2002 SIGDB Tutorial28 A(k)-Index Building cost = O(km) In general, for 1-Index, k < logm Query Processing label path expression whose length ≤ k+1 precise label path expression whose length > k+1 safe : include false results validation => require the data scan

KAIST2002 SIGDB Tutorial29 APEX: Adaptive Path indEx for XML Data [Chung, Min, Shim : SIGMOD 02] Strong DataGuide and 1-Index are kept the all simple paths Users used partial matching path queries //book/title Exhaustive navigation of index structure for partial matching path queries may result in performance degradation

KAIST2002 SIGDB Tutorial30 APEX Deterministic Approximation of DataGuides Efficient processing of partial matching path queries Workload-Aware Self Tuning Strategies [Chaudhuri et. al 00] Utilize Query Workload Build APEX with both XML data and frequently used paths Sequential pattern mining [Agrawal and Srikant 95]

KAIST2002 SIGDB Tutorial31 APEX Hash Tree keep frequently used paths prevent the exhaustive search Graph Structure structural summary + extents APEX frequently used paths = {book.title} extent &0: { } &1: { } &2: { }&3: { } &4: {,, } &5: { } &6: { } &7: { } &8: { } &9: { } labelxnodenext xroot&0 libraryDB&1 book&2 paper&3 title author&4 chapter&5 section&6 editor&7 labelcountxnodenext book&8 remainder&9 libraryDB title paper book &1 &2 &3 &8 &9 &0 author &4 &5 chapter &6 section &7 editor XML Data

KAIST2002 SIGDB Tutorial32 F&B Index [Kaushik, Bohannon, Naughton, Korth : SIGMOD 02] Support Twig path expression /A/B[C] Basic Idea For every edge e labelled l from v to u, add an (inverse) edge e -1 with label l -1 from u to v And then, compute 1-Index on this modified graph. Very large Index space Apply some heuristics - Exploiting Local Similarity : k-bisimulation ABC ABC -1

KAIST2002 SIGDB Tutorial33 Discussion Path Index Improve the query performance by restriction of search space Can be apply to various application Selectivity Estimation QBE(Query By Example) Future Work Support twig queries Query Optimization cost formula of path index

KAIST2002 SIGDB Tutorial34 Thank You! Any Question?

KAIST2002 SIGDB Tutorial35 Reference  C. Chung, J. Min and K. Shim, “ APEX: An Adaptive Path Index for XML Data, ” SIGMOD 02  B. Cooper, N. Sample, M. Franklin, G. Hjaltason and M. Shadmon, “ A Fast Index for Semistructed Data, ” VLDB 01  M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim, “ XTRACT: A System for Extracting Document Type Descriptors from XML Documents, ” SIGMOD 00  L. Goldman and J. Widom, “ DataGuides: Enabling Queries Formulation and Optimization in Seminstructured Databases, ” VLDB 97  R. Kaushik, P. Bohannon, J. Naughton and H. Korth, “ Covering Indexes for Branching Path Queries, ” SIGMOD 02  R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes, “ Exploiting Local Similarity for Indexing Paths in Graph-Structured Data, ” ICDE 02  A. Kemper and G. Moerkotte, “ Access Support Relations: An Indexing Method for Object Bases, ” Information Systems 92  T. Milo and D. Suciu, “ Index Structures for Path Expressions, ” ICDT 99  S. Nestorov, J. Ullman, J. Wiener and S. Chawathe, “ Representative Objects : Concise Representations of Semi structured, Hierarchical Data, ” ICDE 97  F. Rizzolo and A. Mendelzon, ” Indexing XML Data with ToXin, ” WebDB 01  R. Paige and R. Tarjan, “ Three partition refinement algorithms, ” SIAM Journal of Computing 87  P. Valduriez, “ Join Indices, ” TODS 87