Presentation on theme: "On Efficient Part-match Querying of XML Data DATESO 2004 Michal Krátký, Marek Andrt, Department of Computer Science."— Presentation transcript:
On Efficient Part-match Querying of XML Data DATESO 2004 Michal Krátký, firstname.lastname@example.org Marek Andrt, email@example.com Department of Computer Science VŠB–Technical University of Ostrava Czech Republic
Contents Introduction – XML, query languages, indexing XML data, part-match querying. Multi-dimensional approach to indexing XML data. Extension of the multi-dimensional approach for keyword-based querying. Index data structures. Preliminary experimental results. 2/21
Introduction Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema. XML query languages (XPath, XQL, XQuery,…). A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on). Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on. 3/21
Part-match querying XML data Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on. Knowledges from IR are applied. Query languages contain operators for matching term occurrence. For example contains(), ~=. 4/21
Multi-dimensional approach to indexing XML data 5/21 A graph is a set of the paths. XML document is decomposed to paths and labelled paths. labelled path: lp ∈ X LP : s 0,s 1,...,s l PN path: p ∈ X P : id U (u 0 ),id U (u 1 ),...,id U (u l LP ),s id U (u i ) – unique number of a node u i
Indexes Term index – a storage of strings s i of an XML document and their id T (s i ). Labelled path index – a storage of points representing labelled paths. Path index – a storage of points representing paths. 6/21
Example labelled path index, path index books,book,id ; books,book,title and books,book,author. Points (0,1,2) ; (0,1,4) and (0,1,6) are created using id T of element and attribute names, id LP = 0, 1 and 2. For example, the path to value The Two Towers. The labelled path books,book,title with id LP 1 belongs. Vector (1,0,1,3,5) is created using id LP, unique numbers id U of elements, and id T of the term. 7/21
Query for values of elements and attributes XPath query: books/book[author=“Joseph Heller”] 3 phases of a query processing, finding: ● id T of terms from the term index, ● id LP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6), ● points from the path index: range query (2,0,0,0,12) × (2,max,max,max,12). 8/21
Enhanced querying XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,id U (u 0 ),…,id U (u l-1 ), id U (u),0,…, 0) :( max D,id U (u 0 ),…,id U (u l-1 ), id U (u), max D,…,max D ). Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multi- dimensional data structure. 9/21
Comparison of approaches Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results. Result formatting. For example a result of the query //name is all matched subtree. Operation Update and Insert are simple possible. 10/21
Keyword-based searching Motivation: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE Path-Labelled Path-Term (PLT) index is added. The index indexes an 3-dimensional space: (id P, id LP, id T ). id P is added into the point representing path: (id P,id LP,id U 0,id U 1,…,id U l,s). 11/21
Index data structures Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of R- trees. Problems: ● indexing points with different dimensions. ● narrow range query – the signature is applied for efficient processing – Signature R-tree. Efficient processing of the complex range query. 14/21
Efficient processing the complex range query Complex range query = sequence of range queries: qb 1,qb 2,…,qb n. The query is possible to process in one run in the multi-dimensional data structure. 15/21
Experimental results Protein Sequence Database XML document: ● the document size is 683MB, ● number of elements: 21,305,818, ● number of attributes:1,290,647. ● maximal length of path: 7. BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9. 16/21
Experimental results Regular path expression Query: //uid='89071748', 5 labelled paths were matched. Naive processing the complex range query: DAC: 368 Efficient processing the complex range query: DAC: 139 Time: 0.03s, Improvement: 2.5x 18/21
Preliminary experimental results Keyword-based searching othello.xml: ● document size is 250kB, ● maximal length of the path: 6 ● number of paths: 4,967 ● number of labelled paths: 13 ● number of terms: 8,744 ● PLT index: 27,127 19/21
Preliminary experimental results Keyword-based searching Query: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE Labelled path index: result size: 1, DAC: 3 PLT index: result size: 1, DAC: 3 Path index: result size: 1, DAC: 13 Path index: result size: 1, DAC: 4 20/21
Conclusion 21/21 Θ(m × log n), Θ(c × m × log n) vs. Θ(m 1 × m 2 ), m 1,m 2 ≥ m. Efficient processing a query with AND condition. Signature is applied. Multi-dimensional approach for term searching may be applied (e.g. *comp* ). The update operation of XML documents. Comparison with another approaches for test collections (INEX, XMark, …). http://www.cs.vsb.cz/arg
References M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, 2004. M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT 2002. Shiraz, Iran, Springer Verlag, LNCS 2510. M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted.
Paths, labelled paths Paths 0,1,2,’003-04212’ ; 0,5,6,’001-00863’ and 0,9,10,’045-00012’ belong to the labelled path books,book,id,... Paths 0,1,4,’J.R.R. Tolkien’ ; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author.
Complex queries Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title ● Combination of above described techniques: query for value, XPath axis processing. Regular path expression queries for example: books//author ● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author.
Narrow range query – signature multi-dimensional ds Regions intersecting a query hyper box are searched, O(N I × log c n). Ratio c R of relevant N R and intersect N I regions ≪ 1 with an increasing dimension. Signatures are applied to better filtration of irrelevant regions – signature md structures.