On Efficient Part-match Querying of XML Data DATESO 2004 Michal Krátký, Marek Andrt, Department of Computer Science.

Slides:

Advertisements

Similar presentations

XIRQL: Eine Anfragesprache für Information Retrieval in XML-Dokumenten

Advertisements

Ting Chen, Jiaheng Lu, Tok Wang Ling

Mathematical Preliminaries

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.

Copyright © 2003 Pearson Education, Inc. Slide 6-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Structured Query Language (SQL)

Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA.

Relational Database and Data Modeling

Introduction to Algorithms

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Fourth normal form: 4NF 1. 2 Normal forms desirable forms for relations in DB design eliminate redundancies avoid update anomalies enforce integrity constraints.

Query optimisation.

ENV Envisioning Information Lecture 6 – Document Visualization Ken Brodlie

Dr. Alexandra I. Cristea CS 253: Topics in Database Systems: XPath, NameSpaces.

22-Sep-06 CS6795 Semantic Web Techniques 0 Extensible Markup Language.

Information Systems Today: Managing in the Digital World

A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

Traditional IR models Jian-Yun Nie.

Vogler and Metaxas University of Toronto Computer Science CSC 2528: Handshapes and Movements: Multiple- channel ASL recognition Christian Vogler and Dimitris.

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Lecture plan Outline of DB design process Entity-relationship model

On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.

Addition 1’s to 20.

25 seconds left…...

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.

February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.

Sequential PAttern Mining using A Bitmap Representation

How the University Library can help you with your term paper

XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,

Xyleme A Dynamic Warehouse for XML Data of the Web.

Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.

Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.

INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta.

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

BLAS: An Efficient XPath Processing System Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

XML and Database.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.

Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

Efficient Filtering of XML Documents with XPath Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions

OrientX: an Integrated, Schema-Based Native XML Database System

Introduction to XML IR XML Group.

Presentation transcript:

On Efficient Part-match Querying of XML Data DATESO 2004 Michal Krátký, Marek Andrt, Department of Computer Science VŠB–Technical University of Ostrava Czech Republic

Contents Introduction – XML, query languages, indexing XML data, part-match querying. Multi-dimensional approach to indexing XML data. Extension of the multi-dimensional approach for keyword-based querying. Index data structures. Preliminary experimental results. 2/21

Introduction Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema. XML query languages (XPath, XQL, XQuery,…). A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on). Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on. 3/21

Part-match querying XML data Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on. Knowledges from IR are applied. Query languages contain operators for matching term occurrence. For example contains(), ~=. 4/21

Multi-dimensional approach to indexing XML data 5/21 A graph is a set of the paths. XML document is decomposed to paths and labelled paths. labelled path: lp ∈ X LP : s 0,s 1,...,s l PN path: p ∈ X P : id U (u 0 ),id U (u 1 ),...,id U (u l LP ),s id U (u i ) – unique number of a node u i

Indexes Term index – a storage of strings s i of an XML document and their id T (s i ). Labelled path index – a storage of points representing labelled paths. Path index – a storage of points representing paths. 6/21

Example labelled path index, path index books,book,id ; books,book,title and books,book,author. Points (0,1,2) ; (0,1,4) and (0,1,6) are created using id T of element and attribute names, id LP = 0, 1 and 2. For example, the path to value The Two Towers. The labelled path books,book,title with id LP 1 belongs. Vector (1,0,1,3,5) is created using id LP, unique numbers id U of elements, and id T of the term. 7/21

Query for values of elements and attributes XPath query: books/book[author=“Joseph Heller”] 3 phases of a query processing, finding: ● id T of terms from the term index, ● id LP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6), ● points from the path index: range query (2,0,0,0,12) × (2,max,max,max,12). 8/21

Enhanced querying XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,id U (u 0 ),…,id U (u l-1 ), id U (u),0,…, 0) :( max D,id U (u 0 ),…,id U (u l-1 ), id U (u), max D,…,max D ). Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multi- dimensional data structure. 9/21

Comparison of approaches Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results. Result formatting. For example a result of the query //name is all matched subtree. Operation Update and Insert are simple possible. 10/21

Keyword-based searching Motivation: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE Path-Labelled Path-Term (PLT) index is added. The index indexes an 3-dimensional space: (id P, id LP, id T ). id P is added into the point representing path: (id P,id LP,id U 0,id U 1,…,id U l,s). 11/21

Path-Labelled Path-Term index Example 12/21

Query processing plan Example 13/21

Index data structures Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of R- trees. Problems: ● indexing points with different dimensions. ● narrow range query – the signature is applied for efficient processing – Signature R-tree. Efficient processing of the complex range query. 14/21

Efficient processing the complex range query Complex range query = sequence of range queries: qb 1,qb 2,…,qb n. The query is possible to process in one run in the multi-dimensional data structure. 15/21

Experimental results Protein Sequence Database XML document: ● the document size is 683MB, ● number of elements: 21,305,818, ● number of attributes:1,290,647. ● maximal length of path: 7. BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9. 16/21

Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.'] 17/21

Experimental results Regular path expression Query: //uid=' ', 5 labelled paths were matched. Naive processing the complex range query: DAC: 368 Efficient processing the complex range query: DAC: 139 Time: 0.03s, Improvement: 2.5x 18/21

Preliminary experimental results Keyword-based searching othello.xml: ● document size is 250kB, ● maximal length of the path: 6 ● number of paths: 4,967 ● number of labelled paths: 13 ● number of terms: 8,744 ● PLT index: 27,127 19/21

Preliminary experimental results Keyword-based searching Query: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE Labelled path index: result size: 1, DAC: 3 PLT index: result size: 1, DAC: 3 Path index: result size: 1, DAC: 13 Path index: result size: 1, DAC: 4 20/21

Conclusion 21/21 Θ(m × log n), Θ(c × m × log n) vs. Θ(m 1 × m 2 ), m 1,m 2 ≥ m. Efficient processing a query with AND condition. Signature is applied. Multi-dimensional approach for term searching may be applied (e.g. *comp* ). The update operation of XML documents. Comparison with another approaches for test collections (INEX, XMark, …).

References M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT Shiraz, Iran, Springer Verlag, LNCS M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted.

Paths, labelled paths Paths 0,1,2,’ ’ ; 0,5,6,’ ’ and 0,9,10,’ ’ belong to the labelled path books,book,id,... Paths 0,1,4,’J.R.R. Tolkien’ ; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author.

Complex queries Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title ● Combination of above described techniques: query for value, XPath axis processing. Regular path expression queries for example: books//author ● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author.

(B)UB-tree, R-treeUB-tree Z-addressB-tree

Narrow range query – signature multi-dimensional ds Regions intersecting a query hyper box are searched, O(N I × log c n). Ratio c R of relevant N R and intersect N I regions ≪ 1 with an increasing dimension. Signatures are applied to better filtration of irrelevant regions – signature md structures.

Signature R-tree

Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.']

Experimental results