Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Lecture 24 MAS 714 Hartmut Klauck
XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2 Georg Gottlob, TU Wien Christoph Koch, U. Edinburgh Based on joint work with R. Pichler.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
An Algorithm for Streaming XPath Processing with Forward and Backward Axes Charles Barton, Philippe Charles, Deepak Goyal, Mukund Raghavchari IBM T. J.
1 Conditional XPath, the first order complete XPath dialect Maarten Marx Presented by: Einav Bar-Ner.
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
1 Introduction to Computability Theory Discussion1: Non-Deterministic Finite Automatons Prof. Amos Israeli.
Courtesy Costas Busch - RPI1 Non Deterministic Automata.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Introduction to Database Systems 1 Relational Algebra Relational Model: Topic 3.
1 Non-Deterministic Automata Regular Expressions.
Theory of Computing Lecture 22 MAS 714 Hartmut Klauck.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Fall 2004COMP 3351 Another NFA Example. Fall 2004COMP 3352 Language accepted (redundant state)
4/20/2017.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.
Streaming Processing of Large XML Data Jana Dvořáková, Filip Zavoral processing of large XML data using XSLT with optimal memory complexity formal model.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
CS-5800 Theory of Computation II PROJECT PRESENTATION By Quincy Campbell & Sandeep Ravikanti.
1 Chapter 2 Finite Automata (part b) Windmills in Holland.
Copyright © 2004 Pearson Education, Inc.. Chapter 26 XML and Internet Databases.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Regular Expressions Hopcroft, Motawi, Ullman, Chap 3.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
StriD 2 FA: Scalable Regular Expression Matching for Deep Packet Inspection Author: Xiaofei Wang, Junchen Jiang, Yi Tang, Bin Liu, and Xiaojun Wang Publisher:
Deriving Relation Keys from XML Keys by Qing Wang, Hongwei Wu, Jianchang Xiao, Aoying Zhou, Junmei Zhou Reviewed by Chris Ying Zhu, Cong Wang, Max Wang,
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
CS 154 Formal Languages and Computability February 9 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
XML Stream Processing Yanlei Diao University of Massachusetts Amherst.
CSCI 2670 Introduction to Theory of Computing September 7, 2004.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
1 Finite Automata. 2 Introductory Example An automaton that accepts all legal Pascal identifiers: Letter Digit Letter or Digit "yes" "no" 2.
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
Relational Algebra Chapter 4, Part A
Chapter 2 FINITE AUTOMATA.
OrientX: an Integrated, Schema-Based Native XML Database System
(b) Tree representation
Non-Deterministic Finite Automata
Non-Deterministic Finite Automata
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Early Profile Pruning on XML-aware Publish-Subscribe Systems
2/18/2019.
Chapter 1 Regular Language
Presentation transcript:

Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia

Introduction Consumer 1Consumer 2Consumer 3 XPath query 1 XPath query 2 XPath query 3 XML Stream Router XML data stream

Related Work The problem was introduced in [Altinel and Franklin 2000] for a system XFilter. [Chan et al. 2002] describes techniques to solve the problem based on a trie (XTrie) [Diao et al. 2003] discusses a method based on optimized NFAs(YFilter) [Green et al. 2003] introduces how to solve the problem using lazy DFA

DFA approach in general Convert the set of XPath expressions into the set of NFA’s Convert the set of NFA’s into a single NFA Convert the single NFA into a DFA Process XML data stream with DFA (using SAX model)

DFA approach in general (cont) Linear XPath expression: P::= /N | //N | PP N::= E | A | * | text() | text() = S where E – element label A – attribute label / - child axis // - descendant axis * - wild card S – constant string What about predicates? To be decomposed into linear XPath expressions

DFA approach in general (cont) Consider two XPath expressions /datasets/dataset[//tableHead//*/text()=“Galaxy”]/title /datasets/dataset[/history]/tableHead[/field] Corresponding query tree $D IN $R/datasets/dataset $H IN $D/history $T IN $D/title sax f = true $TH IN $D/tableHead sax f = true $N IN $D//tableHead//* $F IN $TH/field $V IN $N/text()="Galaxy"

Conversion of XPath expressions into NFA and DFA $X IN $R/a $Y IN $X//*/b $Z IN $X/b/* $U IN $Z/d Query treeQuery NFAQuery DFA

Eager DFA vs. Lazy DFA DFA is eager if it is obtained by the standard algorithm of conversion of NFA to DFA [Hopcroft and Ullman 1979] DFA is lazy if it is constructed at run-time on demand. Initially it has a single state and whenever we attempt to make a transition into a missing state we compute it and update a transition.

Eager DFA P = p 0 // p 1 //… // p k p i = N 1 / N 2 /… / N ni k = # of //’s n i =length of p i, i=0,…,k m=max # of *’s in each p i n=length (or depth) of P, i.e. s=alphabet size |  | Theorem. Given a linear XPath expression P, define prefix(P) = n 0, and body(P) = when k>0, and body(P) = 1 when k = 0. Then eager DFA for P has at most prefix(P) + body(P) states. In particular, if m = 0 and k  1, then DFA has at most (n+1) states.

Lazy DFA. Example a * * b d b b * b * b * * b * b d DFA Queries \a\\*\b \a\b\*\d Sample XML document

Lazy DFA Graph schema (based on DTD) d – the maximum number of simple cycles that a simple path can intersect D – the total number of nonempty, simple paths starting at the root d = 2, D = 13

Lazy DFA (cont) Theorem. Consider a graph schema with d, D, and let Q be set of XPath expressions of maximum depth n. Then on any XML input satisfying the schema, the lazy DFA has at most 1 + D(1+n) d states Corollary. The number of states of lazy DFA does not depend on the number of XPath expressions, only on their depth. If n = 10, and the number of XPath expressions is equal to 100,000.  Eager DFA may have  2 100,000 states  Lazy DFA will have  1574 states

Lazy DFA. Implementation To process XML stream, it uses SAX model The subset of XPath considered in the implementation  No text() and attribute values tests  Only child and descendant axes  All predicates of a query must fire before the target element

Restrictions of the implementation MEDIA WORKSHOP U Se 101 T 1:30pm 5:20pm 1-3 \\courses[level]\section \\courses[days]\section \\courses[credits]\section XPath queriesSample XML document 1. All predicates fire before the target element 2. Predicates fire between the starting and closing tags of the target element 3. Predicates fire after the target element

Processing attributes When processing a stream, all attributes are converted into elements <section name=“Se 101“ description=“”/> <hours start="1:30pm“ end="5:20pm"/> Se 101 1:30pm 5:20pm

Testing Reference implementation: Galax Testing XML stream: World geographic database Maximum XML depth of the stream was 6 Number of queries was 14 The depth of queries had a range of 1 to 5 The number of predicates had a range of 0 to 3 The depth of predicates had a range of 1 to 4 Method usedNumber of states used NFA22 Eager DFA87 Lazy DFA22

Reference Todd J. Green et al, Processing XML Streams with Deterministic Automata and Stream Indexes,, ACM Transactions on Computational Logic, 12/2004 Altinel, M. and Franklin, M Efficient filtering of XML documents for selective dissemination, In Proceedings of VLDB. Cairo Chen J et al, 2000, NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data Diao, Y. and Franklin, M Query processing for high-volume XML message brokering. In Proceedings of VLDB. Berlin, Germany. John E. Hopcroft, Jeffrey D. Ullman 1987, Introduction to automata theory, languages, and computation