Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Spring Part III: Introduction to XPath XML Path Language.
Web Data Management XQuery 1. In this lecture Summary of XQuery FLWOR expressions – For, Let, Where, Order by, Return FOR and LET expressions Collections.
XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
1 Efficient Processing of XPath Queries Using Indexes Yan Chen 1, Sanjay Madria 1, Kalpdrum Passi 2, Sourav Bhowmick 3 1 Department of Computer Science,
Managing XML and Semistructured Data Lecture 6: XPath Prof. Dan Suciu Spring 2001.
1 Introduction to Database Systems CSE 444 Lecture 11 Xpath/XQuery April 23, 2008.
1 Lecture 11: Xpath/XQuery Friday, October 20, 2006.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.
Managing XML and Semistructured Data
Managing XML and Semistructured Data Lecture 14: Constraints and Keys Prof. Dan Suciu Spring 2001.
XML and Databases 198:541. XML Motivation  Huge amounts of unstructured data on the web: HTML documents  No structure information  Only format instructions.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Lecture 16: Querying XML Data: XPath, XQuery Friday, February 11, 2005.
Querying XML February 12 th, Querying XML Data XPath = simple navigation through the tree XQuery = the SQL of XML XSLT = recursive traversal –will.
4/20/2017.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
XML: Extensible Markup Language FST-UMAC Gong Zhiguo.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
XML by Dan Suciu 1 Introduction to Semistructured Data and XML Based on slides by Dan Suciu University of Washington.
Dan SuciuTools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Management of XML and Semistructured Data Lecture 5: Query Languages Wednesday, 4/1/2001.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)
CSE 636 Data Integration Fall 2006 XML Query Languages XPath.
XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
More XML: semantics, DTDs, XPATH February 18, 2004.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
1 XQuery Slides From Dr. Suciu. 2 XQuery Based on Quilt, which is based on XML-QL Uses XPath to express more complex queries.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
Lecture 17: XPath and XQuery Wednesday, Nov. 7, 2001.
1 Lecture 12: XML, XPath, XQuery Friday, October 24, 2003.
RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
XML Stream Processing Yanlei Diao University of Massachusetts Amherst.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu.
XML path expressions CSE 350 Fall 2003.
Lecture 11: Xpath/XQuery
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Management of XML and Semistructured Data
Lecture 12: XML, XPath, XQuery
Semi-Structured data (XML Data MODEL)
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
2/18/2019.
Lecture 9: XML Monday, October 17, 2005.
Wednesday, May 29, 2002 XML Storage Final Review
Lecture 8: XML Data Wednesday, October
Lecture 15: Querying XML Friday, October 27, 2000.
Semi-Structured data (XML)
Lecture 11: XML and Semistructured Data
Presentation transcript:

Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu

Dan Suciu Univ. of Washington Querying XML Streams2 About Me Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: XML-QL = precursor of XQuery XMill = the XML compressor XML toolkit

Dan Suciu Univ. of Washington Querying XML Streams3 Motivation Text databases –Studied over the past 15 years –Traditional client/server model –Struggled with lack of standard text syntax Recently, new standard: XML –Traditional client/server: in today’s dbms –New applications: stream processing This talk: processing stream XML data –My motivation: work on the XML Toolkit project

Dan Suciu Univ. of Washington Querying XML Streams4 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

Dan Suciu Univ. of Washington Querying XML Streams5 Background: Relational Databases Structured, stored in tables Schema separate from data Queries: precise, refer to schema and data (SQL) :BOOKS ISBNTitleYearPublisher Foundations of Databases 1995AW XData on the Web1999MK AUTHOR AIDNameCountry 44AbiteboulFR 06BunemanUK 62HullUSA 12SuciuUSA 29VianuUSA WROTE: ISBNAID X X X12 Hard to publish, easy to query precisely

Dan Suciu Univ. of Washington Querying XML Streams6 Background: Text Databases Unstructured, stored in documents No schema, only data Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely

Dan Suciu Univ. of Washington Querying XML Streams7 Background: XML Data Semistructured Schema and data are together: self-describing Queries: precise, refer to schema and data (SQL) Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … XML: Easier to publish, easy to query precisely

Dan Suciu Univ. of Washington Querying XML Streams8 Background: XML Data bib book paper title author publisher authorjournal book Data on the Web namecountry AbiteboulFR BunemanUK namecountry Addison Wesley Data model = tree

Dan Suciu Univ. of Washington Querying XML Streams9 Background: XML Data Querying with XPath (and XQuery) This talk: XPath queries restricted to: tag / // * [ ] path=“constant”

Dan Suciu Univ. of Washington Querying XML Streams10 Background: XPath in One Slide /bib/book[author/name=“Abiteboul”] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] /bib/book/author/name /bib/book//name/*/zip tag, / //,* [ ] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] Navigate partially known structure Conjunctive queries a la SQL

Dan Suciu Univ. of Washington Querying XML Streams11 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

Dan Suciu Univ. of Washington Querying XML Streams12 Main Application: XML Packet Routing Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] XML content routing [Snoeren et al.01] SOAP Message routing in Application Servers

Dan Suciu Univ. of Washington Querying XML Streams13 XML Packet Routing value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value

Dan Suciu Univ. of Washington Querying XML Streams14 /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” XPath expressions Input XML Stream Output XML Streams

Dan Suciu Univ. of Washington Querying XML Streams15 The XML Stream Processing Problem Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Hard: Large number of XPath expressions e.g Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions Hard: Large number of XPath expressions e.g Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions

Dan Suciu Univ. of Washington Querying XML Streams16 The Approaches Basic techniques NFA plus optimizations: –Xfilter/Yfilter [Altinel&Franklin’00] –XTrie [Chan et al.02] DFA: –XML Toolkit Beyond the obvious Stream indexes (XML Toolkit) Stream views

Dan Suciu Univ. of Washington Querying XML Streams17 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

Dan Suciu Univ. of Washington Querying XML Streams18 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) Extra processing needed to combine branches (not in this talk) catalog product category price quantity "tools" 200 * price * 

Dan Suciu Univ. of Washington Querying XML Streams19 Basic NFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... NFA... XPath 3,66,102,4534,... 2,3,543,43,254 1,55,99,... STACK SAX events Current states

Dan Suciu Univ. of Washington Querying XML Streams20 Basic NFA Evaluation Properties: Space = linear  Throughput = decreases linearly Systems: XFilter [Altinel&Franklin’99], YFilter. XTrie [Chan et al.’02]

Dan Suciu Univ. of Washington Querying XML Streams21 Basic DFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... XPath STACK SAX events DFAs Current state

Dan Suciu Univ. of Washington Querying XML Streams22 Basic DFA Evaluation Properties: Throughput = constant !  Space = GOOD QUESTION System: XML Toolkit [University of Washington]

Dan Suciu Univ. of Washington Querying XML Streams23 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

Dan Suciu Univ. of Washington Querying XML Streams24 The Size of the DFA NFA b a b a a * DFA for //P has 1+|P| states [KMP] DFA for //P has 1+|P| states [KMP] [other] a a DFA b a b a a [other] //a/b/a/a/b

Dan Suciu Univ. of Washington Querying XML Streams25 The Size of the DFA //a/*/*/*/b Size of DFA = exponential in *’s (not a real concern) Size of DFA = exponential in *’s (not a real concern) * * b a * * NFA a a [other] DFA (fragment, and without back edges) a a b a a [other] b b b

Dan Suciu Univ. of Washington Querying XML Streams26 The Size of the DFA Theorem [GMOS’02] The number of states in the DFA for one linear XPath expression P is at most: k = number of // s = size of the alphabet (number of tags) m = max number of * between two consecutive // k+|P| k s m

Dan Suciu Univ. of Washington Querying XML Streams27 Size of DFA: Multiple Expressions //section/table/footnote //table/footnote //section/figure/footnote..... //abstract/footnote/table //section/table/footnote //table/footnote //section/figure/footnote..... //abstract/footnote/table DFA = Trie has linear number of states [Aho&Corasick]

Dan Suciu Univ. of Washington Querying XML Streams28 Size of DFA: Multiple Expressions //section//footnote //table//footnote //figure//footnote..... //abstract//footnote //section//footnote //table//footnote //figure//footnote..... //abstract//footnote 100 expressions states !! There is a theorem here too, but it’s not useful…

Dan Suciu Univ. of Washington Querying XML Streams29 Solution: Compute the DFA Lazily Also used in text searching But will it work for 10 6 XPath expressions ? YES ! For XPath it is provably effective, for two reasons: –XML data is not very deep –The nesting structure in XML data tends to be predictable

Dan Suciu Univ. of Washington Querying XML Streams30 Lazy DFA and “Simple” DTDs Document Type Definition (DTD) –Part of the XML standard –Will be replaced by XML Schema Example DTD: Definition A DTD is simple if all cycles are loops

Dan Suciu Univ. of Washington Querying XML Streams31 Lazy DFA and “Simple” DTDs document section table figure footnote Simple DTD: //section//footnote //table//footnote //figure//footnote //abstract//footnote //section//footnote //table//footnote //figure//footnote //abstract//footnote XPath expressions abstract Eager DFA “remembers” 2 4 sets Lazy DFA “remembers” only 4 sets

Dan Suciu Univ. of Washington Querying XML Streams32 Lazy DFA and “Simple” DTDs Theorem [GMOS’02] If the XML data has a “simple” DTD, then lazy DFA has at most: states. n = max depths of XPath expressions D = size of the “unfolded” DTD d = max depths of self-loops in the DTD 1+D(1+n) d Fact of life: “Data-like” XML has simple DTDs

Dan Suciu Univ. of Washington Querying XML Streams33 Lazy DFA and Data Guides “Non-simple” DTDs are useless for the lazy DFA “Everything may contain everything” Fact of life: “Text”-like XML has non-simple DTDs

Dan Suciu Univ. of Washington Querying XML Streams34 Lazy DFA and Data Guides Definition [Goldman&Widom’97] The data guide for an XML data instance is the Trie of all its root-to-leaf paths

Dan Suciu Univ. of Washington Querying XML Streams35 Lazy DFA and Data Guides document section table section tablefigure document section table figure section table XML Data Data Guide Fact of life: real XML data has “small” data guide [Liefke&S.’00] Fact of life: real XML data has “small” data guide [Liefke&S.’00] sectio n figur e

Dan Suciu Univ. of Washington Querying XML Streams36 Lazy DFA and “Simple” DTDs Theorem [GMOS’02] If the XML data has a data guide with G nodes, then the number of states in the lazy DFA is at most: G = number of nodes in the data guide 1+G

Dan Suciu Univ. of Washington Querying XML Streams simpleprovebBPSSproteinnasatreebank Number of Lazy DFA States - SYNTHETIC Data 10 3 XPath 10 4 XPath 10 5 XPath 4000 states

Dan Suciu Univ. of Washington Querying XML Streams proteinnasatreebank Number of Lazy DFA States - REAL Data 10 3 XPath 10 4 XPath 10 5 XPath 95 states states G =

Dan Suciu Univ. of Washington Querying XML Streams39 Number of States in the lazy DFA Real XML dataSynthetic XML data Data-style DTD Theorem Lazy DFA is small Theorem Lazy DFA is small Document-style DTD Theorem Lazy DFA is small Fact Lazy DFA is HUGE

Dan Suciu Univ. of Washington Querying XML Streams40 Lazy DFA in the XML Toolkit The XML toolkit uses a lazy DFA to process XML streams “warm-up” phase, followed by very high throughput

Dan Suciu Univ. of Washington Querying XML Streams41 Throughput for 10 3, 10 4, 10 5, 10 6 XPath expressions [ prob(*)=10%, prob(//)=10% ] MB/s 0.001MB/s 0.01MB/s 0.1MB/s 1MB/s 10MB/s 100MB/s 5MB10MB15MB20MB25MB Total input size parser lazyDFA (10 3 XPath) lazyDFA (10 4 XPath) lazyDFA (10 5 XPath) lazyDFA (10 6 XPath) xfilter (10 3 XPath) xfilter (10 4 XPath) xfilter(10 5 XPath) xfilter(10 6 XPath) Parser: 10MB/s Lazy DFA: 5.4MB/s

Dan Suciu Univ. of Washington Querying XML Streams42 Summary of Lazy DFA and XML Linear Xpath expressions: –Process with one lazy DFA Xpath expressions with branches –Process with Deterministic Pushdown Automata (ongoing work at the University of Washington)

Dan Suciu Univ. of Washington Querying XML Streams43 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

Dan Suciu Univ. of Washington Querying XML Streams44 Stream IndeX (SIX) Main observation: Parsing is major bottleneck Definition The SIX of an XML document is a binary table of (begin, end) offsets Idea: Use SIX to reduce amount of parsing Works well with (lazy) DFA Implemented in the XML toolkit

Dan Suciu Univ. of Washington Querying XML Streams45 Stream IndeX (SIX) Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 beginOffsetendOffset bib book publisher12423 author author SIXXML

Dan Suciu Univ. of Washington Querying XML Streams46 Stream IndeX (SIX) The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML

Dan Suciu Univ. of Washington Querying XML Streams47

Dan Suciu Univ. of Washington Querying XML Streams48 Stream Views Idea: Given a workload of XPath expressions with branches Precompute some views for each document to speed up the entire workload views  header has to be small

Dan Suciu Univ. of Washington Querying XML Streams49 Stream Views /a[b=11][c=22][e=23] /a[b=33][d=44] [e=55] /a[c=66][f=77] /a[f=34][g=56] /a[b=33][d=44] [e=55] /a[c=66][f=77] /a[f=34][g=56] /a[b=88][c=99] /a[c=99][e=00] /a[b=88][c=99] /a[c=99][e=00] /a/c /a/e /a/f /a/c /a/e /a/f 3 Views: Short circuit evaluation ! Queries Servers

Dan Suciu Univ. of Washington Querying XML Streams50 Stream Views Views  header (binary offsets) XML x speedup on a hit 100x speedup on a hit XML Header Choosing the views: Difficult problem Choosing the views: Difficult problem

Dan Suciu Univ. of Washington Querying XML Streams51 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

Dan Suciu Univ. of Washington Querying XML Streams52 Summary XML stream processing problem: –Fixed XPath queries, transient XML data –Large number of queries –High data throughput Relationship to text processing techniques: –Still regular expressions –Still automata and lazy DFAs –Different scale Techniques: –Lazy DFAs work for reasons specific to XML –Stream indexes and views: ongoing research

Dan Suciu Univ. of Washington Querying XML Streams53 Future Work Handle branches in XPath expressions View selection for a given workload Network configuration

Dan Suciu Univ. of Washington Querying XML Streams54 Thank you ! Links: xmltk.sourceforge.net