Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu
Dan Suciu Univ. of Washington Querying XML Streams2 About Me Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: XML-QL = precursor of XQuery XMill = the XML compressor XML toolkit
Dan Suciu Univ. of Washington Querying XML Streams3 Motivation Text databases –Studied over the past 15 years –Traditional client/server model –Struggled with lack of standard text syntax Recently, new standard: XML –Traditional client/server: in today’s dbms –New applications: stream processing This talk: processing stream XML data –My motivation: work on the XML Toolkit project
Dan Suciu Univ. of Washington Querying XML Streams4 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
Dan Suciu Univ. of Washington Querying XML Streams5 Background: Relational Databases Structured, stored in tables Schema separate from data Queries: precise, refer to schema and data (SQL) :BOOKS ISBNTitleYearPublisher Foundations of Databases 1995AW XData on the Web1999MK AUTHOR AIDNameCountry 44AbiteboulFR 06BunemanUK 62HullUSA 12SuciuUSA 29VianuUSA WROTE: ISBNAID X X X12 Hard to publish, easy to query precisely
Dan Suciu Univ. of Washington Querying XML Streams6 Background: Text Databases Unstructured, stored in documents No schema, only data Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely
Dan Suciu Univ. of Washington Querying XML Streams7 Background: XML Data Semistructured Schema and data are together: self-describing Queries: precise, refer to schema and data (SQL) Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … XML: Easier to publish, easy to query precisely
Dan Suciu Univ. of Washington Querying XML Streams8 Background: XML Data bib book paper title author publisher authorjournal book Data on the Web namecountry AbiteboulFR BunemanUK namecountry Addison Wesley Data model = tree
Dan Suciu Univ. of Washington Querying XML Streams9 Background: XML Data Querying with XPath (and XQuery) This talk: XPath queries restricted to: tag / // * [ ] path=“constant”
Dan Suciu Univ. of Washington Querying XML Streams10 Background: XPath in One Slide /bib/book[author/name=“Abiteboul”] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] /bib/book/author/name /bib/book//name/*/zip tag, / //,* [ ] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] Navigate partially known structure Conjunctive queries a la SQL
Dan Suciu Univ. of Washington Querying XML Streams11 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
Dan Suciu Univ. of Washington Querying XML Streams12 Main Application: XML Packet Routing Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] XML content routing [Snoeren et al.01] SOAP Message routing in Application Servers
Dan Suciu Univ. of Washington Querying XML Streams13 XML Packet Routing value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value
Dan Suciu Univ. of Washington Querying XML Streams14 /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” XPath expressions Input XML Stream Output XML Streams
Dan Suciu Univ. of Washington Querying XML Streams15 The XML Stream Processing Problem Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Hard: Large number of XPath expressions e.g Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions Hard: Large number of XPath expressions e.g Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions
Dan Suciu Univ. of Washington Querying XML Streams16 The Approaches Basic techniques NFA plus optimizations: –Xfilter/Yfilter [Altinel&Franklin’00] –XTrie [Chan et al.02] DFA: –XML Toolkit Beyond the obvious Stream indexes (XML Toolkit) Stream views
Dan Suciu Univ. of Washington Querying XML Streams17 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
Dan Suciu Univ. of Washington Querying XML Streams18 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) Extra processing needed to combine branches (not in this talk) catalog product category price quantity "tools" 200 * price *
Dan Suciu Univ. of Washington Querying XML Streams19 Basic NFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... NFA... XPath 3,66,102,4534,... 2,3,543,43,254 1,55,99,... STACK SAX events Current states
Dan Suciu Univ. of Washington Querying XML Streams20 Basic NFA Evaluation Properties: Space = linear Throughput = decreases linearly Systems: XFilter [Altinel&Franklin’99], YFilter. XTrie [Chan et al.’02]
Dan Suciu Univ. of Washington Querying XML Streams21 Basic DFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... XPath STACK SAX events DFAs Current state
Dan Suciu Univ. of Washington Querying XML Streams22 Basic DFA Evaluation Properties: Throughput = constant ! Space = GOOD QUESTION System: XML Toolkit [University of Washington]
Dan Suciu Univ. of Washington Querying XML Streams23 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
Dan Suciu Univ. of Washington Querying XML Streams24 The Size of the DFA NFA b a b a a * DFA for //P has 1+|P| states [KMP] DFA for //P has 1+|P| states [KMP] [other] a a DFA b a b a a [other] //a/b/a/a/b
Dan Suciu Univ. of Washington Querying XML Streams25 The Size of the DFA //a/*/*/*/b Size of DFA = exponential in *’s (not a real concern) Size of DFA = exponential in *’s (not a real concern) * * b a * * NFA a a [other] DFA (fragment, and without back edges) a a b a a [other] b b b
Dan Suciu Univ. of Washington Querying XML Streams26 The Size of the DFA Theorem [GMOS’02] The number of states in the DFA for one linear XPath expression P is at most: k = number of // s = size of the alphabet (number of tags) m = max number of * between two consecutive // k+|P| k s m
Dan Suciu Univ. of Washington Querying XML Streams27 Size of DFA: Multiple Expressions //section/table/footnote //table/footnote //section/figure/footnote..... //abstract/footnote/table //section/table/footnote //table/footnote //section/figure/footnote..... //abstract/footnote/table DFA = Trie has linear number of states [Aho&Corasick]
Dan Suciu Univ. of Washington Querying XML Streams28 Size of DFA: Multiple Expressions //section//footnote //table//footnote //figure//footnote..... //abstract//footnote //section//footnote //table//footnote //figure//footnote..... //abstract//footnote 100 expressions states !! There is a theorem here too, but it’s not useful…
Dan Suciu Univ. of Washington Querying XML Streams29 Solution: Compute the DFA Lazily Also used in text searching But will it work for 10 6 XPath expressions ? YES ! For XPath it is provably effective, for two reasons: –XML data is not very deep –The nesting structure in XML data tends to be predictable
Dan Suciu Univ. of Washington Querying XML Streams30 Lazy DFA and “Simple” DTDs Document Type Definition (DTD) –Part of the XML standard –Will be replaced by XML Schema Example DTD: Definition A DTD is simple if all cycles are loops
Dan Suciu Univ. of Washington Querying XML Streams31 Lazy DFA and “Simple” DTDs document section table figure footnote Simple DTD: //section//footnote //table//footnote //figure//footnote //abstract//footnote //section//footnote //table//footnote //figure//footnote //abstract//footnote XPath expressions abstract Eager DFA “remembers” 2 4 sets Lazy DFA “remembers” only 4 sets
Dan Suciu Univ. of Washington Querying XML Streams32 Lazy DFA and “Simple” DTDs Theorem [GMOS’02] If the XML data has a “simple” DTD, then lazy DFA has at most: states. n = max depths of XPath expressions D = size of the “unfolded” DTD d = max depths of self-loops in the DTD 1+D(1+n) d Fact of life: “Data-like” XML has simple DTDs
Dan Suciu Univ. of Washington Querying XML Streams33 Lazy DFA and Data Guides “Non-simple” DTDs are useless for the lazy DFA “Everything may contain everything” Fact of life: “Text”-like XML has non-simple DTDs
Dan Suciu Univ. of Washington Querying XML Streams34 Lazy DFA and Data Guides Definition [Goldman&Widom’97] The data guide for an XML data instance is the Trie of all its root-to-leaf paths
Dan Suciu Univ. of Washington Querying XML Streams35 Lazy DFA and Data Guides document section table section tablefigure document section table figure section table XML Data Data Guide Fact of life: real XML data has “small” data guide [Liefke&S.’00] Fact of life: real XML data has “small” data guide [Liefke&S.’00] sectio n figur e
Dan Suciu Univ. of Washington Querying XML Streams36 Lazy DFA and “Simple” DTDs Theorem [GMOS’02] If the XML data has a data guide with G nodes, then the number of states in the lazy DFA is at most: G = number of nodes in the data guide 1+G
Dan Suciu Univ. of Washington Querying XML Streams simpleprovebBPSSproteinnasatreebank Number of Lazy DFA States - SYNTHETIC Data 10 3 XPath 10 4 XPath 10 5 XPath 4000 states
Dan Suciu Univ. of Washington Querying XML Streams proteinnasatreebank Number of Lazy DFA States - REAL Data 10 3 XPath 10 4 XPath 10 5 XPath 95 states states G =
Dan Suciu Univ. of Washington Querying XML Streams39 Number of States in the lazy DFA Real XML dataSynthetic XML data Data-style DTD Theorem Lazy DFA is small Theorem Lazy DFA is small Document-style DTD Theorem Lazy DFA is small Fact Lazy DFA is HUGE
Dan Suciu Univ. of Washington Querying XML Streams40 Lazy DFA in the XML Toolkit The XML toolkit uses a lazy DFA to process XML streams “warm-up” phase, followed by very high throughput
Dan Suciu Univ. of Washington Querying XML Streams41 Throughput for 10 3, 10 4, 10 5, 10 6 XPath expressions [ prob(*)=10%, prob(//)=10% ] MB/s 0.001MB/s 0.01MB/s 0.1MB/s 1MB/s 10MB/s 100MB/s 5MB10MB15MB20MB25MB Total input size parser lazyDFA (10 3 XPath) lazyDFA (10 4 XPath) lazyDFA (10 5 XPath) lazyDFA (10 6 XPath) xfilter (10 3 XPath) xfilter (10 4 XPath) xfilter(10 5 XPath) xfilter(10 6 XPath) Parser: 10MB/s Lazy DFA: 5.4MB/s
Dan Suciu Univ. of Washington Querying XML Streams42 Summary of Lazy DFA and XML Linear Xpath expressions: –Process with one lazy DFA Xpath expressions with branches –Process with Deterministic Pushdown Automata (ongoing work at the University of Washington)
Dan Suciu Univ. of Washington Querying XML Streams43 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
Dan Suciu Univ. of Washington Querying XML Streams44 Stream IndeX (SIX) Main observation: Parsing is major bottleneck Definition The SIX of an XML document is a binary table of (begin, end) offsets Idea: Use SIX to reduce amount of parsing Works well with (lazy) DFA Implemented in the XML toolkit
Dan Suciu Univ. of Washington Querying XML Streams45 Stream IndeX (SIX) Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 beginOffsetendOffset bib book publisher12423 author author SIXXML
Dan Suciu Univ. of Washington Querying XML Streams46 Stream IndeX (SIX) The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML
Dan Suciu Univ. of Washington Querying XML Streams47
Dan Suciu Univ. of Washington Querying XML Streams48 Stream Views Idea: Given a workload of XPath expressions with branches Precompute some views for each document to speed up the entire workload views header has to be small
Dan Suciu Univ. of Washington Querying XML Streams49 Stream Views /a[b=11][c=22][e=23] /a[b=33][d=44] [e=55] /a[c=66][f=77] /a[f=34][g=56] /a[b=33][d=44] [e=55] /a[c=66][f=77] /a[f=34][g=56] /a[b=88][c=99] /a[c=99][e=00] /a[b=88][c=99] /a[c=99][e=00] /a/c /a/e /a/f /a/c /a/e /a/f 3 Views: Short circuit evaluation ! Queries Servers
Dan Suciu Univ. of Washington Querying XML Streams50 Stream Views Views header (binary offsets) XML x speedup on a hit 100x speedup on a hit XML Header Choosing the views: Difficult problem Choosing the views: Difficult problem
Dan Suciu Univ. of Washington Querying XML Streams51 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
Dan Suciu Univ. of Washington Querying XML Streams52 Summary XML stream processing problem: –Fixed XPath queries, transient XML data –Large number of queries –High data throughput Relationship to text processing techniques: –Still regular expressions –Still automata and lazy DFAs –Different scale Techniques: –Lazy DFAs work for reasons specific to XML –Stream indexes and views: ongoing research
Dan Suciu Univ. of Washington Querying XML Streams53 Future Work Handle branches in XPath expressions View selection for a given workload Network configuration
Dan Suciu Univ. of Washington Querying XML Streams54 Thank you ! Links: xmltk.sourceforge.net