Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu.

Similar presentations


Presentation on theme: "Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu."— Presentation transcript:

1 Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu

2 Dan Suciu Univ. of Washington Querying XML Streams2 About Me Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: XML-QL = precursor of XQuery XMill = the XML compressor XML toolkit

3 Dan Suciu Univ. of Washington Querying XML Streams3 Motivation Text databases –Studied over the past 15 years –Traditional client/server model –Struggled with lack of standard text syntax Recently, new standard: XML –Traditional client/server: in today’s dbms –New applications: stream processing This talk: processing stream XML data –My motivation: work on the XML Toolkit project

4 Dan Suciu Univ. of Washington Querying XML Streams4 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

5 Dan Suciu Univ. of Washington Querying XML Streams5 Background: Relational Databases Structured, stored in tables Schema separate from data Queries: precise, refer to schema and data (SQL) :BOOKS ISBNTitleYearPublisher 0201537710 Foundations of Databases 1995AW 155860622XData on the Web1999MK AUTHOR AIDNameCountry 44AbiteboulFR 06BunemanUK 62HullUSA 12SuciuUSA 29VianuUSA WROTE: ISBNAID 020153771044 020153771062 020153771029 155860622X44 155860622X06 155860622X12 Hard to publish, easy to query precisely

6 Dan Suciu Univ. of Washington Querying XML Streams6 Background: Text Databases Unstructured, stored in documents No schema, only data Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely

7 Dan Suciu Univ. of Washington Querying XML Streams7 Background: XML Data Semistructured Schema and data are together: self-describing Queries: precise, refer to schema and data (SQL) Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … XML: Easier to publish, easy to query precisely

8 Dan Suciu Univ. of Washington Querying XML Streams8 Background: XML Data bib book paper title author publisher authorjournal book Data on the Web namecountry AbiteboulFR BunemanUK namecountry Addison Wesley Data model = tree

9 Dan Suciu Univ. of Washington Querying XML Streams9 Background: XML Data Querying with XPath (and XQuery) This talk: XPath queries restricted to: tag / // * [ ] path=“constant”

10 Dan Suciu Univ. of Washington Querying XML Streams10 Background: XPath in One Slide /bib/book[author/name=“Abiteboul”] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] /bib/book/author/name /bib/book//name/*/zip tag, / //,* [ ] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] Navigate partially known structure Conjunctive queries a la SQL

11 Dan Suciu Univ. of Washington Querying XML Streams11 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

12 Dan Suciu Univ. of Washington Querying XML Streams12 Main Application: XML Packet Routing Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] XML content routing [Snoeren et al.01] SOAP Message routing in Application Servers

13 Dan Suciu Univ. of Washington Querying XML Streams13 XML Packet Routing value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value

14 Dan Suciu Univ. of Washington Querying XML Streams14 /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” XPath expressions...... Input XML Stream Output XML Streams

15 Dan Suciu Univ. of Washington Querying XML Streams15 The XML Stream Processing Problem Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Hard: Large number of XPath expressions e.g. 10 3 - 10 6 Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions Hard: Large number of XPath expressions e.g. 10 3 - 10 6 Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions

16 Dan Suciu Univ. of Washington Querying XML Streams16 The Approaches Basic techniques NFA plus optimizations: –Xfilter/Yfilter [Altinel&Franklin’00] –XTrie [Chan et al.02] DFA: –XML Toolkit Beyond the obvious Stream indexes (XML Toolkit) Stream views

17 Dan Suciu Univ. of Washington Querying XML Streams17 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

18 Dan Suciu Univ. of Washington Querying XML Streams18 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) Extra processing needed to combine branches (not in this talk) catalog product category price quantity "tools" 200 * price * 

19 Dan Suciu Univ. of Washington Querying XML Streams19 Basic NFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... NFA... XPath 3,66,102,4534,... 2,3,543,43,254 1,55,99,... STACK SAX events Current states

20 Dan Suciu Univ. of Washington Querying XML Streams20 Basic NFA Evaluation Properties: Space = linear  Throughput = decreases linearly Systems: XFilter [Altinel&Franklin’99], YFilter. XTrie [Chan et al.’02]

21 Dan Suciu Univ. of Washington Querying XML Streams21 Basic DFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... XPath 399 552 1 STACK SAX events DFAs Current state

22 Dan Suciu Univ. of Washington Querying XML Streams22 Basic DFA Evaluation Properties: Throughput = constant !  Space = GOOD QUESTION System: XML Toolkit [University of Washington] http://xmltk.sourceforge.net

23 Dan Suciu Univ. of Washington Querying XML Streams23 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

24 Dan Suciu Univ. of Washington Querying XML Streams24 The Size of the DFA NFA b a b a a * DFA for //P has 1+|P| states [KMP] DFA for //P has 1+|P| states [KMP] 0 1 2 3 4 5 [other] a a DFA b a b a a [other] 0 01 02 013 014 025 //a/b/a/a/b

25 Dan Suciu Univ. of Washington Querying XML Streams25 The Size of the DFA //a/*/*/*/b Size of DFA = exponential in *’s (not a real concern) Size of DFA = exponential in *’s (not a real concern) * * b a * * 0 1 2 3 4 5 NFA a a [other] DFA (fragment, and without back edges) a a b a a [other] 0 01 012 0123 01234 012345 023 02 01303 0234 0134034 0345 0245045 b b b

26 Dan Suciu Univ. of Washington Querying XML Streams26 The Size of the DFA Theorem [GMOS’02] The number of states in the DFA for one linear XPath expression P is at most: k = number of // s = size of the alphabet (number of tags) m = max number of * between two consecutive // k+|P| k s m

27 Dan Suciu Univ. of Washington Querying XML Streams27 Size of DFA: Multiple Expressions //section/table/footnote //table/footnote //section/figure/footnote..... //abstract/footnote/table //section/table/footnote //table/footnote //section/figure/footnote..... //abstract/footnote/table DFA = Trie has linear number of states [Aho&Corasick]

28 Dan Suciu Univ. of Washington Querying XML Streams28 Size of DFA: Multiple Expressions //section//footnote //table//footnote //figure//footnote..... //abstract//footnote //section//footnote //table//footnote //figure//footnote..... //abstract//footnote 100 expressions 2 100 states !! There is a theorem here too, but it’s not useful…

29 Dan Suciu Univ. of Washington Querying XML Streams29 Solution: Compute the DFA Lazily Also used in text searching But will it work for 10 6 XPath expressions ? YES ! For XPath it is provably effective, for two reasons: –XML data is not very deep –The nesting structure in XML data tends to be predictable

30 Dan Suciu Univ. of Washington Querying XML Streams30 Lazy DFA and “Simple” DTDs Document Type Definition (DTD) –Part of the XML standard –Will be replaced by XML Schema Example DTD:.......... Definition A DTD is simple if all cycles are loops

31 Dan Suciu Univ. of Washington Querying XML Streams31 Lazy DFA and “Simple” DTDs document section table figure footnote Simple DTD: //section//footnote //table//footnote //figure//footnote //abstract//footnote //section//footnote //table//footnote //figure//footnote //abstract//footnote XPath expressions abstract Eager DFA “remembers” 2 4 sets Lazy DFA “remembers” only 4 sets

32 Dan Suciu Univ. of Washington Querying XML Streams32 Lazy DFA and “Simple” DTDs Theorem [GMOS’02] If the XML data has a “simple” DTD, then lazy DFA has at most: states. n = max depths of XPath expressions D = size of the “unfolded” DTD d = max depths of self-loops in the DTD 1+D(1+n) d Fact of life: “Data-like” XML has simple DTDs

33 Dan Suciu Univ. of Washington Querying XML Streams33 Lazy DFA and Data Guides “Non-simple” DTDs are useless for the lazy DFA “Everything may contain everything” Fact of life: “Text”-like XML has non-simple DTDs

34 Dan Suciu Univ. of Washington Querying XML Streams34 Lazy DFA and Data Guides Definition [Goldman&Widom’97] The data guide for an XML data instance is the Trie of all its root-to-leaf paths

35 Dan Suciu Univ. of Washington Querying XML Streams35 Lazy DFA and Data Guides document section table section tablefigure document section table figure section table XML Data Data Guide Fact of life: real XML data has “small” data guide [Liefke&S.’00] Fact of life: real XML data has “small” data guide [Liefke&S.’00] sectio n figur e

36 Dan Suciu Univ. of Washington Querying XML Streams36 Lazy DFA and “Simple” DTDs Theorem [GMOS’02] If the XML data has a data guide with G nodes, then the number of states in the lazy DFA is at most: G = number of nodes in the data guide 1+G

37 Dan Suciu Univ. of Washington Querying XML Streams37 1 10 100 1000 10000 100000 simpleprovebBPSSproteinnasatreebank Number of Lazy DFA States - SYNTHETIC Data 10 3 XPath 10 4 XPath 10 5 XPath 4000 states

38 Dan Suciu Univ. of Washington Querying XML Streams38 1 10 100 1000 10000 100000 proteinnasatreebank Number of Lazy DFA States - REAL Data 10 3 XPath 10 4 XPath 10 5 XPath 95 states 40000 states G = 350000

39 Dan Suciu Univ. of Washington Querying XML Streams39 Number of States in the lazy DFA Real XML dataSynthetic XML data Data-style DTD Theorem Lazy DFA is small Theorem Lazy DFA is small Document-style DTD Theorem Lazy DFA is small Fact Lazy DFA is HUGE

40 Dan Suciu Univ. of Washington Querying XML Streams40 Lazy DFA in the XML Toolkit The XML toolkit uses a lazy DFA to process XML streams “warm-up” phase, followed by very high throughput

41 Dan Suciu Univ. of Washington Querying XML Streams41 Throughput for 10 3, 10 4, 10 5, 10 6 XPath expressions [ prob(*)=10%, prob(//)=10% ] 0.0001MB/s 0.001MB/s 0.01MB/s 0.1MB/s 1MB/s 10MB/s 100MB/s 5MB10MB15MB20MB25MB Total input size parser lazyDFA (10 3 XPath) lazyDFA (10 4 XPath) lazyDFA (10 5 XPath) lazyDFA (10 6 XPath) xfilter (10 3 XPath) xfilter (10 4 XPath) xfilter(10 5 XPath) xfilter(10 6 XPath) Parser: 10MB/s Lazy DFA: 5.4MB/s

42 Dan Suciu Univ. of Washington Querying XML Streams42 Summary of Lazy DFA and XML Linear Xpath expressions: –Process with one lazy DFA Xpath expressions with branches –Process with Deterministic Pushdown Automata (ongoing work at the University of Washington)

43 Dan Suciu Univ. of Washington Querying XML Streams43 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

44 Dan Suciu Univ. of Washington Querying XML Streams44 Stream IndeX (SIX) Main observation: Parsing is major bottleneck Definition The SIX of an XML document is a binary table of (begin, end) offsets Idea: Use SIX to reduce amount of parsing Works well with (lazy) DFA Implemented in the XML toolkit

45 Dan Suciu Univ. of Washington Querying XML Streams45 Stream IndeX (SIX) Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 beginOffsetendOffset bib01490124 book3409023 publisher12423 author426879 author978... SIXXML

46 Dan Suciu Univ. of Washington Querying XML Streams46 Stream IndeX (SIX).................. 0205 3066 72188 0205 3066 72188 90110 9598 0205 3066 The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML

47 Dan Suciu Univ. of Washington Querying XML Streams47

48 Dan Suciu Univ. of Washington Querying XML Streams48 Stream Views Idea: Given a workload of XPath expressions with branches Precompute some views for each document to speed up the entire workload views  header has to be small

49 Dan Suciu Univ. of Washington Querying XML Streams49 Stream Views /a[b=11][c=22][e=23] /a[b=33][d=44] [e=55] /a[c=66][f=77] /a[f=34][g=56] /a[b=33][d=44] [e=55] /a[c=66][f=77] /a[f=34][g=56] /a[b=88][c=99] /a[c=99][e=00] /a[b=88][c=99] /a[c=99][e=00] /a/c /a/e /a/f /a/c /a/e /a/f 3 Views: Short circuit evaluation ! Queries Servers

50 Dan Suciu Univ. of Washington Querying XML Streams50 Stream Views Views  header (binary offsets)......... XML 0 30 72 0 30 72 0 30 72 100x speedup on a hit 100x speedup on a hit XML Header Choosing the views: Difficult problem Choosing the views: Difficult problem

51 Dan Suciu Univ. of Washington Querying XML Streams51 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

52 Dan Suciu Univ. of Washington Querying XML Streams52 Summary XML stream processing problem: –Fixed XPath queries, transient XML data –Large number of queries –High data throughput Relationship to text processing techniques: –Still regular expressions –Still automata and lazy DFAs –Different scale Techniques: –Lazy DFAs work for reasons specific to XML –Stream indexes and views: ongoing research

53 Dan Suciu Univ. of Washington Querying XML Streams53 Future Work Handle branches in XPath expressions View selection for a given workload Network configuration

54 Dan Suciu Univ. of Washington Querying XML Streams54 Thank you ! Links: www.cs.washington.edu/homes/suciu www.cs.washington.edu/homes/suciu/XMLTK xmltk.sourceforge.net


Download ppt "Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu."

Similar presentations


Ads by Google