Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu.

Similar presentations


Presentation on theme: "Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu."— Presentation transcript:

1 Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu

2 Dan SuciuXML Toolkit2 Motivation Lots of data sits in large text files –ad hoc data formats “Queried” with Unix command line tools –grep, sort, tail, etc Would be nice to XML-ize it......but then the Unix command line tools won’t work any more.

3 Dan SuciuXML Toolkit3 Example In the old Unix world… 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... scoredecision paperID title grep “reject” papers.txt | sort | tail 10 Find the top ten rejected papers (in score order): Text file

4 Dan SuciuXML Toolkit4 Example (cont’d) In the new XML world… 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... … can’t use those tools anymore 

5 Dan SuciuXML Toolkit5 Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected s, in order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml | xtail –c /submissions –e paper –n 10

6 Dan SuciuXML Toolkit6 Goals of the XML Toolkit Simple, scalable tools for XML processing Provides service: there are people who need this Provides a research platform: for XML stream processing

7 Dan SuciuXML Toolkit7 Outline The tools The XPath processing engine Conclusions

8 Dan SuciuXML Toolkit8 The Tools Current tools: xsort xagg xnest xflatten xdelete xpair xhead xtail file2xml xmill Will talk only about this May look plenty, but actually still incomplete...

9 Dan SuciuXML Toolkit9 XSort: Definition -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr) * ) * ) * General form

10 Dan SuciuXML Toolkit10 XSort: Definition XSort c c c e1 e2 e3 e4 e5 e6e7 e8e9 c c c e4 e1 e3 e2e6 e7e5 e9 e8

11 Dan SuciuXML Toolkit11 XSort Examples Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly 2001 0-596-00058-8 Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation. 2001 WWW Posters http://www10.org/cdrom/posters/1112.pdf db/conf/www/www2001p.html#Devillers01..... Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly 2001 0-596-00058-8 Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation. 2001 WWW Posters http://www10.org/cdrom/posters/1112.pdf db/conf/www/www2001p.html#Devillers01..... Examples illustrated on data like this:

12 Dan SuciuXML Toolkit12 XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the s, by The s are dropped from the output................ Compare to… xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text()

13 Dan SuciuXML Toolkit13 XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the s, by then................

14 Dan SuciuXML Toolkit14 XSort: Examples xsort –c /bib –e paper –e article –e book –e * s first, then s, then s, then all the rest................................................

15 Dan SuciuXML Toolkit15 XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: s first, then s, then s then all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In s list the s first; in s list the first; Leave other entries unchanged

16 Dan SuciuXML Toolkit16 XSort: Implementation Sorts one context at a time, copies the rest For each context: –Create a “global key” for each item –Sort items, with a two-pass, multiway merge sort Quote from Databases 101 (news from the trenches): –with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes !

17 Dan SuciuXML Toolkit17 XSort: Performance Size (KB)Xalan (sec)Xsort (sec) 0.410.080.00 4.910.090.00 76.220.270.02 991.792.520.26 9671.7927.452.85 100964.43-43.97 1009643.71-461.36 xsort –c /dblp –e * –k title/text() 1GB ! 8minutes

18 Dan SuciuXML Toolkit18 Outline The tools The XPath processing engine Conclusions

19 Dan SuciuXML Toolkit19 The XPath Processor Common to all tools is the following problem: Given: Set of correlated XPath expressions Stream of SAX events Decide: When are the expressions true  variable events

20 Dan SuciuXML Toolkit20 $r$r $c$c $e1$e2$e3 $k1$k2 bib paper book * publishertitle Tree pattern: Example xsort –c /bib –e paper –k publisher –e book –k title –e * xsort –c /bib –e paper –k publisher –e book –k title –e * Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 $r$r $c$c $e2 $k2 Variable events

21 Dan SuciuXML Toolkit21 The XPath Processor How we did it: All Xpath expressions  Deterministic Finite Automaton –Restriction: no predicates yet (current work...) Does this scale to many, many XPath expressions ? –Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) Evaluation time is = parsing time Can do even better with a Stream IndeX (next)

22 Dan SuciuXML Toolkit22 Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets News: The parser is the main bottleneck in XPath stream processing !

23 Dan SuciuXML Toolkit23 Stream IndeX (SIX): Construction Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 startend bib01490124 book3409023 publisher12423 author426879 author978... SIXXML

24 Dan SuciuXML Toolkit24 Stream IndeX (SIX): Skip Parsing Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998...... Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998...... XPath XML /bib/paper/title... Skip Parsing

25 Dan SuciuXML Toolkit25 Stream IndeX (SIX) in XML Stream Processing.................. 0205 3066 72188 0205 3066 72188 90110 9598 0205 3066 The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML

26 Dan SuciuXML Toolkit26

27 Dan SuciuXML Toolkit27

28 Dan SuciuXML Toolkit28 Outline The tools The XPath processing engine Conclusions

29 Dan SuciuXML Toolkit29 Conclusions The toolkit is already available: –http://www.cs.washington.edu/homes/suciu/XMLTK –http://xmltk.sourceforge.net What it does so far it does very well: –Sorting, aggregation, nest/unnest But doesn’t do too much: –Restricted selections, no projections, no restructurings yet –Volunteers welcome ! Can one process XML data without parsing it completely ? –SIX


Download ppt "Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu."

Similar presentations


Ads by Google