9/25/08IEEE ICWS 2008 High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers Wei Zhang & Robert van Engelen Department of Computer Science Florida State University
IEEE ICWS /25/08 Presentation Overview Schema-specific Parsers Related Work PTDX: Table-Driven XML Parser with Permutation Phrase Grammar Performance Conclusion
IEEE ICWS /25/08 Schema-specific parsers Compile-time vs. Run-time Parsers Compile-time parsing and validation approaches use specialized compilation techniques to generate customized parsers from schemas Run-time approaches use generic drivers( or engines) and grammar-like representation of schemas Blocking vs. non-blocking Parsers Blocking parsers may suspend the entire program for sufficient XML content received. E.g. recursive based parsers Non-blocking parsers always control the program and buffered data can be incrementally supplied Time-efficient vs. Space-efficient Parsers Time efficient but encoding many states Space efficient but with backtracking
IEEE ICWS /25/08 Related Work [Van Engelen, 2001] The earliest work on schema-specific LL(1) recursive descent parser w/ namespace support and validation [Van Engelen, 2004] Two-level DFA integrating parsing and validation [Chiu et al., 2004] Using nondeterministic generalized automata to merge all aspects of low-level parsing and validation [Reuter, 2003] Using Cardinality-Constraint Automaton (CCA) to perform schema-aware validation
IEEE ICWS /25/08 Related Work (Cont’d) [Kostoulas et al., 2006] An efficient parser generator that translates XML schema into a parser either in C or Java [Matsa, 2007] Schema-directed interpretive XML parser using special purpose byte-codes. [Zhang et al., 2006] A table-driven approach parsing and validating in a single pass Generator that translates schema in C
IEEE ICWS /25/08 PTDX: Table-Driven XML Parser with Permutation Phrase Table-driven grammar-based parser Extended LL(1) grammar with permutation phrase support Parsing table is constructed from extended LL(1) permutation grammar Run-time parser Generic parsing engine (2-stack PDA) Both time and space efficient Predictive parsing Integrating parsing and validation into a single pass No buffering Operating on tokens Main stack size growing in depth of XMLdata Auxiliary stack size growing in number of elements of, Non-blocking parser
IEEE ICWS /25/08 Constructing PTDX Tables XML Schemas Mapping Rules Extended LL(1) Permutation Phrase Grammar LL(1) Parsing Table Token Table Action Table Note: actions are generated from schemas to perform type-checking verification although some validation constraints are incorporated in grammar productions.
IEEE ICWS /25/08 Mapping Rules Define translation from schema components to LL(1) grammar productions Preserve structural constraints Map Free-ordered schema components (, ) to permutation grammar
IEEE ICWS /25/08 Mapping Example <element name=“a” type=“string” minOccurs=“0”/> <element name=“b” type=“string”/> <element name=“c” type=“string”> T → > A → bA CD eA A → ε B → bB CD eB C → bC CD eC Note: bA and eA representing tokens of starting and closing element “a” Respectively; CD representing token of CDATA
IEEE ICWS /25/08 Permutation Phrase A permutation phrase is a grammatical phrase that specifies a syntactic construct as any permutation of a set of constituent elements. E.g., the permutation phrase > recognizes language { abc, acb, bac, bca, cab, cba }
IEEE ICWS /25/08 Two-stack PDA for Parsing Permutation Phrase > a bc top Main stackAux stack b c aInput: bc top Main stack a Aux stack b c aInput: ac top Main stackAux stack b c aInput: top 1 23
IEEE ICWS /25/08 Two-stack PDA for Parsing Permutation Phrase (Cont’d) > Main stackAux stack 456 c top Main stack a Aux stack top b c a Input: a top Main stackAux stack b c ab c a Input: b c ab c a Note: All optional constituent elements are left on auxiliary stack once all non-empty elements have been parsed.
IEEE ICWS /25/08 PTDX Architecture Hot-swappable
IEEE ICWS /25/08 Schema-directed Scanner Optimized by schema E.g., scanning a specific tag name is more efficient than scanning the generic string then doing comparison Tokenizer Breakes XML message into token stream Token Defined by element names, attribute names, enumeration values Classified as starting tags and closing tags Normalized namespace binding
IEEE ICWS /25/08 Experiment Settings Test environment 3.0 GHz, 2GB RAM, Linux , GCC with option -02 Memory-resident message Randomly arranged free ordered elements Compared with Validation parsers gSOAP 2.7 Xerces pTDX flex based parser Non-validation parsers Expat DFA-based parser
IEEE ICWS /25/08 Test Cases
IEEE ICWS /25/08 Performance: comparison of validating and non-validating parsers Better performance
IEEE ICWS /25/08 Performance: effect of number of elements in of PTDX parser Better performance
IEEE ICWS /25/08 Performance: runtime and compile time memory usage comparison(32 elements)
IEEE ICWS /25/08 Conclusion Free ordered constraints can be parsed and validated efficiently using a 2-stack PDA Table-driven permutation phrase grammar parsing technique is time and space optimal Table-driven approach offers flexible framework for dealing with schema evolvement