Wei Zhang Robert van Engelen

Wei Zhang Robert van Engelen
A Table-Driven Streaming XML Parsing Methodology for High-Performance Web Services Wei Zhang Robert van Engelen 9/20/2006 International Conference on Web Services 2006

Outline XML Performance
Related Work: Schema-Specific XML Parsing (SSP) Table-Driven Streaming XML Parsing (TDX) Experiment Results Conclusion 9/20/2006 ICWS 2006

XML Performance XML messaging is at the heart of Web Services
XML is widely seen as underperforming Increasingly, XML is being used in processes that demand high-performance Validation is even worse Often, validation is typically applied during debugging and testing, and is often disabled in production systems 9/20/2006 ICWS 2006

Why are traditional XML parsers slow?
9/20/2006 ICWS 2006

Traditional parsers: performance issues(1)
Three stages of XML processing Well-formedness parsing Validation Application data handling application Validation The first stage is syntactic and addresses whether or not the document is well-formed XML; The second stage addresses whether or not the structure is a valid instance of a given schema; In the third stage, the application actually uses the data deserialized by the parser in the XML. The separation of three stages incurs overhead. Parsing XML 9/20/2006 ICWS 2006

Traditional parsers: performance issues(2)
Frequent access to schema Comparison done on String (typically inefficient) Work duplicated between validator and deserializer Repeated data format validation and conversion (e.g. string/integer) Data copying 9/20/2006 ICWS 2006

Related Work: Schema-Specific XML Parsing (1)
Idea: Constructing a parser that is hard-coded to process XML by exploiting schema information Merging well-formedness parsing and validation application Validation Parsing SSP is to construct an XML parser that is hand-coded to process XML information by exploiting schema information. For example, without schema information, element names must be buffered by the lexical analyzer for use by the application or during validation. If schema information is available, element names can be directly resolved to an application-provided, element-specific handler during lexical analysis. XML 9/20/2006 ICWS 2006

Related Work: Schema-Specific XML Parsing(2)
Merging parsing and validation by Constructing PDA [Chiu‘ 03] No namespace support Converting from NFA to DFA may result in exponentially growing space requirement Constructing DFA [van Engelen ‘04] Cannot process cyclic XML schema gSOAP toolkit [van Engelen ‘04] Based on recursive-descent parsing Not suitable for generic XML parsing without application data (de)serialization A compiler-based approach for schema-specific XML parsing Constructing a DFA for high-performance XML-based Web Services Van Engelen in [16] presented a method to integrate parsing and validation into a single stage with parsing actions encoded by a deterministic finite state automaton (DFA), where the DFA is directly constructed from a schema based on a set of mapping rules. DFA parsing is fast and combines parsing and validity checks. However, because of the limitations of the regular language described by DFAs, the approach can only be used to process a noncyclic subset of XML schemas. Chiu et al. [3, 4] also suggests an approach to merge all aspects of low-level parsing and validation by constructing a single push-down automaton. However, their approach does not support XML namespaces, which is essential for SOAP compliance. Furthermore, the approach requires conversion from a non-deterministic automaton (NFA) to a DFA. This conversion may result in exponentially growing space requirement caused by subset construction [1] or powerset construction [11]. In earlier work on the gSOAP toolkit [14, 17] a schema specific parsing approach was implemented and a compiler tool was developed to generate LL(1) recursive descent parsers to efficiently parse XML documents with namespaces defined by schemas. This approach has the disadvantages of recursive descent parsing, which include code size and function calling overhead. 9/20/2006 ICWS 2006

Table-Driven Streaming XML Parsing Methodology (TDX)
An integrated Approach to XML Parsing, validation, deserialization, and even application-specific events for High Performance Web Services application Validation Parsing XML 9/20/2006 ICWS 2006

Table-Driven Streaming XML Parsing Methodology (1)
LL(1) Grammar can be generated from schema XML well-formedness parsing can be verified through grammar productions XML structure can be verified through grammar productions e.g. Occurrence, enumeration simpleType CDATA value validation can be accomplished by semantic actions Application-specific events can also be encoded as semantic actions 9/20/2006 ICWS 2006

An Illustrating Example(1)
Schema (abbreviated syntax): <element name=“example” type=“example_type”/> <complexType name=“example_type"> <sequence> </sequence> <element name=“id” type=“xsd:string”/> <element name=“value” type=“xsd:integer”/> <element name=“state” type=“state_type”/> </complexType> <simpleType name=“state_type"> </simpleType> <restriction base=“xsd:string”> </restriction> <enumeration value=“ON”/> <enumeration value=“OFF”/> 9/20/2006 ICWS 2006

Grammar: Schema (abbreviated syntax): (1) s -> ‘<example>’ t ‘</example>’ <element name=“example” type=“example_type”/> <complexType name=“example_type"> <sequence> </sequence> <element name=“id” type=“xsd:string”/> <element name=“value” type=“xsd:integer”/> <element name=“state” type=“state_type”/> </complexType> (2) t -> t1 t2 t3 (3) t1 -> ‘<id>’ CDATA ‘</id>’ //isIdType(…) (4) t2 -> ‘<value>’ CDATA ‘</value>’ //isValueType(…) (5) t3 -> ‘<state>’ v ‘</state>’ <simpleType name=“state_type"> </simpleType> <restriction base=“xsd:string”> </restriction> <enumeration value=“ON”/> <enumeration value=“OFF”/> (6) v -> ‘ON’ EVENT //doStateON() (7) v -> ‘OFF’ 9/20/2006 ICWS 2006

Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> s ‘<example>’ t ‘</example>’ t t t3 ‘<id>’ CDATA ‘</id>’ ‘<value>’ CDATA ‘</value>’ ‘<state>’ v ‘</state>’ invoke isIdType(…) invoke isValueType(…) ‘ON’ EVENT invoke doStateOn() Top-down parsing tree 9/20/2006 ICWS 2006

TDX Architecture(1) 9/20/2006 ICWS 2006

TDX Modularity TDX parsing engine is schema-independent
Hot swap modules for SSP LL(1) grammar productions LL(1) parse table Type-checking actions tokens 9/20/2006 ICWS 2006

TDX Construction Toolkit(1)
Two Code generators: WSDL2TDX and LL2Table Given a schema or WSDL specification, the toolkit automatically generates tables for parsing engine 9/20/2006 ICWS 2006

TDX Construction Toolkit(2)
Why two generators? Application-specific events can not be generated automatically Allows insertion of application specific events 9/20/2006 ICWS 2006

TDX Scanner/Tokenizer
TDX scanner is also runtime tokenizer Why tokenization? Comparison done on tokens (more efficient) Defined by component tags Element names, attribute names Classified as starting tags, ending tags Enumeration values CDATA, EVENT Normarlized namespace binding <namespace,tag_name> By tokenizing XML into tokens, we enhance the overall performance, because matching tokens once is more efficient than repeatedly comparing strings in the parser. Tokens are defined by schema component tags such as element names, attribute names. In our approach we namespace-normalize XML tags and then classify the XML tag names according to uses. All beginning element tags <NAME> are represented by bNAME, ending element names </NAME> are represented by eNAME, and attribute tag names are represented by aNAME. Similarly, enumeration values value=‘V are represented by cV. Enumeration values are also defined as tokens, mainly to improve performance. Namespace bindings are supported by internal normalization of the token stream to simplify the construction of LL(1) parse table. Namespace qualified elements and attributes are translated into normalized tokens according to a namespace mapping table. Thus, identical tag names defined under two different namespace domains are in fact separate tokens. Table 1 lists tokens for an example schema. 9/20/2006 ICWS 2006

Scanner/Tokenizer example
<book xmlns =“x.org" xmlns:y=“y.org"> <title> XML Bible </title> <author> <name> Bob </name> <y:title> professor </y:title> </author> </book> Part of tokens: eTITLE1 bTITLE1 y.org eAUTHOR bAUTHOR eTITLE bTITLE x.org </author> <author> </title> <title> 9/20/2006 ICWS 2006

Mapping Rules Define mapping from XML schema to LL(1) grammars
Preserves structural constrains Many types of validation constraints are incorporated in resulting grammar productions e.g., occurrence constraints Some type-checking constraints are incorporated as grammar productions e.g., enumeration simpleType 9/20/2006 ICWS 2006

Sample Mapping Rules 9/20/2006 ICWS 2006

Mapping Example <complexType name=“example”> <sequence>
<element name=“id” type=“id_type” minOccurs=“0”/> <element name=“value” type=“value_type” minOccurs=“0” maxOccurs=“unbounded”/> </sequence> </complexType> example -> s1 s2 S1 -> bID id_type eID S1 ->  S2 -> bVALUE value_type eVALUE s2 S2 ->  9/20/2006 ICWS 2006

TDX Table Generation Example
Grammar: (1) s -> bE t eE (2) t -> t1 t2 t3 (3) t1 -> bI CD eI //isIdType(…) (4) t2 -> bV CD eV //isValueType(…) (5) t3 -> bS v eS //doStateOn() (6) v -> cON EV (7) v -> cOFF 9/20/2006 ICWS 2006

TDX Table Generation Example(2)
bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 t3->bS v eS v v->cON v->OFF ACC LL(1) Parse Table 9/20/2006 ICWS 2006

TDX Parsing Engine Exmple
Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bE TDX Parsing Engine s $ stack 9/20/2006 ICWS 2006

Parsing Example (cont’d)
Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bE bE t eE $ TDX Parsing Engine stack 9/20/2006 ICWS 2006

Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bE bI TDX Parsing Engine t eE $ stack 9/20/2006 ICWS 2006

Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC bE bI Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> t1 t2 t3 eE $ TDX Parsing Engine stack 9/20/2006 ICWS 2006

Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON VT v->OFF ACC bE bI Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bI CD eI t2 t3 eE $ TDX Parsing Engine stack 9/20/2006 ICWS 2006

Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON VT v->OFF ACC bE bI CD Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> TDX Parsing Engine CD eI t2 t3 eE $ invoke isIdType() stack 9/20/2006 ICWS 2006

Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON VT v->OFF ACC bE bI CD … $ Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> TDX Parsing Engine $ stack 9/20/2006 ICWS 2006

Experiment Results Test environment Compared with
2.4 GHz P4, 512 MB RAM, Red Hat Linux , GNU Compiler g with option –02 Memory-resident XML message Measures with elapsed real time using timeofday() for 100 runs Compared with DFA-based Parser gSOAP 2.7 eXpat 1.2 Xerces 2.7.0 9/20/2006 ICWS 2006

Experiment Results(cont’d)
XML Schema for echoString (abbreviated syntax): <schema> <element name="echoString"> <complexType> <sequence> <element name="input" type="xsd:string“ maxOccurs=“unbounded”/> </sequence> </complexType> </element> </schema> 9/20/2006 ICWS 2006

Parsing Performance (1)
XML document size: 1024B 9/20/2006 ICWS 2006

Parsing Performance (2)
9/20/2006 ICWS 2006

Conclusions TDX is fast Integrated approach across layers
Avoid schema access at runtime Comparison done on tokens Avoid data copying Avoid format conversions Minimized function calls Optimization based on schema structure 9/20/2006 ICWS 2006

Conclusions (cont’d) XML can be parsed, validated, and deserialized efficiently for high-performance Web services using table-driven methodology Can be up to several times faster than than industry-strength high-performance validating XML parsers. Table-Driven methodology can offer high-level of modularity, and Provides a mechanism integrating application-specific events, such as SOAP deserializers 9/20/2006 ICWS 2006

Thank You 9/20/2006 ICWS 2006

Wei Zhang Robert van Engelen

Similar presentations

Presentation on theme: "Wei Zhang Robert van Engelen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wei Zhang Robert van Engelen

Similar presentations

Presentation on theme: "Wei Zhang Robert van Engelen"— Presentation transcript:

Similar presentations

About project

Feedback