Wei Zhang Robert van Engelen

Slides:

Advertisements

Similar presentations

IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization.

Advertisements

1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.

CPSC Compiler Tutorial 9 Review of Compiler.

Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.

BİL744 Derleyici Gerçekleştirimi (Compiler Design)1.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.

Joey Paquet, Lecture 12 Review. Joey Paquet, Course Review Compiler architecture –Lexical analysis, syntactic analysis, semantic.

COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.

Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.

1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.

Lexical Analysis: Finite Automata CS 471 September 5, 2007.

1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.

Compilation With an emphasis on getting the job done quickly Copyright © – Curt Hill.

Introduction to Compiling

Compiler Design Introduction 1. 2 Course Outline Introduction to Compiling Lexical Analysis Syntax Analysis –Context Free Grammars –Top-Down Parsing –Bottom-Up.

Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.

9/25/08IEEE ICWS 2008 High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers Wei Zhang & Robert van Engelen Department of.

1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.

TDX: a High-Performance Table-Driven XML Parser Wei Zhang Robert van Engelen Department of Computer Science Florida State University.

CSC 4181 Compiler Construction

©SoftMoore ConsultingSlide 1 Structure of Compilers.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.

Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.

CS416 Compiler Design1. 2 Course Information Instructor : Dr. Ilyas Cicekli –Office: EA504, –Phone: , – Course Web.

Review 1.Structure of the course Lexical Analysis Syntax Analysis Grammar & Language RG & DFA Top-down LL(1) Parsing Bottom-Up LR Layered Automation Semantic.

Benchmarking XML Processors for Applications in Grid Web Services Michael R. Head*, Madhusudhan Govindaraju*, Robert van Engelen**, Wei Zhang** *Grid Computing.

Advanced Computer Systems

Compiler Design (40-414) Main Text Book:

Introduction Chapter : Introduction.

lec02-parserCFG May 8, 2018 Syntax Analyzer

Chapter 1 Introduction.

Introduction to Compiler Construction

CS 3304 Comparative Languages

Lexical and Syntax Analysis

CS 326 Programming Languages, Concepts and Implementation

COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 5, 09/25/2003 Prof. Roy Levow.

Compiler Construction (CS-636)

SOFTWARE DESIGN AND ARCHITECTURE

Syntax Analysis Chapter 4.

Chapter 1 Introduction.

PROGRAMMING LANGUAGES

-by Nisarg Vasavada (Compiled*)

课程名编译原理 Compiling Techniques

Compiler design Bottom-up parsing: Canonical LR and LALR

Compiler Lecture 1 CS510.

Data Modeling II XML Schema & JAXB Marc Dumontier May 4, 2004

CS416 Compiler Design lec00-outline September 19, 2018

Introduction to Compiler Construction

Lexical and Syntax Analysis

Front End vs Back End of a Compilers

Introduction CI612 Compiler Design CI612 Compiler Design.

CPSC 388 – Compiler Design and Construction

R.Rajkumar Asst.Professor CSE

LR Parsing. Parser Generators.

CS416 Compiler Design lec00-outline February 23, 2019

Introduction to Compiler Construction

Chapter 10: Compilers and Language Translation

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Compiler Construction

Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.

Introduction Chapter : Introduction.

lec02-parserCFG May 27, 2019 Syntax Analyzer

Lecture 5 Scanning.

COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 2, 09/04/2003 Prof. Roy Levow.

Compiler design Bottom-up parsing: Canonical LR and LALR

Faculty of Computer Science and Information System

Compiler design Review COMP 442/6421 – Compiler Design

XML and Web Services (II/2546)

Presentation transcript:

Wei Zhang Robert van Engelen A Table-Driven Streaming XML Parsing Methodology for High-Performance Web Services Wei Zhang Robert van Engelen 9/20/2006 International Conference on Web Services 2006

Outline XML Performance Related Work: Schema-Specific XML Parsing (SSP) Table-Driven Streaming XML Parsing (TDX) Experiment Results Conclusion 9/20/2006 ICWS 2006

XML Performance XML messaging is at the heart of Web Services XML is widely seen as underperforming Increasingly, XML is being used in processes that demand high-performance Validation is even worse Often, validation is typically applied during debugging and testing, and is often disabled in production systems 9/20/2006 ICWS 2006

Why are traditional XML parsers slow? 9/20/2006 ICWS 2006

Traditional parsers: performance issues(1) Three stages of XML processing Well-formedness parsing Validation Application data handling application Validation The first stage is syntactic and addresses whether or not the document is well-formed XML; The second stage addresses whether or not the structure is a valid instance of a given schema; In the third stage, the application actually uses the data deserialized by the parser in the XML. The separation of three stages incurs overhead. Parsing XML 9/20/2006 ICWS 2006

Traditional parsers: performance issues(2) Frequent access to schema Comparison done on String (typically inefficient) Work duplicated between validator and deserializer Repeated data format validation and conversion (e.g. string/integer) Data copying 9/20/2006 ICWS 2006

Related Work: Schema-Specific XML Parsing (1) Idea: Constructing a parser that is hard-coded to process XML by exploiting schema information Merging well-formedness parsing and validation application Validation Parsing SSP is to construct an XML parser that is hand-coded to process XML information by exploiting schema information. For example, without schema information, element names must be buffered by the lexical analyzer for use by the application or during validation. If schema information is available, element names can be directly resolved to an application-provided, element-specific handler during lexical analysis. XML 9/20/2006 ICWS 2006

Related Work: Schema-Specific XML Parsing(2) Merging parsing and validation by Constructing PDA [Chiu‘ 03] No namespace support Converting from NFA to DFA may result in exponentially growing space requirement Constructing DFA [van Engelen ‘04] Cannot process cyclic XML schema gSOAP toolkit [van Engelen ‘04] Based on recursive-descent parsing Not suitable for generic XML parsing without application data (de)serialization A compiler-based approach for schema-specific XML parsing Constructing a DFA for high-performance XML-based Web Services Van Engelen in [16] presented a method to integrate parsing and validation into a single stage with parsing actions encoded by a deterministic finite state automaton (DFA), where the DFA is directly constructed from a schema based on a set of mapping rules. DFA parsing is fast and combines parsing and validity checks. However, because of the limitations of the regular language described by DFAs, the approach can only be used to process a noncyclic subset of XML schemas. Chiu et al. [3, 4] also suggests an approach to merge all aspects of low-level parsing and validation by constructing a single push-down automaton. However, their approach does not support XML namespaces, which is essential for SOAP compliance. Furthermore, the approach requires conversion from a non-deterministic automaton (NFA) to a DFA. This conversion may result in exponentially growing space requirement caused by subset construction [1] or powerset construction [11]. In earlier work on the gSOAP toolkit [14, 17] a schema specific parsing approach was implemented and a compiler tool was developed to generate LL(1) recursive descent parsers to efficiently parse XML documents with namespaces defined by schemas. This approach has the disadvantages of recursive descent parsing, which include code size and function calling overhead. 9/20/2006 ICWS 2006

Table-Driven Streaming XML Parsing Methodology (TDX) An integrated Approach to XML Parsing, validation, deserialization, and even application-specific events for High Performance Web Services application Validation Parsing XML 9/20/2006 ICWS 2006

Table-Driven Streaming XML Parsing Methodology (1) LL(1) Grammar can be generated from schema XML well-formedness parsing can be verified through grammar productions XML structure can be verified through grammar productions e.g. Occurrence, enumeration simpleType CDATA value validation can be accomplished by semantic actions Application-specific events can also be encoded as semantic actions 9/20/2006 ICWS 2006

An Illustrating Example(1) Schema (abbreviated syntax): <element name=“example” type=“example_type”/> <complexType name=“example_type"> <sequence> </sequence> <element name=“id” type=“xsd:string”/> <element name=“value” type=“xsd:integer”/> <element name=“state” type=“state_type”/> </complexType> <simpleType name=“state_type"> </simpleType> <restriction base=“xsd:string”> </restriction> <enumeration value=“ON”/> <enumeration value=“OFF”/> 9/20/2006 ICWS 2006

An Illustrating Example(1) Grammar: Schema (abbreviated syntax): (1) s -> ‘<example>’ t ‘</example>’ <element name=“example” type=“example_type”/> <complexType name=“example_type"> <sequence> </sequence> <element name=“id” type=“xsd:string”/> <element name=“value” type=“xsd:integer”/> <element name=“state” type=“state_type”/> </complexType> (2) t -> t1 t2 t3 (3) t1 -> ‘<id>’ CDATA ‘</id>’ //isIdType(…) (4) t2 -> ‘<value>’ CDATA ‘</value>’ //isValueType(…) (5) t3 -> ‘<state>’ v ‘</state>’ <simpleType name=“state_type"> </simpleType> <restriction base=“xsd:string”> </restriction> <enumeration value=“ON”/> <enumeration value=“OFF”/> (6) v -> ‘ON’ EVENT //doStateON() (7) v -> ‘OFF’ 9/20/2006 ICWS 2006

An Illustrating Example(2) Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> s ‘<example>’ t ‘</example>’ t1 t2 t3 ‘<id>’ CDATA ‘</id>’ ‘<value>’ CDATA ‘</value>’ ‘<state>’ v ‘</state>’ invoke isIdType(…) invoke isValueType(…) ‘ON’ EVENT invoke doStateOn() Top-down parsing tree 9/20/2006 ICWS 2006

TDX Architecture(1) 9/20/2006 ICWS 2006

TDX Architecture(2) 9/20/2006 ICWS 2006

TDX Architecture(3) 9/20/2006 ICWS 2006

TDX Architecture(4) 9/20/2006 ICWS 2006

TDX Modularity TDX parsing engine is schema-independent Hot swap modules for SSP LL(1) grammar productions LL(1) parse table Type-checking actions tokens 9/20/2006 ICWS 2006

TDX Construction Toolkit(1) Two Code generators: WSDL2TDX and LL2Table Given a schema or WSDL specification, the toolkit automatically generates tables for parsing engine 9/20/2006 ICWS 2006

TDX Construction Toolkit(2) Why two generators? Application-specific events can not be generated automatically Allows insertion of application specific events 9/20/2006 ICWS 2006

TDX Scanner/Tokenizer TDX scanner is also runtime tokenizer Why tokenization? Comparison done on tokens (more efficient) Defined by component tags Element names, attribute names Classified as starting tags, ending tags Enumeration values CDATA, EVENT Normarlized namespace binding <namespace,tag_name> By tokenizing XML into tokens, we enhance the overall performance, because matching tokens once is more efficient than repeatedly comparing strings in the parser. Tokens are defined by schema component tags such as element names, attribute names. In our approach we namespace-normalize XML tags and then classify the XML tag names according to uses. All beginning element tags <NAME> are represented by bNAME, ending element names </NAME> are represented by eNAME, and attribute tag names are represented by aNAME. Similarly, enumeration values value=‘V are represented by cV. Enumeration values are also defined as tokens, mainly to improve performance. Namespace bindings are supported by internal normalization of the token stream to simplify the construction of LL(1) parse table. Namespace qualified elements and attributes are translated into normalized tokens according to a namespace mapping table. Thus, identical tag names defined under two different namespace domains are in fact separate tokens. Table 1 lists tokens for an example schema. 9/20/2006 ICWS 2006

Scanner/Tokenizer example <book xmlns =“x.org" xmlns:y=“y.org"> <title> XML Bible </title> <author> <name> Bob </name> <y:title> professor </y:title> </author> </book> Part of tokens: eTITLE1 bTITLE1 y.org eAUTHOR bAUTHOR eTITLE bTITLE x.org </author> <author> </title> <title> 9/20/2006 ICWS 2006

Mapping Rules Define mapping from XML schema to LL(1) grammars Preserves structural constrains Many types of validation constraints are incorporated in resulting grammar productions e.g., occurrence constraints Some type-checking constraints are incorporated as grammar productions e.g., enumeration simpleType 9/20/2006 ICWS 2006

Sample Mapping Rules 9/20/2006 ICWS 2006

Mapping Example <complexType name=“example”> <sequence> <element name=“id” type=“id_type” minOccurs=“0”/> <element name=“value” type=“value_type” minOccurs=“0” maxOccurs=“unbounded”/> </sequence> </complexType> example -> s1 s2 S1 -> bID id_type eID S1 ->  S2 -> bVALUE value_type eVALUE s2 S2 ->  9/20/2006 ICWS 2006

TDX Table Generation Example Grammar: (1) s -> bE t eE (2) t -> t1 t2 t3 (3) t1 -> bI CD eI //isIdType(…) (4) t2 -> bV CD eV //isValueType(…) (5) t3 -> bS v eS //doStateOn() (6) v -> cON EV (7) v -> cOFF 9/20/2006 ICWS 2006

TDX Table Generation Example(2) bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 t3->bS v eS v v->cON v->OFF ACC LL(1) Parse Table 9/20/2006 ICWS 2006

TDX Parsing Engine Exmple Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bE TDX Parsing Engine s $ stack 9/20/2006 ICWS 2006

Parsing Example (cont’d) Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bE bE t eE $ TDX Parsing Engine stack 9/20/2006 ICWS 2006

Parsing Example (cont’d) Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bE bI TDX Parsing Engine t eE $ stack 9/20/2006 ICWS 2006

Parsing Example (cont’d) Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON v->OFF ACC bE bI Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> t1 t2 t3 eE $ TDX Parsing Engine stack 9/20/2006 ICWS 2006

Parsing Example (cont’d) Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON VT v->OFF ACC bE bI Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> bI CD eI t2 t3 eE $ TDX Parsing Engine stack 9/20/2006 ICWS 2006

Parsing Example (cont’d) Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON VT v->OFF ACC bE bI CD Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> TDX Parsing Engine CD eI t2 t3 eE $ invoke isIdType() stack 9/20/2006 ICWS 2006

Parsing Example (cont’d) Parsing Table: bE eE bI bV eV bS eS cON cOFF $ s s->bE t eE t t->t1 t2 t3 t1 t1->bI CD eI t2 t2->bV CD eV t3 T3->bS v eS v v->cON VT v->OFF ACC bE bI CD … $ Input: <id> id_123 </id> <value> 456 </value> <state> ON </state> <example> </example> TDX Parsing Engine $ stack 9/20/2006 ICWS 2006

Experiment Results Test environment Compared with 2.4 GHz P4, 512 MB RAM, Red Hat Linux 3.2.2-5, GNU Compiler g++.3.2.3 with option –02 Memory-resident XML message Measures with elapsed real time using timeofday() for 100 runs Compared with DFA-based Parser gSOAP 2.7 eXpat 1.2 Xerces 2.7.0 9/20/2006 ICWS 2006

Experiment Results(cont’d) XML Schema for echoString (abbreviated syntax): <schema> <element name="echoString"> <complexType> <sequence> <element name="input" type="xsd:string“ maxOccurs=“unbounded”/> </sequence> </complexType> </element> </schema> 9/20/2006 ICWS 2006

Parsing Performance (1) XML document size: 1024B 9/20/2006 ICWS 2006

Parsing Performance (2) 9/20/2006 ICWS 2006

Conclusions TDX is fast Integrated approach across layers Avoid schema access at runtime Comparison done on tokens Avoid data copying Avoid format conversions Minimized function calls Optimization based on schema structure 9/20/2006 ICWS 2006

Conclusions (cont’d) XML can be parsed, validated, and deserialized efficiently for high-performance Web services using table-driven methodology Can be up to several times faster than than industry-strength high-performance validating XML parsers. Table-Driven methodology can offer high-level of modularity, and Provides a mechanism integrating application-specific events, such as SOAP deserializers 9/20/2006 ICWS 2006

Thank You 9/20/2006 ICWS 2006