Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University.

Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University

Motivation XML is gained wide prevalence as a data format for input and output. Multicore CPUs are becoming widespread. –Plans for 100 cores. If you have 100 cores, and you are only using one to read and write your output, that could be a significant waste.

Parallel XML Parsing How can XML parsing be parallelized? –Task parallelism. –Pipeline parallelism. –Data parallelism.

Task parallelism. –Multiple independent processing steps. –The sauce for a dish with sauce can be made in parallel to the main part. Step 1 Step 2A Step 2B Step 3 Time Core 1 Core 2 Core 1

Pipeline parallelism. –Multiple stages, all simultaneously performed in parallel. –If you are making two cakes (but only have one oven), you can start mixing the batter for the second cake while the first one is in the oven. Stage 1 Data C Stage 2 Data B Stage 3 Data A Time Core 1Core 2 Core 3 Stage 1 Data D Stage 2 Data C Stage 3 Data B Stage 1 Data E Stage 2 Data D Stage 3 Data C

Data parallelism –Divide the data up, process multiple pieces in parallel. Input Chunk 1Input Chunk 2Input Chunk 3 Core 1Core 2Core 3 Output Chunk 1 Merge Output

But XML is Inherently Sequential How can a chunk be parsed without knowing what came before? The parser doesn’t know what state to start in. Could do various scanning forwards and backwards, but it is ad hoc, and tricky. –Special characters like < can be in comments. content

Previous work We used a fast, sequential preparse scan –Build an outline of the document (skeleton) –Skeleton are used to guide full parse by first decomposing XML document into well-formed fragments on well-defined unambiguous positions –The XML fragments are parsed separately on each core by Libxml2 APIs –Merge the results into final DOM with Libxml2 APIs The preparse is sequential, however, so Amdahl’s law kicks in. We scale well to 4 cores, or so. So how can we parallelize the preparse?

DFA for XML Parsing First, we model XML parsers as a DFA with states, transitions and actions on transitions. The transition function maps from the current state and input character to a new state and an action. As it makes transitions, it also encounters and executes a sequence of actions. This model is general enough so that we believe the large majority of parsers can fit in this model.

Example: The Preparsing DFA The preparsing DFA has two actions: START and END, which are used to build the skeleton during execution of the DFA.

Example of running preparsing DFA sample 01003012 END 20 START 3 How can this be parallelized?

Meta-DFA Goal –Pursues simultaneously all possible states at the beginning of a chunk when a processor is about to parse the chunk Achieved by: –Transforming the original DFA to a meta-DFA whose transition function runs multiple instances of the original DFA in parallel via sub-DFAs –For each state q of the original DFA, the meta-DFA includes a complete copy of the DFA as a sub-DFA which begins execution in state q at the beginning of the chunk –For the actual execution, the meta-DFA transitions from a vector of states to another vector of states

Constructing meta-DFA Steps on constructing meta-DFA [0, 1] > < a < > a > < a [1, ]  [,0]  < > a [, 1]  > < a [, ]   [0, ] 0 < 1 > aa

Output Merging Since the meta-DFA pursues multiple possibilities simultaneously, there are also multiple outputs when a chunk is finished. –One corresponding to each possible initial state. We know definitively the state at the end of the first chunk. –This is used to select which output of the second chunk is the correct one. –The definitive state at the end of the second chunk is now known. –Etc.

Performance Evaluation Machine: –Sun E6500 with 30 400 MHz US-II processors –Operating System: Solaris 10 –Compiler: g++ 4.0 with the option -O3 –XML Standard Library: Libxml2 2.6.16 Tests: –We take the average of ten runs –Test file is selected from a well-known project named Protein Data Bank (PDB), sized to 34 MB –All the speedups are measured against parsing with stand-alone Libxml2

The full parsing process is: –First do a parallel preparse using a meta- DFA. This generates an outline of the document known as the skeleton. –Then use techniques based on parallel depth- first tree search to parallelize the full parse. –Subtrees of the document are parsed using unmodified libxml2.

Preparser Speedup Parallel preparser relative to the non-parallel preparser

Speedup on parallel full parsing After applying our meta-DFA technique in parallizing the preparsing stage, the parallel full parsing is now scalable.

Analysis

Summary Data parallel XML parsing is challenging because the parser does not know in which state to begin a chunk. –One solution is to simply begin the parser in all states simultaneously. This can be achieved by modeling the parser as a DFA with actions, then transforming the DFA into a meta-DFA (product machine). The meta-DFA runs multiple instances of the original DFA, one instance for each state of the original DFA. The number of states in the meta-DFA is finite, so it is also a DFA and can be executed by a single core. –The parallelism of the meta-DFA is logical parallelism.

Questions

Constructing Meta-DFA from DFA We take this simple DFA as the example 0 < 1 > aa

Execution of meta-DFA

Content Introduction and previous work DFA for XML Parsing Constructing meta-DFA from DFA for Parallel XML Parsing Performance results Conclusion

Apply Multi-core CPUs on XML The chip industry trend now is to –design multiple cores instead of faster CPUs. –So, this call for real parallel techniques XML now is very common in message- oriented interactions, but parsing it is slow Some solutions with multi-core parallelism –One core for parsing one XML message Problem: If the application must first fully process one large XML file, other cores will idle. –Parallelism available on single XML input This is our work: Parallel DOM-style XML parsing

Problem of Previous Work The preparse stage is sequential, so only scale to 4 cores In this paper, we –Can parallize this stage. We only need to know the size of the XML document, and physically decomposed the document into equal-sized chunks. –In doing so, we introduce an XML parsing model with a DFA. Our preparsing can be modeled by such parsing model. We then do parallelization through transforming the DFA to a meta-DFA.

Introduction Performance concerns of XML parsing –XML is self-descriptive, so verbose by design One way to improve the performance –Using multi-core to do parallel processing

Parse single XML input in parallel We do it by: –Decompose the XML document into chunks –Parse each chunk in parallel, use one core per chunk –Merge the results of each chunk into final true result However, we met problem as: –Each chunk cannot be unambiguously parsed independently, since the true state of the parser when start a chunk is unknown until preceding chunks are parsed. –Our previous work did something effective…

Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University.

Similar presentations

Presentation on theme: "Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University.

Similar presentations

Presentation on theme: "Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University."— Presentation transcript:

Similar presentations

About project

Feedback