Download presentation
Presentation is loading. Please wait.
1
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz
2
The Challenge XML becoming a standard for data exchange on the Web. Need: on-line processing of large amounts of data in XML format, using limited memory. Our focus: validating XML documents against given DTDs.
3
Validating Streaming XML Documents Restrictions over the validation: In a single pass. Using a fixed amount of memory, depending on the DTD. Input stream......< /v>... start accept FSA Yes/No FSA
4
The Problem in 2 Flavors There are 2 flavors to the problem: Strong validation: validation that includes checking well-formedness. Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.
5
Tree Document XML documents are abstracted by “ tree documents ”. A tree document over a finite alphabet is a finite unranked tree with labels in and an order on the children of each node. r c aa bcbc
6
String Representation XML documents are a string representation of trees using opening and closing tags for each element. For each, represents the opening tag. represents the closing tag for. Notation:. r c aa bcbc
7
DTDs A DTD consists of an extended context-free grammar over alphabet Σ. DTD : r a* a bc b c? c є A tree document over Σ satisfies a DTD if it is a derivation tree of the grammar. r c aa bcbc satisfies
8
DTDs – cont ’ Each DTD has a unique rule for each symbol. denotes the regular expression. is the language over consisting of the string representations of all tree documents satisfying.
9
Strong Validation of Streaming XML Documents The problem: validating an XML document with respect to a given DTD. Need to characterize the DTDs, for which can be recognized by an FSA. Such DTDs are called strongly recognizable.
10
Strong Validation – Example 1 DTD d: r a a a?. is not regular, so cannot be strongly validated by an FSA. is not strongly recognizable. r a a....
11
Strong Validation – Example 2 DTD d: r a* a b|c. is regular, so is strongly recognizable. r a a. bc
12
More Definitions Let be a DTD over . The dependency graph of,, is the graph constructed as follows: Its set of vertices is . For each rule in, there is an edge from to, for each occurring in.
13
More Definitions (cont ’ ) Two labels, and, are mutually recursive if they belong to some cycle of. is recursive if it is mutually recursive with itself. DTD is non-recursive iff is acyclic. A DTD is fully recursive if all labels from which recursive labels are reachable in are mutually recursive.
14
Dependency Graph – Examples DTD d: r a a a? r a r a bc DTD d: r a* a b|c is non-recursive. is not acyclic. is not fully recursive. is recursive
15
Characterization of Strongly Recognizable DTDs Proof sketch: If is a strongly recognizable DTD, there is an FSA recognizing exactly. Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings. If is non-recursive, an algorithm to build an FSA recognizing is given. Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive.
16
Validating Well-Formed XML Documents The problem: validating an XML document with respect to a given DTD, assuming the XML document is well-formed. Validation using an FSA. Such DTDs are called recognizable. The requirement that should be regular is now too strong. The FSA should only work correctly on well- balanced strings representing trees.
17
Validation - Example 1 DTD d: r a a a? is not strongly recognizable. But, it is recognizable: If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).
18
Validation - Example 2 DTD d: a (ab|ca|є) b є c є is not recognizable. An FSA cannot store enough information to recall, when it reads, whether the corresponding node has a left sibling (in which is not allowed to its right). a ab ab ca ca
19
Characterizing Recognizable DTDs Which DTDs are recognizable? Non-recursive DTDs. What about recursive DTDs? Not a trivial question. Are there any necessary conditions of being a recognizable DTD? Are there any sub-groups of DTDs for which the necessary conditions are also sufficient?
20
Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols: Let be a positive integer and, be mutually recursive symbols of (not necessarily distinct). If, and for, then must be in. Necessary Condition for a Recognizable DTD
21
Fully Recursive DTDs The necessary condition stated in lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. Next, we ’ ll see how to construct an FSA for a DTD, which accepts all words in (and possibly more). For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well- balanced words).
22
The Standard FSA Let be a DTD over alphabet. Equivalence relation on Equivalence classes are the strongly connected components of. Let be a partial order on the classes of, where iff for some and there is an edge from to in. may have several maximal classes, but only one minimum class.
23
Example DTD d: r aa a a? The classes of, are and.. r a
24
Example – cont ’ DTD d: r aa a a? A A Constructing FSA of class {a} ’ s string representation a Constructing FSA for For edge in add to :.
25
Example – cont ’ DTD d: r aa a a?
26
Example – cont ’ The above FSA recognizes all well-balanced words produced by the above DTD. But also other well-balanced words (such as ). There is no automaton recognizing this DTD. DTD d: r aa a a?
27
Theorem 4.1: The following are equivalent for each fully recursive DTD : (i) is recognizable. (ii) satisfies the conditions of Lemma 4.2. (iii) The set of well-balanced strings accepted by the FSA is precisely. Recognizable Fully Recursive DTDs
28
Recognizable DTDs Which DTDs are recognizable? Non-recursive DTDs. Fully recursive DTDs satisfying the conditions of Lemma 4.2. And others … But, characterization in the general case remains an open question. Partial progress: necessary conditions for recognizability.
29
Alternative Validation Approaches 2 alternative approaches for validating DTDs that are not recognizable: Relax the constant memory requirement. Refining the original DTD.
30
Validation with Bounded Stack Relaxing the constant memory requirement. Use a stack whose depth is bounded in the depth of an XML document. Validation done in a single deterministic pass. Appealing approach in practice. For each DTD, there exists a deterministic PDA that accepts precisely its language. Example- the DTD: r aa a a?
31
Refining the DTD Refining a DTD means providing in the tags additional information that can be used for validation. DTD: The refined DTD can be validated by an FSA. For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable. Example:
32
Summary First step towards the formal investigation of processing streaming XML. Provided conditions under which validation can be done in a single pass and constant memory, using an FSA. Considered alternative approaches, when validation using an FSA is not possible.
33
Appendix The Standard FSA Construction
34
The Standard FSA is inductively constructed starting from the maximal elements of. Let be a maximal element of. For each regular expression ( ), a non-deterministic FSA is built. Disjoint states for different ’ s. Initial state of is, while its final states are
35
The Standard FSA – cont ’ Build : Its states are the union of the states of the FSAs for. Transitions- for each transition of, add to the transitions: for the initial state of. for each final state of. must belong to is a maximal element of
36
The Standard FSA – cont ’ Build for non-maximal elements of, when all FSAs of elements, such that are already constructed: Unlike the maximal elements case, has transitions, where (i.e., ). For such transitions, we add to : A new disjoint copy of. for the initial state of. for each final state of.
37
The Standard FSA – cont ’ The final FSA is obtained by adding to the FSA of the minimum class (containing the root label ): A new start state with transition for the start state of. A final state with transition for each final state of.
38
Complexity of ‘ s construction:. is the maximum size of an FSA for a regular expression of. is the depth of the partial order. Lemma 4.3: For each DTD, let be the automation described. We have: (i) Every word in is accepted by. (ii) can be constructed from in exponential time. The Standard FSA - Lemma
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.