On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies University of Tsukuba
Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion
Motivation A DTD is continuously updated according to changes of the real world. If a DTD is updated, XML documents may no longer be valid against the updated DTD. update DTD Updated DTD transform XML document XML document XML document XML document ・ ・ ・ ・ ・ ・ Thus, XML documents have to be transformed according to a DTD update.
Motivation (cnt’d) Problem: DTD Problem: Update D D’ For a DTD update, there may be more than one ways to transform an XML document. t1 XML document . . t tn Thus, we have to select one appropriate transformation way among them.
Example DTD XML document book chapter chapter book section section bib inserted inserted DTD <!ELEMENT book (chapter)*> <!ELEMENT chapter (section+,bib?)> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> <!ELEMENT book (section+,bib?)*> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> Update book XML document chapter chapter book section section bib We have to select an appropriate transformation from these. section section bib book chapter section section bib
Inferring K Optimum Transformations update script s According to update script s, XML document t can be transformed into t1,…,tn . D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn It is useful to infer “K optimum” documents of t1,…,tn from s.
Research Objective Determine the complexity of inferring K optimum transformations. Construct an efficient algorithm for solving this problem.
Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion
Tree Representation of Regular Expression Each content model (regular expression) of a DTD is represented by a tree . λ The positions of the nodes | 1 2 * (a|b|c)d* 11 12 13 21 a b c d Regular expression r Tree representation of r
Update Operations to DTDs ins_elm(a,b,u): Inserting an element del_elm(a,u): Deleting an element agg_elm(a,b,u): Aggregating a subexpression ext_elm(a,u): Extracting a label ins_opr(a,opr,u,v): Inserting an operator del_opr(a,u): Deleting an operator change_opr(a,opr,u): Replacing the operator def_cm(a,r): Defining a content model undef_cm(a): Removing a content model
ins_elm(a,b,u): Inserting an Element Inserts label b at position u in d(a) (d(a): the content model of element a) . λ . λ ins_elm(a, e, 12) | 1 2 | 1 2 * * 11 12 21 11 12 13 21 b c d b e c d d(a) = (b|c)d* (b|e|c)d*
del_elm(a,u): Deleting an Element Replaces a label at u by if d(a) has a label at u, Deletes at u if d(a) has at u . λ . λ del_elm(a, 13) | 1 2 1 2 * | * 11 12 13 21 11 12 13 21 b c e d b c d d(a) = (b|c|e)d* (b|c|)d*
agg_elm(a,b,u): Aggregating a Subexpression Aggregates the subexpression E at u by label b, i.e., replaces E by b and sets d(b) = E . λ . λ agg_elm(a, f, 2) | 1 2 | 1 2 * * f λ 11 12 13 21 11 12 13 1 b c e d b c e d d(a) = (b|c|e)d* d(a) = (b|c|e)f d(f) = d*
ext_elm(a,u): Extracting a Label Extracts the label at u, i.e., replaces the label at u, say b, by d(b) . λ . λ ext_elm(a, 2) | 1 2 1 2 f * λ | * 11 12 13 1 11 12 13 21 b c e d b c e d d(a) = (b|c|e)f d(f) = d* d(a) = (b|c|e)d*
ins_opr(a,opr,u,v): Inserting an Operator Inserts an operator opr (|, ., *. +, ?) as the parent of the nodes at u,…,v in d(a) . λ . λ 1 2 * * ins_opr(a, *, 1, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* (b|c|e)*d*
del_opr(a,u): Deleting an Operator Deletes the operator at u in d(a) . λ . λ 1 2 * * del_opr(a, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* d(a) = (b|c|e)*d*
change_opr(a,opr,u): Replacing an Operator Replaces the operator at u in d(a) by opr . λ . λ 1 2 1 2 * * ? * change_opr(a,?,1) | 11 | 11 21 21 d d 111 112 113 111 112 113 b c e b c e d(a) = (b|c|e)?d* d(a) = (b|c|e)*d*
Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion
Nondeterministic Transformation Inferred from Update Operation Let s = op1…opn be an update script (sequence of update operations). XML document t is transformed according to opi for each i = 1,…,n. op1 op2 opn . . . D D1 Dn DTD t t1 . . . tn XML document We define how an XML document t is “nondeterministically” transformed according to an update operation.
Nondeterministic Transformation Inferred from Update Operation (cnt’d) This transformation is defined according to op. ins_elm(a,b,u) del_elm(a,u) agg_elm(a,b,u) ext_elm(a,u) ins_opr(a,opr,u,v) del_opr(a,u) change_opr(a,opr,u)
The update content model includes d(a) In the Case where op = ins_elm(a,b,u) Inserts label b at u in d(a) The update content model includes d(a) The case where the parent of u is | : Do nothing. The case where the parent of u is . : For each a-element Ea in XML document t do Find positions at which new b-elements should be inserted so that Ea becomes “valid” against the updated content model of a. For each position v found in step (a), create a valid b-element and insert it at v.
Example DTD d(a) = (bc)*d d(a) = (bec)*d (Assuming that d(e) = fg) ins_elm(a, e, 112) d(a) = (bc)*d d(a) = (bec)*d XML document a a b c b c d b e c b e c d f g f g (Assuming that d(e) = fg)
In the Case where op = agg_elm(a,b,u) Replaces the subexpression at u in d(a) by label b For each a-element Ea in XML document t do Find sequences of elements in Ea that match the subexpression at u in d(a). For each sequence found in step (a), insert a new b-node as the parent of the sequence.
Example DTD d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b agg_elm(a,e,11) d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b c d e e d b c b c
Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion
Inferring K Optimum Transformations According to update script s, XML document t can be transformed into t1,…,tn . s D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn We consider inferring K “optimum” documents of t1,…,tn from s.
Cost of Transformation In this paper, a transformation with the least cost (amount of changes) is optimum. The cost of a transformation (between t and t’) is the sum of the number of changes between t and t’. transform t t’ Node/subtree insertion Node/subtree deletion
NP-hardness Theorem: The following decision problem is NP-hard. Instance: A DTD D, an XML document t valid against D, an update script s to D, and a positive integer B Question: Is there a transformation of t inferred from s whose cost is no more than B? Proof: A reduction from SAT.
NP-hardness (cnt’d) Thus, it is unlikely that we can infer K optimum transformations from an update script efficiently, even if K = 1. However, there is an efficient algorithm for finding K optimum transformations, assuming that an update script is of length one.
Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion
Our Algorithm Let D be a DTD, t be an XML document valid against D, and op be an update operation to D. The algorithm infers K optimum transformations of t from op in time polynomial of |D|, |t|, and K.
The subtree of t rooted at n Outline of Algorithm For simplicity, we show how to infer “K optimum costs”, instead of “K optimum transformations”. The algorithm computes, for each node n in t, K optimum costs to transform t/n in a bottom-up manner. XML document t The subtree of t rooted at n n1 n2 n9 Bottom-up n3 n6 n10 n1 n4 n5 n7 n8 n11 n12 n13 n14
Outline of Algorithm (cnt’d) For each node n in t do (in a bottom-up order) Let a be the label of n. Construct a weighted graph G(N,E), which is obtained by “merging” (i) the children of n and (ii) the Glushkov automaton of d(a). 3. Find K shortest paths p1,…, pK on G(N,E). Assign the K weights of p1,…, pK to n. Return the K weights of the root of t. K optimum costs to transform t/n
Outline of Step 2 (op = del_elm()) (a) The children of n (b) The Glushkov automaton of d(a) = g(c*b|cb*) a n b212 b Deleted by del_elm(a,221) b c g c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b2221, n3)
Graph G(N,E) (op = del_elm()) 2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, except that the weight of each edge (g1,n1) (c221,n2) is set to 1 (since deleting a subtree costs one). k Each path from (qI, n0) to (*, n3) represents the sequence of children of n. The k-th shortest path from (qI,n0) to (*, n3) represents the k-th cost to transform t/n.
Outline of Step 2 (op = ins_elm()) (a) The children of n (b) The (modified) Glushkov automaton of d(a) = g(ec*b|cb*) a n b212 eb Inseted by ins_elm(a,e,21) b ec d c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b221, n3)
Graph G(N,E) (op = ins_elm()) 2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, then the weight of each edge (g1,n1) (c2111,n2) is increased by 1 (since adding a subtree costs one). k The weight of k-th shortest path from (qI, n0) to (*, n3) represents the k-th cost to transform t/n.
Time Complexity of the Algorithm For a DTD D, an XML document t valid against D, and a positive integer K, the algorithm runs in time, where d(t) is the maximum outdegree of t. O(|t|・d(t)3・|D|2・K2)
Conclusion Showed that the problem of inferring K optimum transformations from an update script is NP-hard, even if K = 1. Constructed an algorithm for inferring K optimum transformations from an update operation, which runs in time polynomial of |D|, |t|, and K.
Future Work Implement the algorithm and make experiments to examine if the current definition of the cost of a transformation is appropriate, and device another criterion if necessary, the efficiency of the algorithm. Use schema languages more powerful than DTD, e.g., XML Schema and regular tree grammar.
Thank you!