On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.

On Inferring K Optimum Transformations of XML Document from Update Script to DTD
Nobutaka Suzuki Graduate School of Library, Information and Media Studies University of Tsukuba

Outline Motivation Update operations to DTDs
Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Motivation A DTD is continuously updated according to changes of the real world. If a DTD is updated, XML documents may no longer be valid against the updated DTD. update DTD Updated DTD transform XML document XML document XML document XML document ・・・・・・ Thus, XML documents have to be transformed according to a DTD update.

Motivation (cnt’d) Problem:
DTD Problem: Update D D’ For a DTD update, there may be more than one ways to transform an XML document. t1 XML document . . t tn Thus, we have to select one appropriate transformation way among them.

Example DTD XML document book chapter chapter book section section bib
inserted inserted DTD <!ELEMENT book (chapter)*> <!ELEMENT chapter (section+,bib?)> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> <!ELEMENT book (section+,bib?)*> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> Update book XML document chapter chapter book section section bib We have to select an appropriate transformation from these. section section bib book chapter section section bib

Inferring K Optimum Transformations
update script s According to update script s, XML document t can be transformed into t1,…,tn . D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn It is useful to infer “K optimum” documents of t1,…,tn from s.

Research Objective Determine the complexity of inferring K optimum transformations. Construct an efficient algorithm for solving this problem.

Tree Representation of Regular Expression
Each content model (regular expression) of a DTD is represented by a tree . λ The positions of the nodes | 1 2 * (a|b|c)d* 11 12 13 21 a b c d Regular expression r Tree representation of r

Update Operations to DTDs
ins_elm(a,b,u): Inserting an element del_elm(a,u): Deleting an element agg_elm(a,b,u): Aggregating a subexpression ext_elm(a,u): Extracting a label ins_opr(a,opr,u,v): Inserting an operator del_opr(a,u): Deleting an operator change_opr(a,opr,u): Replacing the operator def_cm(a,r): Defining a content model undef_cm(a): Removing a content model

ins_elm(a,b,u): Inserting an Element
Inserts label b at position u in d(a) (d(a): the content model of element a) . λ . λ ins_elm(a, e, 12) | 1 2 | 1 2 * * 11 12 21 11 12 13 21 b c d b e c d d(a) = (b|c)d* (b|e|c)d*

del_elm(a,u): Deleting an Element
Replaces a label at u by  if d(a) has a label at u, Deletes  at u if d(a) has  at u . λ . λ del_elm(a, 13) | 1 2 1 2 * | * 11 12 13 21 11 12 13 21 b c e d b c  d d(a) = (b|c|e)d* (b|c|)d*

agg_elm(a,b,u): Aggregating a Subexpression
Aggregates the subexpression E at u by label b, i.e., replaces E by b and sets d(b) = E . λ . λ agg_elm(a, f, 2) | 1 2 | 1 2 * * f λ 11 12 13 21 11 12 13 1 b c e d b c e d d(a) = (b|c|e)d* d(a) = (b|c|e)f d(f) = d*

ext_elm(a,u): Extracting a Label
Extracts the label at u, i.e., replaces the label at u, say b, by d(b) . λ . λ ext_elm(a, 2) | 1 2 1 2 f * λ | * 11 12 13 1 11 12 13 21 b c e d b c e d d(a) = (b|c|e)f d(f) = d* d(a) = (b|c|e)d*

ins_opr(a,opr,u,v): Inserting an Operator
Inserts an operator opr (|, ., *. +, ?) as the parent of the nodes at u,…,v in d(a) . λ . λ 1 2 * * ins_opr(a, *, 1, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* (b|c|e)*d*

del_opr(a,u): Deleting an Operator
Deletes the operator at u in d(a) . λ . λ 1 2 * * del_opr(a, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* d(a) = (b|c|e)*d*

change_opr(a,opr,u): Replacing an Operator
Replaces the operator at u in d(a) by opr . λ . λ 1 2 1 2 * * ? * change_opr(a,?,1) | 11 | 11 21 21 d d 111 112 113 111 112 113 b c e b c e d(a) = (b|c|e)?d* d(a) = (b|c|e)*d*

Nondeterministic Transformation Inferred from Update Operation
Let s = op1…opn be an update script (sequence of update operations). XML document t is transformed according to opi for each i = 1,…,n. op1 op2 opn . . . D D1 Dn DTD t t1 . . . tn XML document We define how an XML document t is “nondeterministically” transformed according to an update operation.

Nondeterministic Transformation Inferred from Update Operation (cnt’d)
This transformation is defined according to op. ins_elm(a,b,u) del_elm(a,u) agg_elm(a,b,u) ext_elm(a,u) ins_opr(a,opr,u,v) del_opr(a,u) change_opr(a,opr,u)

The update content model includes d(a)
In the Case where op = ins_elm(a,b,u) Inserts label b at u in d(a) The update content model includes d(a) The case where the parent of u is | : Do nothing. The case where the parent of u is . : For each a-element Ea in XML document t do Find positions at which new b-elements should be inserted so that Ea becomes “valid” against the updated content model of a. For each position v found in step (a), create a valid b-element and insert it at v.

Example DTD d(a) = (bc)*d d(a) = (bec)*d (Assuming that d(e) = fg)
ins_elm(a, e, 112) d(a) = (bc)*d d(a) = (bec)*d XML document a a b c b c d b e c b e c d f g f g (Assuming that d(e) = fg)

In the Case where op = agg_elm(a,b,u)
Replaces the subexpression at u in d(a) by label b For each a-element Ea in XML document t do Find sequences of elements in Ea that match the subexpression at u in d(a). For each sequence found in step (a), insert a new b-node as the parent of the sequence.

Example DTD d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b
agg_elm(a,e,11) d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b c d e e d b c b c

Inferring K Optimum Transformations
According to update script s, XML document t can be transformed into t1,…,tn . s D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn We consider inferring K “optimum” documents of t1,…,tn from s.

Cost of Transformation
In this paper, a transformation with the least cost (amount of changes) is optimum. The cost of a transformation (between t and t’) is the sum of the number of changes between t and t’. transform t t’ Node/subtree insertion Node/subtree deletion

NP-hardness Theorem: The following decision problem is NP-hard.
Instance: A DTD D, an XML document t valid against D, an update script s to D, and a positive integer B Question: Is there a transformation of t inferred from s whose cost is no more than B? Proof: A reduction from SAT.

NP-hardness (cnt’d) Thus, it is unlikely that we can infer K optimum transformations from an update script efficiently, even if K = 1. However, there is an efficient algorithm for finding K optimum transformations, assuming that an update script is of length one.

Our Algorithm Let D be a DTD, t be an XML document valid against D, and op be an update operation to D. The algorithm infers K optimum transformations of t from op in time polynomial of |D|, |t|, and K.

The subtree of t rooted at n
Outline of Algorithm For simplicity, we show how to infer “K optimum costs”, instead of “K optimum transformations”. The algorithm computes, for each node n in t, K optimum costs to transform t/n in a bottom-up manner. XML document t The subtree of t rooted at n n1 n2 n9 Bottom-up n3 n6 n10 n1 n4 n5 n7 n8 n11 n12 n13 n14

Outline of Algorithm (cnt’d)
For each node n in t do (in a bottom-up order) Let a be the label of n. Construct a weighted graph G(N,E), which is obtained by “merging” (i) the children of n and (ii) the Glushkov automaton of d(a). Find K shortest paths p1,…, pK on G(N,E). Assign the K weights of p1,…, pK to n. Return the K weights of the root of t. K optimum costs to transform t/n

Outline of Step 2 (op = del_elm())
(a) The children of n (b) The Glushkov automaton of d(a) = g(c*b|cb*) a n b212 b Deleted by del_elm(a,221) b c g c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b2221, n3)

Graph G(N,E) (op = del_elm())
2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, except that the weight of each edge (g1,n1) (c221,n2) is set to 1 (since deleting a subtree costs one). k Each path from (qI, n0) to (*, n3) represents the sequence of children of n. The k-th shortest path from (qI,n0) to (*, n3) represents the k-th cost to transform t/n.

Outline of Step 2 (op = ins_elm())
(a) The children of n (b) The (modified) Glushkov automaton of d(a) = g(ec*b|cb*) a n b212 eb Inseted by ins_elm(a,e,21) b ec d c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b221, n3)

Graph G(N,E) (op = ins_elm())
2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, then the weight of each edge (g1,n1) (c2111,n2) is increased by 1 (since adding a subtree costs one). k The weight of k-th shortest path from (qI, n0) to (*, n3) represents the k-th cost to transform t/n.

Time Complexity of the Algorithm
For a DTD D, an XML document t valid against D, and a positive integer K, the algorithm runs in time, where d(t) is the maximum outdegree of t. O(|t|・d(t)3・|D|2・K2)

Conclusion Showed that the problem of inferring K optimum transformations from an update script is NP-hard, even if K = 1. Constructed an algorithm for inferring K optimum transformations from an update operation, which runs in time polynomial of |D|, |t|, and K.

Future Work Implement the algorithm and make experiments to examine
if the current definition of the cost of a transformation is appropriate, and device another criterion if necessary, the efficiency of the algorithm. Use schema languages more powerful than DTD, e.g., XML Schema and regular tree grammar.

Thank you!

On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.

Similar presentations

Presentation on theme: "On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.

Similar presentations

Presentation on theme: "On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies."— Presentation transcript:

Similar presentations

About project

Feedback