On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

§6 Leftist Heaps CHAPTER 5 Graph Algorithms  Heap: Structure Property + Order Property Target : Speed up merging in O(N). Leftist Heap: Order Property.
Greedy Algorithms Greed is good. (Some of the time)
1 NP-completeness Lecture 2: Jan P The class of problems that can be solved in polynomial time. e.g. gcd, shortest path, prime, etc. There are many.
Department of Computer Science & Engineering
© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Pseudo-polynomial time algorithm (The concept and the terminology are important) Partition Problem: Input: Finite set A=(a 1, a 2, …, a n } and a size.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
Data Structures – LECTURE 10 Huffman coding
Chapter 11: Limitations of Algorithmic Power
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Priority Queues, Heaps & Leftist Trees
Induction and recursion
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
ICS 253: Discrete Structures I Induction and Recursion King Fahd University of Petroleum & Minerals Information & Computer Science Department.
LIMITATIONS OF ALGORITHM POWER
Young CS 331 D&A of Algo. NP-Completeness1 NP-Completeness Reference: Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Introduction to NP Instructor: Neelima Gupta 1.
1 Minimum Routing Cost Tree Definition –For two nodes u and v on a tree, there is a path between them. –The sum of all edge weights on this path is called.
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
The Theory of NP-Completeness
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
Mathematical Foundations of AI
BCA-II Data Structure Using C
Chapter 5 : Trees.
Greedy Technique.
Advanced Algorithms Analysis and Design
Priority Queues An abstract data type (ADT) Similar to a queue
Complexity analysis.
Lecture 22 Complexity and Reductions
Aditya P. Mathur Purdue University
Heaps 9/13/2018 3:17 PM Heaps Heaps.
Lectures on Network Flows
PC trees and Circular One Arrangements
Chapter 5. Optimal Matchings
CSCE350 Algorithms and Data Structure
ICS 353: Design and Analysis of Algorithms
Multi-Way Search Trees
Heaps 12/4/2018 5:27 AM Heaps /4/2018 5:27 AM Heaps.
Ch. 8 Priority Queues And Heaps
Graphs.
Pumping Lemma for Context-free Languages
CS 581 Tandy Warnow.
Minimum Spanning Tree Section 7.3: Examples {1,2,3,4}
Chapter 6: Transform and Conquer
Chapter 11 Limitations of Algorithm Power
CS21 Decidability and Tractability
Priority Queues An abstract data type (ADT) Similar to a queue
Branch and Bound Searching Strategies
CSE 6408 Advanced Algorithms.
NP-Completeness Reference: Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and Johnson, W.H. Freeman and Company, 1979.
Pseudo-polynomial time algorithm (The concept and the terminology are important) Partition Problem: Input: Finite set A=(a1, a2, …, an} and a size s(a)
Heaps & Multi-way Search Trees
Our old list of problems
Heaps 9/29/2019 5:43 PM Heaps Heaps.
CS210- Lecture 13 June 28, 2005 Agenda Heaps Complete Binary Tree
Presentation transcript:

On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies University of Tsukuba

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Motivation A DTD is continuously updated according to changes of the real world. If a DTD is updated, XML documents may no longer be valid against the updated DTD. update DTD Updated DTD transform XML document XML document XML document XML document ・ ・ ・ ・ ・ ・ Thus, XML documents have to be transformed according to a DTD update.

Motivation (cnt’d) Problem: DTD Problem: Update D D’ For a DTD update, there may be more than one ways to transform an XML document. t1 XML document . . t tn Thus, we have to select one appropriate transformation way among them.

Example DTD XML document book chapter chapter book section section bib inserted inserted DTD <!ELEMENT book (chapter)*> <!ELEMENT chapter (section+,bib?)> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> <!ELEMENT book (section+,bib?)*> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> Update book XML document chapter chapter book section section bib We have to select an appropriate transformation from these. section section bib book chapter section section bib

Inferring K Optimum Transformations update script s According to update script s, XML document t can be transformed into t1,…,tn . D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn It is useful to infer “K optimum” documents of t1,…,tn from s.

Research Objective Determine the complexity of inferring K optimum transformations. Construct an efficient algorithm for solving this problem.

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Tree Representation of Regular Expression Each content model (regular expression) of a DTD is represented by a tree . λ The positions of the nodes | 1 2 * (a|b|c)d* 11 12 13 21 a b c d Regular expression r Tree representation of r

Update Operations to DTDs ins_elm(a,b,u): Inserting an element del_elm(a,u): Deleting an element agg_elm(a,b,u): Aggregating a subexpression ext_elm(a,u): Extracting a label ins_opr(a,opr,u,v): Inserting an operator del_opr(a,u): Deleting an operator change_opr(a,opr,u): Replacing the operator def_cm(a,r): Defining a content model undef_cm(a): Removing a content model

ins_elm(a,b,u): Inserting an Element Inserts label b at position u in d(a) (d(a): the content model of element a) . λ . λ ins_elm(a, e, 12) | 1 2 | 1 2 * * 11 12 21 11 12 13 21 b c d b e c d d(a) = (b|c)d* (b|e|c)d*

del_elm(a,u): Deleting an Element Replaces a label at u by  if d(a) has a label at u, Deletes  at u if d(a) has  at u . λ . λ del_elm(a, 13) | 1 2 1 2 * | * 11 12 13 21 11 12 13 21 b c e d b c  d d(a) = (b|c|e)d* (b|c|)d*

agg_elm(a,b,u): Aggregating a Subexpression Aggregates the subexpression E at u by label b, i.e., replaces E by b and sets d(b) = E . λ . λ agg_elm(a, f, 2) | 1 2 | 1 2 * * f λ 11 12 13 21 11 12 13 1 b c e d b c e d d(a) = (b|c|e)d* d(a) = (b|c|e)f d(f) = d*

ext_elm(a,u): Extracting a Label Extracts the label at u, i.e., replaces the label at u, say b, by d(b) . λ . λ ext_elm(a, 2) | 1 2 1 2 f * λ | * 11 12 13 1 11 12 13 21 b c e d b c e d d(a) = (b|c|e)f d(f) = d* d(a) = (b|c|e)d*

ins_opr(a,opr,u,v): Inserting an Operator Inserts an operator opr (|, ., *. +, ?) as the parent of the nodes at u,…,v in d(a) . λ . λ 1 2 * * ins_opr(a, *, 1, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* (b|c|e)*d*

del_opr(a,u): Deleting an Operator Deletes the operator at u in d(a) . λ . λ 1 2 * * del_opr(a, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* d(a) = (b|c|e)*d*

change_opr(a,opr,u): Replacing an Operator Replaces the operator at u in d(a) by opr . λ . λ 1 2 1 2 * * ? * change_opr(a,?,1) | 11 | 11 21 21 d d 111 112 113 111 112 113 b c e b c e d(a) = (b|c|e)?d* d(a) = (b|c|e)*d*

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Nondeterministic Transformation Inferred from Update Operation Let s = op1…opn be an update script (sequence of update operations). XML document t is transformed according to opi for each i = 1,…,n. op1 op2 opn . . . D D1 Dn DTD t t1 . . . tn XML document We define how an XML document t is “nondeterministically” transformed according to an update operation.

Nondeterministic Transformation Inferred from Update Operation (cnt’d) This transformation is defined according to op. ins_elm(a,b,u) del_elm(a,u) agg_elm(a,b,u) ext_elm(a,u) ins_opr(a,opr,u,v) del_opr(a,u) change_opr(a,opr,u)

The update content model includes d(a) In the Case where op = ins_elm(a,b,u) Inserts label b at u in d(a) The update content model includes d(a) The case where the parent of u is | : Do nothing. The case where the parent of u is . : For each a-element Ea in XML document t do Find positions at which new b-elements should be inserted so that Ea becomes “valid” against the updated content model of a. For each position v found in step (a), create a valid b-element and insert it at v.

Example DTD d(a) = (bc)*d d(a) = (bec)*d (Assuming that d(e) = fg) ins_elm(a, e, 112) d(a) = (bc)*d d(a) = (bec)*d XML document a a b c b c d b e c b e c d f g f g (Assuming that d(e) = fg)

In the Case where op = agg_elm(a,b,u) Replaces the subexpression at u in d(a) by label b For each a-element Ea in XML document t do Find sequences of elements in Ea that match the subexpression at u in d(a). For each sequence found in step (a), insert a new b-node as the parent of the sequence.

Example DTD d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b agg_elm(a,e,11) d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b c d e e d b c b c

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Inferring K Optimum Transformations According to update script s, XML document t can be transformed into t1,…,tn . s D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn We consider inferring K “optimum” documents of t1,…,tn from s.

Cost of Transformation In this paper, a transformation with the least cost (amount of changes) is optimum. The cost of a transformation (between t and t’) is the sum of the number of changes between t and t’. transform t t’ Node/subtree insertion Node/subtree deletion

NP-hardness Theorem: The following decision problem is NP-hard. Instance: A DTD D, an XML document t valid against D, an update script s to D, and a positive integer B Question: Is there a transformation of t inferred from s whose cost is no more than B? Proof: A reduction from SAT.

NP-hardness (cnt’d) Thus, it is unlikely that we can infer K optimum transformations from an update script efficiently, even if K = 1. However, there is an efficient algorithm for finding K optimum transformations, assuming that an update script is of length one.

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Our Algorithm Let D be a DTD, t be an XML document valid against D, and op be an update operation to D. The algorithm infers K optimum transformations of t from op in time polynomial of |D|, |t|, and K.

The subtree of t rooted at n Outline of Algorithm For simplicity, we show how to infer “K optimum costs”, instead of “K optimum transformations”. The algorithm computes, for each node n in t, K optimum costs to transform t/n in a bottom-up manner. XML document t The subtree of t rooted at n n1 n2 n9 Bottom-up n3 n6 n10 n1 n4 n5 n7 n8 n11 n12 n13 n14

Outline of Algorithm (cnt’d) For each node n in t do (in a bottom-up order) Let a be the label of n. Construct a weighted graph G(N,E), which is obtained by “merging” (i) the children of n and (ii) the Glushkov automaton of d(a). 3. Find K shortest paths p1,…, pK on G(N,E). Assign the K weights of p1,…, pK to n. Return the K weights of the root of t. K optimum costs to transform t/n

Outline of Step 2 (op = del_elm()) (a) The children of n (b) The Glushkov automaton of d(a) = g(c*b|cb*) a n b212 b Deleted by del_elm(a,221) b c g c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b2221, n3)

Graph G(N,E) (op = del_elm()) 2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, except that the weight of each edge (g1,n1) (c221,n2) is set to 1 (since deleting a subtree costs one). k Each path from (qI, n0) to (*, n3) represents the sequence of children of n. The k-th shortest path from (qI,n0) to (*, n3) represents the k-th cost to transform t/n.

Outline of Step 2 (op = ins_elm()) (a) The children of n (b) The (modified) Glushkov automaton of d(a) = g(ec*b|cb*) a n b212 eb Inseted by ins_elm(a,e,21) b ec d c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b221, n3)

Graph G(N,E) (op = ins_elm()) 2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, then the weight of each edge (g1,n1) (c2111,n2) is increased by 1 (since adding a subtree costs one). k The weight of k-th shortest path from (qI, n0) to (*, n3) represents the k-th cost to transform t/n.

Time Complexity of the Algorithm For a DTD D, an XML document t valid against D, and a positive integer K, the algorithm runs in time, where d(t) is the maximum outdegree of t. O(|t|・d(t)3・|D|2・K2)

Conclusion Showed that the problem of inferring K optimum transformations from an update script is NP-hard, even if K = 1. Constructed an algorithm for inferring K optimum transformations from an update operation, which runs in time polynomial of |D|, |t|, and K.

Future Work Implement the algorithm and make experiments to examine if the current definition of the cost of a transformation is appropriate, and device another criterion if necessary, the efficiency of the algorithm. Use schema languages more powerful than DTD, e.g., XML Schema and regular tree grammar.

Thank you!