On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.

Slides:

Advertisements

Similar presentations

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Advertisements

§6 Leftist Heaps CHAPTER 5 Graph Algorithms  Heap: Structure Property + Order Property Target : Speed up merging in O(N). Leftist Heap: Order Property.

Greedy Algorithms Greed is good. (Some of the time)

1 NP-completeness Lecture 2: Jan P The class of problems that can be solved in polynomial time. e.g. gcd, shortest path, prime, etc. There are many.

Department of Computer Science & Engineering

© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.

© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.

Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.

1 Pseudo-polynomial time algorithm (The concept and the terminology are important) Partition Problem: Input: Finite set A=(a 1, a 2, …, a n } and a size.

2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.

NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.

Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.

Data Structures – LECTURE 10 Huffman coding

Chapter 11: Limitations of Algorithmic Power

Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.

Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.

. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

Priority Queues, Heaps & Leftist Trees

Induction and recursion

XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.

ICS 253: Discrete Structures I Induction and Recursion King Fahd University of Petroleum & Minerals Information & Computer Science Department.

LIMITATIONS OF ALGORITHM POWER

Young CS 331 D&A of Algo. NP-Completeness1 NP-Completeness Reference: Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Introduction to NP Instructor: Neelima Gupta 1.

1 Minimum Routing Cost Tree Definition –For two nodes u and v on a tree, there is a path between them. –The sum of all edge weights on this path is called.

ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.

The Theory of NP-Completeness

The NP class. NP-completeness

More NP-Complete and NP-hard Problems

Mathematical Foundations of AI

BCA-II Data Structure Using C

Chapter 5 : Trees.

Greedy Technique.

Advanced Algorithms Analysis and Design

Priority Queues An abstract data type (ADT) Similar to a queue

Complexity analysis.

Lecture 22 Complexity and Reductions

Aditya P. Mathur Purdue University

Heaps 9/13/2018 3:17 PM Heaps Heaps.

Lectures on Network Flows

PC trees and Circular One Arrangements

Chapter 5. Optimal Matchings

CSCE350 Algorithms and Data Structure

ICS 353: Design and Analysis of Algorithms

Multi-Way Search Trees

Heaps 12/4/2018 5:27 AM Heaps /4/2018 5:27 AM Heaps.

Ch. 8 Priority Queues And Heaps

Pumping Lemma for Context-free Languages

CS 581 Tandy Warnow.

Minimum Spanning Tree Section 7.3: Examples {1,2,3,4}

Chapter 6: Transform and Conquer

Chapter 11 Limitations of Algorithm Power

CS21 Decidability and Tractability

Priority Queues An abstract data type (ADT) Similar to a queue

Branch and Bound Searching Strategies

CSE 6408 Advanced Algorithms.

NP-Completeness Reference: Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and Johnson, W.H. Freeman and Company, 1979.

Pseudo-polynomial time algorithm (The concept and the terminology are important) Partition Problem: Input: Finite set A=(a1, a2, …, an} and a size s(a)

Heaps & Multi-way Search Trees

Our old list of problems

Heaps 9/29/2019 5:43 PM Heaps Heaps.

CS210- Lecture 13 June 28, 2005 Agenda Heaps Complete Binary Tree

Presentation transcript:

On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies University of Tsukuba

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Motivation A DTD is continuously updated according to changes of the real world. If a DTD is updated, XML documents may no longer be valid against the updated DTD. update DTD Updated DTD transform XML document XML document XML document XML document ・・・・・・ Thus, XML documents have to be transformed according to a DTD update.

Motivation (cnt’d) Problem: DTD Problem: Update D D’ For a DTD update, there may be more than one ways to transform an XML document. t1 XML document . . t tn Thus, we have to select one appropriate transformation way among them.

Example DTD XML document book chapter chapter book section section bib inserted inserted DTD <!ELEMENT book (chapter)*> <!ELEMENT chapter (section+,bib?)> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> <!ELEMENT book (section+,bib?)*> <!ELEMENT section #PCDATA> <!ELEMENT bib #PCDATA> Update book XML document chapter chapter book section section bib We have to select an appropriate transformation from these. section section bib book chapter section section bib

Inferring K Optimum Transformations update script s According to update script s, XML document t can be transformed into t1,…,tn . D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn It is useful to infer “K optimum” documents of t1,…,tn from s.

Research Objective Determine the complexity of inferring K optimum transformations. Construct an efficient algorithm for solving this problem.

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Tree Representation of Regular Expression Each content model (regular expression) of a DTD is represented by a tree . λ The positions of the nodes | 1 2 * (a|b|c)d* 11 12 13 21 a b c d Regular expression r Tree representation of r

Update Operations to DTDs ins_elm(a,b,u): Inserting an element del_elm(a,u): Deleting an element agg_elm(a,b,u): Aggregating a subexpression ext_elm(a,u): Extracting a label ins_opr(a,opr,u,v): Inserting an operator del_opr(a,u): Deleting an operator change_opr(a,opr,u): Replacing the operator def_cm(a,r): Defining a content model undef_cm(a): Removing a content model

ins_elm(a,b,u): Inserting an Element Inserts label b at position u in d(a) (d(a): the content model of element a) . λ . λ ins_elm(a, e, 12) | 1 2 | 1 2 * * 11 12 21 11 12 13 21 b c d b e c d d(a) = (b|c)d* (b|e|c)d*

del_elm(a,u): Deleting an Element Replaces a label at u by  if d(a) has a label at u, Deletes  at u if d(a) has  at u . λ . λ del_elm(a, 13) | 1 2 1 2 * | * 11 12 13 21 11 12 13 21 b c e d b c  d d(a) = (b|c|e)d* (b|c|)d*

agg_elm(a,b,u): Aggregating a Subexpression Aggregates the subexpression E at u by label b, i.e., replaces E by b and sets d(b) = E . λ . λ agg_elm(a, f, 2) | 1 2 | 1 2 * * f λ 11 12 13 21 11 12 13 1 b c e d b c e d d(a) = (b|c|e)d* d(a) = (b|c|e)f d(f) = d*

ext_elm(a,u): Extracting a Label Extracts the label at u, i.e., replaces the label at u, say b, by d(b) . λ . λ ext_elm(a, 2) | 1 2 1 2 f * λ | * 11 12 13 1 11 12 13 21 b c e d b c e d d(a) = (b|c|e)f d(f) = d* d(a) = (b|c|e)d*

ins_opr(a,opr,u,v): Inserting an Operator Inserts an operator opr (|, ., *. +, ?) as the parent of the nodes at u,…,v in d(a) . λ . λ 1 2 * * ins_opr(a, *, 1, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* (b|c|e)*d*

del_opr(a,u): Deleting an Operator Deletes the operator at u in d(a) . λ . λ 1 2 * * del_opr(a, 1) | 1 2 * | 11 21 d 11 12 13 21 b c e d 111 112 113 b c e d(a) = (b|c|e)d* d(a) = (b|c|e)*d*

change_opr(a,opr,u): Replacing an Operator Replaces the operator at u in d(a) by opr . λ . λ 1 2 1 2 * * ? * change_opr(a,?,1) | 11 | 11 21 21 d d 111 112 113 111 112 113 b c e b c e d(a) = (b|c|e)?d* d(a) = (b|c|e)*d*

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Nondeterministic Transformation Inferred from Update Operation Let s = op1…opn be an update script (sequence of update operations). XML document t is transformed according to opi for each i = 1,…,n. op1 op2 opn . . . D D1 Dn DTD t t1 . . . tn XML document We define how an XML document t is “nondeterministically” transformed according to an update operation.

Nondeterministic Transformation Inferred from Update Operation (cnt’d) This transformation is defined according to op. ins_elm(a,b,u) del_elm(a,u) agg_elm(a,b,u) ext_elm(a,u) ins_opr(a,opr,u,v) del_opr(a,u) change_opr(a,opr,u)

The update content model includes d(a) In the Case where op = ins_elm(a,b,u) Inserts label b at u in d(a) The update content model includes d(a) The case where the parent of u is | : Do nothing. The case where the parent of u is . : For each a-element Ea in XML document t do Find positions at which new b-elements should be inserted so that Ea becomes “valid” against the updated content model of a. For each position v found in step (a), create a valid b-element and insert it at v.

Example DTD d(a) = (bc)*d d(a) = (bec)*d (Assuming that d(e) = fg) ins_elm(a, e, 112) d(a) = (bc)*d d(a) = (bec)*d XML document a a b c b c d b e c b e c d f g f g (Assuming that d(e) = fg)

In the Case where op = agg_elm(a,b,u) Replaces the subexpression at u in d(a) by label b For each a-element Ea in XML document t do Find sequences of elements in Ea that match the subexpression at u in d(a). For each sequence found in step (a), insert a new b-node as the parent of the sequence.

Example DTD d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b agg_elm(a,e,11) d(a) = (bc)*d d(a) = e*d, d(e) = bc XML document a a b c b c d e e d b c b c

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Inferring K Optimum Transformations According to update script s, XML document t can be transformed into t1,…,tn . s D D’ If t1,…,tn were listed “randomly”, it would be difficult to find an appropriate one. n may be infinite. t1 . . t tn We consider inferring K “optimum” documents of t1,…,tn from s.

Cost of Transformation In this paper, a transformation with the least cost (amount of changes) is optimum. The cost of a transformation (between t and t’) is the sum of the number of changes between t and t’. transform t t’ Node/subtree insertion Node/subtree deletion

NP-hardness Theorem: The following decision problem is NP-hard. Instance: A DTD D, an XML document t valid against D, an update script s to D, and a positive integer B Question: Is there a transformation of t inferred from s whose cost is no more than B? Proof: A reduction from SAT.

NP-hardness (cnt’d) Thus, it is unlikely that we can infer K optimum transformations from an update script efficiently, even if K = 1. However, there is an efficient algorithm for finding K optimum transformations, assuming that an update script is of length one.

Outline Motivation Update operations to DTDs Inferring K optimum transformations from update script Nondeterministic transformation NP-hardness Algorithm for inferring K optimum transformations Conclusion

Our Algorithm Let D be a DTD, t be an XML document valid against D, and op be an update operation to D. The algorithm infers K optimum transformations of t from op in time polynomial of |D|, |t|, and K.

The subtree of t rooted at n Outline of Algorithm For simplicity, we show how to infer “K optimum costs”, instead of “K optimum transformations”. The algorithm computes, for each node n in t, K optimum costs to transform t/n in a bottom-up manner. XML document t The subtree of t rooted at n n1 n2 n9 Bottom-up n3 n6 n10 n1 n4 n5 n7 n8 n11 n12 n13 n14

Outline of Algorithm (cnt’d) For each node n in t do (in a bottom-up order) Let a be the label of n. Construct a weighted graph G(N,E), which is obtained by “merging” (i) the children of n and (ii) the Glushkov automaton of d(a). 3. Find K shortest paths p1,…, pK on G(N,E). Assign the K weights of p1,…, pK to n. Return the K weights of the root of t. K optimum costs to transform t/n

Outline of Step 2 (op = del_elm()) (a) The children of n (b) The Glushkov automaton of d(a) = g(c*b|cb*) a n b212 b Deleted by del_elm(a,221) b c g c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b2221, n3)

Graph G(N,E) (op = del_elm()) 2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, except that the weight of each edge (g1,n1) (c221,n2) is set to 1 (since deleting a subtree costs one). k Each path from (qI, n0) to (*, n3) represents the sequence of children of n. The k-th shortest path from (qI,n0) to (*, n3) represents the k-th cost to transform t/n.

Outline of Step 2 (op = ins_elm()) (a) The children of n (b) The (modified) Glushkov automaton of d(a) = g(ec*b|cb*) a n b212 eb Inseted by ins_elm(a,e,21) b ec d c b c c2111 g n1 n2 n3 qI g1 c start c221 merge b b2221 (c) Graph G(N,E) (assuming that K = 2) 2 2 2 b 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 2 (c221, n2) 1 (b221, n3)

Graph G(N,E) (op = ins_elm()) 2 2 2 source 1 1 1 (qI, n0) (g1, n1) (c2111, n2) (b212, n3) 2 1 This cost has already been obtained, since the algorithm woks in a bottom-up order. destination 2 1 (c221, n2) (b221, n3) k The weight of each edge (*, ni-1) (*, ni) is set to the k-th cost to transform t/ni, then the weight of each edge (g1,n1) (c2111,n2) is increased by 1 (since adding a subtree costs one). k The weight of k-th shortest path from (qI, n0) to (*, n3) represents the k-th cost to transform t/n.

Time Complexity of the Algorithm For a DTD D, an XML document t valid against D, and a positive integer K, the algorithm runs in time, where d(t) is the maximum outdegree of t. O(|t|・d(t)3・|D|2・K2)

Conclusion Showed that the problem of inferring K optimum transformations from an update script is NP-hard, even if K = 1. Constructed an algorithm for inferring K optimum transformations from an update operation, which runs in time polynomial of |D|, |t|, and K.

Future Work Implement the algorithm and make experiments to examine if the current definition of the cost of a transformation is appropriate, and device another criterion if necessary, the efficiency of the algorithm. Use schema languages more powerful than DTD, e.g., XML Schema and regular tree grammar.

Thank you!