Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila.

Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila

Presented by Shnaiderman Lila2 Presentation Outline Introduction Introduction Final Destination Final Destination Definitions and background Definitions and background Canonical models and Match Sets Canonical models and Match Sets Exponential time containment algorithm (complete) Exponential time containment algorithm (complete) Homomorphism Homomorphism Polynomial time containment algorithm (incomplete) Polynomial time containment algorithm (incomplete) co-NP hardness of containment co-NP hardness of containment Additional topics of interest Additional topics of interest Conclusion Conclusion

Presented by Shnaiderman Lila3 Introduction XPath is a simple language for navigating XML documents and selecting a set of nodes. XPath is a simple language for navigating XML documents and selecting a set of nodes. With XPath we can query XML, describe key constraints, express transformations and reference elements in remote documents. With XPath we can query XML, describe key constraints, express transformations and reference elements in remote documents. We can find XPath influence in other XML query languages and features such as XQuery, XSLT, XML schema, XLink, XPointer and more... We can find XPath influence in other XML query languages and features such as XQuery, XSLT, XML schema, XLink, XPointer and more...

Presented by Shnaiderman Lila4 Introduction (continue) This article deals with simple XPath fragments, that consist of: This article deals with simple XPath fragments, that consist of: node tests node tests child axes (/) child axes (/) Descendant axes(//) Descendant axes(//) Wildcards (*) Wildcards (*) Predicates ([…]) Predicates ([…]) This class of queries is called XP {[], *, //} This class of queries is called XP {[], *, //} a b * c d x Example: a//*[b//d][c]

Presented by Shnaiderman Lila5 Final Destination Showing that the containment problem for XP {[], *, //} is co-NP complete (surprising!) Showing that the containment problem for XP {[], *, //} is co-NP complete (surprising!) To present an efficient, sound algorithm which is complete in some cases (this algorithm always runs in PTIME) To present an efficient, sound algorithm which is complete in some cases (this algorithm always runs in PTIME) To present a sound and complete algorithm which is efficient in some cases (the worst time for that algorithm is exponential) To present a sound and complete algorithm which is efficient in some cases (the worst time for that algorithm is exponential)

Presented by Shnaiderman Lila6 Definitions and background NP - stands for “Nondeterministic-Polynomial". NP - stands for “Nondeterministic-Polynomial". P class - A class of mathematical problems for which an efficient solution has been found, which is solvable in polynomial time. P class - A class of mathematical problems for which an efficient solution has been found, which is solvable in polynomial time. NP class - A class of mathematical problems which most likely has Exponential Complexity, for which no efficient solution has been found (yet), which is probably not solvable in polynomial time. NP class - A class of mathematical problems which most likely has Exponential Complexity, for which no efficient solution has been found (yet), which is probably not solvable in polynomial time.Exponential ComplexityExponential Complexity NP hard problem - a problem that each NP problem can be reduced to ( even worse than NP… ). NP hard problem - a problem that each NP problem can be reduced to ( even worse than NP… ). NP complete problem – a problem which belongs to the NP class of problems and is an NP hard problem by itself. NP complete problem – a problem which belongs to the NP class of problems and is an NP hard problem by itself. coNP - is the class of problems whose complement is in NP. Suppose L is a coNP problem, there exists a polynomial-time nondeterministic algorithm M such that: coNP - is the class of problems whose complement is in NP. Suppose L is a coNP problem, there exists a polynomial-time nondeterministic algorithm M such that: If x  L, then M(x) = “yes” for all computation paths. If x  L, then M(x) = “yes” for all computation paths. If x  L, then M(x) = “no” for some computation path. If x  L, then M(x) = “no” for some computation path.

Presented by Shnaiderman Lila7 Definitions and background (continued) Embedding: Embedding: Given a tree pattern p and a tree t, an embedding from p to t is the function e: NODES(p)  NODES (t) with the following conditions: Root-preserving: e(ROOT(p)) = ROOT(t) Label-preserving: For each x  NODES(p), LABEL(x) = * or LABEL(x) = LABEL(e(x)) Child-edge-preserving: For each (x,y)  EDGES / (p), (e(x), e(y))  EDGES(t) Descendant-edge-preserving: For each (x,y)  EDGES // (p), (e(x), e(y))  EDGES + (t) Given a tree pattern p and a tree t, an embedding from p to t is the function e: NODES(p)  NODES (t) with the following conditions: Root-preserving: e(ROOT(p)) = ROOT(t) Label-preserving: For each x  NODES(p), LABEL(x) = * or LABEL(x) = LABEL(e(x)) Child-edge-preserving: For each (x,y)  EDGES / (p), (e(x), e(y))  EDGES(t) Descendant-edge-preserving: For each (x,y)  EDGES // (p), (e(x), e(y))  EDGES + (t) (EDGES +, means that there is at least one edge between two nodes) (EDGES +, means that there is at least one edge between two nodes)

Presented by Shnaiderman Lila8 Definitions and background (continued) example example a cba ccbcb dbcabb Tree instance t a c * b x a Pattern p a[a]//*[b]//c

Presented by Shnaiderman Lila9 Definitions and background (continued) From XPath to Tree Patterns: Every XPath expression can be translated into a tree pattern of arity 1, and vice- versa, while preserving semantics. From XPath to Tree Patterns: Every XPath expression can be translated into a tree pattern of arity 1, and vice- versa, while preserving semantics. From now on we shall consider tree patterns only – P {[],*,//} and its fragments. From now on we shall consider tree patterns only – P {[],*,//} and its fragments. Boolean patterns – patterns with arity 0 Boolean patterns – patterns with arity 0 Definition: If p is boolean then: p(t) =  (false) or p(t) = {()} (true) Definition: If p is boolean then: p(t) =  (false) or p(t) = {()} (true) Containment means implication: p  p’ iff  t p(t)  p’ (t) Containment means implication: p  p’ iff  t p(t)  p’ (t) Proposition 1: Let s 1,…,s k be k labels that are not in ∑. There is a translation of k-ary patterns over the alphabet ∑, to Boolean patterns over the alphabet ∑  {s 1,…,s k }, such that for any k-ary patterns p, p’, and their translation p o,p o ’, we have p  p’ iff p 0  p 0 ’ Proposition 1: Let s 1,…,s k be k labels that are not in ∑. There is a translation of k-ary patterns over the alphabet ∑, to Boolean patterns over the alphabet ∑  {s 1,…,s k }, such that for any k-ary patterns p, p’, and their translation p o,p o ’, we have p  p’ iff p 0  p 0 ’

Presented by Shnaiderman Lila10 Definitions and background (continued) Example: A tree pattern of arity 3, with the distinguished nodes x 1,x 2,x 3, and its translation to a Boolean Pattern p o, used in Proposition 1: p o has three extra nodes labeled s 1, s 2, s 3: Example: A tree pattern of arity 3, with the distinguished nodes x 1,x 2,x 3, and its translation to a Boolean Pattern p o, used in Proposition 1: p o has three extra nodes labeled s 1, s 2, s 3: In the rest of this article, we will assume all tree patterns to be boolean, unless otherwise stated. In the rest of this article, we will assume all tree patterns to be boolean, unless otherwise stated. a c * b x2x2 a x1x1 x3x3 a c * b a s2s2 s3s3 s1s1

Presented by Shnaiderman Lila11 Definitions and background (continued) Mutual Reducability of Containment and Equivalence: The containment and equivalence problems are mutually reducible in polynomial time. Equivalence is simply two-way containment. Mutual Reducability of Containment and Equivalence: The containment and equivalence problems are mutually reducible in polynomial time. Equivalence is simply two-way containment. We will only discuss containment in the reminder of this article. We will only discuss containment in the reminder of this article. Tree pattern evaluation: There is an algorithm that decides for any tree pattern p, and input tree t whether p (t) is true and runs in time O(|p||t|). Tree pattern evaluation: There is an algorithm that decides for any tree pattern p, and input tree t whether p (t) is true and runs in time O(|p||t|). |p|, |t| - are the sizes of p, t, meaning the number of nodes in p, t. |p|, |t| - are the sizes of p, t, meaning the number of nodes in p, t. p(t) is true – means that there is an embedding from p to t. p(t) is true – means that there is an embedding from p to t.

Presented by Shnaiderman Lila12 Canonical models and Match Sets Model of Boolean pattern P: Model of Boolean pattern P: A Model of p is a tree t  T ∑ on which p evaluates to true. A Model of p is a tree t  T ∑ on which p evaluates to true. Mod(p): a set of models – Mod(p) = {t  T ∑ | p(t) is true} Mod(p): a set of models – Mod(p) = {t  T ∑ | p(t) is true} p  p ’ iff Mod(p)  Mod(p ’) p  p ’ iff Mod(p)  Mod(p ’) Witness: a tree t such that p(t) is true and p ’(t) is false  p  p ’ Witness: a tree t such that p(t) is true and p ’(t) is false  p  p ’ In order to find a witness we need to check an infinite set so we need to restrict it: In order to find a witness we need to check an infinite set so we need to restrict it: Canonical Models: Canonical Models: First step: Eliminate all descendant edges by replacing each edge // with a sequence of wildcards */*/…/*. Second step: replace each wild card with a symbol z. First step: Eliminate all descendant edges by replacing each edge // with a sequence of wildcards */*/…/*. Second step: replace each wild card with a symbol z. Formally (first step): p has d descendant edges EDGES // (p)={r 1,…,r d }. Given d numbers û=(u 1,…,u d ), u 1  0,…,u d  0, p [û] is a pattern obtained by replacing each descendant edge with any sequence of *’s. Formally (first step): p has d descendant edges EDGES // (p)={r 1,…,r d }. Given d numbers û=(u 1,…,u d ), u 1  0,…,u d  0, p [û] is a pattern obtained by replacing each descendant edge with any sequence of *’s. distance: d(x,y) = u i + 1 (where x and y are nodes). distance: d(x,y) = u i + 1 (where x and y are nodes).

Presented by Shnaiderman Lila13 Canonical models and Match Sets (continued) Example Example LEMMA: Let e: p  t be an embedding from the tree pattern p to the tree t. There exists a unique extension p[û] and a unique embedding e’: p[û]  t such that  x  NODES(p), e(x) = e’(x). LEMMA: Let e: p  t be an embedding from the tree pattern p to the tree t. There exists a unique extension p[û] and a unique embedding e’: p[û]  t such that  x  NODES(p), e(x) = e’(x). Proof: For each i=1,...,d, e maps the descendant edge r i =(x i,y i )  EDGES // (p) into a pair of nodes (e(x i ),e(y i ))  EDGES + (t). Define u i =d(e(x i ),e(y i )) - 1 (d is the distance in t), and let û= (u 1,…,u d ). Extend e to e’: p[û]  t by mapping the extension nodes between x i and y i to the nodes connecting e(x i ) to e(y i ). Proof: For each i=1,...,d, e maps the descendant edge r i =(x i,y i )  EDGES // (p) into a pair of nodes (e(x i ),e(y i ))  EDGES + (t). Define u i =d(e(x i ),e(y i )) - 1 (d is the distance in t), and let û= (u 1,…,u d ). Extend e to e’: p[û]  t by mapping the extension nodes between x i and y i to the nodes connecting e(x i ) to e(y i ). Tree pattern p a b * c a Tree pattern p[0,2] a b * * a * c Extension nodes

Presented by Shnaiderman Lila14 Canonical models and Match Sets (continued) Formally (second step): replace the *’s with some symbol – s z (p) the tree pattern p obtained by replacing each * in p with z. Formally (second step): replace the *’s with some symbol – s z (p) the tree pattern p obtained by replacing each * in p with z. Set of canonical models: mod z (p) = {s z (p[û]) | û=(u 1,...,u d ), u 1  0,..., u d  0} Set of canonical models: mod z (p) = {s z (p[û]) | û=(u 1,...,u d ), u 1  0,..., u d  0} This set is infinite in case it has at least one descendant edge This set is infinite in case it has at least one descendant edge Set of bounded canonical models for n  0: mod z n (p) = {s z (p[û]) | û=(u 1,...,u d ), 0  u 1  n,..., 0  u d  n} Set of bounded canonical models for n  0: mod z n (p) = {s z (p[û]) | û=(u 1,...,u d ), 0  u 1  n,..., 0  u d  n} This set is always finite. This set is always finite. Star length w in pattern q, is the largest number of nodes labeled with *’s and connected by child edges. Star length w in pattern q, is the largest number of nodes labeled with *’s and connected by child edges. Need to show: For searching a witness for p  p’ it is enough to check a finite set mod z n (p) where z does not occur in p ’ and n depends only on p ’. Need to show: For searching a witness for p  p’ it is enough to check a finite set mod z n (p) where z does not occur in p ’ and n depends only on p ’.

Presented by Shnaiderman Lila15 Canonical models and Match Sets (continued) Proposition: Let p and p’ be two Boolean tree patterns, z  ∑ be a symbol that does not appear in p’, and w be the star length of p’. Then, the following are equivalent: (1) p  p’ (2) mod z (p)  Mod(p’ ), (3) mod z n (p)  Mod(p’ ), where n = w + 1. Proposition: Let p and p’ be two Boolean tree patterns, z  ∑ be a symbol that does not appear in p’, and w be the star length of p’. Then, the following are equivalent: (1) p  p’ (2) mod z (p)  Mod(p’ ), (3) mod z n (p)  Mod(p’ ), where n = w + 1. Proof: (1)  (2)  (3) is obvious (p  p’ is equivalent to mod(p)  Mod(p’ )). This leaves (3)  (1): Proof: (1)  (2)  (3) is obvious (p  p’ is equivalent to mod(p)  Mod(p’ )). This leaves (3)  (1): Suppose p  p’, and let t be a witness(  p(t) is true and p’ (t) is false)). p(t) is true  there exists an embedding e : p  t  There exists e’ : p[û]  t which agrees with e on the nodes of p (follows from the Lemma). Suppose p  p’, and let t be a witness(  p(t) is true and p’ (t) is false)). p(t) is true  there exists an embedding e : p  t  There exists e’ : p[û]  t which agrees with e on the nodes of p (follows from the Lemma). t 1 = s z (p[û])  mod z (p) is still a witness (p’ (t 1 ) is false), to show that: suppose p’ (t 1 ) were true  there exists an embedding e1 : p’  t 1, t 1 = s z (p[û])  mod z (p) is still a witness (p’ (t 1 ) is false), to show that: suppose p’ (t 1 ) were true  there exists an embedding e1 : p’  t 1, let f be a function: f: NODES(p) -> NODES(t) by composing e1: p’  t 1 with e’: p[û]  t, (because NODES(t 1 ) = NODES(p[û]).  let f be a function: f: NODES(p) -> NODES(t) by composing e1: p’  t 1 with e’: p[û]  t, (because NODES(t 1 ) = NODES(p[û]).  contradiction (f:p’  t  p’ (t) is true)  p’ (t 1 ) is false. This ends the proof (t 1 = s z (p[û])  mod z (p) is a witness  p(t 1 ) is true while p’ (t 1 ) is false). contradiction (f:p’  t  p’ (t) is true)  p’ (t 1 ) is false. This ends the proof (t 1 = s z (p[û])  mod z (p) is a witness  p(t 1 ) is true while p’ (t 1 ) is false). Let e: p  t be an embedding from the tree pattern p to the tree t. There exist a unique extension p[û] and a unique embedding e’: p[û]  t such that  x  NODES(p), e(x) = e’(x).

Presented by Shnaiderman Lila16 Canonical models and Match Sets (continued) We now construct some canonical model t 2  mod z n (p) that is still a witness. This follows directly from the next lemma: Let p and p’ be two Boolean tree patterns, z  ∑ be a symbol that does not appear in p’, and w’ be the star length of p’. Let t 1 = s z (p[û]) be a canonical model such that p’ (t 1 ) is false. Define v = (v 1,..., v d ) to be v i = min(u i,n), for i = 1,..., d, where n = w’ + 1, and t 2 = s z (p[v]). Then p’ (t 2 ) is false. Intuition: if p’ (t 2 ) were true, then we could stretch the chains of extra nodes in t 2 to obtain t 1, and we would still have p’ (t 1 ) true. Remark: the n from part (3) depends only on p’ : n = w’ + 1 (w’ is star length). mod z n (p)  Mod(p’ )  p  p’ (t 2 is the witness that for p  p’ ). That concludes the proof: mod z n (p)  Mod(p’ )  p  p’ (t 2 is the witness that for p  p’ ).

Presented by Shnaiderman Lila17 Canonical models and Match Sets (continued) Match Sets: For a tree t (or a pattern p), each node and each edge defines a subtree. Match Sets: For a tree t (or a pattern p), each node and each edge defines a subtree. x  NODES(t) defines t x that consists of the node x and its subtree. (ROOT(t x ) = x; t ROOT(t) = t) x  NODES(t) defines t x that consists of the node x and its subtree. (ROOT(t x ) = x; t ROOT(t) = t) (x,y)  EDGES(t) defines t x,y that consists of t y, the node x and the edge (x,y). (x,y)  EDGES(t) defines t x,y that consists of t y, the node x and the edge (x,y). S(t) – a set of all subtrees of nodes and adges. S(t) – a set of all subtrees of nodes and adges. a x b y c z * u p’ b y c z * u p’ y b y c z p’ y,z b y * u p’ y,u c z p’ z * u p’ u = p’ x = p’ x,y

Presented by Shnaiderman Lila18 Canonical models and Match Sets (continued) q* - the pattern obtained by replacing the root of q with *. q* - the pattern obtained by replacing the root of q with *. ms(t) = {p x | x  NODES(p), p x (t) = true}  {p x,y | (x,y)  EDGES / (p), p x,y (t) = true}  {p x,y | (x,y)  EDGES // (p), (p x,y )*(t) = true} ms(t) = {p x | x  NODES(p), p x (t) = true}  {p x,y | (x,y)  EDGES / (p), p x,y (t) = true}  {p x,y | (x,y)  EDGES // (p), (p x,y )*(t) = true} MS[p] = { ms(t) | t  mod z (p) } MS[p] = { ms(t) | t  mod z (p) } b c t 1 = /a/b/c a b y * u p’ y,u * u p’ u Ms(t 1 ) = {p’ x, p’ x,y, p’ y,u, p’ u } b c t 2 = /a/b/z/c a z a x b y c z * u p’ x = p’ x,y a x b y c z * u ? b y * u p’ y,u * u p’ u Ms(t 2 ) = {p’ y,u, p’ u }

Presented by Shnaiderman Lila19 Exponential time containment algorithm Naive algorithm: to decide if p  p’ : iterate over all t  mod z w’+1 (p) and check p’ (t) (requires O(|t ||p’ |) steps). Naive algorithm: to decide if p  p’ : iterate over all t  mod z w’+1 (p) and check p’ (t) (requires O(|t ||p’ |) steps). The complete time: O(|p||p’ |(w’+2) (d+1) ) (based on the size of s z (p [û]), and the fact that d  |p| ) The complete time: O(|p||p’ |(w’+2) (d+1) ) (based on the size of s z (p [û]), and the fact that d  |p| ) Problem: The naïve algorithm is not practical, since much of the work in evaluating p’ (t) is repeated for various canonical models t. Problem: The naïve algorithm is not practical, since much of the work in evaluating p’ (t) is repeated for various canonical models t. Main idea of the Match Set algorithm: p  p’ iff there exists a canonical tree t  mod z (p) and p’ (t) is false. So it suffices to compute ms(t) for some t and to check if p’ ROOT(p’ )  ms(t). Main idea of the Match Set algorithm: p  p’ iff there exists a canonical tree t  mod z (p) and p’ (t) is false. So it suffices to compute ms(t) for some t and to check if p’ ROOT(p’ )  ms(t). Problem: we don’t know for what tree t to compute ms(t)… Solution: To compute the set of all match sets - MS[p]. And then it suffices to check the condition  ms  MS[p], p’ ROOT(p’ )  ms to determine that p  p’. Problem: we don’t know for what tree t to compute ms(t)… Solution: To compute the set of all match sets - MS[p]. And then it suffices to check the condition  ms  MS[p], p’ ROOT(p’ )  ms to determine that p  p’.

Presented by Shnaiderman Lila20 Exponential time containment algorithm (continued) Remark: MS(p) has at most as many elements as canonical trees in mod z w’+1 (p) (w’ is the star length of p’ ). But in many cases it is much smaller because many canonical trees gives the same match sets (like in the example above).  Match Sets algorithm is better than the naïve one. Remark: MS(p) has at most as many elements as canonical trees in mod z w’+1 (p) (w’ is the star length of p’ ). But in many cases it is much smaller because many canonical trees gives the same match sets (like in the example above).  Match Sets algorithm is better than the naïve one. The full algorithm to check if p  p’ (complete): The full algorithm to check if p  p’ (complete): Compute MS(p) Compute MS(p) check if  ms  MS[p], p’ ROOT(p’ )  ms check if  ms  MS[p], p’ ROOT(p’ )  ms If it exists, return p  p’ If it exists, return p  p’ If it doesn’t, return p  p’ If it doesn’t, return p  p’ MS(p) = { {p’ x, p’ x,y, p’ y,u, p’ u }, {p’ y,u, p’ u } } tree pattern p tree pattern p b c a a x b y c z * u tree pattern p’ p  p’, because:  ms  MS[p], p ’ROOT(p’ )  ms  p’ x  {p’ y,u, p’ u } Example: Example:

Presented by Shnaiderman Lila21 Exponential time containment algorithm (continued) The running time: O(|p||p’ |(w’+2) d ) (based on the size of s z (p [û]), and the fact that d  |p| ) The running time: O(|p||p’ |(w’+2) d ) (based on the size of s z (p [û]), and the fact that d  |p| ) This algorithm is sound and complete, and in some cases runs in exponential time: This algorithm is sound and complete, and in some cases runs in exponential time: In the following example, one ms is: {p’ x,p’ x,y 1,…, p’ x,y n }, and the other ms are subsets of: {p’ x,y 1,…, p’ x,y n }, so the answer of the algorithm is false – p  p’, but it takes exponential time to decide it (because there are 2 n ms sets to check). In the following example, one ms is: {p’ x,p’ x,y 1,…, p’ x,y n }, and the other ms are subsets of: {p’ x,y 1,…, p’ x,y n }, so the answer of the algorithm is false – p  p’, but it takes exponential time to decide it (because there are 2 n ms sets to check). tree pattern p tree pattern p b c1c1 a c2c2 cncn........ tree pattern p’ a x b y1y1 b y2y2 b y3y3........ c1c1 c1c1 c1c1

Presented by Shnaiderman Lila22 Homomorphism A homomorphism h: p’  p between two tree patterns p,p’ is a function h:Nodes(p’ ) -> Nodes(p) that satisfies the regular embedding with the following strengthening of the child edge preservation condition: A homomorphism h: p’  p between two tree patterns p,p’ is a function h:Nodes(p’ ) -> Nodes(p) that satisfies the regular embedding with the following strengthening of the child edge preservation condition: (x,y)  EDGES / (p’ )  (h(x),h(y))  EDGES / (p) (and not EDGES // (p) ) (x,y)  EDGES / (p’ )  (h(x),h(y))  EDGES / (p) (and not EDGES // (p) ) Example: Example: Root-preserving: e(ROOT(p)) = ROOT(t) Label-preserving: For each x  NODES(p), LABEL(x) = * or LABEL(x) = LABEL(e(x)) Child-edge-preserving: For each (x,y)  EDGES/(p), (e(x), e(y))  EDGES(t) Descendant-edge-preserving: For each (x,y)  EDGES//(p), (e(x), e(y))  EDGES+(t) a a cd a c ba P = a a b* a c b P’ =

Presented by Shnaiderman Lila23 Homomorphism (continued) Problem – homomorphism fails in the following case for P {//,*} : Problem – homomorphism fails in the following case for P {//,*} : Solution – adornment: combining // with *: Solution – adornment: combining // with *: //  //  0 //  m * /  //  m+1 / * //  n  //  n+1 //  m * //  n  //  m+n+1 Only * nodes with unique children may be eliminated this way. Only * nodes with unique children may be eliminated this way. In homomorphism with adornment d(h(x),h(y))  d(x,y), where d is the distance function. In homomorphism with adornment d(h(x),h(y))  d(x,y), where d is the distance function. Example - Example - p’= a//*/*/b/*/c//d  p’= a//b/*/c //d p’= a//*/*/b/*/c//d  p’= a//  2 b/*/c //  0 d a * b P =P’ = a * b ? P =P’’ = a * b a b 11

Presented by Shnaiderman Lila24 Homomorphism (continued) Problem: In the following case there is no homomorphism: Problem: In the following case there is no homomorphism: Shadowing: for any leaf node in both p and p’ add a shadow leaf with a label that does not exist in p and p’, connected with the descendant edge to the original leaf. Shadowing: for any leaf node in both p and p’ add a shadow leaf with a label that does not exist in p and p’, connected with the descendant edge to the original leaf. b * c tree pattern p b * tree pattern p’ Has no outgoing edge  can’t be eliminated by adornment b * c tree pattern p a b tree pattern p’ a * 11

Presented by Shnaiderman Lila25 Polynomial time containment algorithm The algorithm: The algorithm: Add shadow leaf symbols to p and p’ Add shadow leaf symbols to p and p’ Apply rewriting rules (adornment) to p’ and get p’’ Apply rewriting rules (adornment) to p’ and get p’’ Find a homomorphism from p’’ to p Find a homomorphism from p’’ to p If found return true If found return true Else return false Else return false Properties of the algorithm: Properties of the algorithm: This algorithm is sound. This algorithm is sound. The running time: polynomial - is O(|p||p’ |) – depends on the part which checks homomorphism existence. The running time: polynomial - is O(|p||p’ |) – depends on the part which checks homomorphism existence. This algorithm is not complete… This algorithm is not complete… This algorithm is complete in the following 4 cases: This algorithm is complete in the following 4 cases: p  P {[],*} p  P {[],*} p’  P {[],*} p’  P {[],*} p’  P {[], //} p’  P {[], //} p’  P {*, //} p’  P {*, //} The proof is given in the paper. The proof is given in the paper.

Presented by Shnaiderman Lila26 Polynomial time containment algorithm (continued) Example of an incomplete case: Example of an incomplete case: a b c d b* b d c dc Tree pattern p a b c * b d c d =0 00 11 11 Tree pattern p’ no more options… Algorithm fails though p  p’… (can be shown by reasoning by case) In homomorphism with adornment d(h(x),h(y))  d(x,y).

Presented by Shnaiderman Lila27 co-NP hardness of containment First we will show that the problem: “p,p’  P { [],*,// } decides whether p  p’ ” is in co-NP: First we will show that the problem: “p,p’  P { [],*,// } decides whether p  p’ ” is in co-NP: Reminder: to show that p  p’ we have to find t  mod z n (p) and to show that there is no embedding from p’ to t. Reminder: to show that p  p’ we have to find t  mod z n (p) and to show that there is no embedding from p’ to t. To prove that the problem is in co-NP: we will present an algorithm to check that p  p’ : To prove that the problem is in co-NP: we will present an algorithm to check that p  p’ : guess d numbers u 1,…u d, each u i  w’+1, where w’ is the star length of p’, and construct a canonical model t = s z (p[u 1,…u d ]), then check in polynomial time that p’(t) is false.  the problem is in co-NP. guess d numbers u 1,…u d, each u i  w’+1, where w’ is the star length of p’, and construct a canonical model t = s z (p[u 1,…u d ]), then check in polynomial time that p’(t) is false.  the problem is in co-NP. Another definition of containment: containment of Boolean pattern p in a union of patterns is defined as follows: Another definition of containment: containment of Boolean pattern p in a union of patterns is defined as follows: p  p 1  …  p k holds if, for all trees t, p(t)  p 1 (t)  p 2 (t)  …  p k (t). p  p 1  …  p k holds if, for all trees t, p(t)  p 1 (t)  p 2 (t)  …  p k (t). Lemma: Given the patterns p, p 1, p 2,…, p k in P { [],*,// }, there exist patterns q, q’ in P { [],*,// } such that p  p 1  …  p k iff q  q’. Lemma: Given the patterns p, p 1, p 2,…, p k in P { [],*,// }, there exist patterns q, q’ in P { [],*,// } such that p  p 1  …  p k iff q  q’. q and q’ are polynomial in the sizes of p, p 1, p 2,…, p k. q and q’ are polynomial in the sizes of p, p 1, p 2,…, p k. q and q’ have no more wildcards than those present in p, p 1, p 2,…, p k. q and q’ have no more wildcards than those present in p, p 1, p 2,…, p k. Suppose L is a coNP problem, there exists a polynomial-time nondeterministic algorithm M such that: If x  L, then M(x) = “yes” for all computation paths. If x  L, then M(x) = “no” for some computation path.

Presented by Shnaiderman Lila28 co-NP hardness of containment Proof: in order to prove the lemma we will do the following construction: Proof: in order to prove the lemma we will do the following construction: r c c c c c V V p V V k-1 nodes V has no * and no // V  p j  fusing the (common) roots in p i subtrees, and replacing * in p i with some letter a and // with / r c c p1p1 p2p2 c pkpk k nodes q pattern q’ pattern The canonical models of q are completely determined by a choice of canonical model for q’s subtree p : for each t  mod z (q), t p  mod z (p) is the subtree corresponding to p

Presented by Shnaiderman Lila29 co-NP hardness of containment (continued) Returning to lemma, p  p 1  …  p k  q  q’: (for every t  mod z (q), q’(t) is true): Returning to lemma, p  p 1  …  p k  q  q’: (for every t  mod z (q), q’(t) is true): for t  mod z (q), p(t p ) is true  for t  mod z (q), p(t p ) is true  p i (t p ) is true for some i  {1,…,k}  p i (t p ) is true for some i  {1,…,k}  q’(t) is true for the following embedding e: q’  t : e maps the root of q’ to the root of q, e maps the subpattern p i to t p, e maps every other p j to a corresponding V (there is enough V below and above p to make it). q’(t) is true for the following embedding e: q’  t : e maps the root of q’ to the root of q, e maps the subpattern p i to t p, e maps every other p j to a corresponding V (there is enough V below and above p to make it). q  q’  p  p 1  …  p k : (for every t p  mod z (p), p 1 (t p )  p 2 (t p )  …  p k (t p ) is true  p  p 1  …  p k ): q  q’  p  p 1  …  p k : (for every t p  mod z (p), p 1 (t p )  p 2 (t p )  …  p k (t p ) is true  p  p 1  …  p k ): t p  mod z (p),t is the extension of t p to t  mod z (q), by adding the spine and k-1 copies of V above and bellow t p. t p  mod z (p),t is the extension of t p to t  mod z (q), by adding the spine and k-1 copies of V above and bellow t p. q(t) is true  q’(t) is true  q(t) is true  q’(t) is true  there exists an embedding e: q’  t. This embedding must map the spine in q’ to the spine in t. Let x be the spine node in t that is right above t p  at least one spine node in q’ must be mapped to x (because there are only k-1 nodes above or below x, and the spine in q’ has only k nodes and no descendant edges  there exists an embedding e: q’  t. This embedding must map the spine in q’ to the spine in t. Let x be the spine node in t that is right above t p  at least one spine node in q’ must be mapped to x (because there are only k-1 nodes above or below x, and the spine in q’ has only k nodes and no descendant edges  There is some node y in q’ mapped to x  we found p i such that p i (t p ) is true  p 1 (t p )  p 2 (t p )  …  p k (t p ) is true. There is some node y in q’ mapped to x  we found p i such that p i (t p ) is true  p 1 (t p )  p 2 (t p )  …  p k (t p ) is true. r c c c c c V V p V V k-1 nodes r c c p1p1 p2p2 c pkpk k nodes q patternq’ pattern

Presented by Shnaiderman Lila30 co-NP hardness of containment (continued) Now we are ready to prove the co-NP hardness: we will do it by reduction from 3-CNF. Now we are ready to prove the co-NP hardness: we will do it by reduction from 3-CNF. Let ψ be a 3-CNF formula with n propositional variables y 1, y 2,..., yn, and k clauses c 1, c 2,..., c k. We construct patterns A,C 1,..., C k, such that ψ is not satisfiable iff A  C 1  …  C k. The tree pattern A is constructed so that its canonical models, mod z (A), encode truth assignments to the n variables of ψ. Tree pattern C i is constructed so that the following property holds: Let ψ be a 3-CNF formula with n propositional variables y 1, y 2,..., yn, and k clauses c 1, c 2,..., c k. We construct patterns A,C 1,..., C k, such that ψ is not satisfiable iff A  C 1  …  C k. The tree pattern A is constructed so that its canonical models, mod z (A), encode truth assignments to the n variables of ψ. Tree pattern C i is constructed so that the following property holds: (*) For every t  mod z (A), C i (t) is true iff the truth assignment encoded by t makes the clause c i false. (*) For every t  mod z (A), C i (t) is true iff the truth assignment encoded by t makes the clause c i false. Property (*) is sufficient to prove co-NP hardness because of the following equivalences, and the last Lemma: (A  C 1  …  C k )  (for every t  mod z( A) there exists i such that C i (t) is true)  (for every truth assignment there exists i such that, c i is false under that assignment)  (ψ is not satisfiable). Property (*) is sufficient to prove co-NP hardness because of the following equivalences, and the last Lemma: (A  C 1  …  C k )  (for every t  mod z( A) there exists i such that C i (t) is true)  (for every truth assignment there exists i such that, c i is false under that assignment)  (ψ is not satisfiable). lets show how to construct A,C 1,...,C k so that property (*) is satisfied. lets show how to construct A,C 1,...,C k so that property (*) is satisfied.

Presented by Shnaiderman Lila31 co-NP hardness of containment (continued) For t  mod z (Y i ), if t consists only of a i followed by b, it corresponds to a truth assignment making y i true. If t contains one or more added nodes between a i and b, it corresponds to a truth assignment making y i false. For t  mod z (Y i ), if t consists only of a i followed by b, it corresponds to a truth assignment making y i true. If t contains one or more added nodes between a i and b, it corresponds to a truth assignment making y i false. We define a tree pattern C i for each clause of ψ by an example: We define a tree pattern C i for each clause of ψ by an example: For C i = (  y j  y k   y l ): For C i = (  y j  y k   y l ): yiyi aiai b T(y i ) aiai b F(y i ) aiai b * b y1y1 y2y2 ykyk Tree pattern A T(y j ) r Tree pattern C i F(y k ) T(y l )

Presented by Shnaiderman Lila32 co-NP hardness of containment (continued) In case of some arbitrary bounds on the number of occurrences of //, or *, or []: In case of some arbitrary bounds on the number of occurrences of //, or *, or []: For //: the containment problem p  p’ remains in PTIME if we bound the number of // edges to some d  0. For //: the containment problem p  p’ remains in PTIME if we bound the number of // edges to some d  0. We have shown that at the beginning of the lecture when we worked on bounded canonical models. We have shown that at the beginning of the lecture when we worked on bounded canonical models. For *: the containment problem p  p’ remains co-NP hard even if we allow at most two *. For *: the containment problem p  p’ remains co-NP hard even if we allow at most two *. Won’t be proved now Won’t be proved now For []: the containment problem p  p’ remains co-NP hard even if we allow at most five [] in p and at most three [] at p’. For []: the containment problem p  p’ remains co-NP hard even if we allow at most five [] in p and at most three [] at p’. Won’t be proved now Won’t be proved now

Presented by Shnaiderman Lila33 Additional topics of interest Disjunction: Disjunction: Containment for P { [], | } patterns is already co-NP complete Containment for P { [], | } patterns is already co-NP complete Can be shown that Containment for P { //,*,[],| } is also co-NP. Can be shown that Containment for P { //,*,[],| } is also co-NP. Given the expresions p,p’  XP { | }, deciding containment is co-NP hard  and of course in case of XP { //,*,[],| } it is also co-NP hard. Given the expresions p,p’  XP { | }, deciding containment is co-NP hard  and of course in case of XP { //,*,[],| } it is also co-NP hard. Finite Alphabet: Finite Alphabet: This article’s results do not hold for finite alphabet of size which is not two. This article’s results do not hold for finite alphabet of size which is not two. In another article (Neven & Schwentick) it is shown that in case of finite alphabet, containment is in PSPACE for P { //,*,[],| } and complete for PSPACE for P { [], | }. In another article (Neven & Schwentick) it is shown that in case of finite alphabet, containment is in PSPACE for P { //,*,[],| } and complete for PSPACE for P { [], | }. Evaluation on graphs: Evaluation on graphs: All results in this article apply directly to an extension of Boolean patterns evaluated on graphs (in our article we deal with trees). All results in this article apply directly to an extension of Boolean patterns evaluated on graphs (in our article we deal with trees). Application to CTL (computation tree logic): Application to CTL (computation tree logic): All co-NP completeness results in this article apply to a fragment of CTL (ECTL  ) as well. All co-NP completeness results in this article apply to a fragment of CTL (ECTL  ) as well.

Presented by Shnaiderman Lila34 Conclusion We have studied the complexity of containment and equivalence for an important core fragment of XPath. Many XML applications benefit from a practical decision procedure for containment of such expressions. Our results provide intuition into the factors that contribute to its high complexity. Nevertheless, we show that in some significant special cases, containment can be decided efficiently, and we provide an algorithm which does so. One direction for future work is to expand this fragment of XPath with additional features, although it is clear that it will be even more challenging to prove efficient special cases of the problem. Another direction is to study containment of XPath expressions over sets of documents conforming to constraints or schema restrictions. Preliminary work shows that sufficiently expressive constraints make this problem intractable for XPath fragments that otherwise have efficient containment problems. THE END !

Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila.

Similar presentations

Presentation on theme: "Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila.

Similar presentations

Presentation on theme: "Containment and Equivalence for an XPath Fragment Authors:Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila."— Presentation transcript:

Similar presentations

About project

Feedback