Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont 1.Approximation and Edit Distance.

Slides:



Advertisements
Similar presentations
1 Property testing and learning on strings and trees Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI.
Advertisements

Property testing of Tree Regular Languages Frédéric Magniez, LRI, CNRS Michel de Rougemont, LRI, University Paris II.
Lecture 24 MAS 714 Hartmut Klauck
Automatic Verification Book: Chapter 6. What is verification? Traditionally, verification means proof of correctness automatic: model checking deductive:
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms Greed is good. (Some of the time)
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
1 Markov Decision Processes: Approximate Equivalence Michel de Rougemont Université Paris II & LRI
Probabilistically Checkable Proofs Madhu Sudan MIT CSAIL 09/23/20091Probabilistic Checking of Proofs TexPoint fonts used in EMF. Read the TexPoint manual.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
CS21 Decidability and Tractability
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Courtesy Costas Busch - RPI1 Non Deterministic Automata.
Testing the Diameter of Graphs Michal Parnas Dana Ron.
Transparency No. P2C4-1 Formal Language and Automata Theory Part II Chapter 4 Parse Trees and Parsing.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Fall 2006Costas Busch - RPI1 Non-Deterministic Finite Automata.
Chapter 11: Limitations of Algorithmic Power
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Witness and Counterexample Li Tan Oct. 15, 2002.
Costas Busch - LSU1 Non-Deterministic Finite Automata.
Model Checking Lecture 5. Outline 1 Specifications: logic vs. automata, linear vs. branching, safety vs. liveness 2 Graph algorithms for model checking.
Non-Deterministic Space is Closed Under Complementation Neil Immerman Richard Szelepcsenyi Presented By: Subhajit Dasgupta.
Some 3CNF Properties are Hard to Test Eli Ben-Sasson Harvard & MIT Prahladh Harsha MIT Sofya Raskhodnikova MIT.
Final Exam Review Cummulative Chapters 0, 1, 2, 3, 4, 5 and 7.
Of 28 Probabilistically Checkable Proofs Madhu Sudan Microsoft Research June 11, 2015TIFR: Probabilistically Checkable Proofs1.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
1 Approximate Satisfiability and Equivalence Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI, LICS.
Automatic Structures Bakhadyr Khoussainov Computer Science Department The University of Auckland, New Zealand.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Euler paths and tours.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Approximate schemas Michel de Rougemont, LRI, University Paris II.
Algorithms for SAT Based on Search in Hamming Balls Author : Evgeny Dantsin, Edward A. Hirsch, and Alexander Wolpert Speaker : 羅正偉.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.
1 Approximate Data Exchange Michel de Rougemont Adrien Vieilleribière University Paris II & LRI University Paris-Sud & LRI ICDT 2007.
Lecture 1 Overview Topics 1. Proof techniques: induction, contradiction Proof techniques June 1, 2015 CSCE 355 Foundations of Computation.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2007.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2006.
Tree Automata First: A reminder on Automata on words Typing semistructured data.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
P & NP.
Chapter 5 : Trees.
Non Deterministic Automata
Automatic Verification
Lecture 18: Uniformity Testing Monotonicity Testing
CSE 421: Introduction to Algorithms
CSCE350 Algorithms and Data Structure
Non-Deterministic Finite Automata
Approximate Validity of XML Streaming Data
Finite Automata and Formal Languages
CSE 421: Introduction to Algorithms
CS21 Decidability and Tractability
CSE 311 Foundations of Computing I
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
Program correctness Model-checking CTL
Switching Lemmas and Proof Complexity
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

japanjune The correction of XML data Université Paris II & LRI Michel de Rougemont 1.Approximation and Edit Distance 2.Testers and Correctors 3.Correcting regular binary trees 4.Applications to XML Practical corrector 5.Relative value of documents

japanjune Relations Dist (R,S) = # x : if Dist(R,S) < 2.Edit-distance 3.Trees: Tree-Edit-Distance Min # Deletions, Insertions Approximation Left-deletion Left-insertion

japanjune Binary trees : p-Distance allows permutation Classical Tree-Edit-Distance Dist(T1,T2) =2 p-Dist (T1,T2) =1 Dist (T, L) = Min Dist (T,T’) a e b cd a e b c a e b cd f e Deletion Insertion

japanjune Satisfiability : Tree |= F 2.Approximate satisfiability Tree |= F Image on a class K of trees Approximate satisfiability

japanjune Logic, testers, correctors A Tester decides |= for a formula F. A Corrector takes a tree T close to a language L and find T’ in L close to T. This is possible if F follows a simple logic. Theorem. there is linear time corrector for regular binary trees and a constant distance. Given a tree T, k- close to a regular language L, we find in linear time T’ in L, c.k -close to T. General problem: given a language L defined in some Logic, find a corrector. Theorem. (implicit in Alon and al. FOCS2000) There is a linear time corrector for regular words and distance Application to Model-Checking (LICS2002)

japanjune Simple example Tester for 0+ 1* 0+ Types of segments: probably accepted rejected with high probability Corrector for 0+ 1* *

japanjune Tree-automata Logical definability on trees Tree grammar Regular expression Regular Trees r(a,b(a,b(a,b(a,b(a,b(a,b)....)r(a(a,b(a,b(a(a,b),b)....),b)

japanjune (q0, q0)  q1 (q0,q1)  q1 Tree automata q0 q1 q0 q1 q2 (q1,q1)  q2 (q1,q0)  q2 (q2,-)  q2 (-,q2)  q2

japanjune Definition : a subtree t is feasible for L if there are subtrees (for its leaves) which reach states (q1...ql) such that the state of the root q=t(q1...ql) can reach an accepting state (in the automaton for L). A subtree is infeasible if it is not feasible Feasible and infeasible subtrees feasible infeasible

japanjune Fact. If then the number of unfeasible subtrees of length a is O(n). Fact. If the distance is small, there are few infeasibles trees. Intuition : make local corrections at the root of the infeasible trees Infeasible subtrees

japanjune Phase 1 : (Bottom-up) Marking of * nodes, roots of infeasible subtrees. Phase 2 : (Top-down) Recursive analysis of the * subtrees to make root accept. Phase 3 : Local corrections Structure of the corrector q0 q1

japanjune Phase 1 : bottom-up marking Definitions: 1.A terminal *-node is the first sink node of a run 2.A * subtree of a node v is the subtree whose root is v reaching leaves or *-node 3.A node v is a *-node if its state is a sink node when all possible reachable states replace the *-nodes of its *-subtree. 4. Compute the size of the subtrees * * Runs with all possible reachable states (q,q’) reach a sink. * O(n) procedure. Lemma 1: If Dist(T,L)<k, there are at most k *-nodes.

japanjune Phase 2 : top-down possible states * * Let (q,q’) a possible choice at the top *-subtree. Let q’’ a possible state for the *-node of the left *-subtree * q1 q2 q’’ instead of * Correction needed.

japanjune Case 1: One essentially-connected component. Case 2: General case Many components Case analysis of the automaton

japanjune Lemma: if (q1,q2,q’’) are in the same connected component, there is a finite subtree t which can correct. Case a : there is a transition (q,q’) to q’’ with both q,q’ in C: there is a finite tree t1 from q1 to q, a finite tree t2 from q2 to q’ and the correction is: Case 1: one component q1 q2 q’’ q q’ q’’ q1 q2 t2 t1

japanjune Case b : there is a transition (q,q’) to q’’ with one of q or q’ being q0: suppose q=q0. The correction uses t2 and cut the left branch. Case c: there is a transition (q0,q0) to q’’. The correction cuts both branches. Case 1: b and c q1 q2 q’’ q0 q’ q’’ q2 t2 q1 q2 q’’ q0 q’’

japanjune Correction rules q1q2qq’ q’’ q in C q’ in C q’’ q0 q’ q’’ q1 q2 q’’ instead of * Action Insert, Insert Cut, Insert

japanjune Hypothesis : q1 in Ci q2 in Cj q’’ in Ck Case a: P such that Ci < Ck and Cj < Ck Find t1 and t2 as in case 1.a Case 2 : many components q1 q2 q’’ q q’ q’’ q1 q2 t2 t1

japanjune Case b,c : P such that Ci >Ck and Cj < Ck Find t2 and let Cp=inf(Ci,Ck). Cut the left branch until Cp. Case d: P such that Ci >Ck and Cj > Ck Let Cp=inf(Ci,Ck). Cut the left branch until Cp. Let Cq=inf(Cj,Ck). Cut the right branch until Cq. Case 2: b and c q1 q2 q’’ q’ q’’ q2 t2 q1 q2 q’’

japanjune Correction rules q1 C1 q2 C2 QCQC q’ C’ q’’ C’’ C1< C’’ C2< C’’ C1< C C2< C’ q’’ ………. q1 q2 q’’ instead of * Action Insert, Insert ….

japanjune Fact 1: finitely many insertions Fact 2: deletions less predictable Lemma: If the cut is large, than the distance must be large. Analysis of the corrector General Corrector : 1. Do the inductive Marking bottom-up. 2. Apply the recursive analysis of compatible states top-down. 3. For each transition (q,q’) -> q’’ apply the correction, compute the distance and select the rule with smallest distance 4. Select the * states with Minimum Dist.. Procedure is O(n), exponential in k and size(Q)

japanjune Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k. Proof : # *-nodes < k Case 1: 0 *-node: no correction Case 2: at least 1 *-node. Looking at all possible k- variations will correct the errors in the *- subtree and diminish the *-nodes. General result

japanjune Labelled trees of large degree. Structure given by a « grammar », or DTD. Generalization of automata: 1. Unranked tree automaton 2. Tree-walking automaton Method: Code an unranked labelled tree with a binary labelled tree. Advantage: the correction table is FINITE. Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k. Unranked trees: XML

japanjune Applications to XML DTD Binary Normal Form l -> l1, a l1 -> c1, t c1 -> c, c1 c1 -> - c -> t, p1 p1 -> p, p1 p1 -> - a -> data t -> data p -> data

japanjune XML tree decomposition XML file transformed into a binary labelled tree.

japanjune XML file with errors

japanjune Corrected XML file No ambiguities on the possible states of q’’ Immediate correction!

japanjune XML Correction rules q1q2qq’ q’’ -p1 t c ……- - - q1 q2 q’’ instead of * Action Insert, Link Delete, Delete

japanjune Parser: Xerces, Tree structure : DOM Phase 1: look at the parent node of *-node. Propose tags for * (c or f) Phase 2: for each proposal, compute the distance. *=c, distance=1, replacing c with b. *=f, distance=2, replacing c with b and adding an a leaf. Choose the 1st solution. Java Implementation abc *b a d a DTD: d (c,b,a) or (f,b,a) c (a,b,b) f (a,b,b,a)

japanjune Relative value of documents Given a DTD, mark the Web documents as follows: –Infinity if there are far –Dist(Document,DTD)=i Provides a relative valued landscape. Works for boolean combinations Generalize to –Min{ Dist(D,DTD’) : }

japanjune Distance on words and trees On words, how can one compute –Dist(w,w’), a P-problem –Is is possible in less than O(n) ? Yes, STOC 2003 –Dist(w,L) and Dist(L,L’) Given two trees, how can one compute: –Dist(T,T’) P on ordered trees and NP-complete on unordered trees –p-Dist(T,T’) NP-complete.

japanjune Conclusion Testers and Correctors –Testers for approximate verification –Correctors Trees –Regular trees are testable –If T is at distance less than k,then we can correct it. Theoretical algorithms Practical algorithms

japanjune Testers, Correctors and formal verification Two different views of logical verification: 1.Formal verification. How can we check if a program satisfies a specification? Logical proof: theorem proving, model checking 2.Design a tester for the specification (closer to practice: Windows 95 to XP !) (Blum & Kanan) 3.Combine the two approaches to approximately verify a specification (LICS 2002, Sylvain’s thesis)

japanjune Testers Self-testers and correctors for Linear Algebra Blum & Kanan 1985s Testers for graph properties : k-colorability Goldreich and al. 1995s graph properties have testers Alon and al Regular languages have testers Alon and al. 2000s Testers for Regular tree languages (Mdr and Magniez) Corrector for regular trees!

japanjune Blum’s Checker and Tester Checker for f (Blum, Kannan, ~1990) P C x y A checker is a probabilistic program with an oracle P such that for all x,k : if P=f, C(x,k) = Correct If P(x)!=f(x), Prob[ C(x,k) =Buggy] >1- ½^k Correct Buggy

japanjune Distance d(f,g) = | {x : f(x) != g(x)}| / | D| A self-tester for f is a probabilistic program T(P, ) such that : –If d(P,f)=0, then T(P, )=Correct –If d(P,f) > then T(P, )=Buggy Corrector. Division (x,y) : Majority { x.r /y.r : r random.} Self-testing

japanjune Property testing on graphs H random subgraph G Bipartite 2-colorable H 2-Colorability G bipartite  Prob [ H is bipartite] =1 G is -far from bipartite  Prob [ H is non- bipartite] > 2/3

japanjune Property testing on graphs 3-Colorability G 3-colorable  Prob [ H is 3-colorable] =1 G is -far from 3-colorable  Prob [ H is non 3-colorable] > 2/3 Generalization to k-colorability G H random subgraph

japanjune Which graphs (and matrices) properties have testers? –Alon and al., STOC 99: Sigma 2  testers Compression. Property testing and descriptive complexity -equivalent

japanjune Property testing on words F : 0*1* W |= F  Prob [ H |= F’ ] =1 W is -far from F  Prob [ H |= not F’] >2/3 H random subword Word W

japanjune A testable regular property W |= F  Prob [ H |= F’ ] =1 W is -far from F  Prob [ H |= not F’] >2/3 Many 10 appear in W. Repeating the test will detect it with high probability H random subword  F’ Word W How can we verify F : 0*1* ? distance(w,w’) = Hamming distance

japanjune Regular properties are testable Theorem. Regular languages are testable. N. Alon, M. Krivelevich, I. Newman, M. Szegedy FOCS 99. General idea : if a word is far from a regular language, it contains many subwords which are infeasible and can be detected. Theorem. Dyck languages are not testable