1 Approximate Data Exchange Michel de Rougemont Adrien Vieilleribière University Paris II & LRI University Paris-Sud & LRI ICDT 2007.

Slides:



Advertisements
Similar presentations
1 Property testing and learning on strings and trees Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI.
Advertisements

Hardness of testing 3- colorability in bounded degree graphs Andrej Bogdanov Kenji Obata Luca Trevisan.
Complexity Theory Lecture 6
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Gillat Kol joint work with Ran Raz Locally Testable Codes Analogues to the Unique Games Conjecture Do Not Exist.
Property testing of Tree Regular Languages Frédéric Magniez, LRI, CNRS Michel de Rougemont, LRI, University Paris II.
Multicut Lower Bounds via Network Coding Anna Blasiak Cornell University.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
1 Markov Decision Processes: Approximate Equivalence Michel de Rougemont Université Paris II & LRI
1 Polynomial Time Probabilistic Learning of a Subclass of Linear Languages with Queries Yasuhiro TAJIMA, Yoshiyuki KOTANI Tokyo Univ. of Agri. & Tech.
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Complexity 12-1 Complexity Andrei Bulatov Non-Deterministic Space.
Complexity 18-1 Complexity Andrei Bulatov Probabilistic Algorithms.
Computability and Complexity 13-1 Computability and Complexity Andrei Bulatov The Class NP.
Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas Dana Ron Alex Samorodnitsky.
Computability and Complexity 19-1 Computability and Complexity Andrei Bulatov Non-Deterministic Space.
Testing the Diameter of Graphs Michal Parnas Dana Ron.
An Algorithm for Polytope Decomposition and Exact Computation of Multiple Integrals.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Data Exchange & Composition of Schema Mappings Phokion G. Kolaitis IBM Almaden Research Center.
Programming the TM qa  (,q) (,q) q1q1 0q1q1 R q1q1 1q1q1 R q1q1  h  Qa  (,q) (,q) q1q1 0q2q2  q1q1 1q3q3  q1q1  h  q2q2 0q4q4 R q2q2 1q4q4.
An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto.
Computability and Complexity 32-1 Computability and Complexity Andrei Bulatov Boolean Circuits.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Testing Metric Properties Michal Parnas and Dana Ron.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
Some 3CNF Properties are Hard to Test Eli Ben-Sasson Harvard & MIT Prahladh Harsha MIT Sofya Raskhodnikova MIT.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Japanjune The correction of XML data Université Paris II & LRI Michel de Rougemont 1.Approximation and Edit Distance.
THEORY OF COMPUTATION 08 KLEENE’S THEOREM.
1 Approximate Satisfiability and Equivalence Michel de Rougemont University Paris II & LRI Joint work with E. Fischer, Technion, F. Magniez, LRI, LICS.
July The Mathematical Challenge of Large Networks László Lovász Eötvös Loránd University, Budapest
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
On Learning Regular Expressions and Patterns via Membership and Correction Queries Efim Kinber Sacred Heart University Fairfield, CT, USA.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Approximate schemas Michel de Rougemont, LRI, University Paris II.
A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.
1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South.
Multi-Return Macro Tree Transducers The Univ. of Tokyo Kazuhiro Inaba The Univ. of Tokyo Haruo Hosoya NICTA, and UNSW Sebastian Maneth CIAA 2008, San Francisco.
1 Design and Analysis of Algorithms Yoram Moses Lecture 11 June 3, 2010
Lecture 2 Plan: 1. Automatic Boolean Algebras 2. Automatic Linear Orders 3. Automatic Trees 4. Automatic Versions of König’s lemma 5. Intrinsic Regularity.
Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.
Chapter 6 Properties of Regular Languages. 2 Regular Sets and Languages  Claim(1). The family of languages accepted by FSAs consists of precisely the.
The Complexity of Tree Transducer Output Languages FSTTCS 2008, Bengaluru The Univ. of Tokyo Kazuhiro Inaba NICTA, and UNSW Sebastian Maneth.
Some Properties of Switching Games on oriented Matroids. Adrien Vieilleribière. LRI Université Paris-Sud. joint work with David Forge. LRI Université Paris-Sud.
CSCI 2670 Introduction to Theory of Computing November 17, 2005.
狄彥吾 (Yen-Wu Ti) 華夏技術學院資訊工程系 Property Testing on Combinatorial Objects.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
15.082J & 6.855J & ESD.78J September 30, 2010 The Label Correcting Algorithm.
1 Finite Model Theory Lecture 16 L  1  Summary and 0/1 Laws.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Algorithms for hard problems Automata and tree automata Juris Viksna, 2015.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Tree Automata First: A reminder on Automata on words Typing semistructured data.
Probabilistic Algorithms
Stochastic Streams: Sample Complexity vs. Space Complexity
Approximating the MST Weight in Sublinear Time
Background: Lattices and the Learning-with-Errors problem
Locally Decodable Codes from Lifting
Non-Deterministic Finite Automata
Approximate Validity of XML Streaming Data
Lecture 10: Query Complexity
CS21 Decidability and Tractability
Classical Algorithms from Quantum and Arthur-Merlin Communication Protocols Lijie Chen MIT Ruosong Wang CMU.
Every set in P is strongly testable under a suitable encoding
Instructor: Aaron Roth
Presentation transcript:

1 Approximate Data Exchange Michel de Rougemont Adrien Vieilleribière University Paris II & LRI University Paris-Sud & LRI ICDT 2007

2 1.Data from different imperfect sources. Framework for Data-Exchange and Data- Integration 2.Logic and Approximation Definability and Complexity (scaling) Robustness 3.Statistics based computations Motivation

3 1.Classical Data Exchange on words and trees 2.Approximation based on Property Testing. Tester for regular words and regular trees (Edit Distance with Moves) Property testing for regular tree languages (ICALP 2004) Approximate Satisfiability and Equivalence (LICS 06) 3.Approximate Data Exchange Plan

4 1. Data Exchange on Trees Source Targets ?

5 Data Exchange setting: (K S,τ,K T ) Fagin et al. 2002: τ defined by Source-Target-Dependencies on relations Arenas, Libkin 2005: τ defined by Tree-Pattern-Formulas on trees Source-Consistency: Given a source structure I in K S, is there a target J in K T s.t. (I,J) in τ ? Typechecking: Decide if for all I in K S and all J s.t. (I,J) in τ, J is in K T. Composition of settings ? Query Answering: Given a source structure I in K S, decide if for all J s.t. (I,J) in τ, J is in K Q. Classical Data-Exchange

6 :c Deterministic Transducer on unranked trees with attributes. In practice, XSLT program. Generalization to non-deterministic Transducers.. Class τ defined by Transducers *1* cabababcaaaaa. c(ab)*ca* 0:ab abababaaaaab c(ab)*ca* 1:a 0:ab 1:a 0:c ababaaa + abcaaa + cabaaa + ccaaa c(ab)*ca* *1* 0:ab 1:a c* ab c* a c* a c* 011

7 (K S,τ,K T ) is a setting, where τ is a transducer: ε-Source-Consistency: Given a source structure I, is there a source I’  K S, ε-close to I s.t. τ(I’) is ε-close to K T ? ε-Typechecking: Decide if for all I in K S, τ(I) is ε-close to K T. ε-Composition of settings. General transducer τ : ε-Query Answering: Given a source structure I, is there a source I’ ε-close to I s.t. any J [s.t. (I’,J) is in τ] is ε- close to K Q ?. Approximate Data Exchange

8 Let F be a property on a class K of structures U An ε -tester for F is a probabilistic algorithm A such that: If U |= F, A accepts If U is ε-far from F, A rejects with high probability A property F is testable if there exists a probabilistic algorithm A s.t. For all ε it is an ε -tester for F Time(A) independent of n=|U|. R. Rubinfeld, M. Sudan, Robust characterizations of polynomials, 1994 O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996.Property Testing and its connection to Learning and Approximation Tester usually implies a linear time corrector. (ε 1, ε 2 )- Tolerant Tester. 2. Property Testing

9 1.Satisfiability: T |= F 2.Approximate Satisfiability: T |= F 3.Approximate Equivalence: Image on a class K of trees Approximate Satisfiability and Equivalence

10 1.Classical Edit Distance: Insertions, Deletions, Modifications 2.Edit Distance with moves Edit Distance with Moves generalizes to Ordered Trees Edit Distances with Moves

11 Uniform Statistics: k=1/ε Distance between words ( NP-complete) Testable, O(1): Sample N subwords of length k: Y(W) and Y(W’) If |Y(w)-Y(w’)| 1 < ε accept, else reject W= length n, n-k+1 blocks of length k For k=2, n=12, 11 blocks Fact 1: dist(W,W’)  |u.stat(W)-u.stat(W’)| 1 for words of similar length Fact 2: |u.stat(W)-Y(W) | 1 ≤  for Y(W) the u.stat vector on N samples

12 r = (010)*0*1* + 1*(01)*(110)* Statistics on Regular Expressions Y(w) H={u.stat(w) : w in r } is a union of polytopes. 2 polytopes for r.. Membership Tester: Compute Y(w). Accept if d(Y(w),H) ≤ , else reject k=2

13 ε-Source-Consistency : Given a source structure I, is there a source I’  K S ε- close to I s.t. τ(I’) is ε-close to K T ? Complexity parameter: n=|I| Case of 1-state on words: how to k-sample uniformly in τ(I) ? Suppose τ(0)=a, τ(1)=bbb. Adjust the probabilities: If s=0…, 1 possible block from τ(0), adjust with 1/3 If s=1…, 3 possible blocks from τ(1), choose a shift in {0,1,2} uniformly Approximate u.stat( τ(I)). 3. Approximate Data Exchange I = τ(I) = a a a a b b b b b b

14 Analysis of  for ε-Source-consistency: u.stat(I)  1 (u 1 )+ 2 (u 2 )+ 3 (u 3 ) u.stat(   (I))=  (v 1 )+  ’(v 4 )+ 2 (v 2 )+ 3 (v 3 ) with  +  ’= 1. (u1)(u1) (u2)(u2) (u3)(u3) (I)(I) HH HSHS H S  u.stat(K S ) H   u.stat(   ) H T  u.stat(K T ) u 1 :v 1 q1q1 u 2 :v 2 q2q2 u 3 :v 3 q3q3 u 1 :v 4 q4q4  11 22

15 Tester for ε-Source-consistency: 1-1-   =0,  ’= 1  = 1,  ’=0 HTHT Tester: u.stat(I) is ε-far from H S : reject [I is far from K S ]  Tester for K S. Generate  ={  | u.stat(I) is ε-close from being decomposable over H  }  Testers for K  While (  ≠  ) { take a  in , approximate u.stat(   (I)) and x=d(u.stat(   (I)), H T ) If x≤ , then accept and stop else remove  from  } Reject Find I’: If the test accepts, split 1 with the  proportions : I = u 2 u 1 u 1 u 1 u 1 u 1 u 1 u 1 u 1 u 1 u 3 u 3 u.stat(   (I))=  (v 1 )+  ’(v 4 )+ 2 (v 2 )+ 3 (v 3 ) with  +  ’= 1. I’ = u 1 u 1 u 1 u 2 u 3 u 3 u 1 u 1 u 1 u 1 u 1 u 1

16 Lemma: If I is s.t.  (I)  K T, then A accepts because there is a  with dist(   (I),K T )=0 Lemma: If I is ε-far from being Source-Consistent, then the tester reject with high probabilities. Theorem: For every ε > 0, there is an ε-tester for the ε-Source-Consistency on words. Corollary: If I is ε-Source-Consistent, the procedure leads to an I’ s.t.  (I’) is  -close to K T. Approximate ε-Source-Consistency:

17 Given a source structure I, is there a source I’ ε-close to I s.t. τ(I’) is ε-close to K T ? Case of 1-state: how to k-sample uniformly in τ(I) ? Suppose τ(0)=ab, τ(1)=a, τ(2)=ccc. Adjust the probabilities: If s=0…,2 possible blocks from τ(0), adjust with 1/3 If s=1…, 1 possible block from τ(1), adjust with 1/6 If s=2…, 3 possible blocks from τ(2), adjust with 1/2 Approximate ustat( τ(I)). ε-Source-Consistency a b a b c c c a a a a aa : 4/7.1/6.3/4=1/14 ab : 2/7.1/3.1/2=1/21 Outputs: bc : 1/7.1/3.1/2=1/42  ustat( τ(w))= ca : 1/7.1/3.1/2=1/42 cc : 1/7.1/2.2/3=1/21

18 Image of the statistics by a general transducer τ I τ (I) Union of polytopes Applications: ε-Source-Consistency: ε-Query Answering: d( u.stat[τ(I)],H T ) ≤  ?u.stat[τ(I)]  ε H Q ? u.stat(I)=

19 Inclusion Tester for regular properties Time polynomial in m=Max(|r 1 |,|r 2 |): Application : ε-Typechecking: Decide if J is ε-close to K T [for all I in K S and all (I,J) in τ]. Solution: Inclusion Tester for τ(K S )   K T.

20 Statistics on Trees (1(1,1),.) (1,.) T: Ordered (extended) Tree of rank 2. T’: squeleton W: word with labels. Apply u.stat on W and define u.stat(T).

21 Extension to trees Statistics on DTDs: H={stat(t) : t in DTD} is still a union of polytopes (harder analysis to construct it) Transducer  with attributes:  :  S ×Q  Hedge  T,A T [Q] h :  S ×Q×A S  {1}  Var extended to  S ×Q×Str  Str  Var  :  S ×Q×A T ×D T  {1,…,k} where D T is the hedge defined by .  is decomposable in a finite number of paths in the graph of the strongly connected components. Lemma: The image of a statistical vector through a path is a union of polytopes.

22 ε-Source-Consistency on trees Test: If there is a  (allowing a decomposition of t on H  ) s.t. u.stat(   (t)) is  -close to H T then accept, else reject Lemma: If  (t)  K T, then there is a  with dist(   (t),K T )=0. Lemma: If t is ε-far from being ε-Source-Consistent, then we reject with high probabilities.  Testers for K S, K  ;  x:approximation of u.stat(   (t)), d(x,H T ) ≤  ? Theorem: For every ε > 0, there is an ε-tester for the ε-Source-Consistency on trees. Corollary: If t is ε-Source-Consistent, the procedure leads to an t’ s.t.  (t’) is  -close to K T

23 Composition of close settings An ε-corrector for a class K 0  K is a algorithm A which takes as input a structure I which is ε-close to K 0 and outputs a structure I 0  K 0, such that I 0 is ε-close to I. Ex : If an XML file F is ε-close from a DTD, find a valid F’ ε-close to F: Data Exchange settings: (K S1,τ 1,K T1 ), (K S2,τ 2,K T2 ): Solution if they are ε-composable –K T1 and K S2 are ε-close. –the settings satisfy ε-typechecking Composition: Apply correctors at every stage to define the new τ.  (K S1,τ,K T2 ) satisfies 3ε-typechecking.

24 τ2τ2 Composition τ1τ1 C1C1 C C2C2 τ = C 2 ◦ τ 2 ◦ C ◦ C 1 ◦ τ 1 K T1 K S2 K T2

25 Conclusion 1.Data Exchange: –Source-Consistency, –Typechecking, –Query-Answering. 2.Approximate Data Exchange: Property Testing based Approximation –ε-Source-Consistency, –ε-Typechecking, –ε-Query-Answering, –ε-Composition.

26 Questions ? Adrien Vieilleribière: Michel de Rougemont: