# Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff.

## Presentation on theme: "Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff."— Presentation transcript:

Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff

Restriction Access, A new model of “Grey-box” access

Systems, Models, Observations From Input-Output (I 1,O 1 ), (I 2,O 2 ), (I 3,O 3 ), ….? Typically more!

Black-box access Successes & Limits Learning: PAC, membership, statistical…queries Decision trees, DNFs? Cryptography: semantic, CPA, CCA, … security Cold boot, microwave,… attacks? Optimization: Membership, separation,… oracles Strongly polynomial algorithms? Pseudorandomness: Hardness vs. Randomness Derandomizing specific algorithms? Complexity:  2 = NP NP What problems can we solve if P=NP?

The gray scale of access f:  n   m D: “device” computing f (from a family of devices) D x 1, f(x 1 ) x 2, f(x 2 ) x 3, f(x 3 ) …. D D How to model? Many specific ideas. Ours: general, clean Black Box Gray Box – natural starting point - natural intermediate pt Clear Box

Restriction Access (RA) f:  n   m D: “device” computing f Restriction:  = (x,L), L  [n], x   n, L L live vars Observations: ( , D|  ) D|  (simplified after fixing) computes f|  on L Black L =  Gray Clear L = [n] (x,f(x)) ( , D|  ) ( x,D) D f(x) x 1, x 2, *,* …. * ||

Example: Decision Tree x1x1 x4x4 x2x2 x3x3 x2x2 0 1 01 01 10 D  = (x,L) L = {3,4} x = (1010) D|  = x4x4 x3x3 0 10

Modeling choices (RA-PAC) Restriction:  = (x,L), L  [n], x   n, unknown D Input x : friendly, adversarial, random Unknown distribution (as in PAC) Live vars L : friendly, adversarial, random  - independent dist (as in random restrictions)

RA-PAC Results Probably, Approximately Correct (PAC) learning of D, from restrictions with each variable remains alive with prob  Thm 1 [DRWY] : A poly(s,  ) alg for RA-PAC learning size-s decision trees, for every  >0 (reconstruction from pairs of live variables) Thm 2 [DRWY] : A poly(s,  ) alg for RA-PAC learning size-s DNFs, for every  >.365… (reduction to “Population Recovery Problem”) Positive - In contrast to PAC !!!

Population Recovery (learning a mixture of binomials)

Population Recovery Problem k species, n attributes, from , Vectors v 1, v 2, … v k   n Distribution p 1, p 2, … p k ,  >0 Task: Recover all v i, p i (upto  ) from samples p 1 1/2 0000 v 1 p 2 1/3 0110 v 2 p 3 1/6 1100 v 3 Red: Known Blue: Unknown n k

Population Recovery Problem k species, n attributes, from , ,  >0 v 1, v 2, … v k   n p 1, p 2, … p k fraction in population Task: Recover all v i, p i (upto  ) from samples Samplers: (1) u  v i with prob. p i  -Lossy Sampler: (2) u(j)  ? with prob. 1-   j  [n]  -Noisy Sampler: (2) u(j) flipped w.p. 1/2-   j  [n] 0110 ?1?0?1?0 11001100 p 1 1/2 0000 v 1 p 2 1/3 0110 v 2 p 3 1/6 1100 v 3

Skull Teeth Vertebrae Arms Ribs Legs Tail Loss – Paleontology Height (ft) Length (ft) Weight (lbs) True Data 26% 11% 13% 30% 20%

Skull Teeth Vertebrae Arms Ribs Legs Tail Loss – Paleontology Height (ft) Length (ft) Weight (lbs) From samples Dig #1 Dig #2 Dig #3 Dig #4 …… each finding common to many species! How do they do it?

Socialism Abortion Gay marriage Marijuana Male Rich North US Noise – Privacy True Data 2% 0 1 1 0 1 0 0 1% 1 1 0 0 0 1 1 …… From samples Joe 0 0 0 0 0 1 1 Jane 0 0 0 0 1 1 1 …. Who flipped every correct answer with probability 49% Deniability? Recovery?

PRP - applications Recovering from loss & noise - Clustering / Learning / Data mining - Computational biology / Archeology / …… - Error correction - Database privacy - …… Numerous related papers & books

PRP - Results Facts:  =0 obliterates all information. - No polytime algorithm for  = o(1) Thm 3 [DRWY] A poly(k, n,  ) algorithm, from lossy samples, for every  >.365… Thm 4 [WY]: A poly(k log k, n,  ) algorithm, from lossy and/or noisy samples, for every  > 0 Kearns, Mansour, Ron, Rubinfeld, Schapire, Sellie exp(k) algorithm for this discrete version Moitra, Valiant exp(k) algorithm for Gaussian version (even when noise is unknown)

Proof of Thm 4 Reconstruct v i, p i From samples,,,…. Lemma 1: Can assume we know the v i ’s ! Proof: Exposing one column at a time. Lemma 2: Easy in exp(n) time ! Proof: Lossy - enough samples without “?” Noisy – linear algebra on sample probabilities. Idea: Make n=O(log k) [Dimension Reduction] p 1 1/2 0000 v 1 p 2 1/3 0110 v 2 p 3 1/6 1100 v 3 ?1?00??01100 n k

Partial IDs a new dimension-reduction technique

Dimension Reduction and small IDs Lemma: Can approximate p i in exp(| S i |) time ! Does one always have small IDs? 1 2 3 4 5 6 7 8 p 1 0 0 0 0 0 1 0 1 v 1 p 2 0 1 1 0 1 0 1 0 v 2 p 3 0 1 0 0 1 0 1 1 v 3 p 4 1 1 1 0 1 0 1 1 v 4 p 5 1 1 0 0 0 1 1 1 v 5 p 6 1 1 0 0 1 0 0 1 v 6 p 7 0 1 0 0 0 1 1 1 v 7 p 8 1 1 0 1 1 0 1 1 v 8 p 9 1 1 0 0 0 1 1 1 v 9 IDs S 1 = {1,2} S 2 = {8} S 3 = {1,5,6} n = 8 k = 9 u – random sample q i = Pr[u[S i ]=v i [S i ]]

Small IDs ? NO! However,… 1 2 3 4 5 6 7 8 p 1 1 0 0 0 0 0 0 0 v 1 p 2 0 1 0 0 0 0 0 0 v 2 p 3 0 0 1 0 0 0 0 0 v 3 p 4 0 0 0 1 0 0 0 0 v 4 p 5 0 0 0 0 1 0 0 0 v 5 p 6 0 0 0 0 0 1 0 0 v 6 p 7 0 0 0 0 0 0 1 0 v 7 p 8 0 0 0 0 0 0 0 1 v 8 p 9 0 0 0 0 0 0 0 0 v 9 IDs S 1 = {1} S 2 = {2} S 3 = {3} … S 8 = {8} S 9 = {1,2,…,8} n = 8 k = 9

Linear algebra & Partial IDs However, we can compute p 9 = 1- p 1 - p 2 -…- p 8 1 2 3 4 5 6 7 8 p 1 1 0 0 0 0 0 0 0 v 1 p 2 0 1 0 0 0 0 0 0 v 2 p 3 0 0 1 0 0 0 0 0 v 3 p 4 0 0 0 1 0 0 0 0 v 4 p 5 0 0 0 0 1 0 0 0 v 5 p 6 0 0 0 0 0 1 0 0 v 6 p 7 0 0 0 0 0 0 1 0 v 7 p 8 0 0 0 0 0 0 0 1 v 8 p 9 0 0 0 0 0 0 0 0 v 9 IDs S 1 = {1} S 2 = {2} S 3 = {3} … S 8 = {8} S 9 =  n = 8 k = 9 P

Back substitution and Imposters Can use back substitution if no cycles ! Are there always acyclic small partial IDs? 1 2 3 4 5 6 7 8 p 1 0 0 1 0 0 1 0 1 v 1 p 2 0 1 1 0 1 0 1 0 v 2 p 3 0 1 0 0 1 0 1 1 v 3 p 4 1 1 1 0 1 0 1 1 v 4 p 5 1 1 0 0 0 1 1 1 v 5 p 6 1 1 0 0 1 0 0 1 v 6 p 7 0 1 0 0 0 1 1 1 v 7 p 8 1 1 0 1 1 0 1 1 v 8 p 9 1 1 0 0 0 1 1 1 v 9 PIDs S 1 = {1,2} S 2 = {8} S 3 = {1,5,6} u – random sample q i = Pr[u[S i ]=v i [S i ]] q 1 = q 2 = q 3 = q 4 - p 1 - p 2 = S 4 = {3} any subset

Acyclic small partial IDs exist Lemma: There is always an ID of length log k 1 2 3 4 5 6 7 8 p 1 0 0 0 0 0 0 0 1 v 1 p 2 0 1 1 0 1 0 1 0 v 2 p 3 1 1 0 0 1 0 1 1 v 3 p 4 1 1 1 0 1 0 1 1 v 4 p 5 1 1 0 0 0 1 1 1 v 5 p 6 1 1 0 0 1 0 0 1 v 6 p 7 1 1 1 1 1 0 1 1 v 7 p 8 0 1 0 0 0 1 1 1 v 8 p 9 0 1 0 0 1 1 1 1 v 9 PIDs S 8 = {1,5,6} n = 8 k = 9 Idea: Remove and iterate to find more PIDs Lemma: Acyclic (log k)-PIDs always exists!

Chains of small Partial IDs Compute: q i = Pr[u i = 1] = Σ j≤i p i from sample u Back substitution : p i = q i - Σ j { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3188837/slides/slide_25.jpg", "name": "Chains of small Partial IDs Compute: q i = Pr[u i = 1] = Σ j≤i p i from sample u Back substitution : p i = q i - Σ j

The PID (imposter) graph Given: V=(v 1, v 2, … v k )   n S=(S 1, S 2,…,S k )  [ n ] n Construct G (V;S) by connecting v j  v i iff v i is an imposter of v j : v i [ S j ] = v j [ S j ] 1 2 3 4 5 6 7 8 1 1 1 1 1 1 1 1 v 1 0 1 1 1 1 1 1 1 v 2 0 0 1 1 1 1 1 1 v 3 0 0 0 1 1 1 1 1 v 4 0 0 0 0 1 1 1 1 v 5 PIDs S 1 = {1} S 2 = {2} S 3 = {3} … S 5 = {5} width = max i |S i | depth = depth ( G ) Want: PIDs w/small width and depth for all V v i  v j iff i > j

Constructing cheap PID graphs Theorem: For every V=(v 1, v 2, … v k ), v i   n we can efficiently find PIDs S=(S 1,S 2,…,S k ), S i  [ n ] of width and depth at most log k Algorithm: Initialize S i =  for all i Invariant: |imposters( v i ;S i )| ≤ k /2 | Si | Repeat: (1) Make S i maximal if not, add minority coordinates to S i (2) Make chains monotone: v j  v i then | S j |<| S i | (so G acyclic) if not, set S i to S j ( and apply (1) to S i ) 1 2 3 4 0 0 1 0 v 1 0 0 0 0 v 2 0 0 0 1 v 3 1 0 0 1 v 4 1 1 1 0 v 5 1 0 1 0 v 6

Analysis of the algorithm Theorem: For every V=(v 1, v 2, … v k )   n we can efficiently find PIDs S=(S 1,S 2,…,S k )  [ n ] n of width and depth at most log k Algorithm: Initialize S i =  for all i Invariant: |imposters( v i ;S i )| ≤ k /2 | Si | Repeat: (1) Make S i maximal (2) Make chains monotone (v j  v i then | S j |<| S i |) Analysis: - | S i |  log k throughout for all i -  i | S i | increases each step - Termination in k log k steps. - width  log k and so depth  log k

Conclusions - Restriction access: a new, general model of “gray box” access (largely unexplored!) - A general problem of population recovery - Efficient reconstruction from loss & noise - Partial IDs, a new dimension reduction technique for databases. Open: polynomial time algorithm in k ? (currently k log k, PIDs can’t beat k loglog k ) Open: Handle unknown errors ?

Download ppt "Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff."

Similar presentations