Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research.

Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research

The Problem Exploit Data, eg, Medical Insurance Database — Does smoking contribute to heart disease? — Was there a rise in asthma emergency room cases this month? — What fraction of the admissions during 2004 were men 25-35? …while preserving privacy of individuals

Holistic Statistics Is the dataset well clustered? What is the single best predictor for risk of stroke? How are attributes X and Y correlated; what is the cov(X,Y)? Are the data inherently low-dimensional?

Statistical Database Query (f,S) f: row  [0,1] S µ [n] Exact Answer  f(row r) Database (D 1, … D n ) (D 1, … D n ) f f f f  + noise

Statistical Database f f f f  + noise Under control of interlocutor: Noise generation Number of queries T permitted

Why Bother With Noise? Limiting interface to queries about large sets is insufficient: A = {1, …, n} and B = {2, …, n}  a2 A f(row a) -  b2 B f(row b) = f(row 1)

Previous (Modern) Work in this Model Dinur, Nissim [2003] Single binary attribute (query function f = identity) Non-privacy: whp adversary guesses 1-  rows — Theorem: Polytime non-privacy if whp |noise| is o(√n) — Theorem: Privacy with o(√n) noise if #queries is << n Privacy “for free” ! Rows » samples from underlying distribution: Pr[row i = 1] = p E[# 1’s] = pn, Var =  (n) Acutal #1’s » pn §  (√n) |Privacy-preserving noise| is o(sampling error)

Real Power in this Model Dwork, Nissim [2004] Multiple binary attributes q=(S,f), f:{0,1} d ! {0,1} — Definition of privacy appropriate to enriched query set — Theorem: Privacy with o(√n) noise if #queries is << n — Coined term SuLQ Vertically Partitioned Databases — Learn joint statistics from independently operated SuLQ databases: Given SulQ A, SuLQ B learn if A implies B in probability Eg, heart disease risk increases with smoking Enables learning statistics for all Boolean fns of attributes

Still More Power [Blum, Dwork, McSherry, Nissim 05] Extend Privacy Proofs — Real-valued functions f: [0,1] d ! [0,1] — Per row analysis: drop dependence on n! How many queries has THIS row participated in? Our Data, Ourselves Holistic Statistics: A Calculus of Noisy Computation — Beyond statistics: (not too) noisy versions of k-means, perceptron, ID3 algs (not too) noisy optimal projections SVD, PCA All of STAT learning

Towards Defining Privacy: “Facts of Life” vs Privacy Breach Diabetes is more likely in obese persons — Does not imply THIS obese person has or will have diabetes Sneaker color preference is correlated with political party — Does not imply THIS person in red sneakers is a Republican Half of all marriages result in divorce — Does not imply Pr [ THIS marriage will fail ] = ½

( , T)-Privacy Power of adversary: Phase 0: Specify a goal function g: row  {0,1} Actually, a polynomial number of functions; Adversary will try to learn this information about someone Phase 1: Adaptively make T queries Phase 2: Choose a row i to attack; get entire database except for row i Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by  Notes: Adversary chooses goal My privacy is preserved even if everybody else tells their secrets to the adversary

Flavor of Privacy Proofs Define confidence in value of g( row i ) — c 0 = log [p 0 /(1-p 0 )] — 0 when p = ½, skyrockets as p moves toward 0 or 1 Model evolution of confidence as a martingale — Argue expected difference at each step is small — Compute absolute upper bound on difference — Plug these two parameters into Azuma’s inequality Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i ) c0c0

Remainder of This Talk Description of SuLQ Algorithm + Statement of Main Theorem Examples — k means — SVD, PCA — Perceptron — STAT learning Vertically Partitioned Data — Determining if  )  in probability: Pr[  |  ] ¸ Pr[  ]+  when  and  are in different SuLQ databases Summary

Azuma’s Inequality Let s 1, … s T be i.i.d. such that E[s j ] ·  and |s j | · . Then Pr[|  i s i | > (  +  ) T 1/2 + T  ] · 2e - /2 We will take  = 1/2R and  = (2 log (T/  )/R) 1/2 + 1/2R 2

The SuLQ Algorithm Algorithm: — Input: query (S µ [n], f: [0,1] d ! [0,1]) — Output:  i 2 S f( row i ) + N(0, R) Theorem: 8 , with probability at least 1- , choosing R > 32 log(2/  ) log (T/  )T/  2 ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than  is at most . R is independent of n. Bigger n means better stats.

k Means Clustering physics, OR, machine learning, data mining, etc.

SuLQ k Means Estimate size of each cluster Estimate average of points in cluster — Estimate their sum; and — Divide estimated sum by estimated average

Side by Side: k Means and SuLQ k-Means Basic step: Input: data points p 1,…,p n and k ‘centers’ c 1,…,c k in [0,1] d S j = points for which c j is the closest center Output: c’ j = average of points in S j, j=1, … k Basic step: Input: data points p 1,…,p n and k ‘centers’ c 1,…,c k in [0,1] d s j = SuLQ( f(d i ) := 1 if j = arg min j ||c j – d i || 0 otherwise)  ’ j = SuLQ( f(d i ) := d i if j = arg min j ||c j - d i || 0 otherwise) / s j k(1+d) queries total

Small Error! For each 1 · j · k, if |S j | >> R 1/2 then with high probability ||  ’ j – c’ j || is O( (||  j || + d 1/2 ) R 1/2 /|S j |). Inaccuracies: — Estimating |S j | — Summing points in S j Even with just the first: (1/s j - 1/|S j |)  I 2 S j d i = (1/s j - 1/|S j |) (  j |S j |) = ((|S j | - s j )/s j )  j ¼ (noise/size)  j

Reducing Dimensionality Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance Find Optimal Linear Projections — Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A Singular Value Decomposition uses top k eigenvectors of A T A Principal Component Analysis uses top k eigenvectors of cov(A) Approach — Approximate A T A and cov(A) using SuLQ, then compute eigenvectors

Optimal Projections A T A =  i d i T d i  = (  i d i )/n cov(A) =  i (d i -  ) T (d i -  ) SuLQ (f(i) = d i T d i ) = A T A + N(0,R) d £ d  ’ = SuLQ(f(i)=d i )/n SuLQ( f(i) = (d i -  ’) T (d i -  ’) ) d 2 and d 2 +d queries, respectively

Perceptron [Rosenblatt 57] Input: n points p 1,…,p n in [-1,1] d, and labels b 1,…,b n in {-1,1} — Assumed linearly separable, with a plane through the origin Initialize w randomly h w, p i b > 0 iff label b agrees with sign of h w, p i While 9 labeled point (p i,b i ) s.t. h w i, p i i b i · 0, s et w = w + p i ·b i Output: w pipipipi w w

SuLQ Perceptron Initialize w 0 = 0 d and s 0 = n. For j = 0, 1, 2, …, repeating so long as s j >> R 1/2 Count the misclassified rows (1 query): s j = SuLQ(f(d i ) := 1 if h d i, w j i b i · 0 and 0 ow) Synthesize a misclassified vector (d queries): v j = SuLQ(f(d i ) := b i d i if h d i, w j i ¢ b i · 0 and 0 ow) / s j. Update w: Set w j+1 = w j + v j. Return the final value of w.

SuLQ Perceptron Initialize w = 0 d and s= n. Repeat while s >> R 1/2 Count the misclassified rows (1 query) : s = SuLQ(f(d i ) := 1 if h d i, w i b i · 0 and 0 ow) Synthesize a misclassified vector (d queries) : v = SuLQ(f(d i ) := b i d i if h d i, w i ¢ b i · 0 and 0 ow) / s Update w: Set w = w + v Return the final value of w.

How Many Rounds? Theorem: If there exists a unit vector w’ and scalar  such that for all i hw',d i i b i ¸  and for all j,  >> (dR) 1/2 /|S j | then with high probability the algorithm terminates in at most 32 max i |d i | 2 /  rounds. |S j | = number of misclassified vectors at iteration j In each round j, hw', wi increases by more than |w| does. Since hw', wi · |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.

The Statistical Queries Learning Model [Kearns93] Concept c: {0,1} d  {0,1} Distribution D on {0,1} d STAT(c,D) Oracle — Query: (p,  ) where p:{0,1} d+1  {0,1} and  =1/poly(d) — Answer: Pr x  D [p(x,c(x))] +  for |  |  

Capturing STAT Each row contains a labeled example (x, c(x)) Input: predicate p and accuracy  Initialize tally = 0. Reduce variance: Repeat t ¸ R/  n 2 times tally = tally + SuLQ(f(d i ) := p(d i )) Output: tally / tn

Capturing STAT Theorem: For any algorithm that  -learns a class C using at most q statistical queries of accuracy {  1, …,  q }, the adapted algorithm can  -learn C on a SuLQ database of n elements, provided that n 2 ¸ R log(q /  )}/(T-q) £  j · q 1/  j

Probabilistic Implication: Two SuLQ Databases  implies  in probability: Pr[  |  ] ≥ Pr[  ]+  Construct a tester for distinguishing   2 (for constants  1 <  2 ) — Estimate  by binary search In the analysis we consider deviations from an expected value, of magnitude  (√n) — As perturbation << √n, it does not mask out these deviations Results generalize to functions  and  of attributes in two distinct SuLQ databases

Key Insight: Test for Pr[  |  ] ≥ Pr[  ]+  Assume T chosen so that noise = o(√n). 1. Find a “heavy” set S for  : a subset of rows that have more than |S| a +[a(1-a) |S] 1/2 ones in  database. Here, a = Pr[  ] and |S| =  (n). Find S s.t. a S,  > |S| a  + √ [|S|( a (1- a ))]. Let excess  = a S,  - |S|  a. Note that excess is  (n 1/2 ). 2. Query the SuLQ database for , on S If a S,  ¸ |S| Pr[  ] + excess  (  / (1 - a )) then return 1 else return 0 If  is constant then noise is too small to hide the correlation.

Probabilistic Implication – The Tester Pr[  |  ] ≥ Pr[  ]+  Distinguishing   2 : — Find a `heavy’ query (S,  ) s.t. a S,  > |S|  p  + √(n a (1- a )) Let bias  = a S,  - |S|  p  — Issue query (S,  ) If a S,  > threshold(bias , p ,  1 ) output 1 random S <1<1<1<1 >2>2>2>2 (2-1)(n)(2-1)(√n)(2-1)(n)(2-1)(√n) 10 Pr[a S,  ] a S, 

Summary SuLQ framework for privacy-preserving statistical databases — real-valued query functions — Variance for noise depends (roughly linearly) on number of queries, not size of database Examples of power of SuLQ calculus Vertically Partitioned Databases

Sources C. Dwork and K. Nissim, Privacy-Preserving Datamining on Vertically Partitioned Databases A. Blum, C. Dwork, F. McSherry, and K. Nissim, Practical Privacy: The SuLQ Framework See http://research.microsoft.com/research/sv/DabasePrivacy

Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research.

Similar presentations

Presentation on theme: "Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research.

Similar presentations

Presentation on theme: "Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research."— Presentation transcript:

Similar presentations

About project

Feedback