Foundations of Privacy Lecture 6 Lecturer: Moni Naor.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
Foundations of Privacy Lecture 4 Lecturer: Moni Naor.
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Derandomized DP  Thus far, the DP-test was over sets of size k  For instance, the Z-Test required three random sets: a set of size k, a set of size k-k’
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Foundations of Privacy Lecture 3 Lecturer: Moni Naor.
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
Private Information Retrieval. What is Private Information retrieval (PIR) ? Reduction from Private Information Retrieval (PIR) to Smooth Codes Constructions.
Foundations of Privacy Lecture 7 Lecturer: Moni Naor.
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Visual Recognition Tutorial
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Calibrating Noise to Sensitivity in Private Data Analysis
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Differential Privacy (2). Outline  Using differential privacy Database queries Data mining  Non interactive case  New developments.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
CS573 Data Privacy and Security Statistical Databases
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Random Sampling, Point Estimation and Maximum Likelihood.
CS433 Modeling and Simulation Lecture 16 Output Analysis Large-Sample Estimation Theory Dr. Anis Koubâa 30 May 2009 Al-Imam Mohammad Ibn Saud University.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Privacy by Learning the Database Moritz Hardt DIMACS, October 24, 2012.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Additive Data Perturbation: the Basic Problem and Techniques.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Private Data Management with Verification
Understanding Generalization in Adaptive Data Analysis
Streaming & sampling.
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Vitaly (the West Coast) Feldman
Nikhil Bansal, Shashwat Garg, Jesper Nederlof, Nikhil Vyas
Summarizing Data by Statistics
Foundations of Privacy Lecture 7
Differential Privacy (2)
Published in: IEEE Transactions on Industrial Informatics
Presentation transcript:

Foundations of Privacy Lecture 6 Lecturer: Moni Naor

( ,  ) - Differential Privacy Bad Responses: Z Z Z Pr [response] ratio bounded This course :  negligible Sanitizer M gives ( ,  ) - differential privacy if: for all adjacent D 1 and D 2, and all A µ range(M): Pr[ M (D 1 ) 2 A] ≤ e  Pr[ M (D 2 ) 2 A] + 

Example: NO Differential Privacy U set of (name,tag 2 {0,1}) tuples One counting query: #of participants with tag=1 Sanitizer A : choose and release a few random tags Bad event T : Only my tag is 1, my tag released Pr A [A(D+I) 2 T] ≥ 1/n Pr A [A(D-I) 2 T] = 0 Pr A [A(D+I) 2 T] Pr A [A(D-I) 2 T] ≤ e ε ≈ 1+ ε e-ε ≤e-ε ≤ Not ε diff private for any ε ! It is (0,1/n) Differential Private

Counting Queries Counting-queries U Q is a set of predicates q: U  {0,1} Query : how many x participants satisfy q? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: some error anyway inherent in statistical analysis U Database x of size n Query q n individuals, each contributing U a single point in U Sometimes talk about fraction

Bounds on Achievable Privacy Want to get bounds on the Accuracy –The responses from the mechanism to all queries are assured to be within α except with probability  Number of queries t for which we can receive accurate answers The privacy parameter ε for which ε differential privacy is achievable –Or ( ε,  ) differential privacy is achievable

Composition: t -Fold Suppose we are going to apply a DP mechanism t times. –Perhaps on different databases Want: the combined outcome is differentially private A value b 2 {0,1} is chosen In each of the t rounds: –adversary A picks two adjacent databases D 0 i and D 1 i and an  -DP mechanism M i –receives result z i of the  -DP mechanism M i on D b i Want to argue: A ‘s view is within  ’ for both values of b A ‘s view: (z 1, z 2, …, z t ) plus randomness used.

M 2 (D b 2 ) D 0 2, D 1 2 M 1 (D b 1 ) Adversary’s view A D 0 1, D 1 1 M1M1 M2M2 M t (D b t ) D 0 t, D 1 t MtMt … z1z1 z2z2 ztzt A ’s view: randomness + (z 1, z 2, …, z t ) Distribution with b : V b

Differential Privacy: Composition Last week: If all mechanisms M i are  -DP, then for any view the probability that A gets the view when b=0 and when b=1 are with e  t t releases, each  -DP, are t ¢  - DP Today: –t releases, each  -DP, are (√t  +t  2,  )- DP (roughly) Therefore results for a single query translate to results on several queries

Privacy Loss as a random walk

Kullback–Leibler Divergence or Relative Entropy Not a metric

The Max Divergence

Differential Privacy and Max Divergent Mechanism M is  -differentially private if and only if on every two neighboring databases x and x’ –KL 1 (M(x)||M(x’)) ·  –KL 1 (M(x’)||M(x)) ·  M is ( ,  ) -differentially private if and only if –KL 1  (M(x)||M(x’)) ·  –KL 1  (M(x’)||M(x)) · 

Connecting Max and Average Divergence If maximum privacy loss is bounded by e , then the expected privacy loss is actually quite a bit lower. Lemma : If for random variables Y and Z we have KL 1 (Y||Z) ·  KL 1 (Z||Y) ·  then KL(Y||Z) ·  (e  -1) For 0 ·  · 1 we have e  · , so KL(Y||Z) · 2  2 Small worst difference in both ways Implies even smaller average difference

Martingales and Azuma’s Inequality A sequence of random variables X 0, X 1 …, X m is a martingale if for 0 · i · m-1 E[X i+1 |X i ] = X i Example: sum a gambler has in series of fair bets Azuma’s Inequality : Let X 0 =0, X 1 …, X m be a martingale with |X i+1 -X i | · 1 Then for any > 0 we have Pr [X m ¸ √m] · e - 2 /2 High concentration around expectation

Applying Azuma’s Inequality Suppose we have m random variables C 1, C 2 …, C m E[C i |C i-1 ] ·  and |C i | ·  Then Pr[|  i=1 m C i ]| ¸ (  +  ) √m+ m  ] · e - 2 /2 Proof: set Z i = C i –E[C i ]. Now |Z i | ·  +  Using Azuma's inequality Pr[  i=1 m Z i ] ¸ (  +  ) √m ] · e - 2 /2 |  i=1 m C i ]| = |  i=1 m Z i +  i=1 m E[C i ]| · |  i=1 m Z i |+m . expectation

The Composition Choose  = t  (e  -1) + √t(2tlog(1/  )) 

Scaling Noise to Sensitivity Global sensitivity of query q:U n → R GS q = max D,D’ |q(D) – q(D’)| For a counting query q : GS q =1 Previous argument generalizes: For any query q:U n → R release q(D) + Lap(GS q /ε) –ε-private – error Õ(GS q /ε) [0,n]

Scaling Noise to Sensitivity Many dimensions Global sensitivity of query q:U n → R d GS q = max D,D’ ||q(D) – q(D’)|| 1 Previous argument generalizes: For any query q:U n → R d release q(D) + (Y 1, Y 2, … Y d ) –Each Y i independent Lap(GS q /ε) ε-private error Õ(GS q /ε)

Histograms Inputs x 1, x 2,..., x n in domain U Domain U partitioned into d disjoint bins S 1,…,S d q(x 1, x 2,..., x n ) = (n 1, n 2,..., n d ) where n j = #{i : x i in j-th bin} Can view as d queries: q i counts # points in set S i For adjacent D, D’, only one answer can change - it can change by 1 Global sensitivity of answer vector is 1 Sufficient to add Lap(1/ε) noise to each query, still get ε -privacy

Distance to DP with Property Suppose P = set of “good” databases –well-clustered databases Distance to P = # points in x that must be changed to put x in P Always has GS = 1 Example: –Distance to data set with “good clustering” P x

Median Median of x 1, x 2,..., x n 2 [0,1] X= 0,…,0,0,1,…,1 X’= 0,…,0,1,1,…,1 median(X) = 0 median(X’) = 1 GSmedian = 1 Noise magnitude: 1. Too much noise! But for “ most” neighbor databases X, X’ |median(X) − median(X’)| is small. Can we add less noise on ”good” instances? (n-1)/2

Global Sensitivity vs. Local sensitivity Global sensitivity is worst case over inputs Local sensitivity of query q at point D LS q (D) = max D’ |q(D) – q(D’)| Reminder: GS q (D) = max D LS q (D) Goal: add less noise when local sensitivity is small Problem: can leak information by amount of noise

Local sensitivity of Median For X = x 1, x 2,..., x n LSmedian( X ) = max(x m − x m−1, x m+1 − x m ) x 1, x 2,..., x m-1, x m, x m+1,..., x n

Sensitivity of Local Sensitivity of Median Median of x 1, x 2,..., x n 2 [0,1] X= 0,…,0,0,0,0,1,…,1 X’= 0,…,0,0,0,1,1,…,1 LS(X) = 0 LS(X’) = 1 Noise magnitude must be an insensitive function! (n-3)/2

Smooth Upper Bound Compute a “smoothed” version of local sensitivity Design sensitivity function S(X) S(X) is an  -smooth upper bound on LS f (X) if: – for all x: S(X) ¸ LS f (X) – for all neighbors X and X’: S(X) · e  S(X’) Theorem: if a response A(x) = f(x) + Lap(S(x)/ε) is given, then A is 2ε-differentially private. Gives ε-DP about f(x) and about S(x)

Smooth sensitivity S f *(X)= max Y {LS f (Y) e -  dist(x,y) } Claim : S(X) is an  -smooth upper bound on LS f (X) for Smooth sensitivity

Sensitivity and Differential Privacy Differential Privacy Sensitivity: – Global sensitivity of query q:U n → R d GS q = max D,D’ ||q(D) – q(D’)|| 1 – Local sensitivity of query q at point D LS q (D)= max D’ |q(D) – q(D’)| – Smooth sensitivity S f *(X)= max Y {LS f (Y) e -  dist(x,y) } Histograms Differential privacy of median

The Exponential Mechanism [ McSherry Talwar] A general mechanism that yields Differential privacy May yield utility/approximation Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions Collusion resistance Compatibility

Side bar: Digital Goods Auction Some product with 0 cost of production n individuals with valuation v 1, v 2, … v n Auctioneer wants to maximize profit Key to truthfulness : what you say should not affect what you pay What about approximate truthfulness?

Example of the Exponential Mechanism Data: x i = website visited by student i today Range: Y = {website names} For each name y, let q(y, X) = #{i : x i = y} Goal: output the most frequently visited site Procedure: Given X, Output website y with probability proportional to e  q(y,X) Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Size of subset

Setting U n RFor input D 2 U n want to find r 2 R R -Base measure  on R - usually uniform U n £ R Score function q’: U n £ R  R assigns any pair (D,r) a real value –Want to maximize it (approximately) The exponential mechanism R –Assign output r 2 R with probability proportional to e  q’(D,r)  (r) Normalizing factor:  r e  q’(D,r)  (r)

The exponential mechanism is private Let  = max D,D’,r |q’(D,r)-q’(D’,r)| Claim: The exponential mechanism yields a 2 ¢  ¢  differentially private solution Prob[ output = r on input D ] = e  q’(D,r)  (r)/  r e  q’(D,r)  (r) Prob[ output = r on input D’ ] = e  q’(D’,r)  (r) /  r e  q’(D’,r)  (r) adjacent Ratio is bounded by e 

Laplace Noise as Exponential Mechanism On query q:U n → R let q’(D,r) = -|q(D)-r| Prob noise = y e -  y / 2  y e -  y =  /2 e -  y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e -|y|/b y

Any Differentially Private Mechanism is an instance of the Exponential Mechanism Let M be a differentially private mechanism Take q’(D,r) to be log (Prob[M(D) =r]) Remaining issue: Accuracy

Private Ranking Each element i 2 {1, … n} has a real valued score S D (i) based on a data set D. Goal: Output k elements with highest scores. Privacy Data set D consists of n entries in domain D. – Differential privacy: Protects privacy of entries in D. Condition: Insensitive Scores –for any element i, for any data sets D, D’ that differ in one entry: |S D (i)- S D’ (i)| · 1

Approximate ranking Let S k be the k th highest score in on data set D. An output list is  -useful if: Soundness : No element in the output has score · S k -  Completeness : Every element with score ¸ S k +  is in the output. Score · S k -  S k +  · Score S k -  · Score · S k + 

Two Approaches Score perturbation –Perturb the scores of the elements with noise –Pick the top k elements in terms of noisy scores. –Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? Exponential sampling –Run the exponential mechanism k times. –more complicated and slower implementation What sort of guarantees? Each input affects all scores

Exponential Mechanism: (almost free) private lunch Database of n individuals, lunch options {1…k}, each individual likes or dislikes each option ( 1 or 0 ) Goal: output a lunch option that many like For each lunch option j 2 [k] : ℓ(j) is # of individuals who like j Exponential Mechanism: Output j with probability e εℓ(j) Actual probability: e εℓ(j) /(∑ i e εℓ(i) ) Normalizer