Foundations of Privacy Lecture 7

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Eric Shou Stat/CSE 598B. What is Game Theory? Game theory is a branch of applied mathematics that is often used in the context of economics. Studies strategic.
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Foundations of Privacy Lecture 4 Lecturer: Moni Naor.
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
An brief tour of Differential Privacy Avrim Blum Computer Science Dept Your guide:
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Limitations of VCG-Based Mechanisms Shahar Dobzinski Joint work with Noam Nisan.
Foundations of Privacy Lecture 3 Lecturer: Moni Naor.
Foundations of Privacy Lecture 7 Lecturer: Moni Naor.
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Privacy by Learning the Database Moritz Hardt DIMACS, October 24, 2012.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Estimating standard error using bootstrap
Ensemble Classifiers.
Machine Learning: Ensemble Methods
University of Texas at El Paso
Private Data Management with Verification
Game Theory Just last week:
A paper on Join Synopses for Approximate Query Answering
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Maximal Independent Set
Chapter 8: Inference for Proportions
Spectral Clustering.
Lecture 18: Uniformity Testing Monotonicity Testing
Distributed Submodular Maximization in Massive Datasets
Roberto Battiti, Mauro Brunato
Privacy as a tool for Robust Mechanism Design in Large Markets
Differential Privacy in Practice
Data Mining Practical Machine Learning Tools and Techniques
Vitaly (the West Coast) Feldman
Randomized Algorithms CS648
Enumerating Distances Using Spanners of Bounded Degree
Maximal Independent Set
CS 3343: Analysis of Algorithms
The Curve Merger (Dvir & Widgerson, 2008)
Coverage Approximation Algorithms
Differential Privacy (2)
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Gentle Measurement of Quantum States and Differential Privacy
Scott Aaronson (UT Austin) UNM, Albuquerque, October 18, 2018
Compact routing schemes with improved stretch
Published in: IEEE Transactions on Industrial Informatics
Clustering.
Gentle Measurement of Quantum States and Differential Privacy *
Differential Privacy.
Presentation transcript:

Foundations of Privacy Lecture 7 Lecturer: Moni Naor

(, d) - Differential Privacy Sanitizer M gives (, d) -differential privacy if: for all adjacent D1 and D2, and all A µ range(M): Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A] + d Pr [response] ratio bounded Bad Responses: Z Typical setting 𝜖= 1 10 and δ negligible This course: d negligible

Example: NO Differential Privacy U set of (name,tag 2{0,1}) tuples One counting query: #of participants with tag=1 Sanitizer A: choose and release a few random tags Bad event T: Only my tag is 1, my tag released PrA[A(D+I) 2 T] ≥ 1/n PrA[A(D-I) 2 T] = 0 Not ε diff private for any ε! It is (0,1/n) Differential Private PrA[A(D+I) 2 T] e-ε ≤ ≤ eε ≈ 1+ε PrA[A(D-I) 2 T]

Sometimes talk about fraction Counting Queries Database x of size n Counting-queries Q is a set of predicates q: U  {0,1} Query: how many x participants satisfy q? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: some error anyway inherent in statistical analysis Query q n individuals, each contributing a single point in U U Sometimes talk about fraction

Bounds on Achievable Privacy Bounds on the Accuracy The responses from the mechanism to all queries are assured to be within α except with probability  Number of queries t for which we can receive accurate answers The privacy parameter ε for which ε differential privacy is achievable Or (ε,) differential privacy is achievable

Composition: t-Fold Suppose we are going to apply a DP mechanism t times. Perhaps on different databases Want: the combined outcome is differentially private A value b 2 {0,1} is chosen In each of the t rounds: adversary A picks two adjacent databases D0i and D1i and an -DP mechanism Mi receives result zi of the -DP mechanism Mi on Dbi Want to argue: A‘s view is within ’ for both values of b A‘s view: (z1, z2, …, zt) plus randomness used.

Adversary’s view A A’s view: randomness + (z1, z2, …, zt) Distribution with b: Vb A D01, D11 D02, D12 … D0t, D1t M1(Db1) M2(Db2) Mt(Dbt) M1 M2 Mt z1 z2 zt

Differential Privacy: Composition Last week: If all mechanisms Mi are -DP, then for any view the probability that A gets the view when b=0 and when b=1 are with et t releases , each -DP, are t¢ -DP Today: t releases, each -DP, are (√t+t 2,)-DP (roughly) Therefore results for a single query translate to results on several queries

Privacy Loss as a Random Walk 𝑂(𝑡) potentially dangerous rounds Number of Steps t 1 -1 1 1 -1 1 1 -1 Privacy loss grows as 𝒕

The Exponential Mechanism [McSherry Talwar] A general mechanism that yields Differential privacy May yield utility/approximation Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions Collusion resistance Compatibility

Side bar: Digital Goods Auction Some product with 0 cost of production n individuals with valuation v1, v2, … vn Auctioneer wants to maximize profit Key to truthfulness: what you say should not affect what you pay What about approximate truthfulness?

Example of the Exponential Mechanism Data: xi = website visited by student i today Range: Y = {website names} For each name y, let q(y, X) = #{i : xi = y} Goal: output the most frequently visited site Procedure: Given X, Output website y with probability proportional to eq(y,X) Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Size of subset

Setting For input D 2 Un want to find r2R Base measure  on R - usually uniform Score function w: Un £ R  R assigns any pair (D,r) a real value Want to maximize it (approximately) The exponential mechanism Assign output r2R with probability proportional to ew(D,r) (r) Normalizing factor r ew(D,r) (r) The reals

The exponential mechanism is private Let  = maxD,D’,r |w(D,r)-w(D’,r)| Claim: The exponential mechanism yields a 2¢¢ differentially private solution For adjacent databases D and D’ and for all possible outputs r 2R Prob[output = r when input is D] = ew(D,r) (r)/r ew(D,r) (r) Prob[output = r when input is D’] = ew(D’,r) (r)/r ew(D’,r) (r) sensitivity adjacent Ratio is bounded by e e

Laplace Noise as Exponential Mechanism On query q:Un→R let w(D,r) = -|q(D)-r| Prob noise = y e-y /2 y e-y =  /2 e-y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b y -4 -3 -2 -1 1 2 3 4 5

Any Differentially Private Mechanism is an instance of the Exponential Mechanism Let M be a differentially private mechanism Take w(D,r) to be log (Prob[M(D) =r]) Remaining issue: Accuracy

Private Ranking Each element i 2 {1, … n} has a real valued score SD(i) based on a data set D. Goal: Output k elements with highest scores. Privacy Data set D consists of n entries in domain D. Differential privacy: Protects privacy of entries in D. Condition: Insensitive Scores for any element i, for any data sets D and D’ that differ in one entry: |SD(i)- SD’(i)| · 1

Approximate ranking Let Sk be the kth highest score in on data set D. An output list is  -useful if: Soundness: No element in the output has score · Sk -  Completeness: Every element with score ¸ Sk +  is in the output. Score · Sk -  Sk +  · Score Sk -  · Score · Sk + 

Each input affects all scores Two Approaches Each input affects all scores Score perturbation Perturb the scores of the elements with noise Pick the top k elements in terms of noisy scores. Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? Exponential sampling Run the exponential mechanism k times. more complicated and slower implementation

Exponential Mechanism: Simple Example (almost free) private lunch Database of n individuals, lunch options {1…k}, each individual likes or dislikes each option (1 or 0) Goal: output a lunch option that many like For each lunch option j2 [k], ℓ(j) is # of individuals who like j Exponential Mechanism: Output j with probability eεℓ(j) Actual probability: eεℓ(j)/(∑i eεℓ(i)) Normalizer

The Net Mechanism Idea: limit the number of possible outputs Want |R| to be small Why is it good? The good (accurate) output has to compete with a few possible outputs If there is a guarantee that there is at least one good output, then the total weight of the bad outputs is limited

 Nets A collection N of databases is called an -net of databases for a class of queries C if: for all possible databases x there exists a y2N such that Maxq2C |q(x) –q(y)| ·  If we use the closest member of N instead of the real database lose at most  In terms of worst query

The Net Mechanism For a class of queries C, privacy  and accuracy , on data base x Let N be an -net for the class of queries C Let w(x,y) = - Maxq2C |q(x) –q(y)| Sample and output according to exponential mechanism with x, w,  and R=N For y2N: Prob[y] proportional to ew(x,y)

Privacy and Utility Claims Privacy: the net mechanism is ¢ differentially private Utility: the net mechanism is (2, ) accurate for any  and  such that ¸ 2/ ¢ log (|N|/) Proof: there is at least one good solution: gets weight at least e- there are at most |N| (bad) outputs: each get weight at most e-2 Use the Union Bound Accuracy less than 2

Synthetic DB: Output is a DB answer 1 answer 3 answer 2 Sanitizer query 1, query 2, . . . ? Database Synthetic DB: output is always a DB Of entries from same universe U User reconstructs answers to queries by evaluating the query on output DB Software and people compatible Consistent answers

Counting Queries Queries with low sensitivity Counting-queries Database D of size n Queries with low sensitivity Counting-queries C is a set of predicates c: U  {0,1} Query: how many D participants satisfy c ? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: error anyway inherent in statistical analysis Assume all queries given in advance Query c U Non-interactive

-Net For Counting Queries If we want to answer many counting queries C with differential privacy: Sufficient to come up with an -Net for C Resulting accuracy max{, log (|N|/) / } Claim: consider the set N consisting of all databases of size m where m = log|C|/2 Consider each element in the set to have weight n/m Then N is an -Net for any collection C of counting queries Error is Õ(n2/3 log|C|)

Remarkable Hope for rich private analysis of small DBs! Quantitative: #queries >> DB size, Qualitative: output of sanitizer -synthetic DB- output is a DB itself

The BLR Algorithm Blum Ligett Roth 2008 Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) For DBs F and D dist(F,D) = maxq2C |q(F) – q(D)| Intuition: far away DBs get smaller probability

The BLR Algorithm Idea: In general: Do not use large DB Sample and answer accordingly DB of size m guaranteeing hitting each query with sufficient accuracy

The BLR Algorithm: Error Õ(n2/3 log|C|) Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) Goodness Lemma: there exists Fgood of size m =Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤ α Proof: construct member of by Fgood taking m random samples from U

The BLR Algorithm: Error Õ(n2/3 log|C|) Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) Goodness Lemma: there exists Fgood of size m =Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤ α Pr [Fgood] ~ e-εα For any Fbad with dist 2α, Pr [Fbad] ~ e-2εα Union bound: ∑ bad DB Fbad Pr [Fbad] ~ |U|me-2εα For α=Õ(n2/3log|C|), Pr [Fgood] >> ∑ Pr [Fbad]

The BLR Algorithm: 2ε-Privacy Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) For adjacent D,D’ for every F |dist(F,D) – dist(F,D’)| ≤ 1 Probability of F by D: e-ε·dist(F,D)/∑G of size m e-ε·dist(G,D) Probability of F by D’: numerator and denominator can change by eε-factor ) 2ε-privacy

The BLR Algorithm: Running Time Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) Generating the distribution by enumeration: Need to enumerate every size-m database, where m = Õ((n\α)2·log|C|) Running time ≈ |U|Õ((n\α)2·log|c|)

Conclusion Offline algorithm, 2ε-Differential Privacy for any set C of counting queries Error α is Õ(n2/3 log|C|/ε) Super-poly running time: |U|Õ((n\α)2·log|C|)

Maintaining State Query q State = Distribution D

The Multiplicative Weights Algorithm Powerful tool in algorithms design Learn a Probability Distribution iteratively In each round: either current distribution is good or get a lot of information on distribution Update distribution

The PMW Algorithm Maintain a distribution D on universe U This is the state. Is completely public! Initialize D to be uniform on U Repeat up to k times Set T Ã T + Lap() Repeat while no update occurs: Receive query q 2 Q Let 𝑎 = x(q) + Lap() Test: If |q(D)- 𝑎 | · T output q(D). Else (update): Output 𝑎 Update D[i] / D[i] e±T/4q[i] and re-weight. Algorithm fails if more than k updates The true value the plus or minus are according to the sign of the error

Overview: Privacy Analysis For the query family Q = {0,1}U for (, d, ) and t the PMW mechanism is (, d) –differentially private (,) accurate for up to t queries where  = Õ(1/( n)1/2) State = Distribution is privacy preserving for individuals (but not for queries) accuracy Log dependency on |U|, d,  and t