Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving.

Similar presentations


Presentation on theme: "Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving."— Presentation transcript:

1 page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) Benny Pinkas HP Labs, Israel

2 page 2March 3, 2005 10th Estonian Winter School in Computer Science Why not use cryptographic methods? Many users contribute data. Cannot require them to participate in a cryptographic protocol. – In particular, cannot require p2p communication between users. Cryptographic protocols incur considerable overhead. d …

3 page 3March 3, 2005 10th Estonian Winter School in Computer Science Data Privacy Data users breach privacy access mechanism d

4 page 4March 3, 2005 10th Estonian Winter School in Computer Science Easy Tempting Solution But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…) Recall, DOB+gender+zip code identify people whp. Worse:`rare’ attributes (e.g. disease with prob.  1/3000) dd Mr. Brown Ms. John Mr. Doe A Bad Solution Idea: a. Remove identifying information (name, SSN, …) b. Publish data

5 page 5March 3, 2005 10th Estonian Winter School in Computer Science What is Privacy? Something should not be computable from query answers – E.g.  Joe ={Joe’s private data} – The definition should take into account the adversary’s power (computational, #of queries, prior knowledge, …) Quite often it is much easier to say what is surely non- private – E.g. Strong breaking of privacy: adversary is able to retrieve (almost) everybody’s private data Intuition: privacy breached if it is possible to compute someone’s private information from his identity

6 page 6March 3, 2005 10th Estonian Winter School in Computer Science The Data Privacy Game: an Information-Privacy Tradeoff Private functions: – want to hide  x (DB)=d x Information functions: – want to revealf (q, DB) for queries q Here: explicit definition of private functions. – The question: which information functions may be allowed? Different from Crypto (secure function evaluation): – There, want to reveal f() (explicit definition of information function) – want to hide all functions  () not computable from f() – Implicit definition of private functions – The question whether f() should be revealed is not asked xxxx f f f

7 page 7March 3, 2005 10th Estonian Winter School in Computer Science A simplistic model: Statistical Database (SDB)  {0,1} n d  {0,1} n q  [n] query  a q =  i  q d i answer Mr. Fox 0/1 Ms. John 0/1 Mr. Doe 0/1 bits

8 page 8March 3, 2005 10th Estonian Winter School in Computer Science Approaches to SDB Privacy Studied extensively since the 70s Perturbation – Add randomness. Give `noisy’ or `approximate’ answers – Techniques: Data perturbation (perturb data and then answer queries as usual) [Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] … Output perturbation (perturb answers to queries) [Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] … – Recent interest: [Agrawal, Srikant 00] [Agrawal, Aggarwal 01],… Query Restriction – Answer queries accurately but sometimes disallow queries – Require queries to obey some structure [Dobkin Jones Lipton 79] Restricts number of queries – Auditing [Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01]

9 page 9March 3, 2005 10th Estonian Winter School in Computer Science Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Srikant ‘00] Interval of confidence – Let Y = X+noise (e.g. uniform noise in [-100,100]. – Perturb input data. Can still estimate underlying distribution. – Tradeoff: more noise  less accuracy but more privacy. – Intuition: large possible interval  privacy preserved Given Y, we know that within c% confidence X is in [a 1,a 2 ]. For example, for Y=200, with 50% X is in [150,250]. a 2 -a 1 defines the amount of privacy at c% confidence – Problem: there might be some a-priori information about X X = someone’s age & Y= -97

10 page 10March 3, 2005 10th Estonian Winter School in Computer Science The [AS] scheme can be turned against itself Assume that N is large – Even if the data-miner doesn’t have a-priori information about X, it can estimate it given the randomized data Y. The perturbation is uniform in [-1,1] [AS]: privacy interval 2 with confidence 100% Let f x (X)=50% for x  [0,1], and 50% for x  [4,5]. But, after learning f x (X) the value of X can be easily localized within an interval of size at most 1. – Problem: aggregate information provides information that can be used to attack individual data

11 page 11March 3, 2005 10th Estonian Winter School in Computer Science Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Aggarwal ‘01] Mutual information – Intuition: High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual information) small I(X;Y) (mutual information)  privacy preserved (Y provides little information about X). Problem [EGS] : – Average notion. Privacy loss can happen with low but significant probability, but without affecting I(X;Y). – Sometimes I(X;Y) seems good but privacy is breached

12 page 12March 3, 2005 10th Estonian Winter School in Computer Science Output Perturbation (Randomization Approach) Exact answer to query q: – a q =  i  q d i Actual SDB answer: â q Perturbation  : – For all q: | â q – a q | ≤  Questions: – Does perturbation give any privacy? – How much perturbation is needed for privacy? – Usability

13 page 13March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserved by Perturbation    n Database: d  R {0,1} n (uniform input distribution!) Algorithm: on query q, 1. Let a q =  i  q d i 2. If | a q - |q|/2 | <  return â q = |q| / 2 3. Otherwise return â q = a q    n (lgn) 2  Privacy is preserved – Assume poly(n) queries – If    n (lgn) 2, whp always use rule 2 No information about d is given! (but database is completely useless…) Shows that sometimes perturbation   n is enough for privacy. Can we do better? q/2 aqaq âqâq

14 page 14March 3, 2005 10th Estonian Winter School in Computer Science strong breaking of privacy The previous useless database achieves the best possible perturbation. Theorem [Dinur-Nissim]: Given any DB and any DB response algorithm with perturbation  = o(  n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n). Perturbation  <<  n Implies no Privacy

15 page 15March 3, 2005 10th Estonian Winter School in Computer Science dd encode pert decode aq1aq1aq1aq1 aq2aq2aq2aq2 aqtaqtaqtaqt aq3aq3aq3aq3 partial sums âq1âq1âq1âq1 âq2âq2âq2âq2 âqtâqtâqtâqt âq3âq3âq3âq3 perturbed sums The Adversary as a Decoding Algorithm ’

16 page 16March 3, 2005 10th Estonian Winter School in Computer Science Proof of Theorem [DN03] The Adversary Reconstruction Algorithm Observation: A solution always exists, e.g. x=d. Query phase: Get â q j for t random subsets q 1,…,q tQuery phase: Get â q j for t random subsets q 1,…,q t Weeding phase: Solve the Linear Program (over  ):Weeding phase: Solve the Linear Program (over  ): 0  x i  1 |  i  qj x i - â qj |  |  i  qj x i - â qj |   Rounding: Let c i = round(x i ), output cRounding: Let c i = round(x i ), output c

17 page 17March 3, 2005 10th Estonian Winter School in Computer Science Why does the Reconstruction Algorithm Work? Consider x  {0,1} n s.t. dist(x,d)=c·n =  (n) Observation: – A random q contains c’·n coordinates in which x≠d – The differences in the sum of these coordinates is, with constant probability, at least  (  n) (>  = o(  n) ). – Such a q disqualifies x as a solution for the LP Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability.

18 page 18March 3, 2005 10th Estonian Winter School in Computer Science Summary of Results (statistical database) [Dinur, Nissim 03] : – Unlimited adversary: Perturbation of magnitude  (n) required – Polynomial-time adversary: Perturbation of magnitude  (sqrt(n)) is required (shown above) – In both cases, adversary may reconstruct a good approximation for the database Disallows even very weak notions of privacy – Bounded adversary, restricted to T << n queries (SuLQ): There is a privacy preserving access mechanism with perturbation << sqrt(T) Chance for usability Reasonable model as database grows larger and larger small DB medium DB large DB

19 page 19March 3, 2005 10th Estonian Winter School in Computer Science SuLQ for Multi-Attribute Statistical Database (SDB) Query (q, f) q  [n]q  [n]q  [n]q  [n] f : {0,1} k  {0,1} Answer a q,f =  i  q f(d i ) n persons k attributes Database {d i,j } f f f f  a q,f 00110 10100 11011 00101 11001 00010 Row distribution D (D 1,D 2,…,D n )

20 page 20March 3, 2005 10th Estonian Winter School in Computer Science Privacy and Usability Concerns for the Multi-Attribute Model [DN] Rich set of queries: subset sums over any property of the k attributes – Obviously increases usability, but how is privacy affected? More to protect: functions of the k attributes Relevant factors: – What is the adversary’s goal? – Row dependency Vertically split data (between k or less databases): – Can privacy still be maintained with independently operating databases?

21 page 21March 3, 2005 10th Estonian Winter School in Computer Science Privacy Definition - Intuition 3-phase adversary – Phase 0: defines a target set G of poly(n) functions g: {0,1} k  {0,1} Will try to learn some of this information about someone – Phase 1: adaptively queries the database T=o(n) times – Phase 2: chooses an index i of a row it intends to attack and a function g  G Attack: – given d -i –try to guess g(d i,1 …d i,k ) use all gained info to choose i, g

22 page 22March 3, 2005 10th Estonian Winter School in Computer Science The Privacy Definition P 0 i,g – a-priori probability that g(d i,1 …d i,k )=1 p T i,g – a-posteriori probability that g(d i,1 …d i,k )=1 – Given answers to the T queries, and d -i Define conf(p) = log (p/(1-p)) – 1-1 relationship between p and conf(p) – conf(1/2)=0; conf(2/3)=1; conf(1)=   conf i,g = conf(p T i,g ) – conf(p 0 i,g ) ( ,T) – privacy: (“relative privacy”) – For all distributions D 1 …D n, row i, function g and any adversary making at most T queries: Pr[  conf i,g >  ] = neg(n)

23 page 23March 3, 2005 10th Estonian Winter School in Computer Science The SuLQ* Database Adversary restricted to T << n queries On query (q, f): q  [n] f : {0,1} k  {0,1} (binary function) – Let a q,f =  i  q f(d i,1 …d i,k ) – Let N  Binomial(0,  T ) – Return a q,f +N *SuLQ – Sub Linear Queries

24 page 24March 3, 2005 10th Estonian Winter School in Computer Science Privacy Analysis of the SuLQ Database P m i,g - a-posteriori probability that g(d i,1 …d i,k )=1 – Given d -i and answers to the first m queries conf(p m i,g ) Describes a random walk on the line with: – Starting point: conf(p 0 i,g ) – Compromise: conf(p m i,g ) – conf(p 0 i,g ) >  W.h.p. more than T steps needed to reach compromise conf(p 0 i,g ) conf(p 0 i,g ) + 

25 page 25March 3, 2005 10th Estonian Winter School in Computer Science Usability: One multi-attribute SuLQ DB Statistics of any property f of the k attributes – I.e. for what fraction of the (sub)population does f(d 1 …d k ) hold? – Easy: just put f in the query – Other applications: k independent multi-attribute SuLQ DBs Vertically partitioned SulQ DBs Testing whether Pr[  |  ] ≥ Pr[  ]+  – Caveat: we hide g() about a specific row (not about multiple rows) 0 1 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1

26 page 26March 3, 2005 10th Estonian Winter School in Computer Science Overview of Methods Input Perturbation Output Perturbation Query Restriction SDB User (Restricted) Query Exact Response Or Denial SDB User SDB’ Data Perturbation Query Response SDB User (Restricted) Query Perturbed Response

27 page 27March 3, 2005 10th Estonian Winter School in Computer Science Query restriction The decision whether to answer or deny the query – Can be based on the content of the query and on answers to previous queries – Or, can be based on the above and on the content of the database SDB User (Restricted) Query Exact Response Or Denial

28 page 28March 3, 2005 10th Estonian Winter School in Computer Science Auditing [AW89] classify auditing as a query restriction method: – “Auditing of an SDB involves keeping up-to-date logs of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued” Partial motivation: May allow for more queries to be posed, if no privacy threat occurs. Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986 Recent interest: Kleinberg, Papadimitriou, Raghavan 2000, Li, Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003

29 page 29March 3, 2005 10th Estonian Winter School in Computer Science How Auditors may Inadvertently Compromise Privacy

30 page 30March 3, 2005 10th Estonian Winter School in Computer Science The Setting Dataset: d={d 1,…,d n } – Entries d i : Real, Integer, Boolean Query: q = (f,i 1,…,i k ) – f : Min, Max, Median, Sum, Average, Count… Bad users will try to breach the privacy of individuals Compromise  uniquely determine d i (very weak def) Statistical database f (d i1,…,d ik ) q = (f,i 1,…,i k )

31 page 31March 3, 2005 10th Estonian Winter School in Computer Science Auditing Statistical database Query log q 1,…,q i Here’s a new query: q i+1 Here’s the answer Query denied (as the answer would cause privacy loss) OR Auditor

32 page 32March 3, 2005 10th Estonian Winter School in Computer Science Example 1: Sum/Max auditing Oh well… q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) Denied (the answer would cause privacy loss) q2 is denied iff d1=d2=d3 = 5 I win! Auditor d i real, sum/max queries, privacy breached if some d i learned There must be a reason for the denial…

33 page 33March 3, 2005 10th Estonian Winter School in Computer Science Sounds Familiar? Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States. David Duncan, Former auditor for Enron and partner in Andersen:

34 page 34March 3, 2005 10th Estonian Winter School in Computer Science Max Auditing q1 = max(d1,d2,d3,d4) M 1234 d i real M 123 / denied If denied: d4=M 1234 M 12 / denied If denied: d3=M 123 Auditor q2 = max(d1,d2,d3) q2 = max(d1,d2) d1d2d4d6d3d5d7d8…dndn d n-1 Learn an item with prob ½

35 page 35March 3, 2005 10th Estonian Winter School in Computer Science Boolean Auditing? 1 / denied q i denied iff d i = d i+1  learn database/complement Auditor … d i Boolean d1d2d4d6d3d5d7d8…dndn d n-1 q1 = sum(d1,d2) q2=sum(d2,d3)

36 page 36March 3, 2005 10th Estonian Winter School in Computer Science The Problem The problem: – Query denials leak (potentially sensitive) information Users cannot decide denials by themselves Possible assignments to {d 1,…,d n } Assignments consistent with (q 1,…q i, a 1,…,a i ) q i+1 denied

37 page 37March 3, 2005 10th Estonian Winter School in Computer Science Solution to the problem: simulatable Auditing An auditor is simulatable if a simulator exists s.t.: Auditor q i+1  Deny/answer Simulator Simulation  denials do not leak information q 1,…,q i a 1,…,a i Statistical database q 1,…,q i

38 page 38March 3, 2005 10th Estonian Winter School in Computer Science Why Simulatable Auditors do not Leak Information? Possible assignments to {d 1,…,d n } Assignments consistent with (q 1,…q i, a 1,…,a i ) q i+1 denied/allowed

39 page 39March 3, 2005 10th Estonian Winter School in Computer Science Simulatable auditing

40 page 40March 3, 2005 10th Estonian Winter School in Computer Science Query Restriction for Sum Queries Given: – D={x 1,..,x n } dataset, x i  – S is a subset of X. Query:  xi  S x i Is it possible to compromise D? – Here compromise means: uniquely determine x i from the queries Can compromise if subsets arbitrarily small: – sum(x 9 )= x 9

41 page 41March 3, 2005 10th Estonian Winter School in Computer Science Query Set Size Control Do not permit queries that involve a small subset of the database. Compromise still possible – Want to discover x: sum(x,y 1,..,y k ) - sum(y 1,..,y k ) = x Issue: Overlap In general, overlap is not enough. – Need to restrict also the number of queries – Note that overlap itself sometimes restricts number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries)

42 page 42March 3, 2005 10th Estonian Winter School in Computer Science Restricting Set-Sum Queries Restricting the sum queries based on – Number of database elements in the sum – Overlap with previous sum queries – Total number of queries Note that the criteria are known to the user – They do not depend on the contents of the database Therefore, the user can simulate the denial/no-denial answer given by the DB – Simulatable auditing

43 page 43March 3, 2005 10th Estonian Winter School in Computer Science Restricting Overlap and Number of Queries Assume: – |Query Q i | ≥ k – |Q i  Q j | ≤ r – Adversary knows a-priori at most L values, L+1 < k Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries. 10 0 0 1 1 1 1 1 0 0 1 0 0 1 0 x1 x2 x3.. xn = Q1 Q2 Q3... Qt ≥ k ≤r ≥ k xlxl ≤r

44 page 44March 3, 2005 10th Estonian Winter School in Computer Science Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss] – k overlap, L  a-priori known items Suppose x c compromised after t queries where each query represented by: – Q i = x i1 + x i2 + … + x ik for i =1, …, t Implies that: – x c =  i=1,t  i Q i =  i=1,t  i  j=1,k x ij – Let  i  = 1 if x  in query i, 0 otherwise – x c =  i=1,t  i   =1,n  i  x  =   =1,n (  i=1,t  i  i  )x  Overlap + Number of Queries

45 page 45March 3, 2005 10th Estonian Winter School in Computer Science We have: x c =   =1,n (  i=1,t  i  i  )x  In the above sum, (  i=1,t  i  i  ) must be 0 for all x  except for x c (in order for x c to be compromised) This happens iff  i  =0 for all i, or if  i  =  j  =1 and  i  j have opposite signs – or  i =0, in which case the ith query didn’t matter Overlap + Number of Queries

46 page 46March 3, 2005 10th Estonian Winter School in Computer Science Wlog, first query contains x c, second query is of opposite sign. In the first query, k elements are probed The second query adds at least k-r elements Elements from first and second queries cannot be canceled within the same (additional) query (opposite signs requires) Therefore each new query cancels items from first or from second query, but not from both. Need to cancel 2k-r-L elements. – Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r. Overlap + Number of Queries

47 page 47March 3, 2005 10th Estonian Winter School in Computer Science Notes The number of queries satisfying |Q i |≥ k and |Q i  Q j | ≤r is small – If k=n/c for some constant c and r=const, then there are only ~c queries where no two queries overlap by more than 1. – Hence, query sequence length may be uncomfortably short. – Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c).

48 page 48March 3, 2005 10th Estonian Winter School in Computer Science Conclusions Privacy should be defined and analyzed rigorously – In particular, assuming randomization  privacy is dangerous High perturbation is needed for privacy against polynomial adversaries – Threshold phenomenon – above  n: total privacy, below  n: no privacy (for poly-time adversary) – Main tool: a reconstruction algorithm Careless auditing might leak private information Self Auditing (simulatable auditors) is safe – Decision whether to allow a query based on previous `good’ queries and their answers Without access to DB contents Users may apply the decision procedure by themselves

49 page 49March 3, 2005 10th Estonian Winter School in Computer Science ToDo Come up with good model and requirements for database privacy – Learn from crypto – Protect against more general loss of privacy Simulatable auditors are a starting point for designing more reasonable audit mechanisms

50 page 50March 3, 2005 10th Estonian Winter School in Computer Science References Course web page: – A Study of Perturbation Techniques for Data Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html http://theory.stanford.edu/~nmishra/cs369-2004.html – Privacy and Databases http://theory.stanford.edu/~rajeev/privacy.html

51 page 51March 3, 2005 10th Estonian Winter School in Computer Science Foundations of CS at the Weizmann Institute Uri Feige Oded Goldreich Shafi Goldwasser David Harel Moni Naor David Peleg Amir Pnueli Ran Raz Omer Reingold Adi Shamir All students receive a fellowship Language of instruction English Yellow  crypto


Download ppt "Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving."

Similar presentations


Ads by Google