Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving.

March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) Benny Pinkas HP Labs, Israel

March 3, 2005 10th Estonian Winter School in Computer Science Why not use cryptographic methods? Many users contribute data. Cannot require them to participate in a cryptographic protocol. – In particular, cannot require p2p communication between users. Cryptographic protocols incur considerable overhead. d …

March 3, 2005 10th Estonian Winter School in Computer Science Data Privacy Data users breach privacy access mechanism d

March 3, 2005 10th Estonian Winter School in Computer Science Easy Tempting Solution But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…) Recall, DOB+gender+zip code identify people whp. Worse:`rare’ attributes (e.g. disease with prob.  1/3000) dd Mr. Brown Ms. John Mr. Doe A Bad Solution Idea: a. Remove identifying information (name, SSN, …) b. Publish data

March 3, 2005 10th Estonian Winter School in Computer Science What is Privacy? Something should not be computable from query answers – E.g.  Joe ={Joe’s private data} – The definition should take into account the adversary’s power (computational, #of queries, prior knowledge, …) Quite often it is much easier to say what is surely non- private – E.g. Strong breaking of privacy: adversary is able to retrieve (almost) everybody’s private data Intuition: privacy breached if it is possible to compute someone’s private information from his identity

March 3, 2005 10th Estonian Winter School in Computer Science The Data Privacy Game: an Information-Privacy Tradeoff Private functions: – want to hide  x (DB)=d x Information functions: – want to revealf (q, DB) for queries q Here: explicit definition of private functions. – The question: which information functions may be allowed? Different from Crypto (secure function evaluation): – There, want to reveal f() (explicit definition of information function) – want to hide all functions  () not computable from f() – Implicit definition of private functions – The question whether f() should be revealed is not asked xxxx f f f

March 3, 2005 10th Estonian Winter School in Computer Science A simplistic model: Statistical Database (SDB)  {0,1} n d  {0,1} n q  [n] query  a q =  i  q d i answer Mr. Fox 0/1 Ms. John 0/1 Mr. Doe 0/1 bits

March 3, 2005 10th Estonian Winter School in Computer Science Approaches to SDB Privacy Studied extensively since the 70s Perturbation – Add randomness. Give `noisy’ or `approximate’ answers – Techniques: Data perturbation (perturb data and then answer queries as usual) [Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] … Output perturbation (perturb answers to queries) [Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] … – Recent interest: [Agrawal, Srikant 00] [Agrawal, Aggarwal 01],… Query Restriction – Answer queries accurately but sometimes disallow queries – Require queries to obey some structure [Dobkin Jones Lipton 79] Restricts number of queries – Auditing [Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01]

March 3, 2005 10th Estonian Winter School in Computer Science Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Srikant ‘00] Interval of confidence – Let Y = X+noise (e.g. uniform noise in [-100,100]. – Perturb input data. Can still estimate underlying distribution. – Tradeoff: more noise  less accuracy but more privacy. – Intuition: large possible interval  privacy preserved Given Y, we know that within c% confidence X is in [a 1,a 2 ]. For example, for Y=200, with 50% X is in [150,250]. a 2 -a 1 defines the amount of privacy at c% confidence – Problem: there might be some a-priori information about X X = someone’s age & Y= -97

March 3, 2005 10th Estonian Winter School in Computer Science The [AS] scheme can be turned against itself Assume that N is large – Even if the data-miner doesn’t have a-priori information about X, it can estimate it given the randomized data Y. The perturbation is uniform in [-1,1] [AS]: privacy interval 2 with confidence 100% Let f x (X)=50% for x  [0,1], and 50% for x  [4,5]. But, after learning f x (X) the value of X can be easily localized within an interval of size at most 1. – Problem: aggregate information provides information that can be used to attack individual data

March 3, 2005 10th Estonian Winter School in Computer Science Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Aggarwal ‘01] Mutual information – Intuition: High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual information) small I(X;Y) (mutual information)  privacy preserved (Y provides little information about X). Problem [EGS] : – Average notion. Privacy loss can happen with low but significant probability, but without affecting I(X;Y). – Sometimes I(X;Y) seems good but privacy is breached

March 3, 2005 10th Estonian Winter School in Computer Science Output Perturbation (Randomization Approach) Exact answer to query q: – a q =  i  q d i Actual SDB answer: â q Perturbation  : – For all q: | â q – a q | ≤  Questions: – Does perturbation give any privacy? – How much perturbation is needed for privacy? – Usability

March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserved by Perturbation    n Database: d  R {0,1} n (uniform input distribution!) Algorithm: on query q, 1. Let a q =  i  q d i 2. If | a q - |q|/2 | <  return â q = |q| / 2 3. Otherwise return â q = a q    n (lgn) 2  Privacy is preserved – Assume poly(n) queries – If    n (lgn) 2, whp always use rule 2 No information about d is given! (but database is completely useless…) Shows that sometimes perturbation   n is enough for privacy. Can we do better? q/2 aqaq âqâq

March 3, 2005 10th Estonian Winter School in Computer Science strong breaking of privacy The previous useless database achieves the best possible perturbation. Theorem [Dinur-Nissim]: Given any DB and any DB response algorithm with perturbation  = o(  n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n). Perturbation  <<  n Implies no Privacy

March 3, 2005 10th Estonian Winter School in Computer Science dd encode pert decode aq1aq1aq1aq1 aq2aq2aq2aq2 aqtaqtaqtaqt aq3aq3aq3aq3 partial sums âq1âq1âq1âq1 âq2âq2âq2âq2 âqtâqtâqtâqt âq3âq3âq3âq3 perturbed sums The Adversary as a Decoding Algorithm ’

March 3, 2005 10th Estonian Winter School in Computer Science Proof of Theorem [DN03] The Adversary Reconstruction Algorithm Observation: A solution always exists, e.g. x=d. Query phase: Get â q j for t random subsets q 1,…,q tQuery phase: Get â q j for t random subsets q 1,…,q t Weeding phase: Solve the Linear Program (over  ):Weeding phase: Solve the Linear Program (over  ): 0  x i  1 |  i  qj x i - â qj |  |  i  qj x i - â qj |   Rounding: Let c i = round(x i ), output cRounding: Let c i = round(x i ), output c

March 3, 2005 10th Estonian Winter School in Computer Science Why does the Reconstruction Algorithm Work? Consider x  {0,1} n s.t. dist(x,d)=c·n =  (n) Observation: – A random q contains c’·n coordinates in which x≠d – The differences in the sum of these coordinates is, with constant probability, at least  (  n) (>  = o(  n) ). – Such a q disqualifies x as a solution for the LP Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability.

March 3, 2005 10th Estonian Winter School in Computer Science Summary of Results (statistical database) [Dinur, Nissim 03] : – Unlimited adversary: Perturbation of magnitude  (n) required – Polynomial-time adversary: Perturbation of magnitude  (sqrt(n)) is required (shown above) – In both cases, adversary may reconstruct a good approximation for the database Disallows even very weak notions of privacy – Bounded adversary, restricted to T << n queries (SuLQ): There is a privacy preserving access mechanism with perturbation << sqrt(T) Chance for usability Reasonable model as database grows larger and larger small DB medium DB large DB

March 3, 2005 10th Estonian Winter School in Computer Science SuLQ for Multi-Attribute Statistical Database (SDB) Query (q, f) q  [n]q  [n]q  [n]q  [n] f : {0,1} k  {0,1} Answer a q,f =  i  q f(d i ) n persons k attributes Database {d i,j } f f f f  a q,f 00110 10100 11011 00101 11001 00010 Row distribution D (D 1,D 2,…,D n )

March 3, 2005 10th Estonian Winter School in Computer Science Privacy and Usability Concerns for the Multi-Attribute Model [DN] Rich set of queries: subset sums over any property of the k attributes – Obviously increases usability, but how is privacy affected? More to protect: functions of the k attributes Relevant factors: – What is the adversary’s goal? – Row dependency Vertically split data (between k or less databases): – Can privacy still be maintained with independently operating databases?

March 3, 2005 10th Estonian Winter School in Computer Science Privacy Definition - Intuition 3-phase adversary – Phase 0: defines a target set G of poly(n) functions g: {0,1} k  {0,1} Will try to learn some of this information about someone – Phase 1: adaptively queries the database T=o(n) times – Phase 2: chooses an index i of a row it intends to attack and a function g  G Attack: – given d -i –try to guess g(d i,1 …d i,k ) use all gained info to choose i, g

March 3, 2005 10th Estonian Winter School in Computer Science The Privacy Definition P 0 i,g – a-priori probability that g(d i,1 …d i,k )=1 p T i,g – a-posteriori probability that g(d i,1 …d i,k )=1 – Given answers to the T queries, and d -i Define conf(p) = log (p/(1-p)) – 1-1 relationship between p and conf(p) – conf(1/2)=0; conf(2/3)=1; conf(1)=   conf i,g = conf(p T i,g ) – conf(p 0 i,g ) ( ,T) – privacy: (“relative privacy”) – For all distributions D 1 …D n, row i, function g and any adversary making at most T queries: Pr[  conf i,g >  ] = neg(n)

March 3, 2005 10th Estonian Winter School in Computer Science The SuLQ* Database Adversary restricted to T << n queries On query (q, f): q  [n] f : {0,1} k  {0,1} (binary function) – Let a q,f =  i  q f(d i,1 …d i,k ) – Let N  Binomial(0,  T ) – Return a q,f +N *SuLQ – Sub Linear Queries

March 3, 2005 10th Estonian Winter School in Computer Science Privacy Analysis of the SuLQ Database P m i,g - a-posteriori probability that g(d i,1 …d i,k )=1 – Given d -i and answers to the first m queries conf(p m i,g ) Describes a random walk on the line with: – Starting point: conf(p 0 i,g ) – Compromise: conf(p m i,g ) – conf(p 0 i,g ) >  W.h.p. more than T steps needed to reach compromise conf(p 0 i,g ) conf(p 0 i,g ) + 

March 3, 2005 10th Estonian Winter School in Computer Science Usability: One multi-attribute SuLQ DB Statistics of any property f of the k attributes – I.e. for what fraction of the (sub)population does f(d 1 …d k ) hold? – Easy: just put f in the query – Other applications: k independent multi-attribute SuLQ DBs Vertically partitioned SulQ DBs Testing whether Pr[  |  ] ≥ Pr[  ]+  – Caveat: we hide g() about a specific row (not about multiple rows) 0 1 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1

March 3, 2005 10th Estonian Winter School in Computer Science Overview of Methods Input Perturbation Output Perturbation Query Restriction SDB User (Restricted) Query Exact Response Or Denial SDB User SDB’ Data Perturbation Query Response SDB User (Restricted) Query Perturbed Response

March 3, 2005 10th Estonian Winter School in Computer Science Query restriction The decision whether to answer or deny the query – Can be based on the content of the query and on answers to previous queries – Or, can be based on the above and on the content of the database SDB User (Restricted) Query Exact Response Or Denial

March 3, 2005 10th Estonian Winter School in Computer Science Auditing [AW89] classify auditing as a query restriction method: – “Auditing of an SDB involves keeping up-to-date logs of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued” Partial motivation: May allow for more queries to be posed, if no privacy threat occurs. Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986 Recent interest: Kleinberg, Papadimitriou, Raghavan 2000, Li, Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003

March 3, 2005 10th Estonian Winter School in Computer Science How Auditors may Inadvertently Compromise Privacy

March 3, 2005 10th Estonian Winter School in Computer Science The Setting Dataset: d={d 1,…,d n } – Entries d i : Real, Integer, Boolean Query: q = (f,i 1,…,i k ) – f : Min, Max, Median, Sum, Average, Count… Bad users will try to breach the privacy of individuals Compromise  uniquely determine d i (very weak def) Statistical database f (d i1,…,d ik ) q = (f,i 1,…,i k )

March 3, 2005 10th Estonian Winter School in Computer Science Auditing Statistical database Query log q 1,…,q i Here’s a new query: q i+1 Here’s the answer Query denied (as the answer would cause privacy loss) OR Auditor

March 3, 2005 10th Estonian Winter School in Computer Science Example 1: Sum/Max auditing Oh well… q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) Denied (the answer would cause privacy loss) q2 is denied iff d1=d2=d3 = 5 I win! Auditor d i real, sum/max queries, privacy breached if some d i learned There must be a reason for the denial…

March 3, 2005 10th Estonian Winter School in Computer Science Sounds Familiar? Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States. David Duncan, Former auditor for Enron and partner in Andersen:

March 3, 2005 10th Estonian Winter School in Computer Science Max Auditing q1 = max(d1,d2,d3,d4) M 1234 d i real M 123 / denied If denied: d4=M 1234 M 12 / denied If denied: d3=M 123 Auditor q2 = max(d1,d2,d3) q2 = max(d1,d2) d1d2d4d6d3d5d7d8…dndn d n-1 Learn an item with prob ½

March 3, 2005 10th Estonian Winter School in Computer Science Boolean Auditing? 1 / denied q i denied iff d i = d i+1  learn database/complement Auditor … d i Boolean d1d2d4d6d3d5d7d8…dndn d n-1 q1 = sum(d1,d2) q2=sum(d2,d3)

March 3, 2005 10th Estonian Winter School in Computer Science The Problem The problem: – Query denials leak (potentially sensitive) information Users cannot decide denials by themselves Possible assignments to {d 1,…,d n } Assignments consistent with (q 1,…q i, a 1,…,a i ) q i+1 denied

March 3, 2005 10th Estonian Winter School in Computer Science Solution to the problem: simulatable Auditing An auditor is simulatable if a simulator exists s.t.: Auditor q i+1  Deny/answer Simulator Simulation  denials do not leak information q 1,…,q i a 1,…,a i Statistical database q 1,…,q i

March 3, 2005 10th Estonian Winter School in Computer Science Why Simulatable Auditors do not Leak Information? Possible assignments to {d 1,…,d n } Assignments consistent with (q 1,…q i, a 1,…,a i ) q i+1 denied/allowed

March 3, 2005 10th Estonian Winter School in Computer Science Simulatable auditing

March 3, 2005 10th Estonian Winter School in Computer Science Query Restriction for Sum Queries Given: – D={x 1,..,x n } dataset, x i  – S is a subset of X. Query:  xi  S x i Is it possible to compromise D? – Here compromise means: uniquely determine x i from the queries Can compromise if subsets arbitrarily small: – sum(x 9 )= x 9

March 3, 2005 10th Estonian Winter School in Computer Science Query Set Size Control Do not permit queries that involve a small subset of the database. Compromise still possible – Want to discover x: sum(x,y 1,..,y k ) - sum(y 1,..,y k ) = x Issue: Overlap In general, overlap is not enough. – Need to restrict also the number of queries – Note that overlap itself sometimes restricts number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries)

March 3, 2005 10th Estonian Winter School in Computer Science Restricting Set-Sum Queries Restricting the sum queries based on – Number of database elements in the sum – Overlap with previous sum queries – Total number of queries Note that the criteria are known to the user – They do not depend on the contents of the database Therefore, the user can simulate the denial/no-denial answer given by the DB – Simulatable auditing

March 3, 2005 10th Estonian Winter School in Computer Science Restricting Overlap and Number of Queries Assume: – |Query Q i | ≥ k – |Q i  Q j | ≤ r – Adversary knows a-priori at most L values, L+1 < k Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries. 10 0 0 1 1 1 1 1 0 0 1 0 0 1 0 x1 x2 x3.. xn = Q1 Q2 Q3... Qt ≥ k ≤r ≥ k xlxl ≤r

March 3, 2005 10th Estonian Winter School in Computer Science Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss] – k overlap, L  a-priori known items Suppose x c compromised after t queries where each query represented by: – Q i = x i1 + x i2 + … + x ik for i =1, …, t Implies that: – x c =  i=1,t  i Q i =  i=1,t  i  j=1,k x ij – Let  i  = 1 if x  in query i, 0 otherwise – x c =  i=1,t  i   =1,n  i  x  =   =1,n (  i=1,t  i  i  )x  Overlap + Number of Queries

March 3, 2005 10th Estonian Winter School in Computer Science We have: x c =   =1,n (  i=1,t  i  i  )x  In the above sum, (  i=1,t  i  i  ) must be 0 for all x  except for x c (in order for x c to be compromised) This happens iff  i  =0 for all i, or if  i  =  j  =1 and  i  j have opposite signs – or  i =0, in which case the ith query didn’t matter Overlap + Number of Queries

March 3, 2005 10th Estonian Winter School in Computer Science Wlog, first query contains x c, second query is of opposite sign. In the first query, k elements are probed The second query adds at least k-r elements Elements from first and second queries cannot be canceled within the same (additional) query (opposite signs requires) Therefore each new query cancels items from first or from second query, but not from both. Need to cancel 2k-r-L elements. – Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r. Overlap + Number of Queries

March 3, 2005 10th Estonian Winter School in Computer Science Notes The number of queries satisfying |Q i |≥ k and |Q i  Q j | ≤r is small – If k=n/c for some constant c and r=const, then there are only ~c queries where no two queries overlap by more than 1. – Hence, query sequence length may be uncomfortably short. – Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c).

March 3, 2005 10th Estonian Winter School in Computer Science Conclusions Privacy should be defined and analyzed rigorously – In particular, assuming randomization  privacy is dangerous High perturbation is needed for privacy against polynomial adversaries – Threshold phenomenon – above  n: total privacy, below  n: no privacy (for poly-time adversary) – Main tool: a reconstruction algorithm Careless auditing might leak private information Self Auditing (simulatable auditors) is safe – Decision whether to allow a query based on previous `good’ queries and their answers Without access to DB contents Users may apply the decision procedure by themselves

March 3, 2005 10th Estonian Winter School in Computer Science ToDo Come up with good model and requirements for database privacy – Learn from crypto – Protect against more general loss of privacy Simulatable auditors are a starting point for designing more reasonable audit mechanisms

March 3, 2005 10th Estonian Winter School in Computer Science References Course web page: – A Study of Perturbation Techniques for Data Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html http://theory.stanford.edu/~nmishra/cs369-2004.html – Privacy and Databases http://theory.stanford.edu/~rajeev/privacy.html

March 3, 2005 10th Estonian Winter School in Computer Science Foundations of CS at the Weizmann Institute Uri Feige Oded Goldreich Shafi Goldwasser David Harel Moni Naor David Peleg Amir Pnueli Ran Raz Omer Reingold Adi Shamir All students receive a fellowship Language of instruction English Yellow  crypto

Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving.

Similar presentations

Presentation on theme: "Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving.

Similar presentations

Presentation on theme: "Page 1March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving."— Presentation transcript:

Similar presentations

About project

Feedback