Foundations of Privacy Lecture 1 Lecturer: Moni Naor.

Foundations of Privacy Lecture 1 Lecturer: Moni Naor

What is Privacy? Extremely overloaded term Hard to define “Privacy is a value so complex, so entangled in competing and contradictory dimensions, so engorged with various and distinct meanings, that I sometimes despair whether it can be usefully addressed at all.” Robert C. Post, Three Concepts of Privacy, 89 Geo. L.J. 2087 (2001). Privacy is like oxygen – you only feel it when it is gone

What is Privacy? Extremely overloaded term “the right to be let alone” - Samuel D. Warren and Louis D. Brandeis, The Right to Privacy, Harv. L. Rev. (1890) “our concern over our accessibility to others: the extent to which we are known to others, the extent to which others have physical access to us, and the extent to which we are the subject of others attention. - Ruth Gavison, “ Privacy and the Limits of the Law,” Yale Law Journal (1980)

What is Privacy? Extremely overloaded term Photojournalism Census data Huge databases collected by companies –Data deluge –Example: “Ravkav” Public Surveillance Information –Cameras –RFIDs Social Networks Louis Brandeis and Samuel Warren: The Right to Privacy, Harvard Law Rev. 1890 Mandatory participation Must not reveal individual data

Official Description The availability of fast and cheap computers coupled with massive storage devices has enabled the collection and mining of data on a scale previously unimaginable. This opens the door to potential abuse regarding individuals' information. There has been considerable research exploring the tension between utility and privacy in this context. The goal is to explore techniques and issues related to data privacy. In particular: Definitions of data privacy Techniques for achieving privacy Limitations on privacy in various settings. Privacy issues in specific settings

Planned Topics Privacy of Data Analysis Differential Privacy – Definition and Properties – Statistical databases – Dynamic data Privacy of learning algorithms Privacy of genomic data Interaction with cryptography SFE Voting Entropic Security Data Structures Everlasting Security Privacy Enhancing Tech. – Mixed nets

Course Information Foundation of Privacy - Spring 2010 Instructor: Moni Naor When: Mondays, 11:00--13:00 (2 points) Where: Ziskind 1 Course web page : www.wisdom.weizmann.ac.il/~naor/COURSE/foundations_of_privacy.html Prerequisites: familiarity with algorithms, data structures, probability theory, and linear algebra, at an undergraduate level; a basic course in computability is assumed. Requirements: – Participation in discussion in class Best: read the papers ahead of time – Homework : There will be several homework assignments Homework assignments should be turned in on time (usually two weeks after they are given)! – Class Project and presentation – Exam : none planned Office: Ziskind 248 Phone: 3701 E-mail: moni.naor@

Projects Report on a paper Apply a notion studied to some known domain Checking the state of privacy is some setting

Cryptography and Privacy Extremely relevant - but does not solve the privacy problem Secure function Evaluation How to distributively compute a function f(X 1, X 2, …,X n ), –where X j known to party j. E.g.,  = sum(a,b,c, …) –Parties should only learn final output (  ) Many results depending on –Number of players –Means of communication –The power and model of the adversary –How the function is represented More worried what to compute than how to compute

Example: Securely Computing Sums X1X1 X2X2 X3X3 X4X4 X5X5 0 · X i · P-1. Want to compute  X i Party 1 selects r 2 R [0..P-1]. Sends Y 1 = X 1 +r Party i received Y i-1 and sends Y i = Y i-1 + X i Party 1 received Y n and announces  =  X i = Y n -r Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5  mod P

Is this Protocol Secure? To talk rigorously about cryptographic security: Specify the Power of the Adversary –Access to the data/system –Computational power? – “Auxiliary” information? Define a Break of the System –What is compromise –What is a “win” for the adversary? If it controls two players - insecure Can be all powerful here

The Simulation Paradigm A protocol is considered secure if: For every adversary (of a certain type) There exists a simulator that outputs an indistinguishable ``transcript”. Examples: Encryption Zero-knowledge Secure function evaluation Power of analogy

SFE: Simulating the ideal model A protocol is considered secure if: For every adversary there exists a simulator operating in the ``ideal” (trusted party) model that outputs an indistinguishable transcript. Major result : “ Any function f that can be evaluated using polynomial resources can be securely evaluated using polynomial resources ” Breaking = distinguishing!

The Problem with SFE SFE does not imply privacy: The problem is with ideal model –E.g.,  = sum(a,b) –Each player learns only what can be deduced from  and her own input to f –if  and a yield b, so be it. Need ways of talking about leakage even in the ideal model

Statistical Data Analysis Huge social benefits from analyzing large collections of data: Finding correlations E.g. medical: genotype/phenotype correlations Providing better services Improve web search results, fit ads to queries Publishing Official Statistics Census, contingency tables Datamining Clustering, learning association rules, decision trees, separators, principal component analysis However: data contains confidential information WHAT ABOUT PRIVACY ? Better Privacy Better Data

Example of Utility John Snow’s map Cholera cases in London 1854 epidemic Suspected pump Suspected pump Cholera cases

Modern Privacy of Data Analysis Is public analysis of private data a meaningful/achievable Goal? The holy grail: Get utility of statistical analysis while protecting privacy of every individual participant Ideally: “privacy-preserving” sanitization allows reasonably accurate answers to meaningful information

Sanitization: Traditional View Curator/ Sanitizer OutputData A Trusted curator can access DB of sensitive information, should publish privacy-preserving sanitized version

Traditional View: Interactive Model Data Multiple queries, chosen adaptively ? query 1 query 2 Sanitizer

Sanitization: Traditional View Curator/ Sanitizer OutputData A How to sanitize Anonymization?

Auxiliary Information Information from any source other than the statistical database –Other databases, including old releases of this one –Newspapers –General comments from insiders –Government reports, census website –Inside information from a different organization Eg, Google’s view, if the attacker/user is a Google employee Linkage Attacks: Malicious Use of Aux Info

The Netflix Prize Netflix Recommends Movies to its Subscribers –Seeks improved recommendation system –Offered $1,000,000 for 10% improvement Not concerned here with how this is measured –Published training data Prize won in September 2009 “BellKor's Pragmatic Chaos team”

From the Netflix Prize Rules Page… “The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.” “The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided.”

Netflix Data Release [Narayanan-Shmatikov 2008] User 1 User 2 User N Item 1Item 2 Item M Ratings for subset of movies and users Usernames replaced with random IDs Some additional perturbation Credit: Arvind Narayanan via Adam Smith

A Source of Auxiliary Information Internet Movie Database (IMDb) –Individuals may register for an account and rate movies – Need not be anonymous Probably want to create some web presence –Visible material includes ratings, dates, comments

Use Public Reviews from IMDb.com Alice Bob Charlie Danielle Erica Frank Anonymized NetFlix data Public, incomplete IMDB data Identified NetFlix Data = Alice Bob Charlie Danielle Erica Frank Credit: Arvind Narayanan via Adam Smith

De-anonymizing the Netflix Dataset Results “With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset.” “For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.” Consequences? –Learn about movies that IMDB users didn’t want to tell the world about... Sexual orientation, religious beliefs – Subject of current lawsuits Credit: Arvind Narayanan via Adam Smith of which 2 may be completely wrong Video Privacy Protection Act 1988 Settled, March 2010

30 AOL Search History Release (2006) 650,000 users, 20 Million queries, 3 months AOL’s goal : –provide real query logs from real users Privacy? –“Identifying information” replaced with random identifiers – But: different searches by the same user still linked

31 Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA AOL Search History Release (2006)

Other Successful Attacks Against anonymized HMO records [Sweeny 98] –Proposed K -anonymity Against K-anonymity [MGK06] –Proposed L-diversity Against L-diversity [XT07] –Proposed M-Invariance Against all of the above [GKS08]

Example: two hospitals serve overlapping populations  What if they independently release “anonymized” statistics? Composition attack: Combine independent releases 33 “Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008] Individuals Hospital B stats B stats A Hospital A Curators Attac ker sensitive information

Example: two hospitals serve overlapping populations  What if they independently release “anonymized” statistics? Composition attack: Combine independent releases 34 “Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008] Individuals Hospital B stats B “Adam has either diabetes or high blood pressure” Hospital A Curators Attac ker sensitive information stats A “Adam has either diabetes or emphyzema”

35 “Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008] “IPUMS” census data set. 70,000 people, randomly split into 2 pieces with overlap 5,000. With popular technique (k-anonymity, k=30 ) for each database, can learn “sensitive” variable for 40% of individuals With popular technique (k-anonymity, k=30 ) for each database, can learn “sensitive” variable for 40% of individuals

Analysis of Social Network Graphs “Friendship” Graph –Nodes correspond to users –Users may list others as “friend,” creating an edge Edges are annotated with directional information Hypothetical Research Question –How frequently is the “friend” designation reciprocated?

Attack Replace node names/labels with random identifiers Permits analysis of the structure of the graph Privacy hope: randomized identifiers make it hard/impossible to identify nodes with specific individuals, – thereby hiding the privacy of who is connected to whom Disastrous! [Blum Dwork K07] –Vulnerable to active and passive attacks

Flavor of Active Attack  Connections:  Targets: “Steve” and “Jerry”  Attack Contacts: A and B  Finding A and B allows finding Steve and Jerry S J A B

Flavor of Active Attack  Magic Step  Isolate lightly linked-in subgraphs from rest of graph  Special structure of subgraph permits finding A, B S J A B

Why Settle for Ad Hoc Notions of Privacy? Dalenius, 1977: Anything that can be learned about a respondent from the statistical database can be learned without access to the database –Captures possibility that “ I ” may be an extrovert –The database doesn’t leak personal information –Adversary is a user Analogous to Semantic Security for Crypto – Anything that can be learned from the ciphertext can be learned without the ciphertext –Adversary is an eavesdropper Goldwasser- Micali 1982

Computational Security of Encryption Semantic Security Whatever Adversary A can compute on encrypted string X  0,1  n, so can A ’ that does not see the encryption of X, yet simulates A ’s knowledge with respect to X A selects: Distribution D n on  0,1  n Relation R(X,Y) - computable in probabilistic polynomial time For every pptm A there is an pptm A’ so that for all pptm relation R for X  R D n  Pr  R(X,A(E(X))  - Pr  R(X,A’(  ))   is negligible Outputs of A and A’ are indistinguishable even for a tester who knows X

XY R E(X) A XY R. A’ A: D n A’: D n ¼ X 2 R D n

Making it Slightly less Vague Cryptographic Rigor Applied to Privacy Define a Break of the System –What is compromise –What is a “win” for the adversary? Specify the Power of the Adversary –Access to the data –Computational power? – “Auxiliary” information? Conservative/Paranoid by Nature –Protect against all feasible attacks

In full generality: Dalenius Goal Impossible –Database teaches smoking causes cancer –I smoke in public –Access to DB teaches that I am at increased risk for cancer But what about cases where there is significant knowledge about database distribution

Outline The Framework A General Impossibility Result –Dalenius’ goal cannot be achieved in a very general sense The Proof –Simplified –General case

Two Models DatabaseSanitized Database ? San Non-Interactive: Data are sanitized and released

Two Models Database Interactive: Multiple Queries, Adaptively Chosen ? San

Auxiliary Information Common theme in many privacy horror stories: Not taking into account side information –Netflix challenge: not taking into account IMDb [Narayanan-Shmatikov] The auxiliary information The Database SAN(DB) =remove names

Not learning from DB With access to the database San A Auxiliary Information San A’ Auxiliary Information DB There is some utility of DB that legitimate users should learn Possible breach of privacy Goal: users learn the utility without the breach Without access to the database

Not learning from DB With access to the database Without access to the database San A Auxiliary Information San A’ Auxiliary Information DB Want: anything that can be learned about an individual from the database can be learned without access to the database D 8 D 8 A 9 A’ whp DB 2 R D 8 auxiliary information z |Prob [A(z) $ DB wins] – Prob[ A’(z) wins]| is small

Illustrative Example for Difficulty Want: anything that can be learned about a respondent from the database can be learned without access to the database More Formally 8 D 8 A 9 A’ whp DB 2 R D 8 auxiliary information z |Probability [A(z) $ DB wins] – Probability [ A’(z) wins]| is small Example: suppose height of individual is sensitive information –Average height in DB not known a priori Aux z = “Adam is 5 cm shorter than average in DB ” –A learns average height in DB, hence, also Adam’s height –A’ does not

Defining “Win”: The Compromise Function Notion of privacy compromise Compromise? y 0/1 Adv DB D Privacy breach Privacy compromise should be non trivial: Should not be possible to find privacy breach from auxiliary information alone Privacy breach should exist: Given DB there should be y that is a privacy breach Should be possible to find y efficiently

Basic Concepts D Distribution on (Finite) Databases D –Something about the database must be unknown –Captures knowledge about the domain E.g., rows of database correspond to owners of 2 pets D Privacy Mechanism San( D, DB) –Can be interactive or non-interactive –May have access to the distribution D D Auxiliary Information Generator AuxGen( D, DB) –Has access to the distribution and to DB – Formalizes partial knowledge about DB Utility Vector w –Answers to k questions about the DB –(Most of) utility vector can be learned by user –Utility: Must inherit sufficient min-entropy from source D

Impossibility Theorem: Informal DFor any* distribution D on Databases DB For any* reasonable privacy compromise decider C. Fix any useful* privacy mechanism San Then There is an auxiliary info generator AuxGen and an adversary A Such that For all adversary simulators A’ [A(z) $ San( DB)] wins, but [A’(z)] does not win Tells us information we did not know z=AuxGen(DB) Finds a compromise

Impossibility Theorem Fix any useful* privacy mechanism San and any reasonable privacy compromise decider C. Then D There is an auxiliary info generator AuxGen and an adversary A such that for “ all ” distributions D and all adversary simulators A’ Pr[A( D, San( D,DB), AuxGen( D, DB)) wins] - Pr[A’( D, AuxGen( D, DB)) wins] ≥  for suitable, large,  D The probability spaces are over choice of DB 2 R D and the coin flips of San, AuxGen, A, and A’ To completely specify: need assumption on the entropy of utility vector W and how well SAN(W) behaves

Strategy The auxiliary info generator will provide a hint that together with the utility vector w will yield the privacy breach. Want AuxGen to work without knowing D just DB –Find privacy breach y and encode in z –Make sure z alone does not give y. Only with w Complication: is the utility vector w –Completely learned by the user? –Or just an approximation?

Entropy of Random Sources Source : –Probability distribution X on {0,1} n. –Contains some “randomness”. Measure of “randomness” – Shannon entropy : H(X) = - ∑ x  Γ P x (x) log P x (x) Represents how much we can compress X on the average But even a high entropy source may have a point with prob 0.9 – min-entropy: H min (X) = - log max x  Γ P x (x) Represents the most likely value of X Definition : X is a k -source if H 1 (X) ¸ k. i.e. Pr[X = x] · 2 -k for all x {0,1} n

Min-entropy Definition : X is a k -source if H 1 (X) ¸ k. i.e. Pr[X = x] · 2 -k for all x Examples : – Bit-fixing : some k coordinates of X uniform, rest fixed or even depend arbitrarily on others. – Unpredictable Source : 8 i 2 [n], b 1,..., b i-1 2 {0,1}, k/n · Prob[X i =1| X 1, X 2, … X i-1 = b 1,..., b i-1 ] · 1-k/n – Flat k -source : Uniform over S µ {0,1} n, |S|=2 k Fact every k -source is convex combination of flat ones.

Min-Entropy and Statistical Distance For a probability distribution X over {0,1} n H 1 (X) = - log max x Pr[X = x] X is a k -source if H 1 (X) ¸ k Represents the probability of the most likely value of X ¢ (X,Y) =  a  |Pr[X=a] – Pr[Y=a]| Statistical distance : Want to be close to uniform distribution :

Extractors Universal procedure for “purifying” an imperfect source Definition: Ext: {0,1} n £ {0,1} d ! {0,1} ℓ is a (k,  ) -extractor if: for any k -source X ¢ (Ext(X, U d ), U ℓ ) ·  d random bits “seed” E XT k -source of length n ℓ almost-uniform bits x s {0,1} n 2 k strings

Strong extractors Output looks random even after seeing the seed. Definition : Ext is a (k,  ) strong extractor if Ext ’ (x,s) = s ◦ Ext(x,s) is a (k,  ) -extractor i.e. 8 k -sources X, for a 1-  ’ frac. of s 2 {0,1} d Ext(X,s) is  -close to U ℓ.

Extractors from Hash Functions Leftover Hash Lemma [ILL89]: universal (pairwise independent) hash functions yield strong extractors –output length: ℓ = k-O(1) –seed length: d = O(n) Example: Ext(x,(a,b))= first ℓ bits of a ¢ x+b in GF[2 n ] Almost pairwise independence : –seed length: d= O(log n+k) ℓ = k – 2log(1/  )

Suppose w Learned Completely AuxGen and A share a secret: w AuxGen(DB) Find privacy breach y of DB of length ℓ Find w from DB – simulate A Choose s 2 R {0,1} d and compute Ext(w,s) Set z = (s, Ext(w,s) © y) San DB Aux Gen A C 0/1 w z

Suppose w Learned Completely AuxGen and A share a secret: w DB Aux Gen A’ C 0/1San DB Aux Gen A C 0/1 w z z = (s, Ext(w,s) © y) z Technical Conditions: H min (W|y) ≥ |y| and |y| “safe”

Why is it a compromise? AuxGen and A share a secret: w Why doesn’t A’ learn y : For each possible value of y (s, Ext(w,s)) is  -close to uniform Hence: (s, Ext(w,s) © y) is  -close to uniform San DB Aux Gen A C 0/1 w z z = (s, Ext(w,s) © y) Need H min (W) ¸ 3 ℓ +O(1) Technical Conditions: H min (W|y) ≥ |y| and |y| “safe”

To complete the proof Handle the case where not all of w is retrieved

Foundations of Privacy Lecture 1 Lecturer: Moni Naor.

Similar presentations

Presentation on theme: "Foundations of Privacy Lecture 1 Lecturer: Moni Naor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Foundations of Privacy Lecture 1 Lecturer: Moni Naor.

Similar presentations

Presentation on theme: "Foundations of Privacy Lecture 1 Lecturer: Moni Naor."— Presentation transcript:

Similar presentations

About project

Feedback