Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy-Preserving Data Mining

Similar presentations


Presentation on theme: "Privacy-Preserving Data Mining"— Presentation transcript:

1 Privacy-Preserving Data Mining
Jaideep Vaidya Joint work with Chris Clifton (Purdue University)

2 Outline Introduction Privacy-Preserving Outlier Detection
Privacy-Preserving Data Mining Horizontal / Vertical Partitioning of Data Secure Multi-party Computation Privacy-Preserving Outlier Detection Privacy-Preserving Association Rule Mining Conclusion Security Proofs - Very necessary, Quite complex, Not discussed now, Present in paper 3 bullets are intro of talk less (sub bullets)

3 Back in the good ol’ days
Now Future Back in the good ol’ days Dominick’s Safeway Jewel

4 A “real” example Ford / Firestone
Individual databases Possible to join both databases (find corresponding transactions) Commercial reasons to not share data Valuable corporate information - Cost structures / business structures Ford Explorers with Firestone tires → Tread Separation Problems (Accidents!) Might have been able to figure out a bit earlier (Tires from Decatur, Ill. Plant, certain situations) Complete cost structures / business structures

5 Public (mis)Perception of Data Mining: Attack on Privacy
Fears of loss of privacy constrain data mining Protests over a National Registry In Japan Data Mining Moratorium Act Would stop all data mining R&D by DoD Terrorism Information Awareness ended Data Mining could be key technology Btw.. The other facet is security…

6 Is Data Mining a Threat? Data Mining summarizes data
(Possible?) exception: Anomaly / Outlier detection Summaries aren’t private Or are they? Does generating them raise issues? Data mining can be a privacy solution Data mining enables safe use of private data

7 Privacy Problems with Data Mining
The problem isn’t Data Mining, it is the infrastructure to support it! Japanese registry data already held by prefectures Protests arose over moving to a National registry Total Information Awareness program doesn’t generate new data Goal is to enable use of data from multiple agencies Loss of Separation of Control Increases potential for misuse Find patterns while seeing only your own data!

8 Privacy-Preserving Data Mining
How can we mine data if we cannot see it? Perturbation Agrawal & Srikant, Evfimievski et al. Extremely scalable, approximate results Debate about security properties Cryptographic Lindell & Pinkas, Vaidya & Clifton Completely accurate, completely secure (tight bound on disclosure), appropriate for small number of parties Condensation/Hybrid GIVE TRADEOFF (Accuracy v/s scalability) – Generation I am FIRST to do vertical partitioning (done clustering, classification, assoc rules) Access Control Applied SMC (similar to what we are doing) – many papers, but most restricted to 2 parties Secure Multiparty Computation Proof that this is (theoretically) possible

9 Assumptions Data distributed Data holders don’t want to disclose data
Each data set held by source authorized to see it Nobody is allowed to see aggregate data Knowing all data about an individual violates privacy Data holders don’t want to disclose data Won’t collude to violate privacy

10 Gold Standard: Trusted Third Party

11 Horizontal Partitioning of Data
CC# Active? Delinquent? Amount Bank of America 123 Yes <$300 324 No $ 919 >$1000 Chase Manhattan 3450 Yes <$300 4127 No $ 8772 >$1000

12 Vertical Partitioning of Data
Global Database View TID Brain Tumor? Diabetes? Model Battery Medical Records Cell Phone Data Need to give horizontal partitioning RPJ Yes Diabetic CAC No Tumor No PTR RPJ 5210 Li/Ion CAC none PTR 3650 NiCd

13 Secure Multi-Party Computation (SMC)
Given a function f and n inputs, distributed at n sites, compute the result while revealing nothing to any site except its own input(s) and the result. Meaning of security Excepting polynomial predicates – not clear or necessary Skip input problems with semi-honest input

14 Secure Multi-Party Computation It can be done!
Yao’s Millionaire’s problem (Yao ’86) Secure computation possible if function can be represented as a circuit Idea: Securely compute gate Continue to evaluate circuit Extended to multiple parties (BGW/GMW ’87) Biggest Problem - Efficiency Will not work for lots of parties / large quantities of data Efficiency and yao’s Protocol – Maybe use simulation figures from Agrawal and Srikant?? Proof of security: Simulator based approach Mention later

15 SMC – Models of Computation
Semi-honest Model Parties follow the protocol faithfully Malicious Model Anything goes! Provably Secure In either case, input can always be modified No collusion Model No collusion allowed Only sensible for multiple parties Ways of proving security in both kinds of models. Basically, secure protocols exist in both models. - Change .. Have a incentive compatibility slide

16 Incentive compatibility
From a higher level perspective (economic notion) If a party cheats Either party is caught Or party suffers an economic loss Possible for many useful collaboration problems If protocol is incentive compatible, semi-honest model sufficient for security

17 What is an Outlier? An object O in a dataset T is a DB(p,dt)-outlier if at least fraction p of the objects in T lie at distance greater than dt from O Centralized solution from Knorr and Ng Nested loop comparison Maintain count of objects inside threshold If count exceeds threshold, declare non-outlier and move to next Clever processing order minimizes I/O cost 2 1 1

18 Privacy-Preserving Solution
Key idea: share splitting Computations leave results (randomly) split between parties Only outcome is if the count of points within distance threshold exceeds outlier threshold Requires pairwise comparison of all points But failure to compare all points reveals information about non-outliers This alone makes it possible to cluster points This is a privacy violation Asymptotically equivalent to Knorr & Ng

19 Solution: Horizontal Partition
Compare locally with your own points For remote points, get random share of distance Calculate random share of “exceeds threshold or doesn’t” Sum shares and test if enough “close” points 1.5 32 -31 -0.9 0.3 3 -3 0.9 2.5 -12 12 -0.7 1.5 1 -1 3.2 1 24 -23

20 Random share of distance
x2, y2 local; sum of xy is scalar product Several protocols for share-splitting scalar product (Du&Atallah’01; Vaidya&Clifton’02; Ioannidis, Grama, Atallah’02)

21 Shares of “Within Threshold”
Goal: is x + y ≤ dt ? Essentially Yao’s Millionaires’ problem (Yao’86) Represent function to be computed as circuit Cryptographic protocol gives random shares of each wire Solves “sum of shares from within dt exceeds minimum” as well

22 Vertically Partitioned Data
Each party computes its part of distance Secure comparison (circuit evaluation) gives each party shares of 1/0 (close/not) Sum and compare as with horizontal partitioning

23 Why is this Secure? Random shares indistinguishable from random values
Contain no knowledge in isolation Assuming no collusion – so shares viewed in isolation Number of values (= number of shares) known Nothing new revealed Too few close points is outlier definition This is the desired result No knowledge that can’t be discovered from one’s own input and the result!

24 Conclusion (Outlier Detection)
Outlier detection feasible without revealing anything but the outliers Possibly expensive (quadratic) But more efficient solution for this definition of outlier inherently reveals potential privacy-violating information Key: Privacy of non-outliers preserved Reason why outliers are outliers also hidden Allows search for “unusual” entities without disclosing private information about entities

25 Association Rules Association rules a common data mining task
Find A, B, C such that AB  C holds frequently (e.g. Diapers  Beer) Fast algorithms for centralized and distributed computation Basic idea: For AB  C to be frequent, AB, AC, and BC must all be frequent Require sharing data Secure Multiparty Computation too expensive Have this problem… have sub-block later Make it clear this is beyond 3 items i.e. could have ABCD=>E

26 Association Rule Mining
Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k) A B Support of itemset is defined as number of transactions in which all attributes of the itemset are present For binary data, support =|Ai Λ Bi|. Key A1 k1 1 k2 k3 k4 k5 Key B1 k1 k2 1 k3 k4 k5 {A1, B1} is supported for keys k4, k5. Support is 2.

27 Association Rule Mining
Idea based on TID-list representation of data Represent attribute A as TID-list Atid Support of ABC is | Atid ∩ Btid ∩ Ctid | Use a secure protocol to find size of set intersection to find candidate sets We now know how to compute one of (half a slide on how to compute one freq set from other) Millions of candidate itemsets – wont work --

28 Cardinality of Set Intersection
Use a secure commutative hash function Pohlig-Hellman Encryption Each party generates own encryption key All parties encrypt all the input sets E1(E2(…Ek(X))…) = El(Ei(…Ej(X))…) Result is (# common objects) in all sets No need to decrypt

29 Cardinality of Set Intersection
Hashing All parties hash all sets with their key Initial intersection Each party finds intersection of all sets (except its own) Final intersection Parties exchange the final intersection set, and compute the intersection of all sets Order is permuted in each hashing step. Finally, hashed set is sent to every party except the original set

30 Computing Size of Intersection
1 X E1(X) E1(E2(Y)) E1(E2(E3(Z))) Z:α,β,κ,λ,γ X∩Y∩Z:λ,β Z:α,β,κ,λ,γ Y∩Z:λ,β Probing attacks ---- possible to design algos to prevent / detect certain kinds of inputs – too many concepts n one slide.. Post 2 Y 3 Z X:α,λ,σ,β E2(E3(Z)) Y:λ,σ,φ,υ,β E3(E1(E2(Y))) E3(E1(X)) E2(E3(E1(X))) E2(Y) E3(Z) X∩Y∩Z:λ,β X∩Y∩Z:λ,β X∩Z:α,β,λ X:α,λ,σ,β Y:λ,σ,φ,υ,β X∩Y:λ,σ,β

31 Why need an intermediate intersection step?
Probing 1 party only interested in a particular item Input set composed of interesting item and junk Output reveals information about the presence / absence of item Solution Intermediate step, every party receives encrypted sets of all other parties (but not its own) If Intersection size lower than a threshold, possibility of probing => Abort protocol (What if the item represents medical records for a celebrity?)h

32 Proof of Security Proof by Simulation What is known
The size of the intersection set Site i learns How it can be simulated Protocol is symmetric, simulating view of one party is sufficient Proof by simulation (explain)

33 Proof of Security Hashing Intersection
Party i receives encrypted set from party i-1 Can use random numbers to simulate this Intersection Party i receives fully hashed sets of all parties

34 Simulating Fully Encrypted Sets
|ABC| = 2, |AB| = 3, |AC| = 4, |BC| = 2, |A| = 6, |B| = 7, |C| = 8 ABC 2 AB AC 3-2 =1 4-2 =2 BC 2-2 =0 A B C =1 =4 =4

35 A B C R1 R2 R3 R4 R5 R6 R1 R2 R3 R7 R8 R9 R10 R1 R2 R4 R5 R11 R12 R13
Why is this computationally indistinguishable no use w/o this

36 Optimized version

37 Association Rule Mining (Revisited)
Naïve algorithm => Simply use APRIORI. A single set intersection determines the frequency of a single candidate itemset Thousands of itemsets Key intuition Set Intersection algorithm developed also allows computation of intermediate sets All parties get fully encrypted sets for all attributes Local computation allows efficient discovery of all association rules

38 Communication Cost k parties, m set size, p frequent attributes
k*(2k-2) = O(k2) messages p*(2p-2)*m*encrypted message size = O(p2m) bits k rounds Independent of number of itemsets found Big O estimates Metric is not if this is as efficent as non-privacy preserving computation Right question is if this is sufficiently fast for practical use… Consider dropping Non-Secure Method (esp. if giving actual times) PPl might ask what cost of non-secure method? Make the point its not right metric of comparison Needs to be a solid answer (with appropriate tone!) practice tone being non defensive, etc. Non secure method would be faster, but the right way to think abt it is to think abt if it is practical…

39 Other Results ID3 Decision Tree learning Association Rules
Horizontal Partitioning: Lindell&Pinkas ’00 Also vertical partitioning (Du, Vaidya) Association Rules Horizontal Partitioning: Kantarcıoğlu K-Means / EM Clustering K-Nearest Neighbor Naïve Bayes, Bayes network structure And many more

40 Challenges What do the results reveal?
A general approach (instead of per data mining technique) Experimental results Incentive Compatibility Note: Upcoming book in the Advances in Information Security series by Springer-Verlag

41 Questions


Download ppt "Privacy-Preserving Data Mining"

Similar presentations


Ads by Google