Presentation is loading. Please wait.

Presentation is loading. Please wait.

PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Similar presentations


Presentation on theme: "PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte."— Presentation transcript:

1 PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ

2 PDM April 8, 20062 Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg

3 PDM April 8, 20063 Source: http://www.privacyinternational.org/survey/dpmap.jpg HIPAA for health care  California State Bill 1386 Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy etc. PIPEDA 2000 European Union (Directive 94/46/EC)

4 PDM April 8, 20064 Mining vs. Privacy Data mining The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) Individual Privacy Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers Privacy Preserving Data Mining (PPDM) How to “perturb” data such that  we can build a good data mining model (data utility)  while preserving individual’s privacy at the record level (privacy)?

5 PDM April 8, 20065 Our Focus SSNNameZipAgeSexBalance…IncomeInterest Paid 1*** 2822320M10k…85k2k 2*** 2822330F15k…70k18k 3*** 2826220M50k…120k35k.......….. n*** 2822320M80k…110k15k Focus in this talk k-anonymity, L-diversity SDC etc.

6 PDM April 8, 20066 Additive Noise based PPDM Distribution reconstruction AS method, Agrawal and Srikant, SIGMOD 00 EM method, Agrawal and Aggarwal, PODS 01 Individual value reconstruction Spectral Filtering (SF), Kargupta et al. ICDM 03 PCA, Huang, Du and Chen SIGMOD 05

7 PDM April 8, 20067 Additive Randomization (Y = X +R ) 50 | 40K |...30 | 70K |...... Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K |...... 30 becomes 65 (30+35) Alice’s age Add random number to Age R.Agrawal and R.Srikant SIGMOD 00

8 PDM April 8, 20068 Distribution Reconstruction f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate – Agrawal and Aggarwal PODS 01 Algorithm

9 PDM April 8, 20069 Individual Reconstruction Spectral Filtering Technique (Kargupta et al. ICDM03) Apply EVD Using the covariance of V, extract the first k principle components  λ 1 ≥ λ 2 ··· ≥ λ k ≥ λ e and e 1, e 2, · · ·,e k are the corresponding eigenvectors of  Q k = [e 1 e 2 · · · e k ] forms an orthonormal basis of a subspace X Find the orthogonal projection on to X: Estimate data as PCA Technique, Huang, Du and Chen, SIGMOD 05

10 PDM April 8, 200610 Motivation The goal of randomization-based perturbation To hide the sensitive data by randomly modifying the data values using some additive noise To keep the aggregate characteristics or distribution remain unchanged or recoverable Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual ’ s sensitive data? private information

11 PDM April 8, 200611 Our Scenario Each individual data is associated with one privacy interval privacy policies corporate agreements The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval Balance…IncomeInterest Paid 110k…85k2k 215k…70k18k 350k…120k35k..….. n80k…110k15k A single party (data holder) holds a collection of original individual data

12 PDM April 8, 200612 Inter-Quantile Range (IQR) Inter-Quantile Range [x α1, x α2 ] is defined as P( x α1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence. IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value. IQR we used: [x (1-c)/2, x (1+c)/2 ] α2α2 α1α1 xα1xα1 xα2xα2

13 PDM April 8, 200613 Comparison with other Privacy definition Interval privacy (Agrawal and Srikant, SIGMOD00) If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level Mutual Information (Aggarwal and Agrawal, PODS01) Reconstruction privacy (Rizvi & Haritsa, VLDB02) -to- privacy breach (Evfimievski et al. PODS03)

14 PDM April 8, 200614 Disclosure Measure Individual’s privacy interval Attacker’s estimated range Measure Similarity Complete disclosed point if its estimated range contains the original value contains the original value fully falls within the pre- fully falls within the pre- specified privacy interval specified privacy interval

15 PDM April 8, 200615 Empirical Evaluation Data sets: Bank  5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs)  50,000 tuples Signal  35 correlated features (sinusoidal, square, triangle, normal distributions )  30,000 tuples Pre-specified individual ’ s privacy intervals: [u i (1-p), u i (1+p)] p is varied

16 PDM April 8, 200616 IQR from Reconstructed Dist. Using AS with Uniform noise IQR Direct inference ---perturbed IRQ with AS inference ---reconstructed IRQ ideal inference ---original Uniform noise: [-125,125] Bank Data set Attribute: Stock/Bonds 95% IQR information loss for AS : 14.6% Ratio of Complete disclosure points

17 PDM April 8, 200617 Interval p %no. of disclosed points(100%)D directIQR idealIQR with ASidealAS 3513.921.23.50.6050.663 4016.032.515.10.6600.698 4517.943.029.60.7120.746 5019.852.941.80.7630.796 5522.062.953.20.8140.844 6023.972.963.40.8640.889 6526.083.373.50.9160.932 7028.094.383.70.9720.977 7529.999.994.50.999 8032.0100 11 IQR from Reconstructed Dist. Using AS with Uniform noise

18 PDM April 8, 200618 AS vs. SF with Gaussian Noise Gaussian noise N(0,8) Signal dataset Feature 2 (sinusoidal distributed) 95% IQR information loss for AS : 32.9% information loss for SF : 47.0%

19 PDM April 8, 200619 Disclosure vs. noise Uniform noise with varied range Bank Data set Attribute: Stock/Bonds 95% IQR

20 PDM April 8, 200620 Extend to Multivariate Cases In practice, the distribution of multiple numerical attributes are often modeled by one multi-variate normal distribution, N(μ,Σ) The ellipsoid {z : (z − μ)′ Σ −1 (z − μ) ≤ χ 2 p (α)} contains a fixed percentage, (1 −α)100% of data values. The projection of this ellipsoid on axis z i has bound:

21 PDM April 8, 200621 Related Work Rotation based approach: Y = RX When R is an orthonormal matrix (RR T = I)  Vector length: |Rx| = |x|  Euclidean distance: |Rx – Ry| = |x-y|  Inner product : = Popular classifiers and clustering methods are invariant to this perturbation. K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE 2006. K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005

22 PDM April 8, 200622 Is Y=RX Secure? 0.33330.6667 -0.66670.6667-0.3333 -0.6667-0.33330.6667 1015504580 857012023110 2183513415 61.3363.67110.00119.6763.33 49.3330.6755.00-59.33-31.67 -33.67-21.33-30.0051.67-51.67 = Y = R X Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k RR T = R T R = I

23 PDM April 8, 200623 Our Preliminary Results Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers. 4.7512.4292.282 1.1564.4570.093 3.0343.8114.107 1015504580 857012023110 2183513415 265.95286.63475.68581.71520.53 394.30338.49569.58174.22277.79 362.55394.11665.37776.46463.08 = Y = RX + 7.3344.1999.1996.2089.048 3.7597.5378.4477.3135.692 0.0997.9393.6781.9396.318 + E R can be any random matrix

24 PDM April 8, 200624 A-priori Knowledge ICA Based Attack Privacy can be breached when a small subset of the original data X, is available to attackers Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k

25 PDM April 8, 200625 Summary The reconstructed distribution can be exploited by attackers to derive sensitive individual information. Present a simple IQR attacking method Complex and effective attacking methods exist More research is needed on attacking methods from the attacker point of view

26 PDM April 8, 200626 Acknowledgement NSF Grant CCR-0310974 IIS-0546027 Personnel Xintao Wu Songtao Guo Ling Guo More Info http://www.cs.uncc.edu/~xwu/ xwu@uncc.edu, xwu@uncc.edu

27 PDM April 8, 200627 Questions? Thank you!

28 PDM April 8, 200628 Information Loss Distribution level Individual value level

29 PDM April 8, 200629 National Laws US HIPAA for health care  Passed August 21, 96  lowest bar and the States are welcome to enact more stringent rules  California State Bill 1386 Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for childern’s online privacy etc. Canada PIPEDA 2000  Personal Information Protection and Electronic Documents Act  Effective from Jan 2004 European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislation Forbids sharing data with states that do not protect privacy

30 PDM April 8, 200630 ICA Direct Attack? Can we get X when only Y is available? It seems Independent Component Analysis can help. Y = R X + E General Linear Perturbation Model X = A S + N ICA Model

31 PDM April 8, 200631 ICA Linear Mixing Process Mixing MatrixSource Observed Separation Process Separated Demixing Matrix Independent? Cost Function Optimize

32 PDM April 8, 200632 Restriction of ICA Restrictions: All the components s i should be independent; They must be non-Gaussian with the possible exception of one component. The number of observed linear mixtures m must be at least as large as the number of independent components n The matrix A must be of full column rank Can we apply the ICA directly? No Correlations among attributes of X More than one attributes may have Gaussian distributions Y = RX + E X = AS + N

33 PDM April 8, 200633 A-priori Knowledge based ICA (AK-ICA) Attack

34 PDM April 8, 200634 Correctness of AK-ICA We prove that J exists such that J represents the connection between the distributions of and


Download ppt "PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte."

Similar presentations


Ads by Google