PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ
PDM April 8, Source:
PDM April 8, Source: HIPAA for health care California State Bill 1386 Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy etc. PIPEDA 2000 European Union (Directive 94/46/EC)
PDM April 8, Mining vs. Privacy Data mining The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) Individual Privacy Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers Privacy Preserving Data Mining (PPDM) How to “perturb” data such that we can build a good data mining model (data utility) while preserving individual’s privacy at the record level (privacy)?
PDM April 8, Our Focus SSNNameZipAgeSexBalance…IncomeInterest Paid 1*** M10k…85k2k 2*** F15k…70k18k 3*** M50k…120k35k ….. n*** M80k…110k15k Focus in this talk k-anonymity, L-diversity SDC etc.
PDM April 8, Additive Noise based PPDM Distribution reconstruction AS method, Agrawal and Srikant, SIGMOD 00 EM method, Agrawal and Aggarwal, PODS 01 Individual value reconstruction Spectral Filtering (SF), Kargupta et al. ICDM 03 PCA, Huang, Du and Chen SIGMOD 05
PDM April 8, Additive Randomization (Y = X +R ) 50 | 40K |...30 | 70K | Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K | becomes 65 (30+35) Alice’s age Add random number to Age R.Agrawal and R.Srikant SIGMOD 00
PDM April 8, Distribution Reconstruction f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate – Agrawal and Aggarwal PODS 01 Algorithm
PDM April 8, Individual Reconstruction Spectral Filtering Technique (Kargupta et al. ICDM03) Apply EVD Using the covariance of V, extract the first k principle components λ 1 ≥ λ 2 ··· ≥ λ k ≥ λ e and e 1, e 2, · · ·,e k are the corresponding eigenvectors of Q k = [e 1 e 2 · · · e k ] forms an orthonormal basis of a subspace X Find the orthogonal projection on to X: Estimate data as PCA Technique, Huang, Du and Chen, SIGMOD 05
PDM April 8, Motivation The goal of randomization-based perturbation To hide the sensitive data by randomly modifying the data values using some additive noise To keep the aggregate characteristics or distribution remain unchanged or recoverable Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual ’ s sensitive data? private information
PDM April 8, Our Scenario Each individual data is associated with one privacy interval privacy policies corporate agreements The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval Balance…IncomeInterest Paid 110k…85k2k 215k…70k18k 350k…120k35k..….. n80k…110k15k A single party (data holder) holds a collection of original individual data
PDM April 8, Inter-Quantile Range (IQR) Inter-Quantile Range [x α1, x α2 ] is defined as P( x α1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence. IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value. IQR we used: [x (1-c)/2, x (1+c)/2 ] α2α2 α1α1 xα1xα1 xα2xα2
PDM April 8, Comparison with other Privacy definition Interval privacy (Agrawal and Srikant, SIGMOD00) If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level Mutual Information (Aggarwal and Agrawal, PODS01) Reconstruction privacy (Rizvi & Haritsa, VLDB02) -to- privacy breach (Evfimievski et al. PODS03)
PDM April 8, Disclosure Measure Individual’s privacy interval Attacker’s estimated range Measure Similarity Complete disclosed point if its estimated range contains the original value contains the original value fully falls within the pre- fully falls within the pre- specified privacy interval specified privacy interval
PDM April 8, Empirical Evaluation Data sets: Bank 5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs) 50,000 tuples Signal 35 correlated features (sinusoidal, square, triangle, normal distributions ) 30,000 tuples Pre-specified individual ’ s privacy intervals: [u i (1-p), u i (1+p)] p is varied
PDM April 8, IQR from Reconstructed Dist. Using AS with Uniform noise IQR Direct inference ---perturbed IRQ with AS inference ---reconstructed IRQ ideal inference ---original Uniform noise: [-125,125] Bank Data set Attribute: Stock/Bonds 95% IQR information loss for AS : 14.6% Ratio of Complete disclosure points
PDM April 8, Interval p %no. of disclosed points(100%)D directIQR idealIQR with ASidealAS IQR from Reconstructed Dist. Using AS with Uniform noise
PDM April 8, AS vs. SF with Gaussian Noise Gaussian noise N(0,8) Signal dataset Feature 2 (sinusoidal distributed) 95% IQR information loss for AS : 32.9% information loss for SF : 47.0%
PDM April 8, Disclosure vs. noise Uniform noise with varied range Bank Data set Attribute: Stock/Bonds 95% IQR
PDM April 8, Extend to Multivariate Cases In practice, the distribution of multiple numerical attributes are often modeled by one multi-variate normal distribution, N(μ,Σ) The ellipsoid {z : (z − μ)′ Σ −1 (z − μ) ≤ χ 2 p (α)} contains a fixed percentage, (1 −α)100% of data values. The projection of this ellipsoid on axis z i has bound:
PDM April 8, Related Work Rotation based approach: Y = RX When R is an orthonormal matrix (RR T = I) Vector length: |Rx| = |x| Euclidean distance: |Rx – Ry| = |x-y| Inner product : = Popular classifiers and clustering methods are invariant to this perturbation. K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005
PDM April 8, Is Y=RX Secure? = Y = R X Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k RR T = R T R = I
PDM April 8, Our Preliminary Results Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers = Y = RX E R can be any random matrix
PDM April 8, A-priori Knowledge ICA Based Attack Privacy can be breached when a small subset of the original data X, is available to attackers Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k
PDM April 8, Summary The reconstructed distribution can be exploited by attackers to derive sensitive individual information. Present a simple IQR attacking method Complex and effective attacking methods exist More research is needed on attacking methods from the attacker point of view
PDM April 8, Acknowledgement NSF Grant CCR IIS Personnel Xintao Wu Songtao Guo Ling Guo More Info
PDM April 8, Questions? Thank you!
PDM April 8, Information Loss Distribution level Individual value level
PDM April 8, National Laws US HIPAA for health care Passed August 21, 96 lowest bar and the States are welcome to enact more stringent rules California State Bill 1386 Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for childern’s online privacy etc. Canada PIPEDA 2000 Personal Information Protection and Electronic Documents Act Effective from Jan 2004 European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislation Forbids sharing data with states that do not protect privacy
PDM April 8, ICA Direct Attack? Can we get X when only Y is available? It seems Independent Component Analysis can help. Y = R X + E General Linear Perturbation Model X = A S + N ICA Model
PDM April 8, ICA Linear Mixing Process Mixing MatrixSource Observed Separation Process Separated Demixing Matrix Independent? Cost Function Optimize
PDM April 8, Restriction of ICA Restrictions: All the components s i should be independent; They must be non-Gaussian with the possible exception of one component. The number of observed linear mixtures m must be at least as large as the number of independent components n The matrix A must be of full column rank Can we apply the ICA directly? No Correlations among attributes of X More than one attributes may have Gaussian distributions Y = RX + E X = AS + N
PDM April 8, A-priori Knowledge based ICA (AK-ICA) Attack
PDM April 8, Correctness of AK-ICA We prove that J exists such that J represents the connection between the distributions of and