PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Slides:



Advertisements
Similar presentations
Independent Component Analysis
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimension reduction (1)
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.
An introduction to Principal Component Analysis (PCA)
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Demo, May 2005 Privacy Preserving Database Application Testing Xintao Wu, Yongge Wang, Yuliang Zheng, UNC Charlotte.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Dimensional reduction, PCA
SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Independent Component Analysis (ICA) and Factor Analysis (FA)
1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University.
1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Techniques for studying correlation and covariance structure
Information Privacy Policy in Canada Presented By: Sue Wu.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.
Lecture 2: Statistical learning primer for biologists
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Introduction to Vectors and Matrices
Privacy-Preserving Data Mining
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Outlier Processing via L1-Principal Subspaces
Matrices Definition: A matrix is a rectangular array of numbers or symbolic elements In many applications, the rows of a matrix will represent individuals.
Feature space tansformation methods
Multiplicative Data Perturbations (1)
LECTURE 09: DISCRIMINANT ANALYSIS
Introduction to Vectors and Matrices
Independent Factor Analysis
Feature Selection Methods
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presentation transcript:

PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte Yingjiu Li Singapore Management Univ

PDM April 8, Source:

PDM April 8, Source: HIPAA for health care  California State Bill 1386 Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy etc. PIPEDA 2000 European Union (Directive 94/46/EC)

PDM April 8, Mining vs. Privacy Data mining The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) Individual Privacy Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers Privacy Preserving Data Mining (PPDM) How to “perturb” data such that  we can build a good data mining model (data utility)  while preserving individual’s privacy at the record level (privacy)?

PDM April 8, Our Focus SSNNameZipAgeSexBalance…IncomeInterest Paid 1*** M10k…85k2k 2*** F15k…70k18k 3*** M50k…120k35k ….. n*** M80k…110k15k Focus in this talk k-anonymity, L-diversity SDC etc.

PDM April 8, Additive Noise based PPDM Distribution reconstruction AS method, Agrawal and Srikant, SIGMOD 00 EM method, Agrawal and Aggarwal, PODS 01 Individual value reconstruction Spectral Filtering (SF), Kargupta et al. ICDM 03 PCA, Huang, Du and Chen SIGMOD 05

PDM April 8, Additive Randomization (Y = X +R ) 50 | 40K |...30 | 70K | Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K | becomes 65 (30+35) Alice’s age Add random number to Age R.Agrawal and R.Srikant SIGMOD 00

PDM April 8, Distribution Reconstruction f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate – Agrawal and Aggarwal PODS 01 Algorithm

PDM April 8, Individual Reconstruction Spectral Filtering Technique (Kargupta et al. ICDM03) Apply EVD Using the covariance of V, extract the first k principle components  λ 1 ≥ λ 2 ··· ≥ λ k ≥ λ e and e 1, e 2, · · ·,e k are the corresponding eigenvectors of  Q k = [e 1 e 2 · · · e k ] forms an orthonormal basis of a subspace X Find the orthogonal projection on to X: Estimate data as PCA Technique, Huang, Du and Chen, SIGMOD 05

PDM April 8, Motivation The goal of randomization-based perturbation To hide the sensitive data by randomly modifying the data values using some additive noise To keep the aggregate characteristics or distribution remain unchanged or recoverable Do those aggregate characteristics or distribution contain confidential information which may be exploited by snoopers to derive individual ’ s sensitive data? private information

PDM April 8, Our Scenario Each individual data is associated with one privacy interval privacy policies corporate agreements The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any individual data within its privacy interval Balance…IncomeInterest Paid 110k…85k2k 215k…70k18k 350k…120k35k..….. n80k…110k15k A single party (data holder) holds a collection of original individual data

PDM April 8, Inter-Quantile Range (IQR) Inter-Quantile Range [x α1, x α2 ] is defined as P( x α1 ≤ x ≤ x α2 ) ≥ c%, while c = α2 − α1 denotes the confidence. IQR measures the amount of spread and variability of the variable. Hence it can be used by attackers to estimate the range of each individual value. IQR we used: [x (1-c)/2, x (1+c)/2 ] α2α2 α1α1 xα1xα1 xα2xα2

PDM April 8, Comparison with other Privacy definition Interval privacy (Agrawal and Srikant, SIGMOD00) If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level Mutual Information (Aggarwal and Agrawal, PODS01) Reconstruction privacy (Rizvi & Haritsa, VLDB02) -to- privacy breach (Evfimievski et al. PODS03)

PDM April 8, Disclosure Measure Individual’s privacy interval Attacker’s estimated range Measure Similarity Complete disclosed point if its estimated range contains the original value contains the original value fully falls within the pre- fully falls within the pre- specified privacy interval specified privacy interval

PDM April 8, Empirical Evaluation Data sets: Bank  5 attributes (Home Equity, Stock/Bonds, Liabilities, Savings, CDs)  50,000 tuples Signal  35 correlated features (sinusoidal, square, triangle, normal distributions )  30,000 tuples Pre-specified individual ’ s privacy intervals: [u i (1-p), u i (1+p)] p is varied

PDM April 8, IQR from Reconstructed Dist. Using AS with Uniform noise IQR Direct inference ---perturbed IRQ with AS inference ---reconstructed IRQ ideal inference ---original Uniform noise: [-125,125] Bank Data set Attribute: Stock/Bonds 95% IQR information loss for AS : 14.6% Ratio of Complete disclosure points

PDM April 8, Interval p %no. of disclosed points(100%)D directIQR idealIQR with ASidealAS IQR from Reconstructed Dist. Using AS with Uniform noise

PDM April 8, AS vs. SF with Gaussian Noise Gaussian noise N(0,8) Signal dataset Feature 2 (sinusoidal distributed) 95% IQR information loss for AS : 32.9% information loss for SF : 47.0%

PDM April 8, Disclosure vs. noise Uniform noise with varied range Bank Data set Attribute: Stock/Bonds 95% IQR

PDM April 8, Extend to Multivariate Cases In practice, the distribution of multiple numerical attributes are often modeled by one multi-variate normal distribution, N(μ,Σ) The ellipsoid {z : (z − μ)′ Σ −1 (z − μ) ≤ χ 2 p (α)} contains a fixed percentage, (1 −α)100% of data values. The projection of this ellipsoid on axis z i has bound:

PDM April 8, Related Work Rotation based approach: Y = RX When R is an orthonormal matrix (RR T = I)  Vector length: |Rx| = |x|  Euclidean distance: |Rx – Ry| = |x-y|  Inner product : = Popular classifiers and clustering methods are invariant to this perturbation. K. Liu, H. Kargupta etc. Random projection based multiplicative data perturbation for privacy preserving distributed data mining. TKDE K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. ICDM 2005

PDM April 8, Is Y=RX Secure? = Y = R X Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k RR T = R T R = I

PDM April 8, Our Preliminary Results Even Y = RX + E is NOT secure when some a-priori knowledge is available to attackers = Y = RX E R can be any random matrix

PDM April 8, A-priori Knowledge ICA Based Attack Privacy can be breached when a small subset of the original data X, is available to attackers Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k

PDM April 8, Summary The reconstructed distribution can be exploited by attackers to derive sensitive individual information. Present a simple IQR attacking method Complex and effective attacking methods exist More research is needed on attacking methods from the attacker point of view

PDM April 8, Acknowledgement NSF Grant CCR IIS Personnel Xintao Wu Songtao Guo Ling Guo More Info

PDM April 8, Questions? Thank you!

PDM April 8, Information Loss Distribution level Individual value level

PDM April 8, National Laws US HIPAA for health care  Passed August 21, 96  lowest bar and the States are welcome to enact more stringent rules  California State Bill 1386 Grann-Leach-Bliley Act of 1999 for financial institutions COPPA for childern’s online privacy etc. Canada PIPEDA 2000  Personal Information Protection and Electronic Documents Act  Effective from Jan 2004 European Union (Directive 94/46/EC) Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislation Forbids sharing data with states that do not protect privacy

PDM April 8, ICA Direct Attack? Can we get X when only Y is available? It seems Independent Component Analysis can help. Y = R X + E General Linear Perturbation Model X = A S + N ICA Model

PDM April 8, ICA Linear Mixing Process Mixing MatrixSource Observed Separation Process Separated Demixing Matrix Independent? Cost Function Optimize

PDM April 8, Restriction of ICA Restrictions: All the components s i should be independent; They must be non-Gaussian with the possible exception of one component. The number of observed linear mixtures m must be at least as large as the number of independent components n The matrix A must be of full column rank Can we apply the ICA directly? No Correlations among attributes of X More than one attributes may have Gaussian distributions Y = RX + E X = AS + N

PDM April 8, A-priori Knowledge based ICA (AK-ICA) Attack

PDM April 8, Correctness of AK-ICA We prove that J exists such that J represents the connection between the distributions of and