Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.

Similar presentations


Presentation on theme: "SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte."— Presentation transcript:

1 SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte Yingjiu Li Singapore Management Univ

2 SAC, Dijon, FranceApril 23-27, 2006 2 Outline  Motivation  General Location Model  Value Disclosure Analysis Basic disclosure scenario Conditional disclosure scenario Combinatorial disclosure scenario  Conclusion and Future Work

3 SAC, Dijon, FranceApril 23-27, 2006 3 Motivation  Information Disclosure in general databases Identity Disclosure Value Disclosure SSNNameZipRaceAgeSexDividendsWagesInterests 128223Asian20M10k85k2k 228223Asian30F15k70k18k 328262Black20M50k120k35k....... n28223Asian20M80k110k15k

4 SAC, Dijon, FranceApril 23-27, 2006 4 Motivation  Previous work Additive randomization approach  Agrawal & Srikant, SIGMOD00, Agrawal &Aggawal PODS01  Kargupta et al. ICDM03, Du et al. SIGMOD05  Various methods from statistical databases Multiplicative rotation approach  Chen et al. ICDM05  Kargupta et al. TKDE06 Limitation  Conduct disclosure analysis on the data space  Prune to potential attacking  Our Modeling based approach First build an approximate statistical model Analyze disclosure on the parameter space Apply the model to generate data for future mining

5 SAC, Dijon, FranceApril 23-27, 2006 5 Application  Database Application Testing Testing on the local development databases  a small number of data samples  cannot conduct performance testing Testing against the live production databases  privacy disclosure  incorrectly update the underlying databases.  Generate mock databases for application software testing such that the generated data Valid Resembling to original data in terms of statistical distribution Privacy preserving

6 SAC, Dijon, FranceApril 23-27, 2006 6 ER Data DDL Catalog Schema & Domain Filter Schema’ Domain’ Disclosure Assessment Performance Assessment General Location Model Data Generator Synthetic database Synthetic database R R NR S S

7 SAC, Dijon, FranceApril 23-27, 2006 7 General Location Model SSNNameZipRaceAgeSexDividendsWagesInterests 128223Asian20M10k85k2k 228223Asian30F15k70k18k 328262Black20M50k120k35k....... n28223Asian20M80k110k15k Categorical Attributes (Multinomial Distribution) Categorical Attributes (Multinomial Distribution) Numerical Attributes (Multivariate Gaussian Distributions) Numerical Attributes (Multivariate Gaussian Distributions)

8 SAC, Dijon, FranceApril 23-27, 2006 8 General Location Model  Given a dataset which contains n tuples Categorical attributes: Numerical attributes :  The categorical part can be summarized by a contingency table with cells. The number of tuples in each cell, has a multinomial distribution  For each cell d, the numerical attributes satisfy a conditionally multivariate normal distribution

9 SAC, Dijon, FranceApril 23-27, 2006 9 Parameter Fitting  The MLE estimates of parameter as follows where is the set of tuples belonging to cell d

10 SAC, Dijon, FranceApril 23-27, 2006 10 Value Disclosure  Attackers may be able to estimate or infer the value of a certain confidential numerical attribute of an entity or a group of entities with a level of accuracy than a threshold  All numerical attribute values are generated from multi- variate normal distribution, specifically from SSNNameZipRaceAgeSexDividendsWagesInterests 28262Asian30M ……….. 28262Asian30M 28223White50F 28223White50F ………… 28223White50F

11 SAC, Dijon, FranceApril 23-27, 2006 11 Value Disclosure Analysis Basic Disclosure Scenario  All numerical attributes are confidential  The analysis is based on probability density contour.  The disclosure is measured in terms of confidence interval or confidence region. Conditional Scenario  Non-confidential + confidential attributes Combinatorial Scenario  Linear combinations exist among both confidential and non-confidential attributes

12 SAC, Dijon, FranceApril 23-27, 2006 12 Privacy Measure Confidence Interval  Agrawal & Srikant SIGMOD00  If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level Confidence Region  In the p-dimensional case, a c% confidence region is determined by the probability density contour of data.

13 SAC, Dijon, FranceApril 23-27, 2006 13 Basic Disclosure Scenario Confidential attributes (X) ~ N(μ,Σ) The projection of this multidimensional ellipsoid on axis z i has bounds:

14 SAC, Dijon, FranceApril 23-27, 2006 14 Basic Disclosure Scenario Measure Privacy  Heuristic method  Use a hyper-rectangle to approximate the ellipsoid  Measure privacy for one dimension  Adjust parameters Original Interval Original Interval Dissimilarity Constrain (d) Dissimilarity Constrain (d) New Interval New Interval

15 SAC, Dijon, FranceApril 23-27, 2006 15 Conditional Scenario  Confidential attributes (X) and Non-confidential attributes (S) E.g., the non-confidential values of Dividends and Wages can help predict confidential values of Interests Same method with conditional Parameters:

16 SAC, Dijon, FranceApril 23-27, 2006 16 Combinatorial Scenario RaceAgeSexDividendsWagesInterests Asian20M10k85k2k Asian30F15k70k18k Black20M50k120k35k Total Income 87k 103k 205k Many Potential Combinations exist, e.g. Dividends + Wages + Interests = Total Income Even if the level of security provided for a single confidential attribute is adequate, the level of security provided for linear combinations of confidential attributes could be very low.

17 SAC, Dijon, FranceApril 23-27, 2006 17 Combinatorial Scenario  Canonical Correlation Analysis (CCA) A statistical procedure that is used to identify and quantify the relationship between two sets of variables, S and X. CCA can identify a linear combination of variables in one set, X, that have the highest correlation with a linear combination of variables in another set, S. It can be used to evaluate the level of security when estimating the linear combinations of the confidential attributes, X, using the non-confidential attributes, S.

18 SAC, Dijon, FranceApril 23-27, 2006 18 Combinatorial Scenario  Canonical Correlation Analysis (CCA) λ 1 : represents the most general measure of inferential value disclosure for any combination 1− λ 1 : the worst-case security λ 1 ≤λ : no combinatorial disclosure exists  Adjust parameters If λ i > λ then λ i = λ, keeping other eigenvalues, eigenvectors unchanged. Get a new Adjust : Adjust : optimization problem

19 SAC, Dijon, FranceApril 23-27, 2006 19 Conclusion  Propose a model-based privacy preserving approach  Investigate value disclosure in three scenarios

20 SAC, Dijon, FranceApril 23-27, 2006 20 Future Work  How to conduct individual value disclosure analysis when individual privacy intervals are specified  How the information loss due to modeling affects the utility of generated data

21 SAC, Dijon, FranceApril 23-27, 2006 21 Acknowledgement  NSF Grant CCR-0310974 IIS-0546027  Personnel Xintao Wu, Songtao Guo, UNC Charlotte Yingjiu Li, Singapore Management Univ.  More Info http://www.cs.uncc.edu/~xwu/ xwu@uncc.edu, xwu@uncc.edu

22 SAC, Dijon, FranceApril 23-27, 2006 22 Questions? Thank you!


Download ppt "SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte."

Similar presentations


Ads by Google