SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.

Slides:



Advertisements
Similar presentations
Nonparametric Bootstrap Inference on the Characterization of a Response Surface Robert Parody Center for Quality and Applied Statistics Rochester Institute.
Advertisements

Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 9 Inferences Based on Two Samples.
A Privacy Preserving Index for Range Queries
PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Simple Linear Regression
Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Demo, May 2005 Privacy Preserving Database Application Testing Xintao Wu, Yongge Wang, Yuliang Zheng, UNC Charlotte.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Association Rules Olson Yanhong Li. Fuzzy Association Rules Association rules mining provides information to assess significant correlations in large.
1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.
Lecture 23 Multiple Regression (Sections )
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Business Statistics - QBM117 Statistical inference for regression.
Variables and Measurement (2.1) Variable - Characteristic that takes on varying levels among subjects –Qualitative - Levels are unordered categories (referred.
Chapter 14 Inferential Data Analysis
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Incident Response Mechanism for Chemical Facilities By Stephen Fortier and Greg Shaw George Washington University, Institute for Crisis, Disaster and Risk.
Chapter 12 Multiple Regression and Model Building.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Edoardo PIZZOLI, Chiara PICCINI NTTS New Techniques and Technologies for Statistics SPATIAL DATA REPRESENTATION: AN IMPROVEMENT OF STATISTICAL DISSEMINATION.
TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
TEKS (6.10) Probability and statistics. The student uses statistical representations to analyze data. The student is expected to: (B) identify mean (using.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
Additive Data Perturbation: the Basic Problem and Techniques.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Multivariate Data Analysis Chapter 1 - Introduction.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.
Chapter 16 Social Statistics. Chapter Outline The Origins of the Elaboration Model The Elaboration Paradigm Elaboration and Ex Post Facto Hypothesizing.
1 Chi-square Test Dr. T. T. Kachwala. Using the Chi-Square Test 2 The following are the two Applications: 1. Chi square as a test of Independence 2.Chi.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Distributions cont.: Continuous and Multivariate
Chapter 12 Using Descriptive Analysis, Performing
Multiple Regression Models
Descriptive vs. Inferential
Simple Linear Regression
Simple Linear Regression
Clinical prediction models
Presentation transcript:

SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte Yingjiu Li Singapore Management Univ

SAC, Dijon, FranceApril 23-27, Outline  Motivation  General Location Model  Value Disclosure Analysis Basic disclosure scenario Conditional disclosure scenario Combinatorial disclosure scenario  Conclusion and Future Work

SAC, Dijon, FranceApril 23-27, Motivation  Information Disclosure in general databases Identity Disclosure Value Disclosure SSNNameZipRaceAgeSexDividendsWagesInterests Asian20M10k85k2k Asian30F15k70k18k Black20M50k120k35k n28223Asian20M80k110k15k

SAC, Dijon, FranceApril 23-27, Motivation  Previous work Additive randomization approach  Agrawal & Srikant, SIGMOD00, Agrawal &Aggawal PODS01  Kargupta et al. ICDM03, Du et al. SIGMOD05  Various methods from statistical databases Multiplicative rotation approach  Chen et al. ICDM05  Kargupta et al. TKDE06 Limitation  Conduct disclosure analysis on the data space  Prune to potential attacking  Our Modeling based approach First build an approximate statistical model Analyze disclosure on the parameter space Apply the model to generate data for future mining

SAC, Dijon, FranceApril 23-27, Application  Database Application Testing Testing on the local development databases  a small number of data samples  cannot conduct performance testing Testing against the live production databases  privacy disclosure  incorrectly update the underlying databases.  Generate mock databases for application software testing such that the generated data Valid Resembling to original data in terms of statistical distribution Privacy preserving

SAC, Dijon, FranceApril 23-27, ER Data DDL Catalog Schema & Domain Filter Schema’ Domain’ Disclosure Assessment Performance Assessment General Location Model Data Generator Synthetic database Synthetic database R R NR S S

SAC, Dijon, FranceApril 23-27, General Location Model SSNNameZipRaceAgeSexDividendsWagesInterests Asian20M10k85k2k Asian30F15k70k18k Black20M50k120k35k n28223Asian20M80k110k15k Categorical Attributes (Multinomial Distribution) Categorical Attributes (Multinomial Distribution) Numerical Attributes (Multivariate Gaussian Distributions) Numerical Attributes (Multivariate Gaussian Distributions)

SAC, Dijon, FranceApril 23-27, General Location Model  Given a dataset which contains n tuples Categorical attributes: Numerical attributes :  The categorical part can be summarized by a contingency table with cells. The number of tuples in each cell, has a multinomial distribution  For each cell d, the numerical attributes satisfy a conditionally multivariate normal distribution

SAC, Dijon, FranceApril 23-27, Parameter Fitting  The MLE estimates of parameter as follows where is the set of tuples belonging to cell d

SAC, Dijon, FranceApril 23-27, Value Disclosure  Attackers may be able to estimate or infer the value of a certain confidential numerical attribute of an entity or a group of entities with a level of accuracy than a threshold  All numerical attribute values are generated from multi- variate normal distribution, specifically from SSNNameZipRaceAgeSexDividendsWagesInterests 28262Asian30M ……… Asian30M 28223White50F 28223White50F ………… 28223White50F

SAC, Dijon, FranceApril 23-27, Value Disclosure Analysis Basic Disclosure Scenario  All numerical attributes are confidential  The analysis is based on probability density contour.  The disclosure is measured in terms of confidence interval or confidence region. Conditional Scenario  Non-confidential + confidential attributes Combinatorial Scenario  Linear combinations exist among both confidential and non-confidential attributes

SAC, Dijon, FranceApril 23-27, Privacy Measure Confidence Interval  Agrawal & Srikant SIGMOD00  If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level Confidence Region  In the p-dimensional case, a c% confidence region is determined by the probability density contour of data.

SAC, Dijon, FranceApril 23-27, Basic Disclosure Scenario Confidential attributes (X) ~ N(μ,Σ) The projection of this multidimensional ellipsoid on axis z i has bounds:

SAC, Dijon, FranceApril 23-27, Basic Disclosure Scenario Measure Privacy  Heuristic method  Use a hyper-rectangle to approximate the ellipsoid  Measure privacy for one dimension  Adjust parameters Original Interval Original Interval Dissimilarity Constrain (d) Dissimilarity Constrain (d) New Interval New Interval

SAC, Dijon, FranceApril 23-27, Conditional Scenario  Confidential attributes (X) and Non-confidential attributes (S) E.g., the non-confidential values of Dividends and Wages can help predict confidential values of Interests Same method with conditional Parameters:

SAC, Dijon, FranceApril 23-27, Combinatorial Scenario RaceAgeSexDividendsWagesInterests Asian20M10k85k2k Asian30F15k70k18k Black20M50k120k35k Total Income 87k 103k 205k Many Potential Combinations exist, e.g. Dividends + Wages + Interests = Total Income Even if the level of security provided for a single confidential attribute is adequate, the level of security provided for linear combinations of confidential attributes could be very low.

SAC, Dijon, FranceApril 23-27, Combinatorial Scenario  Canonical Correlation Analysis (CCA) A statistical procedure that is used to identify and quantify the relationship between two sets of variables, S and X. CCA can identify a linear combination of variables in one set, X, that have the highest correlation with a linear combination of variables in another set, S. It can be used to evaluate the level of security when estimating the linear combinations of the confidential attributes, X, using the non-confidential attributes, S.

SAC, Dijon, FranceApril 23-27, Combinatorial Scenario  Canonical Correlation Analysis (CCA) λ 1 : represents the most general measure of inferential value disclosure for any combination 1− λ 1 : the worst-case security λ 1 ≤λ : no combinatorial disclosure exists  Adjust parameters If λ i > λ then λ i = λ, keeping other eigenvalues, eigenvectors unchanged. Get a new Adjust : Adjust : optimization problem

SAC, Dijon, FranceApril 23-27, Conclusion  Propose a model-based privacy preserving approach  Investigate value disclosure in three scenarios

SAC, Dijon, FranceApril 23-27, Future Work  How to conduct individual value disclosure analysis when individual privacy intervals are specified  How the information loss due to modeling affects the utility of generated data

SAC, Dijon, FranceApril 23-27, Acknowledgement  NSF Grant CCR IIS  Personnel Xintao Wu, Songtao Guo, UNC Charlotte Yingjiu Li, Singapore Management Univ.  More Info

SAC, Dijon, FranceApril 23-27, Questions? Thank you!