On Privacy Preservation against Adversarial Data Mining Conner Simmons.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Three Special Functions
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 6 Sampling and Sampling Distributions
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Decision Tree Algorithm
Ensemble Learning: An Introduction
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
2-1 Sample Spaces and Events Conducting an experiment, in day-to-day repetitions of the measurement the results can differ slightly because of small.
Similar Sequence Similar Function Charles Yan Spring 2006.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Fast Algorithms for Association Rule Mining
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Keystone Problems… Keystone Problems… next Set 19 © 2007 Herbert I. Gross.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Modeling Possibilities
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Confidence Interval Estimation
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Copyright © 2009 Cengage Learning Chapter 10 Introduction to Estimation ( 추 정 )
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
+ Simulation Design. + Types event-advance and unit-time advance. Both these designs are event-based but utilize different ways of advancing the time.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Presented By Amarjit Datta
Elsayed Hemayed Data Mining Course
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Security in Outsourcing of Association Rule Mining
Database Management System
Sampling Distributions and Estimation
Rule Induction for Classification Using
CARPENTER Find Closed Patterns in Long Biological Datasets
Differential Privacy in Practice
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Introduction to Estimation
Presentation transcript:

On Privacy Preservation against Adversarial Data Mining Conner Simmons

Abstract   The purpose is to hide a minimal set of entries to ensure the privacy of the sensitive fields are conserved.   If an adversary were to be data mining, they would not be able to correctly recover the hidden data entries.   The authors model the problem of privacy preservation against adversarial data mining in a general sense.   They express that finding an ideal solution to this problem is NP- hard.   An efficient experimental algorithm is then developed to find practical solutions to the problem.   Finally, the authors conduct an extensive evaluation on both real data sets and synthetic data sets to observe the methods implementation.

Introduction  With corporations growing in size, the misuse and privacy of data is becoming more of a concern.  The most basic approach is to erase sensitive data.  Advantage: Easy to implement.  Problem: Data mining algorithms can fill in the blanks by using correlations among the fields.

Example 1: Motivation   Consider a table Employee in Table 1. Suppose some users want to hide some information as follows. (1) Cathy wants to hide her salary level; (2) Frank wants to hide his education background; (3) Grace wants to hide her marriage status; (4) Ian wants to hide his gender; (5) Janet wants to hide her gender as well; and (6) All names should be hidden and replaced by random row-ids. NameTitleGende r MstatusEducationSal- lvl AliceAssistantFUnmarriedCollegeSL-3 BobAssistantMMarriedCollegeSL-3 CathyAssistantFMarriedUniversitySL-3 DanielManagerFUnmarriedUniversitySL-5 ElenaManagerFMarriedUniversitySL-5 FrankManagerMMarriedUniversitySL-5 GraceManagerFMarriedMBASL-7 HelenManagerFMarriedPh.D.SL-5 IanAccountantFUnmarriedMBASL-5 JanetAccountantMMarriedUniversitySL-4 Table 1: Table Employee.

IdTitleGende r MstatusEducationSal- lvl 1ManagerFMarriedUniversitySL-5 2AssistantFUnmarriedCollegeSL-3 3ManagerFMarriedPh.D.SL-5 4AssistantMMarriedCollegeSL-3 5ManagerMMarried#SL-5 6Accountant#UnmarriedMBASL-5 7ManagerFUnmarriedUniversitySL-5 8AssistantFMarriedUniversity# 9ManagerF#MBASL-7 10Accountant#MarriedUniversitySL-4 Table 2: Table Employee after hiding some sensitive entries. Example 1: Motivation   Table 2 shows the result after hiding some of the sensitive material.   It is possible to generate the following association rules from Table 2:   R1: Assistant → SL-3 (confidence 100%),   R2 : Manager ∧ SL-5 → University (confidence 66.7%)   R3 : Manager ∧ Female → Married (confidence 66.7%)

IdTitleGende r MstatusEducationSal- lvl 1ManagerFMarriedUniversitySL-5 2AssistantFUnmarriedCollegeSL-3 3ManagerFMarriedPh.D.SL-5 4AssistantMMarriedCollegeSL-3 5ManagerMMarried#SL-5 6Accountant#UnmarriedMBASL-5 7ManagerFUnmarriedUniversitySL-5 8AssistantFMarriedUniversity# 9ManagerF#MBASL-7 10Accountant#MarriedUniversitySL-4 Table 2: Table Employee after hiding some sensitive entries. Example 1: Motivation   From these rules, one may accurately predict that: (1) the missing value in tuple 5 is “University” (by rule R2); (2) the missing value in tuple 8 is “SL-3” (by rule R1); and (3) the missing value in tuple 9 is “Married” (by rule R3).   However, the missing values in tuples 6 and 10 cannot be predicted accurately.

Problem Statement   Consider a database T of n records t 1,..., t n and m attributes D 1,...,D m.   The value of record t i on attribute D j is denoted by t i,j. (t i,j is also referred to as an entry)   A set of entries in one record is called an entry-set.   The set of privacy sensitive entries in the data is called the directly private set, denoted by P.   For simplicity, a value in an attribute is called an item, and a combination of multiple items an itemset.

Problem Statement   Further, it assumed that the information hiding is done at the server end.   The advantage: It is possible to use the inter-attribute correlations among the different records in order to decide which entries should be masked.   Primary question: What is the choice of entries which should be hidden?   We use t i,j = # to denote that a value is hidden.   The table in which the privacy sensitive entries are blanked out is denoted by (T−P).

Problem Statement   An itemset X is said to appear in a tuple t if X matches a set of entries in t.   For (T-P), X is said to publicly appear in a tuple t if the entries in t matching X are not in P.   Example 1 show that if we publicly publish Table 2, it may not preserve the privacy sufficiently, so only blanking out the sensitive entries is inadequate.   Entries which have strong predictive power to any entry in set P need to be removed from the data.

Problem Statement   The problem of privacy preservation against adversarial data mining is to find a set of entries K such that predictive methods using only the information in (T −P −K) cannot have an accuracy at least δ to predict the values of any entries in P, where δ ∈ (0, 1] is a user-specified parameter.   We want to minimize the size of K to make the information loss as little as possible.   K is referred to as the derived private set.

Example 2: Rule Invalidation   Rule Invalidation: The process of removing derived private entries in order to reduce the confidence level of the rules in the data. Consider rule R2: Manager ∧ SL-5 → University, discussed in Example 1.   In order to invalidate this rule, we can blank some occurrences of “Manager”, “SL-5” and/or “University” in tuples 5 and 7, but not all occurrences.   Removing t 1,Title = Manager is more advantageous than removing t 1,Education = University since t 1,Title affects two rules R2 and R3 rather than just one.

Example 3: Rule Marginalization   Rule Marginalization: rules may continue to have high confidence level, but get marginalized because they are no longer relevant to any sensitive entry in the data. Therefore, the predictive power of the rule is effectively removed. Consider rule R1: Assistant → SL-3 in Example 1.   In Table 2, only sensitive entry t 8, Salary-level can be predicted using this rule.   Therefore, instead of invalidating the rule, we can simply blank out entry t 8, Title = Assistant. Then, sensitive entry t 8, Salary-level cannot be predicted accurately.

Overview of the Approach Step 1: Mining adversarial rules.   We mine all association rules from (T − P) in the form: X → y, where X is a set of attribute values and y is a value on an attribute Y such that: (1) the confidence of the rule is no less than δ in (T − P); and (2) for tuples t where X publicly appears and the value of t on attribute Y is blanked out, t has value y on Y with a probability of at least δ.   Such confident association rules related to the adversarial data mining are called adversarial rules.

Overview of the Approach Step 2: Determining derived private set.   We select a set of entries K ⊆ (T − P) such that by deleting the entries corresponding to P ∪ K, no adversarial rules can be fired to predict any value in P accurately.   An adversarial rule is potentially revealing when it has high public confidence and high hidden confidence, or it can’t be used to predict accurately.   If a rule has high hidden confidence but low public confidence, then a user who can only read (T − P) can’t identify the rule from the public data.

Theorem 1: Correctness   For a table T and a directly private set P, any entry z ∈ P on attribute Y cannot be predicted with a confidence δ or higher using only the information from (T − P) if and only if there exists no adversarial rule R : X → y in (T − P) such that y is on attribute Y and X publicly appears in the tuple t containing z.

Mining all Adversarial Rules   The general idea of the algorithm to mine the complete set of adversarial rules to conduct a depth-first search of the set enumeration tree.   The nodes in the tree are the itemsets publicly appearing in some tuples.   We check whether itemsets can be an antecedent of some adversarial rules.   The search of adversarial rules is complete if all the itemsets publicly appearing in some tuples can be enumerated completely.

Theorem 2: Anti-Monotonicity   If an itemset X does not publicly appear in any tuple, then any superset of X cannot publicly appear in any tuple.

Pruning Rule 1: Pruning Not Publicly Appeared Itemsets   In the depth-first search of the set enumeration tree, an itemset X which does not publicly appear in any tuple, as well as all the supersets of X can be pruned.

Mining all Adversarial Rules   Not every itemset publicly appearing in some tuples is an antecedent of some adversarial rule.   This fact can be used to improve the effectiveness of the depth-first search algorithm.   If we can determine that all nodes in a subtree can’t be antecedents of any adversarial rule, we can prune the subtree and the search space is narrowed.   Project Databases also allow us to examine whether an itemset and its supersets are antecedents of some adversarial rules.

Mining all Adversarial Rules   Let X be an itemset, T be a table, and P be a directly private set.   For a tuple t in T, if X publicly appears in t, then the projection of t with respect to X, denoted by t|X, is the set of entries in t that are not matched by X.   If X does not publicly appear in t, then t|X =0.   The projected database with respect to X is the set of nonempty projections with respect to X in T.

Example 4: Projected Database Consider the example illustrated in Table 2.   The projected database for itemset:   X1 = {College} is T|X1 = {(Assistant, Female, Unmarried, SL-3), (Assistant, Male, Married, SL-3)}.   X2 = {Accountant} is T|X2 = {(#, Unmarried, MBA, SL-5), (#, Married, University, SL-4)}.   X3 = {Unmarried, SL-5} is T |X3 = {(Accountant, #, MBA), (Manager, Female, University)}.   X4 = {Manager, Female} is T |X4 = {(Married, University, SL-5), (Married, Ph.D., SL-5), (Unmarried, University, SL-5), (#, MBA, SL-7)}.

Example 5: Privacy-Free Itemsets   An itemset is privacy-free if its projected database does not contain any directly private entry at all. Consider itemset X1 = {College} from Example 4.   Since its projected database contains no directly private entry, X1 can’t be the antecedent of any adversarial rule.   Any superset of X1, such as {Assistant, College} and {Assistant, Male, College}, can’t be the antecedent of any adversarial rule, either.

Pruning Rule 2: Pruning Privacy-Free Itemsets   In the depth-first search of the set enumeration tree, a privacy-free itemset and all supersets of it can be pruned.

Example 6: Non-discriminative Itemsets   An itemset X is non-discriminative if every tuple in the projected database of X contains directly private entries in the same attribute(s). Consider itemset X2 = {Accountant} in Example 4.   Every tuple in T|X2 has a blanked value in attribute Gender.   Any association rule having X2 or any superset of X2 can’t make an accurate prediction of the gender of accountants.   So, those itemsets cannot be the antecedents of any adversarial rules with respect to gender.

Pruning Rule 3: Pruning Non- Discriminative Itemsets   In the depth-first search of the set enumeration tree, if an itemset X is non-discriminative with respect to Y, then any superset of X is also non-discriminative with respect to Y.   Y can be pruned from the projected databases of X and any superset of X.   If an itemset X is non-discriminative with respect to all other attributes that contain some entries in the directly private set, then X and its supersets can be pruned.

Example 7: Contrast Itemsets   An itemset X is said to be a contrast itemset if for any entry y ∈ P such that X ∪ y appears in some tuples in T, X → y has a public confidence of 0.   Clearly, X as well as any supersets of X cannot be used to accurately predict y. Consider itemset X3 = {Unmarried, SL-5} in Example 4.   In the projected database, there is one blanked entry in attribute Gender, whose value is “Female”.   However, in the public set of X3, rule “X3 → Male” has a public confidence of 100%. Thus, X3 or any of its supersets cannot be used to predict the gender accurately.

Pruning Rule 4: Pruning Contrast Itemsets   In the depth-first search of the set enumeration tree, a contrast itemset and all supersets of it can be pruned.

Example 8: Discriminative Itemsets   An itemset X is discriminative if X is the antecedent of some adversarial rules. Consider itemset X4 = {Manager, Female} in Example 4.   In the projected database, there is one blanked entry in attribute Married-or-not, whose value is “Married”.   Two out of the three tuples in the projected database have non-blanked value “Married” in attribute Married-or-not.   We can generate a confident association rule Manager ∧ Female → Married, which is the adversarial rule R3 discussed in Example 1.

Example 8: Discriminative Itemsets (Con’t)   The supersets of discriminative itemsets should still be checked because we may find confident adversarial rules among these supersets.   We must invalidate or marginalize all adversarial rules.   An itemset is undetermined if it is not in any of the previous four categories.   For such an itemset, we can neither prune it, nor generate adversarial rules, so the depth-first search needs to be continued at such nodes in order to make judgments about the itemsets in the corresponding subtrees.   If the support of an adversarial rule is very low, then the adversarial rule can be statistically insignificant. Any itemset whose support is less than this threshold can be pruned, and so can its supersets.

Determining Optimal Derived Private Set   We can use the set of adversarial rules to determine the set of entries which need to be removed from the data.   The problem of finding the smallest derived private set is NP-hard, so a heuristic algorithm needs to be designed.   The greater the level of information revealed by an entry, the greater the weight of the corresponding entry. (Denoted by c i,j for entry t i,j and intitially equal to zero)   When an entry is deleted, it could either prevent a rule from being fired because of rule marginalization, or it could prevent a rule from being found because of rule invalidation.

Determining Optimal Derived Private Set   If tuple t i is in public set R, then blanking out t i,j will reduce either the public confidence of R or the size of the public set of R.   If tuple t i is in hidden set, then blanking out t i,j marginalizes R in one instance.   If t i is not in public set R or hidden set R, then blanking out t i,j does not make any contribution to the invalidation or marginalization of R.

Determining Optimal Derived Private Set   The weight c i,j can be calculated as the sum of contributions of blanking out t i,j to all adversarial rules.   The entries with the highest weights should be blanked out.   Once an entry is blanked out, the weights of other entries should be adjusted.   The blanking process can be conducted iteratively.

Empirical Evaluation Results on Synthetic Data Sets   Figure 3 shows the variation in the number of adversarial rules with varying Zipf factor for different dimensionalities and cardinalities of the data set.   The number of rules increases with dimensionality, but reduces with increasing cardinality on each categorical attribute.

Empirical Evaluation Results on Synthetic Data Sets   Figure 4 tests the effect of support threshold on the number of adversarial rules and the number of sensitive entries.   The number of adversarial rules increases exponentially as the support threshold goes down.   The number of sensitive entries changes linearly, since the Zipf distribution embeds some correlations with high support, which affects many tuples.

Empirical Evaluation Results on Synthetic Data Sets   Figure 5 shows the effect of confidence threshold on the number of adversarial rules and the number of sensitive entries.   A lower confidence threshold introduces more adversarial rules and sensitive entries.   We should note that the corresponding entries are very valuable from the point of view of an adversary.

Empirical Evaluation Results on Synthetic Data Sets   Figure 6 tests the effect of size of directly private set on the number of adversarial rules.   When it is small, the number of adversarial rules is also small since the rules must be associated with some directly private entries.   When it is large, the number of adversarial rules increases linearly.

Empirical Evaluation   Figure 7 plots the size of derived private set with respect to the blanking factor. Results on Synthetic Data Sets   Figure 8 displays the corresponding running time.

Empirical Evaluation Results on Synthetic Data Sets   Figures 7 and 8 show that privacy can be preserved by blanking out only a very small subset of sensitive entries.   In general, the less entries we blank out in a round, the smaller the derived private set we get. On the other hand, it increases the running time.   The iterative use of a blanking factor helps in substantially improving the efficiency of the algorithm with only a modest loss in the data entries.

Empirical Evaluation Results on Synthetic Data Sets   Figure 9 shows the scalability with respect to the number of tuples in the table.   The number of adversarial rules decreases as the size of table increases.

Empirical Evaluation Results on Synthetic Data Sets   The number of sensitive entries relies on two factors: the number of adversarial rules and the size of directly private set.   Experiments show that the runtime of the method is mainly proportional to the number of sensitive entries, since the determination of derived private set dominates the cost.   Since the number of sensitive entries show more modest scalability behavior, this also helps to contain the running time of the method.

Empirical Evaluation Experiments on Real Data Set (Adult)   Figure 10 shows the number of sensitive entries and size of derived private set as the size of directly private set increases.   The number of sensitive entries is not very sensitive to the change of private set, since it is bounded by the total number of non-blanked entries.   On the other hand, the number of derived private entries increases with the size of directly private set.

Empirical Evaluation   Figure 11 shows the numbers of adversarial rules and sensitive entries, as well as the size of derived private set on the support threshold.   All three measures go up as the support threshold goes down.   The running time is shown in Figure 12.   The size of the derived private set is within a small factor of the directly private set over all ranges of the support parameter, indicating only a modest level of information loss. Experiments on Real Data Set (Adult)

Summary   The study on both synthetic and real data sets strongly suggests that privacy preservation against adversarial data mining is practical from an information-loss perspective.   The results indicate that the derived private set does not increase as fast as the directly private set, and tends to be quite stable over a wide range of user parameters.   The heuristic approach is also efficient from a computational perspective and requires a few seconds over many practical settings on data sets containing thousands of records.   Finally, the scheme scales modestly over a wide range of user-specified parameters thus contributing to the practicality and usability of the approach.

References Aggarwal, C. C., Pei, J., and Zhang, B On privacy preservation against adversarial data mining. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 510–516.