Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.

Similar presentations


Presentation on theme: "Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data."— Presentation transcript:

1 Data Anonymization (1)

2 Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

3 The Massachusetts Governor Privacy Breach Name SSN Visit Date Diagnosis Procedure Medication Total Charge Name Address Date Registered Party affiliation Date last voted Zip Birth date Sex Medical DataVoter List Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis Zip Birth date Sex Sweeney, IJUFKS 2002 Quasi Identifier 87 % of US population 3

4 Definition  Table Column: attributes, row: records  Quasi-identifier A list of attributes that can potentially be used to identify individuals  K-anonymity Any QI in the table appears at least k times

5 Basic techniques  Generalization Zip {02138, 02139}  0213* Domain generalization hierarchy  A0  A1  …  An  Eg. {02138, 02139}  0213*  021*  02*  0*  *  This hierarchy is a tree structure suppression

6  Balance Better privacy guarantee Lower data utility There are many schemes satisfying the k-anonymity specification. We want to minimize the distortion of table, in order to maximize data utility Suppression is required if we cannot find a k-anonymity group for a record.

7 Criteria  Minimal generalization Minimal generalization that satisfy the k- anonymization specification  Minimal table distortion Minimal generalization with minimal utility loss Use precision to evaluate the loss [sweeny papers] Application-specific utility

8 Complexity of finding optimal solution on generalization  NP-hard (bayardo ICDE05)  So all proposed algorithms are approximate algorithms

9 Shared features in different solutions  Always satisfy the k-anonymity specification If some records not, suppress them  Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric  Algorithms Assume the domain generalization hierarchy is given Efficiency Utility maximization

10 Metrics to be optimized  Two cost metrics – we want to minimize (bayardo ICDE05) Discernibility Classification  The dataset has a class label column – preserving the classification model # of items in the k-anony group # Records in minor classes in the group

11 metrics  A combination of information loss and anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

12 metrics  Information loss Dataset has class labels Entropy  a set S, labeled by different classes  Entropy is used to calculate the impurity of labels  Information loss of a generalization G {c1,c2,…cn}  p I(G) = info(Sp) - info (Rci) Pi is the percentage of label iInfo(S)=

13  Anonymity gain A(VID) : # of records with the VID A G (VID) >= A(VID): generalization improves or does not change A(VID) Anonymity gain P(G) = x – A(VID) x = A G (VID) if A G (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain

14  Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G)

15 Domain-hierarchy based algorithms  The sweeny’s algorithm  Bayardo’s tree pruning algorithm  Wang’s top-down and bottom up algorithms  They are all dimension-by-dimension methods

16 Multidimensional techniques  Categorical data? Categories are mapped to numerize the categories  Bayardo 95 paper  Order matters? (no research on that)  Numerical data K-anonymization  n-dim space partitioning Many existing techniques can be applied

17 Single-dimensional vs. multidimensional

18 The evolving procedure Categorical(domain hierarchy)[sweeney, top- down/bottom-up]  numerized categories, single dimensional [bayardo05]  numerized/numerical multidimensional[Mondrian,spatial indexing,…]

19 Method 1: Mondrain  Numerize categorical data  Apply a top-down partioning process step1 Step2.1 Step2.2

20 Allowable cut

21 Method 2: spatial indexing  Multidimensional spatial techniques Kd-tree (similar to Mondrain algorithm) R-tree and its variations R-tree R+-tree Leaf layer Upper layer

22 Compacting bounds Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done. Information is better preserved

23 Benefits of using R+-Tree  Scalable: originally designed for indexing disk-based large data  Multi-granularity k-anonymity: layers  Better performance  Better quality

24 Performance Mondrain

25 Utility  Metrics Discenibility penalty KL divergence: describe the difference between a pair of distributions Certainty penalty Anonymized data distribution T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

26

27 Other issues  Sparse high-dimensionality Transactional data  boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 Relate to the clustering problem of transactional data!  The above one uses matrix-based clustering  item based clustering (?)

28 Other issues  Effect of numerizing categorical data Ordering of categories may have certain impact on quality  General-purpose utility metrics vs. special task oriented utility metrics  Attacks on k-anonymity definition


Download ppt "Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data."

Similar presentations


Ads by Google