Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.

Slides:



Advertisements
Similar presentations
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Advertisements

Anonymizing Location-based data Jarmanjit Singh Jar_sing(at)encs.concordia.ca Harpreet Sandhu h_san(at)encs.concordia.ca Qing Shi q_shi(at)encs.concordia.ca.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical.
Probabilistic Inference Protection on Anonymized Data
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
C MU U sable P rivacy and S ecurity Laboratory 1 Privacy Policy, Law and Technology Data Privacy October 30, 2008.
1 Global Privacy Guarantee in Serial Data Publishing Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jia Liu 2, Ke Wang 3, Yabo Xu 4 The Hong Kong University.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
MobiHide: A Mobile Peer-to-Peer System for Anonymous Location-Based Queries Gabriel Ghinita, Panos Kalnis, Spiros Skiadopoulos National University of Singapore.
The Union-Split Algorithm and Cluster-Based Anonymization of Social Networks Brian Thompson Danfeng Yao Rutgers University Dept. of Computer Science Piscataway,
Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης Roadmap Motivation Core ideas Extensions 2.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Database Laboratory Regular Seminar TaeHoon Kim.
Preserving Privacy in Published Data
Privacy and trust in social network
Publishing Microdata with a Robust Privacy Guarantee
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.
CS573 Data Privacy and Security Anonymization methods Li Xiong.
Refined privacy models
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Privacy of Correlated Data & Relaxations of Differential Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 16: Fall 12.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Differential Privacy SIGMOD 2012 Tutorial Marianne Winslett University of Illinois at Urbana-Champaign Advanced Digital Sciences Center, Singapore Including.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Unraveling an old cloak: k-anonymity for location privacy
No Free Lunch in Data Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 15: Fall 12.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
ROLE OF ANONYMIZATION FOR DATA PROTECTION Irene Schluender and Murat Sariyar (TMF)
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Versatile Publishing For Privacy Preservation
Fast Data Anonymization with Low Information Loss
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
Towards identity-anonymization on graphs
Refined privacy models
Privacy-Preserving Data Publishing
Presentation transcript:

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Centralized publication  Assume that a hospital wants to publish the following table, called the microdata.  The publication must preserve the privacy of patients.  Prevent an adversary from knowing who-contracted- what. Microdata

Centralized publication (cont.)  A simple solution: Remove column ‘Name’.  It does not work. See next. publish

Linking attacks The published table A voter registration list Quasi-identifier (QI) attributes An adversary

These are real threats  Fact: 87% of Americans can be uniquely identified by {Zipcode, gender, date-of-birth}.  A famous experiment by Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]  finds the medical record of an ex-governor of Massachusetts.

Objectives  Publish a distorted version of the dataset so that  [Privacy] the privacy of all individuals is “adequately” protected;  [Utility] the dataset is useful for analyzing the characteristics of the microdata.  Paradox: Privacy protection , utility .

Issues  Privacy principle  What is adequate privacy protection?  Distortion approach  How to achieve the privacy principle?  The literature has discussed other issues as well.  Complexities, improving the utility of the published data, etc.

Principle 1: k-anonymity  2-anonymous generalization: QI attributes Sensitive attribute 4 QI groups A voter registration list [Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

Defects of k-anonymity  What is the disease of Joe? No “diversity” in this QI group. A voter registration list

Principle 2: l-diversity  Each QI group should have at least l “well-represented” sensitive values.  Different ways to interpret “well-represented”. [Machanavajjhala et al., ICDE, 2006]

Naive interpretation  Each QI-group has l different sensitive values. A 2-diverse table AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

Defects of the naive interpretation  Assume that Joe is identified in the QI group. What is the probability that he contracted HIV?  Implication: The most frequent sensitive value in a QI group cannot be too frequent.  But accomplishing only is still vulnerable against attacks with background knowledge. A QI group with 100 tuples 98 tuples

Background knowledge attack  Let Joe be an individual in the QI group having HIV.  A friend of Joe has the background knowledge: “Joe does not have pneumonia”.  How likely would this friend assume that Joe had HIV? A QI group with 100 tuples 50 tuples 49 tuples

Controlling also the 2nd most frequent value  Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has HIV with 40 / 70 probability. A QI group with 100 tuples 40 tuples 30 tuples

An example of 4-diversity A QI group The most frequent value The 2nd most frequent value The 3rd most frequent value The 4th most frequent value The other values

An example of 4-diversity (cont.) A QI group The most frequent value The other values Same cardinality

 Assume that Joe is a person in the QI group.  Property: If an adversary can eliminate only  3 diseases, s/he can correctly guess the disease of Joe with at most 50% probability. An example of 4-diversity (cont.) A QI group HIV pneumonia bronchitis cancer The other values

l-diversity  Consider a QI group.  m is the number of sensitive values in the group.  r 1 is the number of tuples having the most sensitive value.  r 2 is the number of tuples having the 2nd most sensitive value.  …  r m is the number of tuples having the m-th most sensitive value.  Then, r 1  c (r l + … + r m ), where c is a constant.  If an adversary can eliminate only l – 1 sensitive values, s/he can infer the disease of a person with probability at most 1 / (c + 1).  Called (c, l)-diversity precisely.

Defects of l-diversity  Andy does not want anyone to know that he had a stomach problem.  Sarah does not mind at all if others find out that she had flu. NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableA voter registration list AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

 Does not work if an individual can have multiple tuples in the microdata. Defects of l-diversity (cont.) Microdata NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Andy4M12000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu

Defects of l-diversity (cont.) NameAgeSexZipcode Andy4M12000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableA voter registration list AgeSexZipcodeDisease 4M12000gastric ulcer 4M12000dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

Principle 3: Personalized anonymity  Key ideas: Guarding node + sensitive attribute (SA) generalization  Assume a publicly-known hierarchy on the sensitive attribute. [Xiao and Tao, SIGMOD, 2006]

Guarding node  Andy does not want anyone to know that he had a stomach problem.  He can specify “stomach disease” as the guarding node for his tuple.  Protect Andy from being conjectured to have any disease in the subtree of the guarding node. NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease

Guarding node (cont.)  Sarah is willing to disclose her exact symptom.  She can specify Ø as the guarding node for her tuple. NameAgeSexZipcodeDiseaseguarding node Sarah28F37000flu Ø

Guarding node (cont.)  Bill does not have any special preference.  He sets the guarding node of his tuple to be the same as his sensitive value. NameAgeSexZipcodeDiseaseguarding node Bill5M14000dyspepsia

A personalized approach NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

Personalized anonymity  No adversary should be able to breach the privacy requirement of any guarding node with a probability above p breach..  If p breach = 0.3, then no adversary can have more than 30% probability to find out that:  Andy had a stomach disease  Bill had dyspepsia …… NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

Why SA generalization?  How many female patients are there with age above 30?  4 ∙ (60 – ) / (60 – ) = 3  Real answer: 1 Pure QI generalization AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu Microdata

SA generalization (cont.) With SA generalization AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]flu 56F58000 respiratory infection Pure QI generalization AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

Evaluation of disclosure risk  What is the probability that the adversary can find out that “Andy had a stomach disease”? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection A voter registration list The published data

Combinatorial reconstruction (cont.)  Can each individual appear more than once?  No = the primary case  Yes = the non-primary case  Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The non-primary case

Combinatorial reconstruction (cont.)  Can each individual appear more than once?  No = the primary case  Yes = the non-primary case  Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The non-primary case

Breach probability (primary)  Totally 120 possible reconstructions  If Andy is associated with a stomach disease in n b reconstructions  The probability that the adversary should associate Andy with some stomach problem is n b / 120  Andy is associated with  gastric ulcer in 24 reconstructions  dyspepsia in 24 reconstructions  gastritis in 0 reconstructions  n b = 48  The breach probability for Andy’s tuple is 48 / 120 = 2 / 5. Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

Breach probability (non-primary)  Totally 625 possible reconstructions  Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions.  n b = 225  The breach probability for Andy’s tuple is 225 / 625 = 9 / 25 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

A defect of personalized anonymity  Does not guard against background knowledge.  Recall that l-diversity can achieve this purpose.  But it seems possible to adapt the personalized approach to tackle background knowledge.  Future work?

Other privacy principles  k-gather.  Due to [Aggarwal et al., PODS, 2006]  Suffers from the problems of k-anonymity.  (a, k)-anonymity  Due to [Wong et al., KDD, 2006]  t-closeness.  Recently proposed by [Li and Li, ICDE, 2007]

Issues  Privacy principle  What is adequate privacy protection?  Distortion approach  How to achieve the privacy principle?

Three approaches  Suppression  We do not discuss it because the utility of the resulting table is low; it can be regarded as a special case of generalization.  Generalization  Due to [Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]  Anatomy (also called “bucketization”)  Due to [Xiao and Tao, VLDB, 2006]  Each of the above approaches can be integrated with all the privacy principles discussed earlier.

A multidimensional view of generalization

Taxonomy of generalization  Local recoding  (Generalized) rectangles may overhalp.  Suppression is a special case of local recoding.  Global recoding  All rectangles are disjoint. [LeFevre et al. SIGMOD, 2005]

Taxonomy of generalization (cont.)  Global recoding can be further divided.  Single-dimension recoding  Rectangles form a grid.  Multi-dimension recoding  The opposite of single- dimension recoding.

Taxonomy of generalization (cont.)  Single-dimension recoding can be further divided.  Full-domain recoding  Full-subtree recoding  Both assume a hierarchy on each QI attribute.  Example: A hierarchy on Age

Taxonomy of generalization (cont.)  Full-domain recoding  All age values must be generalized to the same level of the hierachy.

Taxonomy of generalization (cont.)  Full-subtree recoding  The subtrees of all generalized values must be disjoint.  Permissible generalization: [1, 30], [31, 40], [41, 50], [51, 60], [61, 90].  Illegal generalization: [1, 10], [1, 30], [31, 60], [61, 90].

Why all these generalization types?  Reason 1: If a dataset is generalized in a more restricted manner, less preprocessing is required before it can be analyzed by a standard statistical tool (such as SAAS).

Why all these generalization types?  Reason 2: More restrictive generalization is usually faster to compute and easier to analyze.

Why all these generalization types?  Reason 3: Less restrictive generalization promises more accurate data analysis, provided that a sophisticated analytical method is used.

Generalization algorithms  Operate on a quality metric. Examples:  The generalization level (for full-domain recoding)  Total rectangle size (for local recoding) ……  Mostly heuristics-based.  Finding the optimal generalization is often NP hard.

Defect of generalization  Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis  Estimated answer: 2p, where p is the probability that each of the two tuples satisfies the query conditions on the Age and Zipcode.

Defect of generalization (cont.)  Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000]  p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05  Estimated answer for Query A: 2p = 0.1 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]pneumonia

Defect of generalization (cont.)  Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000]  Estimated answer = 0.1 NameAgeSexZipcodeDisease Bob23M11000pneumonia Ken27M13000dyspepsia Peter35M59000dyspepsia Sam59M12000pneumonia Jane61F54000flu Linda65F25000gastritis Alice65F25000flu Mandy70F30000bronchitis  The exact answer = 1

Defect of generalization (cont.)  Cause of inaccuracy: QI distribution inside each QI group is lost! AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]pneumonia

Anatomy  Releases a quasi-identifier table (QIT) and a sensitive table (ST). Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F Quasi-identifier table (QIT) Sensitive table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis Microdata

Anatomy (cont.) 1. Decide an l-diverse partition of the tuples. AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis QI group 1 QI group 2 A 2-diverse partition

Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition. Disease pneumonia dyspepsia pneumonia flu gastritis flu bronchitis AgeSexZipcode 23M M M M F F F F30000 group 1 group 2 quasi-identifier table (QIT)sensitive table (ST)

Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the decided partition. Group-IDDisease 1pneumonia 1dyspepsia 1 1pneumonia 2flu 2gastritis 2flu 2bronchitis AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT)sensitive table (ST)

Privacy preservation  Given a pair of QIT and ST generated from an l-diverse partition, an adversary can infer the sensitive value of each individual with confidence at most 1 / l. Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT) sensitive table (ST) NameAgeSexZipcode Bob23M11000

Accuracy of data analysis  Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F Quasi-identifier table (QIT) Sensitive table (ST)

Accuracy of data analysis  Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000]  2 patients contracted pneumonia  2 out of 4 patients satisfy the query conditions on Age and Zipcode  Estimated answer = 2 * 2 / 4 = 1. AgeSexZipcodeGroup-ID 23M M M M t1t2t3t4t1t2t3t4

A defect of anatomy  Existence breach: Does an individual exist in the microdata?

Future work  Re-publication  Tackle stronger background knowledge  Recent work [Martin et al., ICDE, 2007]  Improving utility  Pioneering work [Kifer and Gehrke, SIGMOD, 2006]  Application to specific (non-trivial) applications  Location privacy Pioneering work [Mokbel et al., VLDB, 2006]