Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anonymity for Continuous Data Publishing

Similar presentations


Presentation on theme: "Anonymity for Continuous Data Publishing"— Presentation transcript:

1 Anonymity for Continuous Data Publishing
The 11th International Conference on Extending Database Technology (EDBT 2008) Anonymity for Continuous Data Publishing Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon Fraser University Burnaby, BC, Canada Ada Wai-Chee Fu The Chinese University of Hong Kong Jian Pei Simon Fraser University Burnaby, BC, Canada

2 Privacy-Preserving Data Publishing k-anonymity [SS98]
2-anoymous patient table Birthplace Job Disease UK Professional Flu Diabetes France Raw patient table Quasi-Identifier (QID) Sensitive Birthplace Job Disease UK Engineer Flu Lawyer Diabetes France (Hospital) 2

3 Privacy Requirement k-anonymity [SS98] Confidence bounding
Every QID group contains at least k records. Confidence bounding [WFY05, WFY07] Bound the confidence QIDsensitive value within h%. l-diversity [MGKV06] Every QID group contains l well-represented distinct sensitive values. Patient table QID Sensitive Birthplace Job Disease UK Professional Flu Diabetes France To simplify the discussion, we assume each attribute has finite set of domain values. Each QID attribute has a taxonomy tree for generalization.

4 Continuous Data Publishing Model
At time T1, Collected a set of raw data records D1 Published a k-anonymous version of D1, denoted release R1. At time T2, Collect a new set of raw data records D2 Want to publish all data collected so far. Publish a k-anonymous version of D1UD2, denoted release R2. Assume: Each individual has at most one record in D1UD2.

5 Continuous Data Publishing Model
Birthplace Job Disease (a1) Europe (UK) Lawyer Flu (a2) (a3) (a4) Europe (France) Diabetes (a5) R1 D1 Birthplace Job Disease (b1) UK Professional (Lawyer) Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) Professional (Doctor) (b8) (b9) (b10) R2 D1 D2

6 Correspondence Attacks
An attacker could “crack” the k-anonymity by comparing R1 and R2. Background knowledge: QID of a target victim (e.g., Alice is born in France and is a lawyer.) Timestamp of a target victim. Correspondence knowledge: Every record in R1 has a corresponding record in R2. Every record timestamped T2 has a record in R2, but not in R1.

7 Our Contributions What exactly are the records that can be excluded (cracked) based on R1 and R2? Systematically characterize the set of cracked records by correspondence attacks. Propose the notion of BCF-anonymity to measure anonymity after excluding the cracked records. Developed an efficient algorithm to identify a BCF-anonymized R2, and studied its data quality. Extended the proposed approach to deal with more than two releases and other privacy notions.

8 Problem Statements Detection problem: Anonymization problem:
Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k-anonymized R1 and R2 . Anonymization problem: Given R1, D1 and D2, we want to generalize R2 = D1UD2 so that R2 satisfies a given BCF-anonymity requirement and remains as useful as possible wrt a specified information metric.

9 Alice: {France, Lawyer} with timestamp T1.
Forward-Attack (F-Attack) R1 Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Alice: {France, Lawyer} with timestamp T1. Attempt to identify her record in R1. R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) a1, a2, a3 cannot all originate from [France, Lawyer]. Otherwise, R2 would have at least three [France, Professional, Flu].

10 F-Attack CG(qid1,qid2) = {(g1,g2),(g1',g2')} g1 qid1 g1' g2' qid2 g2
R1 Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) qid1 g1 g1' R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) CG(qid1,qid2) = {(g1,g2),(g1',g2')} qid2 g2' g2

11 F-Attack Crack size of g1 wrt P: c = |g1| – min(|g1|,|g2|)
Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Crack size of g1 wrt P: c = |g1| – min(|g1|,|g2|) c = 3 – min(3, 2) = 1. R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) Crack size of g1' wrt P: c = |g1'| – min(|g1'|,|g2'|) c = 2 – min(2, 3) = 0. F(P, qid1, qid2) = c over all CG(qid1, qid2)

12 Definition: F-Anonymity
F(qid1, qid2) denotes the maximum F(P, qid1, qid2) for any target P that matches (qid1, qid2). F(qid1) denotes the maximum F(qid1, qid2) for all qid2 in R2. F-anonymity of (R1,R2), denoted by FA(R1,R2), is the minimum(|qid1| - F(qid1)) for all qid1 in R1.

13 Alice: {France, Lawyer} with timestamp T1.
Cross-Attack (C-Attack) R1 Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Alice: {France, Lawyer} with timestamp T1. Attempt to identify her record in R2. R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) At least one of b4,b5,b6 must have timestamp T2. Otherwise, R1 would have at least three records [Europe, Lawyer, Diabetes]

14 C-Attack Crack size of g2 wrt P: c = |g2| – min(|g1|,|g2|)
Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Crack size of g2 wrt P: c = |g2| – min(|g1|,|g2|) c = 2 – min(3, 2) = 0 R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) Crack size of g2' wrt P: c = |g2'| – min(|g1'|,|g2'|) c = 3 – min(2, 3) = 1 C(P, qid1, qid2) = c over all CG(qid1, qid2)

15 Definition: C-Anonymity
C(qid1, qid2) denotes the maximum C(P, qid1, qid2) for any target P that matches (qid1, qid2). C(qid2) denotes the maximum C(qid1, qid2) for all qid1 in R1. C-anonymity of (R1,R2), denoted by CA(R1,R2), is the minimum(|qid2| - C(qid2)) for all qid2 in R2.

16 Alice: {UK, Lawyer} with timestamp T2.
Backward-Attack (B-Attack) R1 Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Alice: {UK, Lawyer} with timestamp T2. Attempt to identify her record in R2. R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) At least one of b1,b2,b3 must have timestamp T1. Otherwise, one of a1,a2,a3 would have no corresponding record in R2.

17 B-Attack Target person P {UK, Lawyer} with timestamp T2.
Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Target person P {UK, Lawyer} with timestamp T2. Crack size of g2 wrt P: c = max(0,|G1|-(|G2|-|g2|)) g2 = {b1, b2, b3} G1 = {a1, a2, a3} G2 = {b1, b2, b3, b7, b8} c = max(0,3-(|5|-|3|)) = 1 R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10)

18 B-Attack B(P, qid2) = c over all g2 in qid2. Crack size of g2' wrt P:
Birthplace Job Disease (a1) Europe Lawyer Flu (a2) (a3) (a4) Diabetes (a5) Crack size of g2' wrt P: c = max(0,|G1'|-(|G2'|-|g2'|)) g2' = {b9, b10} G1' = {a4, a5} G2' = {b4, b5, b6, b9, b10} c = max(0,2-(|5|-|2|)) = 0 R2 Birthplace Job Disease (b1) UK Professional Flu (b2) (b3) (b4) France Diabetes (b5) (b6) (b7) (b8) (b9) (b10) B(P, qid2) = c over all g2 in qid2.

19 Definition: B-Anonymity
B(qid2) denotes the maximum B(P, qid2) for any target P that matches qid2. B-anonymity of (R1,R2), denoted by BA(R1,R2), is the minimum(|qid2| - B(qid2)) for all qid2 in R2.

20 In brief… cracked records: either do not originate from Alice's QID or do not have Alice's timestamp. Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.

21 Definition: BCF-Anonymity
A BCF-anonymity requirement states that all of BA(R1,R2)k, CA(R1,R2)k, and FA(R1,R2)k, where k is a user-specified threshold. We now present an algorithm for anonymizing R2=D1UD2.

22 BCF-Anonymizer generalize every value for Aj  QID in R2 to ANYj;
let candidate list contain all ANYj; sort candidate list by Score in descending order; while the candidate list is not empty do if the first candidate w in candidate list is valid then specialize w into {w1,…,wz} in R2; compute Score for all wi; and add them to candidate list; sort the candidate list by Score in descending order; else remove w from the candidate list; end if end while output R2 ANY Europe …… America France UK

23 Anti-Monotonicity of BCF-Anonymity
Theorem: Each of FA, CA and BA is non-increasing with respect to a specialization on R2. Guarantee that the produced BCF-anonymized R2 is maximally specialized (suboptimal) which any further specialization leads to a violation.

24 Empirical Study Study the threat of correspondence attacks.
Evaluate the information usefulness of a BCF-anonymized R2. Adult dataset (US Census data) 8 categorical attributes 30,162 records in training set 15,060 records in testing set

25 Experiment Settings D1 contains all records in testing set.
Three cases of D2 at timestamp T2: 200D2: D2 contains the first 200 records in the training set, modelling a small set of new records at T2. 2000D2: D2 contains the 2000 records in the training set, modelling a medium set of new records at T2. allD2: D2 contains all 30,162 records in the training set, modelling a large set of new records at T2.

26 Violations of BCF-Anonymity

27 Anonymization BCF-Anonymized R2: Our method.
k-Anonymized R2: Not safe from correspondence attacks. k-Anonymized D2: Anonymize D2 separately from D1.

28 Extension: Beyond two releases
Consider the raw data D1,…,Dn collected at timestamp T1,…,Tn. Correspondence knowledge: Every record in Ri has a corresponding record in all releases Rj such that j > i. Optimal micro attacks: Choose the “best” background release, yielding the largest possible crack size. Composition of micro attacks: “Compose" multiple micro attacks together (apply one after another) in order to increase the crack size of a group.

29 Related Work Byun et al. (VLDB-SDM06) is an early study on continuous data publishing scenario. Anonymization relies on delaying records release and the delay can be unbounded. In our method, records collected at timestamp Ti are always published in the corresponding release Ri without delay. Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication. Anonymization relies on generalization and adding counterfeit records.

30 Related Work Wang and Fung (SIGKDD06) study the problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records. R1 A B C D R2

31 Conclusion & Contributions
Systematically characterize different types of correspondence attacks and concisely compute their crack size. Define BCF-anonymity requirement. Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness. Extendable to multiple releases. 31

32 For more information: http://www.ciise.concordia.ca/~fung
Acknowledgement: Reviewers of EDBT Concordia University Faculty Start-up Grants Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants PGS Doctoral Award

33 References [BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), 2006. [MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April 2006. [PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007

34 References [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March 1998. [WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages , November 2005.

35 References [WFY07] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3): , April 2007. [XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.


Download ppt "Anonymity for Continuous Data Publishing"

Similar presentations


Ads by Google