Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anonymity for Continuous Data Publishing Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon.

Similar presentations


Presentation on theme: "Anonymity for Continuous Data Publishing Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon."— Presentation transcript:

1 Anonymity for Continuous Data Publishing Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon Fraser University Burnaby, BC, Canada Ada Wai-Chee Fu The Chinese University of Hong Kong Jian Pei Simon Fraser University Burnaby, BC, Canada The 11 th International Conference on Extending Database Technology (EDBT 2008)

2 2 Privacy-Preserving Data Publishing k-anonymity [SS98] 2-anoymous patient table BirthplaceJobDisease UKProfessionalFlu UKProfessionalDiabetes FranceProfessionalDiabetes FranceProfessionalFlu Raw patient table Quasi-Identifier (QID)Sensitive BirthplaceJobDisease UKEngineerFlu UKLawyerDiabetes FranceEngineerDiabetes FranceLawyerFlu (Hospital)

3 3 Privacy Requirement k-anonymity [SS98]  Every QID group contains at least k records. Confidence bounding [WFY05, WFY07]  Bound the confidence QID  sensitive value within h%. l-diversity [MGKV06]  Every QID group contains l well-represented distinct sensitive values. Patient table QIDSensitive BirthplaceJobDisease UKProfessionalFlu UKProfessionalDiabetes UKProfessionalDiabetes UKProfessionalDiabetes FranceProfessionalDiabetes FranceProfessionalDiabetes FranceProfessionalFlu FranceProfessionalFlu

4 4 Continuous Data Publishing Model At time T 1,  Collected a set of raw data records D 1  Published a k-anonymous version of D 1, denoted release R 1. At time T 2,  Collect a new set of raw data records D 2  Want to publish all data collected so far.  Publish a k-anonymous version of D 1 UD 2, denoted release R 2.

5 BirthplaceJobDisease (a1)Europe (UK)LawyerFlu (a2)Europe (UK)LawyerFlu (a3)Europe (UK)LawyerFlu (a4)Europe (France)LawyerDiabetes (a5)Europe (France)LawyerDiabetes BirthplaceJobDisease (b1)UKProfessional (Lawyer)Flu (b2)UKProfessional (Lawyer)Flu (b3)UKProfessional (Lawyer)Flu (b4)FranceProfessional (Lawyer)Diabetes (b5)FranceProfessional (Lawyer)Diabetes (b6)FranceProfessional (Lawyer)Diabetes (b7)FranceProfessional (Doctor)Flu (b8)FranceProfessional (Doctor)Flu (b9)UKProfessional (Doctor)Diabetes (b10)UKProfessional (Lawyer)Diabetes R1R1 R2R2 D1D1 D2D2 D1D1 Continuous Data Publishing Model

6 6 Correspondence Attacks An attacker could “crack” the k-anonymity by comparing R 1 and R 2. Background knowledge:  QID of a target victim (e.g., Alice is born in France and is a lawyer.)  Timestamp of a target victim. Correspondence knowledge:  Every record in R 1 has a corresponding record in R 2.  Every record timestamped T 2 has a record in R 2, but not in R 1.

7 7 Our Contributions What exactly are the records that can be excluded (cracked) based on R 1 and R 2 ?  Systematically characterize the set of cracked records by correspondence attacks.  Propose the notion of BCF-anonymity to measure anonymity after excluding the cracked records. Developed an efficient algorithm to identify a BCF-anonymized R 2, and studied its data quality. Extended the proposed approach to deal with more than two releases and other privacy notions.

8 8 Problem Statements Detection problem:  Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k- anonymized R 1 and R 2. Anonymization problem:  Given R 1, D 1 and D 2, we want to generalize R 2 = D 1 UD 2 so that R 2 satisfies a given BCF- anonymity requirement and remains as useful as possible wrt a specified information metric.

9 R1BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes Alice: {France, Lawyer} with timestamp T 1. Attempt to identify her record in R 1. Forward-Attack (F-Attack) a1, a2, a3 cannot all originate from [France, Lawyer]. Otherwise, R 2 would have at least three [France, Professional, Flu].

10 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes F-Attack CG(qid 1,qid 2 ) = {(g 1,g 2 ),(g 1 ',g 2 ')} g1g1 g2g2 g1'g1' g2'g2' qid 1 qid 2

11 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes F-Attack Crack size of g 1 wrt P: c = |g 1 | – min(|g 1 |,|g 2 |) c = 3 – min(3, 2) = 1. Crack size of g 1 ' wrt P: c = |g 1 '| – min(|g 1 '|,|g 2 '|) c = 2 – min(2, 3) = 0. F(P, qid 1, qid 2 ) =  c over all CG(qid 1, qid 2 )

12 12 Definition: F-Anonymity F(qid 1, qid 2 ) denotes the maximum F(P, qid 1, qid 2 ) for any target P that matches (qid 1, qid 2 ). F(qid 1 ) denotes the maximum F(qid 1, qid 2 ) for all qid 2 in R 2. F-anonymity of (R 1,R 2 ), denoted by FA(R 1,R 2 ), is the minimum(|qid 1 | - F(qid 1 )) for all qid 1 in R 1.

13 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes Alice: {France, Lawyer} with timestamp T 1. Attempt to identify her record in R 2. Cross-Attack (C-Attack) At least one of b4,b5,b6 must have timestamp T 2. Otherwise, R 1 would have at least three records [Europe, Lawyer, Diabetes]

14 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes C-Attack Crack size of g 2 wrt P: c = |g 2 | – min(|g 1 |,|g 2 |) c = 2 – min(3, 2) = 0 Crack size of g 2 ' wrt P: c = |g 2 '| – min(|g 1 '|,|g 2 '|) c = 3 – min(2, 3) = 1 C(P, qid 1, qid 2 ) =  c over all CG(qid 1, qid 2 )

15 15 Definition: C-Anonymity C(qid 1, qid 2 ) denotes the maximum C(P, qid 1, qid 2 ) for any target P that matches (qid 1, qid 2 ). C(qid 2 ) denotes the maximum C(qid 1, qid 2 ) for all qid 1 in R 1. C-anonymity of (R 1,R 2 ), denoted by CA(R 1,R 2 ), is the minimum(|qid 2 | - C(qid 2 )) for all qid 2 in R 2.

16 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes Alice: {UK, Lawyer} with timestamp T 2. Attempt to identify her record in R 2. Backward-Attack (B-Attack) At least one of b1,b2,b3 must have timestamp T 1. Otherwise, one of a1,a2,a3 would have no corresponding record in R 2.

17 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes B-Attack Target person P {UK, Lawyer} with timestamp T 2. Crack size of g 2 wrt P: c = max(0,|G 1 |-(|G 2 |-|g 2 |)) g 2 = {b1, b2, b3} G 1 = {a1, a2, a3} G 2 = {b1, b2, b3, b7, b8} c = max(0,3-(|5|-|3|)) = 1

18 R1R1 BirthplaceJobDisease (a1)EuropeLawyerFlu (a2)EuropeLawyerFlu (a3)EuropeLawyerFlu (a4)EuropeLawyerDiabetes (a5)EuropeLawyerDiabetes R2R2 BirthplaceJobDisease (b1)UKProfessionalFlu (b2)UKProfessionalFlu (b3)UKProfessionalFlu (b4)FranceProfessionalDiabetes (b5)FranceProfessionalDiabetes (b6)FranceProfessionalDiabetes (b7)FranceProfessionalFlu (b8)FranceProfessionalFlu (b9)UKProfessionalDiabetes (b10)UKProfessionalDiabetes B-Attack Crack size of g 2 ' wrt P: c = max(0,|G 1 '|-(|G 2 '|-|g 2 '|)) g 2 ' = {b9, b10} G 1 ' = {a4, a5} G 2 ' = {b4, b5, b6, b9, b10} c = max(0,2-(|5|-|2|)) = 0 B(P, qid 2 ) =  c over all g 2 in qid 2.

19 19 Definition: B-Anonymity B(qid 2 ) denotes the maximum B(P, qid 2 ) for any target P that matches qid 2. B-anonymity of (R 1,R 2 ), denoted by BA(R 1,R 2 ), is the minimum(|qid 2 | - B(qid 2 )) for all qid 2 in R 2.

20 20 In brief… cracked records: either do not originate from Alice's QID or do not have Alice's timestamp. Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.

21 21 Definition: BCF-Anonymity A BCF-anonymity requirement states that all of BA(R 1,R 2 )  k, CA(R 1,R 2 )  k, and FA(R 1,R 2 )  k, where k is a user-specified threshold. We now present an algorithm for anonymizing R 2 =D 1 UD 2.

22 BCF-Anonymizer 1. generalize every value for A j  QID in R 2 to ANY j ; 2. let candidate list contain all ANY j ; 3. sort candidate list by Score in descending order; 4. while the candidate list is not empty do 5. if the first candidate w in candidate list is valid then 6. specialize w into {w 1,…,w z } in R 2 ; 7. compute Score for all w i ; and add them to candidate list; 8. sort the candidate list by Score in descending order; 9. else 10. remove w from the candidate list; 11. end if 12. end while 13. output R 2 ANY Europe …… America FranceUK ……

23 23 Anti-Monotonicity of BCF-Anonymity Theorem: Each of FA, CA and BA is non- increasing with respect to a specialization on R 2. Guarantee that the produced BCF- anonymized R 2 is maximally specialized (suboptimal) which any further specialization leads to a violation.

24 24 Empirical Study Study the threat of correspondence attacks. Evaluate the information usefulness of a BCF-anonymized R 2. Adult dataset (US Census data)  8 categorical attributes  30,162 records in training set  15,060 records in testing set

25 Experiment Settings D 1 contains all records in testing set. Three cases of D 2 at timestamp T 2 :  200D2: D 2 contains the first 200 records in the training set, modelling a small set of new records at T 2.  2000D2: D 2 contains the 2000 records in the training set, modelling a medium set of new records at T 2.  allD2: D 2 contains all 30,162 records in the training set, modelling a large set of new records at T 2.

26 26 Violations of BCF-Anonymity

27 27 Anonymization BCF-Anonymized R2: Our method. k-Anonymized R2: Not safe from correspondence attacks. k-Anonymized D2: Anonymize D2 separately from D1.

28 28 Extension: Beyond two releases Beyond two releases  Consider the raw data D 1,…,D n collected at timestamp T 1,…,T n.  Correspondence knowledge: Every record in R i has a corresponding record in all releases R j such that j > i. Optimal micro attacks: Choose the “best” background release, yielding the largest possible crack size. Composition of micro attacks: “Compose" multiple micro attacks together (apply one after another) in order to increase the crack size of a group.

29 29 Related Work Byun et al. (VLDB-SDM06) is an early study on continuous data publishing scenario.  Anonymization relies on delaying records release and the delay can be unbounded.  In our method, records collected at timestamp T i are always published in the corresponding release R i without delay. Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication.  Anonymization relies on generalization and adding counterfeit records.

30 30 Related Work Wang and Fung (SIGKDD06) study the problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records. ABCD R1R1 R2R2

31 31 Conclusion & Contributions Systematically characterize different types of correspondence attacks and concisely compute their crack size. Define BCF-anonymity requirement. Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness. Extendable to multiple releases.

32 32 For more information: Acknowledgement: Reviewers of EDBT Concordia University  Faculty Start-up Grants Natural Sciences and Engineering Research Council of Canada (NSERC)  Discovery Grants  PGS Doctoral Award

33 33 References [BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), [MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April [PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007

34 34 References [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k- anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March [WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages , November 2005.

35 35 References [WFY07] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3): , April [XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.


Download ppt "Anonymity for Continuous Data Publishing Benjamin C. M. Fung Concordia University Montreal, QC, Canada Ke Wang Simon."

Similar presentations


Ads by Google