Presentation is loading. Please wait.

Presentation is loading. Please wait.

To Do or Not To Do: The Dilemma of Disclosing Anonymized Data Lakshmanan L, Ng R, Ramesh G Univ. of British Columbia Oren Fine Nov. 2008 CS Seminar in.

Similar presentations


Presentation on theme: "To Do or Not To Do: The Dilemma of Disclosing Anonymized Data Lakshmanan L, Ng R, Ramesh G Univ. of British Columbia Oren Fine Nov. 2008 CS Seminar in."— Presentation transcript:

1 To Do or Not To Do: The Dilemma of Disclosing Anonymized Data Lakshmanan L, Ng R, Ramesh G Univ. of British Columbia Oren Fine Nov. 2008 CS Seminar in Databases (236826)

2 Once Upon a Time… The police is after Edgar, a drug lord suspect. –Intel. has gathered calls & meetings data records as a transactional database –In order to positively frame Edgar, the police must find hard evidence, and wishes to outsource data mining tasks to “We Mind your Data Ltd.” –But, the police is subject to the law, and is obligated to keep the privacy of the people in the database – including Edgar, which is innocent until proven otherwise. –Furthermore, Edgar is seeking for the smallest hint to disappear…

3 I have the pleasure to introduce Edgar vs. The Police VS.

4 Motivation The Classic Dilemma: –Keep your data close to your chest and never risk privacy or confidentiality or… –Disclose the data and gain potential valuable knowledge and benefits In order to decide, we need to answer a major question –“Just how safe is the anonymized data?” –Safe = protecting the identities of the of the objects.

5 Agenda Anonymization Model the Attacker’s Knowledge Determine the risk to our data

6 Anonymization or De-Identification Transform sensitive data into generated unique content (strings, numbers) Example NamesTID {Hussein, Hassan, Dimitri}1 {Hussein, Edgar, Anglea}2 {Angela, Edgar}3 {Raz, Adi, Yishai}4 {Hassan, Yishai, Dimitri, Raz}5 {Raz, Anglea, Nithai}6 TransactionTID {1,2,3}1 {1,4,5}2 {5,4}3 {6,7,8}4 {2,8,3,6}5 {6, 5, 9}6

7 Anonymization or De-Identification Advantages –Very simple –Does not affect final outcome or perturb data characteristics We do not suggest that anonymization is the “right” way, but it is probably the most common

8 Frequent Set Mining Crash Course Transactional database Each transaction has TID and a set of items An association rule of the form X  Y has –Support s if s% of the transactions include (X,Y) –Confidence c if c% of the transactions that include X also include Y Support = frequent sets Confidence = association rules A k-itemset is a set of k items

9 Example NamesTID Angela, Ariel, Edgar, Steve, Benny1 Edgar, Hassan, Steve, Tommy2 Joe, Sara, Israel3 Steve, Angela, Edgar4 Benny, Mahhmud, Tommy5 Angela, Sara, Edgar6 Hassan, Angela, Joe, Edgar, Noa7 Edgar, Benny, Steve, Tommy8

10 Example (Cont.) First, we look for frequent sets, according to a support threshold 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have 50% support (4 out of 8 transactions). 3-itemsets: {Angela, Edgar, Steve}, {Benny, Edgar, Steve} and {Tommy, Edgar, Steve} have only 25% support (2 out of 8 transactions) The rule {Edgar, Steve}  {Angela} has 50% confidence (2 out 4 transactions) and the rule {Tommy}  {Edgar, Steve} has 66.6% confidence.

11 Frequent Set Mining Crash Course (You’re Qualified!) Widely used in market basket analysis, intrusion detection, Web usage mining and bioinformatics Aimed at discovering non trivial or not necessarily intuitive relation between items/variables of large databases “Extracting wisdom out of data” Who knows what is the most famous frequent set?

12 Big Mart’s Database

13 Modeling the Attacker’s Knowledge We believe that the attacker has prior knowledge about the items in the original domain The prior information regards the frequencies of items in the original domain We capture the attacker’s knowledge with “Belief Functions”

14 Examples of Belief Functions

15 Consistent Mapping Mapping anonymized entities to original entities only according to the belief function

16 Ignorant Belief Function (Q) How does the graph look like? What is the expected number of cracks? Suppose n items. Further suppose that we are only interested in a partial group, of size n 1 What is the expected number of cracks now? Don’t you underestimate Edgar…

17 Ignorant Belief Function (A)

18 Compliant Point-Valued Belief Function (Q) How does the graph look like? What is the expected number of cracks? Suppose n items. Further suppose that we are only interested in a partial group, of size n 1 What is the expected number of cracks now? Unless he has inner source, we shouldn’t overestimate Edgar either…

19 Compliant Point-Valued Belief Function (A)

20 Compliant Interval Belief Functions Direct Computation Method –Build a graph G and adjacency matrix A G –The probability of cracking k out of n items: Computing the permanent is know to be #P-complete problem, state of the art approximation running time O(n 22 ) !! What the !#$!% is a permanent or #P-complete?

21 Permanent A permanent of an n*n matrix is The sum is over all permutations of 1,2,… Calculating the permanent is #P-complete Which brings us to…

22 #P-Complete Unlike well known complexity classes which are of decision problems, this is a class of function problems "compute f(x)," where f is the number of accepting paths of an NP machine Example –NP: Are there any subsets of a list of integers that add up to zero? –#P: How many subsets of a list of integers add up to zero?

23 Chain Belief Functions

24

25 Unfortunately… General Belief Function does not always produce a chain… We seek for way to estimate the number of cracks.

26 The O-estimate Heuristic Suppose Graph G, interval belief function β. For each x, let O x denote the outdegree of x in G. The probability of cracking x is simply The expected number of cracks is

27 Properties of O-estimate Inexact (hence “estimate”) Monotonic

28  -Compliant Belief Function Suppose we “somehow” know which items are guessed wrong We sum the O-estimates only over the compliant frequency groups

29 Risk Assessment Worst case \ Best case – unrealistic Determine the intervals width –Twice the median gap of all successive frequency groups –Why? Determine the degree of compliancy –Perform binary search on , subject to a “degree of tolerance” – .

30 End to End Example These Intel. Calls & Meeting DR are classified “Top Secret” NamesTID Angela, Ariel, Edgar, Steve, Benny1 Edgar, Hassan, Steve, Tommy2 Joe, Sara, Israel3 Steve, Angela, Edgar4 Benny, Mahhmud, Tommy5 Angela, Sara, Edgar6 Hassan, Angela, Joe, Edgar, Noa7 Edgar, Benny, Steve, Tommy8

31 We Anonymize the Database freqJI 4/81Angela 1/82Ariel 6/83Edgar 4/84Steve 3/85Benny 2/86Hassan 3/87Tommy 2/88Joe 2/89Sara 1/810Israel 1/811Noa 1/812Mahhmud ItemsTID 1, 2, 3, 4, 51 3, 6, 4, 72 8, 9, 103 4, 1, 34 5, 7, 125 1, 9, 36 6, 1, 8, 3, 117 3, 5, 4, 78

32 Frequency Groups The gaps between the frequency groups: 1/8, 1/8, 1/8, 1/8, 2/8 The median gap = 1/8 ItemsFrequency 2, 10, 11, 121/8 6, 8, 92/8 5, 73/8 1, 44/8 36/8

33 The Attacker’s Prior Knowledge Frequency GroupI 3/8 – 5/8Angela 0 – 2/8Ariel 5/8 – 7/8Edgar 3/8 – 5/8Steve 2/8 – 4/8Benny 1/8 – 3/8Hassan 2/8 – 4/8Tommy 1/8 – 3/8Joe 1/8 – 3/8Sara 0 – 2/8Israel 0 – 2/8Noa 0 – 2/8Mahhmud

34 The Graph, By the Way… 1 2 4 3 8 5 6 7 10 9 11 12 Angela Ariel Edgar Steve Benny Hassan Tommy Joe Sara Israel Noa Mahhmud

35 Calculating the Risk O est =1/4+1/7+1/3+1/4+1/7+1/9+1/7+ 1/9+1/9+1/7+1/7+1/7 = 2.023 Now, it’s a question of how much would you tolerate... Note, that this is the expected number of cracks. However, if we are interested in Edgar, as we’ve seen in previous lemmas, the probability of crack – 1/3.

36 Experiments

37 Open Problems The attacker’s prior knowledge remains a largely unsolved issue This article does not really deal with frequent sets but rather frequent items –Frequent sets can add more information and differentiate objects from one frequency group

38 Modeling the Attacker’s Knowledge in the Real World In a report for the Canadian Privacy Commissioner appears a broad mapping of adversary knowledge –Mapping phone directories –CV’s –Inferring gender, year of birth and postal code from different details –Data remnants on 2 nd hand hard disks –Etc.

39 סוף טוב, הכל טוב

40 Bibliography Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499. Pan-Canadian De-Identification Guidelines for Personal Health Information, Khaled El-Emam et al., April 2007. Wikipedia –Association rule –#P –Permanent

41 Questions ?


Download ppt "To Do or Not To Do: The Dilemma of Disclosing Anonymized Data Lakshmanan L, Ng R, Ramesh G Univ. of British Columbia Oren Fine Nov. 2008 CS Seminar in."

Similar presentations


Ads by Google