Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Similar presentations


Presentation on theme: "Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo."— Presentation transcript:

1 Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo

2 My topic How to share medical records to other third parties without compromising data privacy

3 Reported by AMRA "...medical information is routinely shared with and viewed by third parties who are not involved in patient care.... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."

4 Privacy preserving data publishing Microdata Purposes: –Allow researchers to effectively study the correlation between various attributes –Protect the privacy of every patient bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName

5 A naïve solution It does not work. See next. publish bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge

6 Inference attack An adversary knows that Bob –has been hospitalized before –is 23 years old –lives in an area with zipcode 11000 bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge Published table Quasi-identifier (QI) attributes

7 Background Generalization Anatomy

8 Generalization A generalized table bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge 11000M23Bob ZipcodeSexAgeName Transform each QI value into a less specific form How much generalization do we need?

9 l-diversity A QI-group with m tuples is l -diverse, iff each sensitive value appears no more than m / l times in the QI-group. A table is l -diverse, iff all of its QI-groups are l -diverse. The above table is 2-diverse. 2 QI-groups Quasi-identifier (QI) attributes Sensitive attribute bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge

10 What l-diversity guarantees From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge 11000M23Bob ZipcodeSexAgeName A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

11 Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

12 Defect of generalization (cont.)‏ Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05 Estimated answer for query A: 2 * p = 0.1 pneumonia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge

13 Defect of generalization (cont.)‏ Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Estimated answer from the generalized table: 0.1 bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName The exact answer should be: 1

14 Basic Idea of Anatomy For a given microdata table, Anatomy releases a quasi- identifier table (QIT) and a sensitive table (ST)‏ 2 2 2 1 1 1gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID 230000F70 225000F65 225000F65 254000F61 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge Quasi-identifier Table (QIT)‏ Sensitive Table (ST)‏ bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge microdata

15 Basic Idea of Anatomy (cont.)‏ 1. Select a partition of the tuples bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge QI group 1 QI group 2 a 2-diverse partition

16 Basic Idea of Anatomy (cont.)‏ 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition bronchitis flu gastritis flu pneumonia dyspepsia pneumonia Disease 30000F70 25000F65 25000F65 54000F61 12000M59 59000M35 13000M27 11000M23 ZipcodeSexAge group 1 group 2 quasi-identifier table (QIT)‏sensitive table (ST)‏

17 Basic Idea of Anatomy (cont.)‏ 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition bronchitis2 flu2 gastritis2 flu2 pneumonia1 dyspepsia1 1 pneumonia1 DiseaseGroup-ID 230000F70 225000F65 225000F65 254000F61 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏sensitive table (ST)‏

18 Basic Idea of Anatomy (cont.)‏ 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition 2 2 2 1 1 1gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID 230000F70 225000F65 225000F65 254000F61 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏ sensitive table (ST)‏

19 Privacy Preservation From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l 2 2 2 1 1 1gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID 230000F70 225000F65 225000F65 254000F61 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏ sensitive table (ST)‏ 11000M23Bob ZipcodeSexAgeName

20 Accuracy of Data Analysis Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] 2 2 2 1 1 1gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID 230000F70 225000F65 225000F65 254000F61 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏ sensitive table (ST)‏

21 Accuracy of Data Analysis (cont.)‏ Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] 2 patients have contracted pneumonia 2 out of 4 patients satisfies the query condition on Age and Zipcode Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge t1t2t3t4t1t2t3t4

22 Anatomy vs. Generalization Revisit Sometimes the adversary is not sure whether an individual appears in the microdata or not bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge A 2-diverse generalized table 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName A Voter Registration List

23 Anatomy vs. Generalization Revisit From the adversary’s perspective: –Bob has 4 / 6 probability to be in the microdata –If Bob indeed appears the microdata, there is 2 / 4 probability that he has contracted pneumonia –So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumonia ………… pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge A 2-diverse generalized table 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName A Voter Registration List

24 Anatomy vs. Generalization Revisit The adversary knows that –Bob must appear the microdata –There is 1/2 probability that Bob has contracted pneumonia … 1 1 …… 2pneumonia 2dyspepsia CountDiseaseGroup-ID ………… 112000M59 159000M35 113000M27 111000M23 Group-IDZipcodeSexAge 2-diverse QIT 2-diverse ST 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName

25 Anatomy vs. Generalization Revisit For a given value of l, l -diverse generalization may lead to higher privacy protection than l -diverse anatomy does. But is not always the case, since: –the external database may not contain any irrelevant individuals –the adversary may know that some individuals indeed appear in the microdata 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName


Download ppt "Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo."

Similar presentations


Ads by Google