Presentation is loading. Please wait.

Presentation is loading. Please wait.

Source: IEEE Journal of Biomedical and Health Informatics, Vol

Similar presentations


Presentation on theme: "Source: IEEE Journal of Biomedical and Health Informatics, Vol"— Presentation transcript:

1 A Scalable and Pragmatic Method for the Safe Sharing of High-Quality Health Data
Source: IEEE Journal of Biomedical and Health Informatics, Vol. 22, pp , Mar Author: Fabian Prasser, Florian Kohlmayer, Helmut Spengler, and Klaus A. Kuhn Speaker: Joyun Liu Date: 05/02/2019

2 Outline Introduction Proposed scheme Experimental result Conclusions 1
2 Proposed scheme Outline 3 Experimental result 4 Conclusions

3 Introduction(1/3) Medical data released as anonymous Voter list
SSN Name DOB Sex ZIP Marital Status Problem 09/13/64 female 02141 married shortness of breath 09/07/64 obesity 05/14/61 male 02138 single chest pain 05/08/61 09/15/61 02139 widow Linkage attack Voter list Name City ZIP DOB Sex Sue J. Carlson Cambridge 02139 09/15/61 female

4 Introduction(2/3) Data de-identification Privacy Data quality
Ethnicity DOB Sex ZIP Marital Status black 09/13/64 female 02141 married 09/07/64 white 05/14/61 male 02138 single 05/08/61 09/15/61 02139 widow Ethnicity DOB Sex ZIP Marital Status black 64 not-rel 02100 white 61 Privacy Data quality

5 Introduction(3/3) - k-anonymity
Ethnicity Date of birth Sex ZIP Marital Status Problem 1 black 64 * > 02140 shortness of breath 2 obesity 3 white 61 < 02140 chest pain 4 5 2-anonymity 3-anonymity

6 Proposed scheme(1/10) Data Organizational and legal
measures to ensure trust. Controlled data sharing to prevent malicious attacks. Tailored data de-identification to prevent direct disclosure. Data custodian Data recipient Hospital Data use agreements

7 Proposed scheme(2/10) LINDDUN methodology Data Controlled data sharing
Data Controlled data sharing to prevent malicious attacks. Privacy breach Linkage attack Direct disclosure Proxy Analytics Research data User Entity Process Data store Data flow Data flow diagram Linkage on server Linkage on other system Threat tree

8 Proposed scheme(3/10) Apply generalization Evaluate risk model
Data Proposed scheme(3/10) Tailored data de-identification to prevent direct disclosure. Apply generalization Evaluate risk model Calculate quality Suppress all groups with risk above threshold Apply generalization Evaluate risk model Calculate quality Suppress group with lowest information content Risk below threshold Yes No Traditional process for evaluating de-identification policies Novel process for evaluating de-identification policies

9 Proposed scheme(4/10) - Apply generalization
Hierarchy for “Sex” Dataset Age Sex * Male Female 53 Male 55 40 Female 65 Hierarchy for “Age” * ≤ 19 [20, 80) ≥ 80 [20, 40) [40, 60) [60, 80) 1, …, 19 20, …, 39 40, …, 59 60, …, 79 80, …, 100 Level 0 Level 1 Level 2 Level 3

10 Proposed scheme(5/10) - Apply generalization
0, 0 1, 0 0, 1 3, 1 53 Male 55 40 Female 65 [40, 60) Male Female [60, 80) 53 * 55 40 65 * 1, 0 2, 0 3, 0 0, 0 3, 1 0, 1 1, 1 2, 1

11 Proposed scheme(6/10) - Evaluate risk model
Measure risk by Super-Population model(Dankar et al.). Example: Population Sample The influencing factors of heart disease in Taiwan. Sampling ?= Population uniques (PU) 𝑓 1 𝛼, 𝜃 = 𝑖=1 𝑢−1 1 𝜃+𝑖𝛼 − 𝑖=1 𝑛−1 1 𝜃+𝑖 =0 𝑓 2 𝛼, 𝜃 = 𝑖=1 𝑢−1 𝑖 𝜃+𝑖𝛼 − 𝑖=2 𝑛 𝑠 𝑖 𝑗=1 𝑖−1 1 𝑗−𝛼 =0 PU= 𝛤(𝜃+1) 𝛤(𝜃+𝛼) 𝑁 𝛼 The Super-Population model by Hoshino “Applying Pitman’s Sampling Formula to Microdata Disclosure Risk Analysis”

12 Proposed scheme(7/10) - Suppress group with lowest information content
1, 0 Threshold = 2 53 Male 55 65 Female 40 [40, 60) Male [60, 80) Female Population [40, 60) Male [60, 80) Female Generalization. [40, 60) Male [60, 80) Female * Unique record suppressed.

13 Proposed scheme(8/10) - Calculate quality
0, 0 1, 0 0, 1 3, 1 Quality = 100% Quality = 75% Quality = 50% Quality = 0% 53 Male 55 65 Female 40 [40, 60) Male [60, 80) Female * 53 * 55 65 40 * 1, 0 2, 0 3, 0 0, 0 3, 1 0, 1 1, 1 2, 1

14 Proposed scheme(9/10) Reducing the number of candidate policies 3, 1
0, 1 1, 1 2, 1 3, 1 Generalization level Quality = 50%

15 Proposed scheme(10/10) Reducing the number of risk calculation (Subprocess A) Increasing information content per group 𝐺𝑟𝑜𝑢𝑝 1 𝐺𝑟𝑜𝑢𝑝 2 𝐺𝑟𝑜𝑢𝑝 3 𝐺𝑟𝑜𝑢𝑝 4 𝐺𝑟𝑜𝑢𝑝 𝑛 Increasing uniqueness in output data Risk below threshold Ordering groups to enable fast record suppression Reducing the complexity of risk calculation(Subprocess B) 𝑖=1 𝑢−1 1 𝜃+𝑖𝛼 = 1 𝛼 ψ 𝑢+ 𝜃 𝛼 −ψ 1+ 𝜃 𝛼 𝑖=0 𝑁−1 1 𝑥+𝑖 =ψ(𝑁+𝑥) −ψ(𝑥) 𝑖=1 𝑢−1 𝑖 𝜃+𝑖𝛼 = 1 𝛼 2 −𝜃ψ 𝑢+ 𝜃 𝛼 +𝜃ψ 1+ 𝜃 𝛼 +𝛼(𝑢−1)

16 Experimental result(1/3)
Analysis of the impact of the developed optimizations on data de-identification with the model by Dankar et al. with 1% threshold. 30,162 records from the 1994 U.S. Census (ADULT) 63,441 records from the 1998 KDD competition (CUP) 100,937 records about traffic accidents from the NHTSA Fatality Analysis Reporting System (FARS) 539,253 records from the American Time Use Survey (ATUS) 1,193,504 records from the Integrated Health Interview Series (IHIS)

17 Experimental result(2/3)
Scalability of our approach for the largest evaluation dataset (Integrated Health Interview Series).

18 Experimental result(3/3)
(20%) (50%) (20%) (50%) Non-Uniform Entropy by DeWaal and Willenborg “Information loss through global recoding and local suppression” Loss by Iyengar “Transforming Data to Satisfy Privacy Constraints”

19 Conclusions A concept has been introduced for the safe sharing of high quality health data. The concept is also suitable for large data. (scalable) The concept also includes the consideration of regulation implementation and environment setup. (pragmatic)


Download ppt "Source: IEEE Journal of Biomedical and Health Informatics, Vol"

Similar presentations


Ads by Google