Presentation is loading. Please wait.

Presentation is loading. Please wait.

deidentify Philipp Burckhardt and Rema Padman

Similar presentations


Presentation on theme: "deidentify Philipp Burckhardt and Rema Padman"— Presentation transcript:

1 deidentify Philipp Burckhardt and Rema Padman
Session: Privacy and Deidentification S35 Philipp Burckhardt and Rema Padman Carnegie Mellon University 5000 Forbes Ave Pittsburgh PA 15213 Twitter: #AMIA2017

2 Disclosure We have no relevant relationships with commercial interests to disclose. AMIA | amia.org

3 Learning Objectives After participating in this session the learner should be better able to: De-identify free-text medical records via open-source tools in order to comply with the privacy rule of the Health Insurance Portability and Accountability Act (HIPAA). Identify the challenges posed by a certain data set of free text medical records for de- identification tasks. AMIA | amia.org

4 Introduction Why deidentify? AMIA | amia.org

5 EHRs have become ubiquitous.
Photo by NEC Corporation of America with Creative Commons license. AMIA | amia.org

6 90% of hospitals use an EHR system.
Photo by NEC Corporation of America with Creative Commons license. AMIA | amia.org

7 Privacy Protection Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) restricts distribution of all medical data containing protected health information (PHI) HIPAA permits two methods for the de-identification of PHIs: the “Safe Harbor” rule lists 18 identifiers, which have to be removed an expert can testify that the employed statistical or scientific method provides only a small risk of identification Identifier List: (A) Names (B) All geographic subdivisions smaller than a state (…) (C) All elements of dates (except year) for dates that are directly related to an individual (…) (D) Telephone numbers (L) Vehicle identifiers and serial numbers, including license plate numbers (E) Fax numbers (M) Device identifiers and serial numbers (F) addresses (N) Web Universal Resource Locators (URLs) (G) Social security numbers (O) Internet Protocol (IP) addresses (H) Medical record numbers (P) Biometric identifiers, including finger and voice prints (I) Health plan beneficiary numbers (…) AMIA | amia.org

8 The Promise Data-driven diagnostics Predictive analytics
Precision medicine AMIA | amia.org

9 Related Work: Two Strands
Rule-based Approaches: Scrub . Machine-Learning Methods: Conditional Random Fields (CRF) Decision Trees Maximum Entropy models Support Vector Machines (SVM) Sweeney L. Replacing personally-identifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp. 1996;p. 333–337. Aramaki E, Imai T, Miyo K, Ohe K. Automatic deidentification by using sentence features and label consistency. 2b2 Work Challenges Nat Lang Process Clin Data. 2006;p. 10–11. AMIA | amia.org

10 deidentify

11 deidentify Automatic scrubbing or replacement of all protected health information (PHI) Supports *.pdf, *.doc(x), and *.txt format. Graphical User Interface (GUI) Customizable Platform-agnostic (Windows, MacOS, Linux) Open-source (GNU General Public License v2) AMIA | amia.org

12 Methodology

13 Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org

14 Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org

15 Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org

16 Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org

17 Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org

18 Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org

19 Methodology Pattern matching via Regular Expressions:
Named Entity Recognition via Conditional Random Fields: Model by Stanford Natural Language Processing Pre-trained on large corpora (CoNLL, MUC-6, MUC-7 and ACE) Features: the current, previous, and next words the current word character n-gram the current Part-of-Speech (POS) tag Type Regular Expression Phone /(\+\d{1,2}\s)?\(?\d{3}\)?[/\s.-]?\d{3}[/\s.-]?\d{4}/g Fax /\+?[0-9]{7,}/g SSN /\d{3}-?\d{2}-?\d{4}/g AMIA | amia.org

20 Substitution Strategies
Original Fake Data Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >) or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org

21 Substitution Strategies
Original Fake Data Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >) or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org

22 Substitution Strategies: Fake Data
Original Fake Data Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >) or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org

23 Substitution Strategies: Identifiers
Original Generated Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >)or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org

24 Substitution Strategies: Redaction
Original Generated Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >)or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org

25 Evaluation

26 Evaluation Data Corpus of nursing notes from PhysioNet 2,434 records
Manually de-identified by several clinicians Example: 58 YO FEMALE READMITTED TO CCU TODAY S/P CATH WITH PA LINE ON MILRINONE. PT WITH PMH MI ’92, CABG X3 ’92, REDO ’95, DDD PACER ’95, AFLUTTER S/P ABLATION ’96, AFIB S/P CARDIOVERSION Notes written by specialist physicians on Chronic Kidney Disease after each patient visit 48 records Example: Dear XXX, I saw your patient, Mrs. YYY, in the office today for follow up of her transplantation. As you will recall she had problems with acute interstitial nephritis which ended up being endstage renal disease in early 2000’s and has subsequently undergone transplantation from one of her family members. She was seen in September at which point her creatinine was stable at (...) Neamatullah I, Douglass MM, Lehman LwH, Reisner A, Villarroel M, Long WJ, et al. Automated deidentification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32. AMIA | amia.org

27 Evaluation Metrics Recall (also known as sensitivity): proportion of PHIs correctly identified throughout the doctor’s notes; Precision (also known as positive predictive value): proportion of correct findings among the terms identified as PHI. To protect the privacy of patient’s health care data, high recall is quintessential, whereas a high precision preserves the integrity and readability of the text. AMIA | amia.org

28 De-Idenfication Performance (Nursing Notes)
PHI Type PHI Sub-Type Count # FNs Recall Precision Name Patient Name 54 3 0.944 Patient Initial 2 0.0 Clinician Name 593 41 0.925 Relative / Proxy 175 0.989 Name (overall) 822 48 0.95 0.734 Date 482 4 0.992 0.256 Location 367 95 0.741 0.922 Phone 53 1.0 0.899 Overall 1724 124 0.919 0.645 AMIA | amia.org

29 De-Idenfication Performance (Doctor’s Notes)
PHI Type PHI Sub-Type Count # FNs Recall Precision Name Patient Name 24 1.0 Clinician Name 112 6 0.946 Name (overall 136 0.956 0.738 AMIA | amia.org

30 Conclusion deidentify Challenges: Future Work: is easy-to-use
does not require training data, but can be manually augmented cross-platform and open-source Challenges: lab values are frequently confused with dates issues with texts containing only uppercase letters and inconsistent punctuation Future Work: better handling of dates embedding of structured information (laboratory results, medical staff lists, patient information etc.). integration through CLI for real-time de-identification of PHIs AMIA | amia.org

31 GitHub: https://github.com/Planeshifter/deidentify
AMIA | amia.org

32 The End Thank you! AMIA | amia.org

33 Question 1 Imagine that you are the executive officer of a large clinic, who is responsible for evaluating various software solutions for de-identification of medical records for HIPAA compliance. After some research, you have narrowed down the number of available options to five to software tools, which make different quality claims. However, they were produced by marketing professionals and might overstate certain advantages. Which of the following claims seems most credible? “Out of the box, our solution is guaranteed to remove all personal identifiers.” “With our tooling consistently resulting in zero false positives, you don’t have to worry about HIPAA compliance anymore.” “Using advanced machine learning techniques, our system just works. You don’t need any human revision of the de-identified output.” “Building on various dictionaries and other lists of hand-selected identifiers, our tool will serve you best to de-identify your medical records.” “Our system combines the machine learning techniques with dictionaries that are customizable through an easy-to-use interface. Thus, you can achieve HIPAA compliance and preserve the readability of the patient records.” AMIA | amia.org

34 Correct Answer: E “Our system combines the machine learning techniques with dictionaries that are customizable through an easy-to-use interface. Thus, you can achieve HIPAA compliance and preserve the readability of the patient records.” Explanation: The goal of the HIPAA Privacy Rule is to protect patient’s personal identifiers. This is a complex task because through combination of small details it might become possible to deduce certain individuals even if their names might have been removed from the records. For example, take the case where the phone number of a patient was not removed from a medical record. However, for the readability of the records it is crucial to not mistakenly flag all number sequences as personal identifiers, as dosages or other medical data might need to be preserved. With these conflicting objectives, there will always be a trade-off resulting in non-zero false positives and false negatives. Therefore, claims that promise a 100% success rate need be taken with a grain of salt, the more so if they promise to achieve this out-of- the-box (such as answer A). However, the statement that no false positives occur alone is not a quality mark: If one’s method was to not label anything as a personal identifier, no false positives would occur, so answer B emphasizes the wrong thing. Solely relying on machine learning techniques, as promised in answer C, is not viable either: Since there is no 100% success rate, human supervision and redaction is crucial. On the other hand, dictionaries alone cannot adequately serve the purpose of de-identification, as they will not capture spelling errors, acronyms, or rare names. This leaves us with answer E as the most plausible one, which combines the various approaches (machine-learning, dictionaries, custom rules, human supervision and redaction) and highlights the two competing goals: de-identification and preservation of readability.  AMIA | amia.org

35 Question 2 As chief executive officer, you have decided to use an open-source solution for de-identification of medical records. In a board meeting with doubtful members, you need make the case why an in-house solution based on open-source software might be preferable to a commercial offering where everything is taken care of. Which of the following arguments is most convincing? The open-source solution will cost less money. As studies have shown, open-source software is generally more resilient than proprietary solutions Using in-house tools based on open-source software, we can adapt it to our needs, leading to better performance and adaptability to changing circumstances (e.g. moving to a new Electronic Health Record system) We minimize risk this way since we cannot be sure that the respective company is still operating in a few years. An open standard on the other hand is sustainable. AMIA | amia.org

36 Correct Answer: C The open-source solution will cost less money.
As studies have shown, open-source software is generally more resilient than proprietary solutions Using in-house tools based on open-source software, we can adapt it to our needs, leading to better performance and adaptability to changing circumstances (e.g. moving to a new Electronic Health Record system) We minimize risk this way since we cannot be sure that the respective company is still operating in a few years. An open standard on the other hand is sustainable Explanation: Answer A might be correct sometimes, but not always: In-house staff may also be expensive. The same applies to answer B: There are great open-source projects that can rival commercial offerings because of their large number of users (and testers), but also many which fall short. One needs to evaluate the alternatives on an individual basis. The argument of risk minimization in answer D is quite plausible, however it is generally preferable to argue for something via its virtues rather than pointing out risks associated with an alternative. Therefore, answer C is the most convincing one: Building such an in-house solution on a sound open-source basis provides resilience, and open standard, as well as adaptability. The last point is in stark contrast to most proprietary offerings and in times of Big Data accumulating in-house skills is necessary for the knowledge transfer into the future. AMIA | amia.org

37 AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. AMIA | amia.org

38 Email me at: pgb@andrew.cmu.edu
Thank you! me at:


Download ppt "deidentify Philipp Burckhardt and Rema Padman"

Similar presentations


Ads by Google