Download presentation
Presentation is loading. Please wait.
1
deidentify Philipp Burckhardt and Rema Padman
Session: Privacy and Deidentification S35 Philipp Burckhardt and Rema Padman Carnegie Mellon University 5000 Forbes Ave Pittsburgh PA 15213 Twitter: #AMIA2017
2
Disclosure We have no relevant relationships with commercial interests to disclose. AMIA | amia.org
3
Learning Objectives After participating in this session the learner should be better able to: De-identify free-text medical records via open-source tools in order to comply with the privacy rule of the Health Insurance Portability and Accountability Act (HIPAA). Identify the challenges posed by a certain data set of free text medical records for de- identification tasks. AMIA | amia.org
4
Introduction Why deidentify? AMIA | amia.org
5
EHRs have become ubiquitous.
Photo by NEC Corporation of America with Creative Commons license. AMIA | amia.org
6
90% of hospitals use an EHR system.
Photo by NEC Corporation of America with Creative Commons license. AMIA | amia.org
7
Privacy Protection Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) restricts distribution of all medical data containing protected health information (PHI) HIPAA permits two methods for the de-identification of PHIs: the “Safe Harbor” rule lists 18 identifiers, which have to be removed an expert can testify that the employed statistical or scientific method provides only a small risk of identification Identifier List: (A) Names (B) All geographic subdivisions smaller than a state (…) (C) All elements of dates (except year) for dates that are directly related to an individual (…) (D) Telephone numbers (L) Vehicle identifiers and serial numbers, including license plate numbers (E) Fax numbers (M) Device identifiers and serial numbers (F) addresses (N) Web Universal Resource Locators (URLs) (G) Social security numbers (O) Internet Protocol (IP) addresses (H) Medical record numbers (P) Biometric identifiers, including finger and voice prints (I) Health plan beneficiary numbers (…) AMIA | amia.org
8
The Promise Data-driven diagnostics Predictive analytics
Precision medicine AMIA | amia.org
9
Related Work: Two Strands
Rule-based Approaches: Scrub . Machine-Learning Methods: Conditional Random Fields (CRF) Decision Trees Maximum Entropy models Support Vector Machines (SVM) Sweeney L. Replacing personally-identifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp. 1996;p. 333–337. Aramaki E, Imai T, Miyo K, Ohe K. Automatic deidentification by using sentence features and label consistency. 2b2 Work Challenges Nat Lang Process Clin Data. 2006;p. 10–11. AMIA | amia.org
10
deidentify
11
deidentify Automatic scrubbing or replacement of all protected health information (PHI) Supports *.pdf, *.doc(x), and *.txt format. Graphical User Interface (GUI) Customizable Platform-agnostic (Windows, MacOS, Linux) Open-source (GNU General Public License v2) AMIA | amia.org
12
Methodology
13
Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org
14
Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org
15
Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org
16
Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org
17
Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org
18
Methodology Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD AMIA | amia.org
19
Methodology Pattern matching via Regular Expressions:
Named Entity Recognition via Conditional Random Fields: Model by Stanford Natural Language Processing Pre-trained on large corpora (CoNLL, MUC-6, MUC-7 and ACE) Features: the current, previous, and next words the current word character n-gram the current Part-of-Speech (POS) tag … Type Regular Expression Phone /(\+\d{1,2}\s)?\(?\d{3}\)?[/\s.-]?\d{3}[/\s.-]?\d{4}/g Fax /\+?[0-9]{7,}/g SSN /\d{3}-?\d{2}-?\d{4}/g AMIA | amia.org
20
Substitution Strategies
Original Fake Data Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >) or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org
21
Substitution Strategies
Original Fake Data Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >) or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org
22
Substitution Strategies: Fake Data
Original Fake Data Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >) or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org
23
Substitution Strategies: Identifiers
Original Generated Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >)or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org
24
Substitution Strategies: Redaction
Original Generated Identifier Redacted Dear Janine Keane, as we have discussed, I hereby send you the requested information about my patient, Julie Andrews. You can reach her via (her address is or via phone: Sincerely, Elijah Hunt, MD Dear Rosie Copeland, as we have discussed, I hereby send you the requested information about my patient, Beatrice Burton. You can reach her via (her address is or via phone: (836) Sincerely, Jayden Bush, MD Dear <name>, as we have discussed, I hereby send you the requested information about my patient, <name>. You can reach her via (her address is < >)or via phone: <phone>. Sincerely, <name>, MD Dear **********, as we have discussed, I hereby send you the requested information about my patient, *************. You can reach her via (her address is ******************) or via phone: ************. Sincerely, ***********, MD AMIA | amia.org
25
Evaluation
26
Evaluation Data Corpus of nursing notes from PhysioNet 2,434 records
Manually de-identified by several clinicians Example: 58 YO FEMALE READMITTED TO CCU TODAY S/P CATH WITH PA LINE ON MILRINONE. PT WITH PMH MI ’92, CABG X3 ’92, REDO ’95, DDD PACER ’95, AFLUTTER S/P ABLATION ’96, AFIB S/P CARDIOVERSION Notes written by specialist physicians on Chronic Kidney Disease after each patient visit 48 records Example: Dear XXX, I saw your patient, Mrs. YYY, in the office today for follow up of her transplantation. As you will recall she had problems with acute interstitial nephritis which ended up being endstage renal disease in early 2000’s and has subsequently undergone transplantation from one of her family members. She was seen in September at which point her creatinine was stable at (...) Neamatullah I, Douglass MM, Lehman LwH, Reisner A, Villarroel M, Long WJ, et al. Automated deidentification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32. AMIA | amia.org
27
Evaluation Metrics Recall (also known as sensitivity): proportion of PHIs correctly identified throughout the doctor’s notes; Precision (also known as positive predictive value): proportion of correct findings among the terms identified as PHI. To protect the privacy of patient’s health care data, high recall is quintessential, whereas a high precision preserves the integrity and readability of the text. AMIA | amia.org
28
De-Idenfication Performance (Nursing Notes)
PHI Type PHI Sub-Type Count # FNs Recall Precision Name Patient Name 54 3 0.944 Patient Initial 2 0.0 Clinician Name 593 41 0.925 Relative / Proxy 175 0.989 Name (overall) 822 48 0.95 0.734 Date 482 4 0.992 0.256 Location 367 95 0.741 0.922 Phone 53 1.0 0.899 Overall 1724 124 0.919 0.645 AMIA | amia.org
29
De-Idenfication Performance (Doctor’s Notes)
PHI Type PHI Sub-Type Count # FNs Recall Precision Name Patient Name 24 1.0 Clinician Name 112 6 0.946 Name (overall 136 0.956 0.738 AMIA | amia.org
30
Conclusion deidentify Challenges: Future Work: is easy-to-use
does not require training data, but can be manually augmented cross-platform and open-source Challenges: lab values are frequently confused with dates issues with texts containing only uppercase letters and inconsistent punctuation Future Work: better handling of dates embedding of structured information (laboratory results, medical staff lists, patient information etc.). integration through CLI for real-time de-identification of PHIs AMIA | amia.org
31
GitHub: https://github.com/Planeshifter/deidentify
AMIA | amia.org
32
The End Thank you! AMIA | amia.org
33
Question 1 Imagine that you are the executive officer of a large clinic, who is responsible for evaluating various software solutions for de-identification of medical records for HIPAA compliance. After some research, you have narrowed down the number of available options to five to software tools, which make different quality claims. However, they were produced by marketing professionals and might overstate certain advantages. Which of the following claims seems most credible? “Out of the box, our solution is guaranteed to remove all personal identifiers.” “With our tooling consistently resulting in zero false positives, you don’t have to worry about HIPAA compliance anymore.” “Using advanced machine learning techniques, our system just works. You don’t need any human revision of the de-identified output.” “Building on various dictionaries and other lists of hand-selected identifiers, our tool will serve you best to de-identify your medical records.” “Our system combines the machine learning techniques with dictionaries that are customizable through an easy-to-use interface. Thus, you can achieve HIPAA compliance and preserve the readability of the patient records.” AMIA | amia.org
34
Correct Answer: E “Our system combines the machine learning techniques with dictionaries that are customizable through an easy-to-use interface. Thus, you can achieve HIPAA compliance and preserve the readability of the patient records.” Explanation: The goal of the HIPAA Privacy Rule is to protect patient’s personal identifiers. This is a complex task because through combination of small details it might become possible to deduce certain individuals even if their names might have been removed from the records. For example, take the case where the phone number of a patient was not removed from a medical record. However, for the readability of the records it is crucial to not mistakenly flag all number sequences as personal identifiers, as dosages or other medical data might need to be preserved. With these conflicting objectives, there will always be a trade-off resulting in non-zero false positives and false negatives. Therefore, claims that promise a 100% success rate need be taken with a grain of salt, the more so if they promise to achieve this out-of- the-box (such as answer A). However, the statement that no false positives occur alone is not a quality mark: If one’s method was to not label anything as a personal identifier, no false positives would occur, so answer B emphasizes the wrong thing. Solely relying on machine learning techniques, as promised in answer C, is not viable either: Since there is no 100% success rate, human supervision and redaction is crucial. On the other hand, dictionaries alone cannot adequately serve the purpose of de-identification, as they will not capture spelling errors, acronyms, or rare names. This leaves us with answer E as the most plausible one, which combines the various approaches (machine-learning, dictionaries, custom rules, human supervision and redaction) and highlights the two competing goals: de-identification and preservation of readability. AMIA | amia.org
35
Question 2 As chief executive officer, you have decided to use an open-source solution for de-identification of medical records. In a board meeting with doubtful members, you need make the case why an in-house solution based on open-source software might be preferable to a commercial offering where everything is taken care of. Which of the following arguments is most convincing? The open-source solution will cost less money. As studies have shown, open-source software is generally more resilient than proprietary solutions Using in-house tools based on open-source software, we can adapt it to our needs, leading to better performance and adaptability to changing circumstances (e.g. moving to a new Electronic Health Record system) We minimize risk this way since we cannot be sure that the respective company is still operating in a few years. An open standard on the other hand is sustainable. AMIA | amia.org
36
Correct Answer: C The open-source solution will cost less money.
As studies have shown, open-source software is generally more resilient than proprietary solutions Using in-house tools based on open-source software, we can adapt it to our needs, leading to better performance and adaptability to changing circumstances (e.g. moving to a new Electronic Health Record system) We minimize risk this way since we cannot be sure that the respective company is still operating in a few years. An open standard on the other hand is sustainable Explanation: Answer A might be correct sometimes, but not always: In-house staff may also be expensive. The same applies to answer B: There are great open-source projects that can rival commercial offerings because of their large number of users (and testers), but also many which fall short. One needs to evaluate the alternatives on an individual basis. The argument of risk minimization in answer D is quite plausible, however it is generally preferable to argue for something via its virtues rather than pointing out risks associated with an alternative. Therefore, answer C is the most convincing one: Building such an in-house solution on a sound open-source basis provides resilience, and open standard, as well as adaptability. The last point is in stark contrast to most proprietary offerings and in times of Big Data accumulating in-house skills is necessary for the knowledge transfer into the future. AMIA | amia.org
37
AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. AMIA | amia.org
38
Email me at: pgb@andrew.cmu.edu
Thank you! me at:
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.