Presentation is loading. Please wait.

Presentation is loading. Please wait.

De-identifying Pathology Reports for Pathology Informatics

Similar presentations


Presentation on theme: "De-identifying Pathology Reports for Pathology Informatics"— Presentation transcript:

1 De-identifying Pathology Reports for Pathology Informatics
James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel Saltz Center for Comprehensive Informatics 1

2 Introduction The HIPAA Privacy Rule regulates the use and disclosure of Protected Health Information (PHI) De-identification of pathology reports is of critical importance in order to facilitate secondary use of medical records for research HIDE (Health Information DE-identification) is an open- source de-id tool based on advanced statistical based de- identification technologies While statistical learning based techniques have shown promising results for de-identification purposes, few such systems are publicly available. A comprehensive study evaluating the effects of different feature sets and potential impacts of sampling on extracting PHI from pathology reports.

3 HIPAA Identifiers These identifiers have to be removed or
1. Names; 2. All geographical subdivisions smaller than a state; 3. All elements of dates (except year); 4. Phone numbers; 5. Fax numbers; 6. Electronic mail addresses; 7. Social Security numbers; 8. Medical record numbers; 9. Health plan beneficiary numbers; 10. Account numbers; 11. Certificate/license numbers; 12. Vehicle identifiers and serial numbers; 13. Device identifiers and serial numbers; 14. Web Universal Resource Locators (URLs); 15. Internet Protocol (IP) address numbers; 16. Biometric identifiers, including finger and voice prints; 17. Full face photographic images or comparable images; and 18. Any other unique identifying number, characteristic, or code These identifiers have to be removed or Based on the opinion from an qualified statistical expert, the risk of identifying an individual is very small

4 HIDE Overview Utilizes the state-of-the-art named entity recognition technique, Conditional Random Fields, for extracting PHI Previous tools such as DE-ID and HMS scrubber use rule- based approaches which are labor intensive and not portable Provides flexible de-identification options including full de- identification and state-of-the-art statistical de-identification Previous tools allow simple removal or substitution of the PHI Provides an easy-to-use web-based interface that utilizes the latest web-technologies Integrated with caTIES, and caTissue (in progress)

5 PHI Extraction Utilizes state-of-the-art NLP technique, Conditional Random Fields High accuracy, easy to train, portable Combines different feature sets and sampling techniques Feature sets: dictionary, affix, regular expression and context Can use default models or custom trained models Web interface for annotating and training custom models A set of reports are loaded and manually labeled The labeled documents will generate a trained model for automatically de-identifying new reports

6 HIDE: De-identification Options
Full de-identification safe-harbor, all 18 HIPAA identifiers removed or substituted Partial de-identification limited dataset, all direct HIPAA identifiers removed or substituted(not for dates, address other than street/P.O.Box) Configurable de-identification A configurable set of identifiers removed or substituted Statistical de-identification Advanced anonymization that guarantees rigorous statistically acceptable privacy while keeping the utility of the data An example of utility from statistical de-id results?

7 Statistical De-identification Example
De-identification satisfying k-anonymity (k=2) (every record is indistinguishable in a group of records with size greater than or equal to k)

8 Study 1: PHI Extraction on Emory Pathology Reports
The CRF classifier with a good feature set achieves good attribute extraction accuracy (100 reports,10-fold cross validation) Precision: true positives over the sum of true positives and false positives Recall (sensitivity): true positives over total actual positives F1: combination: 2*precision*recall/(precision+ recall)

9 Study 2: PHI Extraction on i2b2 Reports
Based on 669 discharge summaries, 10-fold cross validation Good precision and recall for most individual PHI identifiers Good overall precision and recall for PHI extraction I2b2: Informatics for Integrating Biology and the Bedside

10 Study 3: Impact of Different Feature Sets
The context features, the previous word, next work, etc., are the most important. Regular expression features; Affix features – prefix and suffix; Dictionary features (phrase or token); Context features: previous, next words, and occurrence counts Dictionary (d), affix (a), regular expression (r) and context (c) features are in order of increasing importance for statistical CRF based PHI extraction

11 Integrating HIDE with caTIES
caTIES (cancer Text Information Extraction System) provides tools for de-identification and automated coding of free-text pathology reports caTIES provides de-id extensibility through implementing its CaTIES_DeIdentifier interface HIDEDeIdentifier, which calls HIDE client API Added HIDE de-id option in caTIES installer HIDE is bundled with caTIES since release v3.7 (May )

12 Integrating HIDE with caTissue (in Progress)
caTissue uses caTIES V2.x and refactored it into caTissue’s workflow HIDE integration with caTissue is similar to caTIES Implementation and evaluation under going Goal: Integration of pathology reports into caTissue installation at Winship Cancer Institute at Emory University

13 Continue development on HIDE/caTissue integration
Ongoing Development Continue development on HIDE/caTissue integration Usability improvement: simplified installation progress System improvements Efficiency and scalability of the system Multiple file formats support Additional statistical de-identification options

14 HIDE Demo

15 Li Xiong (lxiong@mathcs.emory.edu)
Thank you Li Xiong


Download ppt "De-identifying Pathology Reports for Pathology Informatics"

Similar presentations


Ads by Google