Presentation on theme: "De-Identification. Privacy in Organizational Processes Patient medical bills Insurance CompanyHospitalDrug Company Patient information Patient Advertising."— Presentation transcript:
Privacy in Organizational Processes Patient medical bills Insurance CompanyHospitalDrug Company Patient information Patient Advertising Aggregate anonymized patient information PUBLIC Complex Process within a Hospital
Transfer and Use Between Organizations Achieve organizational purpose while respecting privacy expectations in the transfer and use of personal information (individual and aggregate) within and across organizational boundaries
We Use the Health Data for Research in Many Aspects
Two Swords in Health Research Informed Consent Form De-Identification IRB ICF ( ) ICF ( )
HIPAA Background Commercial Healthcare Insurance Pharmaceutical Benefit Maker (Intruder) Health Maintain Organization holding hospitals stock share or M&A hospitals Research Fraud and Scandal of Clinical Trials Who can market our medical record data?
Health Insurance Portability and Accountability Act HIPPA, enacted by US Congress in 1996 Title I: Health Care Access, Portability, and Renewability Title II: Preventing Health Care Fraud and Abuse; Administrative Simplification; Medical Liability Reform 1. Privacy Rule 2. Transactions and Code Sets Rule 3. Security Rule 4. Unique Identifiers Rule 5. Enforcement Rule HITECH Act: Privacy Requirements
ICF (KUSO ) XXX X X X
De-Identification and Re-Identification – – – – – –
What items are prohibited for disclosure ?
HIPAA Privacy Rule and Research with De-identified Information (1) (1) Names (2) All geographic subdivisions smaller than a State, including: street, city, county, precinct, zip code - the first three digits of the zip code can be used if this geocode includes more than 20,000 people. If such geocode is less than 20,000 persons, "000" must be used as the zip code. (3) All elements of dates (except year) related to an individual, including birth date, admission date, discharge date, date of death. For individuals > 89 years of age, year of birth cannot be used - all elements must be aggregated into a category of 90 and older.
HIPAA Privacy Rule and Research with De-identified Information (2) (4) Telephone numbers (5) FAX numbers (6) Electronic mail addresses (7) SSN (8) Medical record numbers (9) Health plan beneficiary numbers (10) Account numbers (11) Certificate/license numbers (12) Vehicle identifiers and serial numbers, including license plates (13) Device identifiers and serial numbers (14) Web universal resource locators (URLs) (15) Internet protocol (IP) address (16) Biometric identifiers, including finger and voice prints (17) Full face photos, and comparable images (18) Any unique identifying number, characteristic or code and
Following the HIPAA Regulation Is it really a safe procedure to de- identification ? ( Yes or No ) Are you sure that researchers can proceed their research after deleting these tags or codes ( Yes or No )
Example To track those subjects of cervical cancer by comparing the ICD9 and SCC data ( Date, Tag and Result ) Age and Location (Place) are very important influencing factors. Will this data-link- decoding spoil your research?
Categories of variables in a data set Directly Identifying Variables Quasi-identifiers Sensitive variables – Sensitive Variable : like the financial or health status of an individual. –How many sensitive variables are allowed in a limited database ?
Direct Identifiers Direct Identifiers are which can directly link to a subject personal data by public data information infrastructure. Name, Account Number, Medical Record Number, ID Number …..
In-direct Identifier (Quasi) Location (Address, Zip-Code) Communication Identifier ( Telephone, FAX) Internet Identifier ( IP, , Machine Code ) Any unique identifying number, characteristic or code
Quasi-Identifier Date of Birth (DoB) DoB – Month and Year Day, Month and Year of Admission, Discharge or Operation Gender Initials Address City Region Postal Code
The Difference Anonymous Confidential De-identified The IRB often finds that the terms anonymous, confidential, and de-identified are used incorrectly. These terms are described below as they relate to an individuals participation in the research and the way that their data are collected and maintained for analysis.
Anonymous It is impossible to know whether or not an individual participated in the study directly. A study participant who is a member of a minority ethnic group might be identifiable from even a large data pool. Information regarding other unique individual characteristics (indirect identifiers) might make it possible to identify an individual from a pool of dataset.
Example A Taiwan Health Insurance Claim Data Set for Physician Behavior of Prescription in Commercial Use (PBMs know which physician prescribed their medications)
Confidential The research team is obligated to protect the data from disclosure outside the research according to the terms of the research protocol and the informed consent document. In order to protect against accidental disclosure, the subjects name or other identifiers should be stored separately from their research data and replaced with a unique code to create a new identity for the subject. Note that coded data are not anonymous.
Example B Use distrust or conflict mechanism between different individuals or branches Congressmen and Officers Accounting and Financial Branch Market and Sale IRB and Researcher
De-identified When any direct or indirect identifiers or codes linking the data to the individual subjects identity are destroyed. Data have been de-identified. There were no risk to re-identify. However, in the research aspect, there were a lot of details and facts would be ignored and loosed.
Re-Identification Re-Link with some identifier or quasi- identifier to access original identification. Evaluation the risk of re-identification is an attitude or consensus for a reviewer.
Limited or De-Identified Contract or not ? –(Non-Disclosure Agreement) Regulation or not ? Expiated or Full Board ? Preservation or Time Period Available ? –Indefinite –With Date to be Expired Database Access Committee ? Database Administrator ?
A Perfect Data Security Management & Infrastructure IRB Role and Review FAQ Heuristics
Are subjects identifiable by their age, gender, and residence ? Interval 10 5 ZIPCode
Can a person be re-identified from their diagnosis code ? Many data sets also include diagnosis codes (for example, ICD-10 codes). Hospital medical record abstract data is almost publicly available. A set of diagnosis codes can make an individual very unique. Some of the records in the disclosed data set have diagnosis codes for rare and visible diseases/conditions
Can a claim database be used for re-identification ? A lot of literature makes the point that claim database can be used for re- identification. However, the accuracy of this statement will depend on your jurisdiction. Other sources of public information they can still be very useful for re-identification.
Can individuals be re-identified from disease maps ?
Do these maps risk identifying any of the individuals ? There are three questions that need to be answered to determine the risk: 1.Is the disease visible ? 2.Is the disease rare in the geography ? 3.If I re-identify an individual, will I learn something new about them ?
Can postal codes re-identify individuals ? 5 codes are the smallest geographic unit that is used by Taiwan post to deliver mail. In a health care context they are the most common geographic unit because that is what patients know and are able to provide. The postal code is the only demographic information that is being disclosed in this data set. The smallest postal codes in all provinces and territories have very few people living there. Any information about the postal code would pertain to a very small number of individuals.
Definition of identifiable dataset if a person can find their record(s) in the dataset Who is most sensitive to a data de- identification ? (Individual or reviewer) Best de-identification of dataset is that a individual cannot point out his/her record.
How can I de-identify longitudinal records ? Time Series Record is just a DNA (unique)- sequential dataset. It can easily re-identified. It should be considered a limited database. Intervals are less likely to be unique than actual dates.
How can I safely release data to multiple researchers? Re-numbering Re-ranking Different Sampling Shuffle your data before disclosure Strong dis-incentive to match the two data sets Change (Say 0.4 to 40%, English style to metric)
Is sampling sufficient to de- identify a data set ? Not only statistical significance but also risk re-identification would be taken into consideration. Intruder may not know their target within disclosure database Sampling fraction if it is higher ? (Similar as public database)
Is there a secondary use market for health information ? Yes or No Pharmaceutical Benefit Maker Private Health Insurance Service Other service ( Women and Children)
Should de-identified data go through a research ethics review ? In the first approach the IRB form has a checkbox question asking the investigator if the data is de- identified. (UM forms) If the investigator checks that box then the IRB does not review the protocol and it is automatically approved. The reasoning is that it is de-identified data and therefore there is no requirement to review the protocol.
Should IRBs decide if a data set is de-identified ? Yes or No ? (No) We dont have a privacy expert. Whether a particular data set is identifiable, and resolving any re-identification risk concerns is iterative. If these interactions are attempted they can be very slow and consequently frustrating.
Should we de-identify if technology is moving so fast ? Re-Identification technology moves faster than De-Identification – Educations for data security is cheaper than new technology. High technology stands for high risk
The difference between consenters and non-consenters Secondary use of previous dataset which is contributed from previous consenter. ( ) Should the data of drop out consenter would be included (Consenter )? Were there any words of consent to use his/her personal information found in the ICF. (Usually, to agree specimen but no personal Information) Non-consenter would be reviewed by a data access center or the privacy expert. ( )
The five levels of identifiability Level 1. The full data set as is. Level 2. The names are replaced by fake names, the health insurance number is replaced with a fake number, and the street address field is removed altogether. Level 3. The data set at Level 2 also has the postal code generalized from six characters to five characters. The risk at Level 3 is the same as Level 2, but the organization believes it has de-identified the data and discloses it. Therefore, the organization is exposed.
The five levels of identifiability Level 4. The data set at Level 3 is further modified by replacing the 5 digit postal code with a single character postal code, the date of birth is replaced by age, and the date of visit is replaced by the month of the visit. A re-identification risk assessment is then performed on this data set and the risk was found to be below a pre-specified threshold. Level 5. The number of individuals with a sexually transmitted disease.
What are the quasi-identifiers that I should use for managing risk ? (Neighborhood) Address and telephone information about the target individual Household and dwelling information (number of children, value of property, type of property) Key dates (births, deaths, weddings, admissions, discharges) Visible characteristics: gender, race, ethnicity, language spoken at home, weight, height, physical disabilities Profession
What are the quasi-identifiers that I should use for managing prosecutor risk ? (Ex-Spouse) The same things that a neighbor would know Basic medical history (allergies, chronic diseases) Income, Years of schooling
What de-identification software tools are there ? The PARAT tool from Privacy Analytics implements comprehensive risk management for three types of identity disclosure risk. mu-Argus, developed by the Netherlands national statistical agency. The Cornell Anonymization Toolkit (CAT) implements a k-anonymity algorithm. The University of Texas at Dallas Anonymization Toolbox
Who cares about my medical records ? (Finance) Some medical records have financial information in them (e.g. information used for billing purposes) For example, date of birth, address, and mother's maiden name. It is used as pin or password frequently. Even if medical records do not have information in them that is suitable for financial fraud, if your record has information about your health insurance then it can be very valuable.
Who cares about my medical records ? (Media) If you ever become of interest to the media and they want to do a story on you or your family, then reporters may be interested in re-identifying records about you. Medical records are a good source of revenue if you are in the extortion business. Even if there is no financial impact, some people feel violated if there is a breach of privacy of their medical information and change their behavior by adopting privacy protective behaviors. There are a number of attempts to make your health information publicly (or at least very widely) available.
61 Other researchers questions What genes predict better prognosis or response to treatment? Can study these questions using cancer registries, claims data
62 Ethical questions 1.May you share (coded genome- wide) data with other researchers? 2.May you use (coded genome-wide) data for additional research without consent? 3.May you do whole genome sequencing on existing coded samples without consent?
63 Ethical concerns 4.May you follow participants as prospective cohort using medical records without consent? –With identifiers can link to cancer registry, Medicare claims databases
Federal regulations on human subjects research IRB review Informed consent Not apply if –Researcher not interact with participant AND –Information not identifiable 64
65 What are de-identified data? 18 HIPAA specific identifiers –Overt identifiers, including SSN, medical record number –Geographic data more precise than first 3 digits of zip code –Dates except for year –Biometric identifiers –Any other unique identifying characteristic
66 Ethical issues 1.Informed consent –When giving broad permission for future research, do donors appreciate Whole genome sequencing? Very sensitive downstream research?
67 Sensitive research projects Some donors may object to research –Genetics of antisocial behavior –Human evolution –Beliefs about group ancestry
68 Ethical issues 2.Privacy and confidentiality –Heightened concerns, particularly about whole genome sequencing
69 Special concerns about genetic confidentiality Information considered particularly sensitive –About relatives and groups –Highly predictive of future illness Future diaries
70 Genetic Information Nondiscrimination Act (2008) Remove barriers to genetic testing Health insurers may not –Use genetic information to set eligibility or premiums –Require or request genetic testing
71 Genetic Information Nondiscrimination Act (2008) Employers may not –Use genetic information in employment or promotion decisions –Require or request genetic testing
72 Limitations of GINA After job offer, employer may request medical records –Impractical to delete genetic information Not apply to disability, life, long-term care insurance –Adverse selection if individual rating
73 HIPAA fails to protect privacy Weak security protections Applies only to covered entities –Protection does not follow information technology advancing