Using Large Databases for Research Melissa Schiff, MD, MPH UNM Hospitalists’ Research Club July 16, 2019
How many people have used a large database for a research project?
Outline Reasons for using large databases How to evaluate if a database will work for your research Nuts and bolts of using large databases Useful databases for internal med / hospitalist research projects
“Epidemiologists are data scavengers” Secondary database – administrative data collected by an organization, surveillance data Use of secondary data revolutionized by tech advances and information technology “Big Data” used in health and many other fields outside of medicine
Reasons to use Large Databases Primary data collection – expensive in terms of money and time Secondary database advantages – Data available on the web, from data collection organization or third party Ex: Medicare data available from CMS Data collected on large number of people – large sample size, ascertain rare diseases or exposures Ex: Cerner Health Facts 600 hospitals, 85+ million patients
Population-based data - calculate incidence rates, avoid referral bias Ex: Birth certificates, death certificates, NM HIDD Data less biased – collected for another purpose Ex: Self-report alcohol-related admissions vs hospitalization data
Linkage of secondary database to primary data – independent measure or validation Ex: Occupational lung disease study linked to employment records Use of natural language processing – search text for keywords Ex: OMI data searched for oil and gas worker deaths
Evaluating Large Database for your Research Research question definition, variables Exposure(s) Outcome(s) Confounding factors Identify the database of interest – publically available, documentation available Population included in database Demographics, geographic location Inclusion/exclusion criteria
Database – SEER linked to Medicare, 2001-2011 Research question: Does use of preventive care differ by race among Medicare beneficiaries with early stage endometrial cancer? Database – SEER linked to Medicare, 2001-2011 Population –women 65+ years (Medicare), endometrial cancer (SEER) N=13,054 Exposure – race (Medicare) Outcome – preventive care (Medicare outpatient visits) Well visit Flu vaccine Mammogram Diabetes screening
Time frame – months, years Data dictionary – document listing the variables, definitions Are your key variables included in the database? How are they measured? How much missing data for key variables? How many cases available in database? May need to download database to evaluate
Nuts and Bolts for using Large Databases Accessing database – Available online, from organization May need to submit written request with research question Cost of database Finding help Researchers at UNM with specific database experience CTSC State health department Experts at federal agencies (e.g. CMS)
Statistical / data management needs Statistical package capacity for large databases Biostatistician with familiarity of database Data collection at the individual level or encounter level – un-duplication Subject identifiers Longitudinal - analysis over time IRB – typically considered “Exempt” status, check with UNM HRPO
Efficient use of one database Main research question Specific populations - age groups, disease severity Trends over time Evaluation of policy change Financial data – health economics Identify refined research question – future primary data collection Become the local expert
Useful Databases Databases available at UNM Health Facts – Cerner EMR data for UNM hospital, 600 hospitals Truven – 240 million patients, pharmaceuticals I2B2 – identify number of cases in UNM EMR Vizient New Mexico Tumor Registry (SEER) – all cancer cases in NM
New Mexico databases Office of Medical Investigator New Mexico Death Certificates SEER-Medicare linked data Behavioral Risk Factor Surveillance System New Mexico Prescription Monitoring Program Indicator-based Information System (IBIS)
National databases Health Care Cost and Utilization Project (HCUP) National Inpatient Sample (NIS) – 7 million hospital stays National Health And Nutrition Examination Survey (NHANES) – 5000 people annually National Ambulatory Medical Care Survey Medicaid Medicare Veteran’s Administration
List of large databases - https://www.ehdp.com/vitalnet/datasets.htm
Summary Research question - consider using a large database Investigating a large database – population, variables, missing data, cost Accessing a large database – availability, time frame to get data, data management/analysis, people to help Multiple uses – consider variety of research questions to answer
Questions? Contact information mschiff@salud.unm.edu