Scenario-Based Evaluation of Cluster Detection and Tracking Capability

Scenario-Based Evaluation of Cluster Detection and Tracking Capability
Howard Burkom1, Jian Xing2, Linda Moniz1, Jerry Tokars2 1 Johns Hopkins University Applied Physics Laboratory 2 Centers for Disease Control and Prevention 6th Annual Public Health Information Network Conference Section F7: Evaluation of Surveillance Systems Atlanta, GA August 27, 2008 Copyright 2006 The Johns Hopkins University Applied Physics Laboratory. All Rights Reserved. Cite collaboration of Linda, Jim & Jerry at CDC BioSense Bridging gap between statistical research and PH practice—data dependent Addressing challenge others have posed, including at this conference. This presentation was supported by Grant R01 PH from CDC. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC.

Presentation Outline BioSense Data Environment: DoD Outpatient Clinic Visit Records Cluster Detection Methodology Synthetic Signal Approach Study Design Study Results Conclusions Presentation outline: note importance of data source. Conclusions don’t necessarily carry over other data sources and syndrome filterings, & we’ve developed methods for other data types.

Data Source: Daily Outpatient Visits Counts by Syndrome
Data provided through CDC BioSense program under R01 grant PH 3 years’ worth of military clinic data from Texas, with some clinics in Oklahoma and Louisiana; most patients in LA and OK clinics come from Texas. 32 facility zip codes, 1233 residence zips. Dataset includes active duty personnel, dependents, and retirees. Zip code level data: individual patient records contain 2 zip codes: treatment facility & residence. No other spatial information available. Can look at spatial clusters by either one. By facility, surrogate for work, by residence, surrogate for home.

Data Highly Skewed Toward a Few Locations
A few clinics completely dominate the case counts. A few home zips dominate the case counts. 1099 For both types of spatial organization, recorded zips highly concentrated in relatively few locs. Facililty: 20 of 32 zips have median daily count 0, just 3 with > 100. Residence: 1099 of 1233 have median daily count 0. Records not spatially distributed like census pop—Jim’s presentation last yr.

Method of Scan Statistics
form cylinders: Bases are circles about each centroid in region A, height is time Surveillance Region centroids of data collection regions calculate statistic: for event count in each cylinder relative to entire region, within space & time limits x x x x x most significant clusters: regions whose centroids form base of cylinder with maximum statistic x x Summary of cluster detection method—scan statistics popularized by SaTScan software. x but how unusual is it? Repeat procedure with Monte Carlo trials, compare max statistic to set of maxima of these M. Kulldorff’s SaTScan, downloadable at x x x x

Implementations of Scan Statistics
Separate codes in C and Matlab, precisely cross-checked Enable testing of spatial distribution estimation methods Significance testing: use Gumbel (extreme value) distribution to reduce the number of required Monte Carlo runs. method of Abrams A, Kulldorff M, Kleinman K, “Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic”, Adv. in Disease Surv. Vol. 1. They showed that thresholds derived from the Gumbel distribution with 100 replicates may be used in place of rank-based p-values using 999 Monte Carlo replicates Added capability to inject parametric, stochastic signals to background data in order to evaluate detection performance

Details of Scan Statistic Study Implementation
Estimated spatial distribution inferred from a sliding baseline 56 days with a 2-day separation buffer from test day(s) 2 estimation methods in this study for individual zip-code totals a. Flat baseline average b. Averages stratified by day-of-week (DOW), counting holidays as Sundays. Only single-day clusters examined in this study Output of the scan statistic: clusters whose Gumbel p-values are below chosen threshold. We ran the scan statistics on the entire 3 years of data to get the background daily rate of significant cluster determination. Injected sets of stochastic signals to measure detection performance

Algorithm Evaluation: Authentic vs Synthetic Data
Authentic data Advantage of representation of true background noise Data very difficult to obtain and publish Known outbreak effects, especially with spatial distribution, very rare Synthetic data Fully understood noise field, but must show epidemiological relevance Onset and strength of signal known precisely Can know that background alarms are false alarms Semi-synthetic approach Injecting realistic signals into authentic background data Authentic data for algorithm evaluation are hard to get, sometimes harder to publish This difficulty has led to frequent discussion of whether to use relatively small sets of unlabelled data, or simulate noise background and/or signal In authentic data, almost never know exact date of onset in a population sense—even individual onset dates can be questionable ( been involved in expert-directed studies—never satisfactory because info required by experts so difficult to get )

Repeated Trials with Synthetic Signals
Add a fixed number of inject cases to selected daily zip code counts using chosen case spread geometry study added records region-wide over ~8 days Run scan statistic detection on resultant dataset Criterion for detection: a statistically significant cluster containing a zip code with injected cases. Subsequent trials: advance start day by 8 days, repeat the process, for about 130 trials using ~3 years’ data To evenly sample day-of-week, seasonal effects

Idealized Signal Shape: Sartwell’s Lognormal Model
Incubation periods may be described by a lognormal distribution, with parameters depending on: disease type route of infection Individual factors: dosage susceptibility (vaccination status, genetic factors) Epicurve: plot of number of symptomatic cases by day Canonical idea of a bioterrorist attack is a localized, point-source outbreak, such as the 1979 accident at Sverdlovsk where weaponized anthrax spores were released in aerosol form, an unknown number infected, and about 70 died Magenta dotted curve shows actual epicurve we constructed from plot in 1992 Meselson paper We’ve taken data such as this to calculate zeta and sigma for disease-specific lognormal dist. Can then plot the “maximum likelihood epicurve” Modal day is exp(zeta + 2*sigma); in constructing a test signal, we set the modal number of cases to a multiple of the estimated standard deviation of the time series of interest, then divide by the modal probability to get the total number infected, and we add the resulting counts to the authentic data Sartwell, PE. The distribution of incubation periods of infectious disease. Am J Hyg 1950; 51:310:318

“maximum likelihood” epicurve Each symptomatic case a random draw
Form Stochastic Epicurve, then Distribute Each Day’s Cases by Subregion “maximum likelihood” epicurve Each symptomatic case a random draw For each trial, compute histogram of cases by day with random draws, then distribute each day’s cases according to chosen geometry of spread However, we don’t assume the maximum likelihood epicurve in our simulation; we form a stochastic signal using the calculated N, zeta, and sigma For each of the N simulated cases, we take a random draw from the lognormal with params, zeta & sigma, just as each individual’s incubation period could be seen as a random draw from distributions of dosage and susceptibility Each trial is then a set of N such random draws, giving a large set of random signals These signals are what we add to the noise background of authentic data Rolka H, Burkom H, Cooper GF, Kulldorff M, Madigan D, Wong W-K, Issues in applied statistics for public health, Bioterrorism surveillance using multiple data streams: research needs, Statistics in Medicine (2007), Vol. 26, 8:

Inject Geometries Circular (resp) Hourglass (resp) Wedge (resp)

Sample Distribution of Background and Injected Visit Records
Typical data scenario: effects of an outbreak—50 additional cases spread out over 9 days Note regionwide data series and epicurve. Black for background counts, red for injects.

Results Chart for Respiratory Syndrome
Note drop in sensitivity from residence to facility level analysis High sensitivity in rural regions, dropping as background gets more noisy

Results Chart for Rash Syndrome
Sensitivity to weaker signals for more specific rash syndrome, sparser background At level of residence zip codes, sensitivity increases as signal moves away from noisy urban center

Outbreak Signal Sizes Required for Detection using only Time Series
Multiple testing problems may be severe at facility or local levels—Does distributed investigation capability exist?

Comparison of Estimation Methods by threshold p-value
Resp Syndrome, by residence zip, “suburban” center, PD by Day 4 flat baseline avg., 100 injects wkday/wkend avg., 100 injects flat baseline avg., no injects wkday/wkend avg., no injects

Comparison of Estimation Methods by empirical background cluster rate
50 total injects flat baseline avg wkday/wkend avg. 100 total injects flat baseline avg wkday/wkend avg.

Utility of Signal Formation/Injection Procedure
Scenario-based comparison, tuning of cluster detection methods Quantifying alerting capability Use case-specific: how large a signal is detectable? How to set thresholds, estimate spatial distribution for required sensitivity Clarifying sensitivity by background density in highly skewed distributions Urban, Suburban, Rural Helping to evaluate syndrome groupings for a clinical data source by quantifying spatial noise effects on detection capability

Conclusions 1: DoD outpatient visit dataset
Event-based scenarios involving military bases: use facility-level cluster detection More representative of work location Maximum general sensitivity: use residence level zips To estimate spatial distribution must account for day-of-week and seasonal effects to avoid spurious clustering but can lose specificity by overfitting model for specific syndrome grouping

BioSense rash category allowed detection of small clusters ( <50 additional region-wide cases/week ) at low background cluster rates (1-2 per month) BioSense resp-related record grouping too noisy for sensitivity to modest-sized clusters ( ~100 cases) at low alert rates consider more specific filtering (subsyndrome) for this dataset

Findings emphasize importance of: ability to classify records by patient residence or work location, to capture location of exposure syndromic record filtering for effective signal-to-noise discrimination Developing rapid capability to Display clinics, chief complaints, ages, other key record fields for all cases in suspected cluster Identifying potential linkages among these records

Acknowledgements Data were provided through the CDC BioSense program under R01 grant PH Jim Edgerton, APL and Mike Leuze, SAIC did principal development of the MATLAB and C implementations of the scan statistic algorithms; Jim Edgerton developed the Gumbel distribution adaptation in the MATLAB implementation. This presentation was supported by Grant R01 PH from CDC. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC.

Scenario-Based Evaluation of Cluster Detection and Tracking Capability

Similar presentations

Presentation on theme: "Scenario-Based Evaluation of Cluster Detection and Tracking Capability"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scenario-Based Evaluation of Cluster Detection and Tracking Capability

Similar presentations

Presentation on theme: "Scenario-Based Evaluation of Cluster Detection and Tracking Capability"— Presentation transcript:

Similar presentations

About project

Feedback