Download presentation
Presentation is loading. Please wait.
Published byPauline Harmon Modified over 9 years ago
1
Practical Aspects of Alerting Algorithms in Biosurveillance Howard S. Burkom The Johns Hopkins University Applied Physics Laboratory National Security Technology Department Biosurveillance Information Exchange Working Group DIMACS Program/Rutgers University Piscataway, NJ February 22, 2006
2
Outline What information do temporal alerting algorithms give the health monitor? How can typical data issues introduce bias or other misinformation? How do spatial scan statistics and other spatiotemporal methods give the monitor a different look at the data? What data issues are important for the quality of this information?
3
Conceptual approaches to Aberration Detection What does ‘aberration’ mean? Different approaches for a single data source: Process control-based: “The underlying data distribution has changed” – many measures Model-based: “The data do not fit an analytical model based on a historical baseline” – many models Can combine these approaches Spatiotemporal Approach: “The relationship of local data to neighboring data differs from expectations based on model or recent history”
4
Comparing Alerting Algorithms Criteria: Sensitivity –Probability of detecting an outbreak signal –Depends on effect of outbreak in data Specificity ( 1 – false alert rate ) –Probability(no alert | no outbreak ) –May be difficult to prove no outbreak exists Timeliness –Once the effects of an outbreak appear in the data, how soon is an alert expected?
5
Aggregating Data in Time baseline interval Used to get some estimate of normal data behavior Mean, variance Regression coefficients Expected covariate distrib. -- spatial -- age category -- % of claims/syndrome guardband Avoids contamination of baseline with outbreak signal Data stream(s) to monitor in time: Counts to be tested for anomaly Nominally 1 day Longer to reduce noise, test for epicurve shape Will shorten as data acquisition improves test interval
6
Elements of an Alerting Algorithm –Values to be tested: raw data, or residuals from a model? –Baseline period Historical data used to determine expected data behavior Fixed or a sliding window? Outlier removal: to avoid training on unrepresentative data What does algorithm do when there is all zero/no baseline data? Is a warmup period of data history required? –Buffer period (or guardband) Separation between the baseline period and interval to be tested –Test period Interval of current data to be tested –Reset criterion to prevent flooding by persistent alerts caused by extreme values –Test statistic: value computed to make alerting decisions –Threshold: alert issued if test statistic exceeds this value
7
Rash Syndrome Grouping of Diagnosis Codes www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc
8
Example: Daily Counts with Injected Cases Injected Cases Presumed Attributable to Outbreak Event
9
Example: Algorithm Alerts Indicated Test Statistic Exceeds Chosen Threshold
10
EWMA Monitoring Exponential Weighted Moving Average Average with most weight on recent X k : S k = S k-1 + (1- )X k, where 0 < Test statistic: S k compared to expectation from sliding baseline Basic idea: monitor (S k – k ) / k Added sensitivity for gradual events Larger means less smoothing
11
Example with Detection Statistic Plot Statistic Exceeds Threshold Threshold
12
Example: EWMA applied to Rash Data
13
Effects of Data Problems missed event Additional flags
14
Importance of spatial data for biosurveillance –Purely temporal methods can find anomalies, IF you know which case counts to monitor Location of outbreak? Extent? –Advantages of spatial clustering Tracking progression of outbreak Identifying population at risk
15
x x x x x x x x x x x x x Evaluating Candidate Clusters x x x x Surveillance Region Candidate cluster: The scan statistic gives a measure of: “how unlikely is the number of cases inside relative to the number outside, given the expected spatial distribution of cases” (Thus, a populous region won’t necessarily flag.)
16
x x x x x x x x x x x x x Selecting Candidate Clusters
17
Searching for Spatial Clustering form cylinders: bases are circles about each centroid in region A, height is time calculate statistic for event count in each cylinder relative to entire region, within space & time limits most significant clusters: regions whose centroids form base of cylinder with maximum statistic but how unusual is it? Repeat procedure with Monte Carlo runs, compare max statistic to maxima of each of these x x x x x x x x x centroids of data collection regions region A x x x
18
Scan Statistic Demo
19
Scan Statistics: Advantages Gives monitor guidance for cluster size, location, significance Avoids preselection bias regarding cluster size or location Significance testing has control for multiple testing Can tailor problem design by data, objective: –Location (zipcode, hospital/provider site, patient/customer residence, school/store address) –Time windows used (cases, history, guardband) –Background estimation method: model, history, population, eligible customers
20
Surveillance Application OTC Anti-flu Sales, Dates: 15-24Apr2002 Total sales as of 25Apr: 1804 potential cluster: center at 22311 63 sales, 39 exp. from recent data rel. risk = 1.6 p = 0.041
21
Distribution of Nonsyndromic Visits 4 San Diego Hospitals
22
Effect of Data Discontinuities on OTC Cough/Cold Clusters Before removing problem zips, cluster groups are dominated by zips that “turn on” after sustained periods of zero or abnormally low counts. After editing, more interesting cluster groups emerge. Days Zip (S to N)
23
School Nurse Data: All Visits unreported
24
Cluster Investigation by Record Inspection Records Corresponding to a Respiratory Cluster
25
Backups
26
Cumulative Summation Approach (CUSUM) Widely adapted to disease surveillance Devised for prompt detection of small shifts Look for changes of 2k standard deviations from the mean often k = 0.5) Take normalized deviation: often Z t = (x t – ) / Compare lower, upper sums to threshold h: S H,j = max ( 0, (Z t - k) + S H,j-1 ) S L,j = max ( 0, (-Z t - k) + S L,j-1 ) Phase I sets h, k Upper Sum: Keep adding differences between today’s count and k std deviations above mean. Alert when the sum exceeds threshold h.
27
CuSum Example: CDC EARS Methods C1-C3 Three adaptive methods chosen by National Center for Infectious Diseases after 9/1/2001 as most consistent Look for aberrations representing increases, not decreases Fixed mean, variance replaced by values from sliding baseline (usually 7 days) Baseline for C1-MILD (-1 to -7 day) Baseline C2-MEDIUM (-3 to -9days) Baseline for C3-ULTRA (-3 to -9 days) Current Count Day-9 Day-8 Day-7 Day-6 Day-5 Day-4 Day-3 Day-2 Day-1 Day 0
28
Calculation for C1-C3: Individual day statistic for day j with lag n: S j,n = Max {0, ( Count j – [μ n + σ n ] ) / σ n }, where μ n is 7-day average with n-day lag ( so μ 3 is mean of counts in [j-3, j-9] ), and σ n = standard deviation of same 7-day window C1 statistic for day k is S k,1 (no lag) C2 statistic for day k is S k,3 (2-day lag) C3 statistic for day k is S k,3 + S k-1,3 + S k-2,3,where S k-1,3, S k-2,3 are added if they do not exceed the threshold Upper bound threshold of 2: equivalent to 3 standard deviations above mean
29
Detailed Example, I Fewer alerts AND more sensitive: why?
30
Detailed Example, II Signal Detected only with 28-day baseline
31
Detailed Example, III “the rest of the story”
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.