Practical Aspects of Alerting Algorithms in Biosurveillance Howard S. Burkom The Johns Hopkins University Applied Physics Laboratory National Security.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Bayesian Forecasting and Dynamic Models M.West and J.Harrison Springer, 1997 Presented by Deepak Agarwal.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Statistical approaches for detecting clusters of disease. Feb. 26, 2013 Thomas Talbot New York State Department of Health Bureau of Environmental and Occupational.
Agricultural and Biological Statistics
Chapter 10 Section 2 Hypothesis Tests for a Population Mean
Early Detection of Disease Outbreaks Prospective Surveillance.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
Decision Theoretic Analysis of Improving Epidemic Detection Izadi, M. Buckeridge, D. AMIA 2007,Symposium Proceedings 2007.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
Chapter 10 Quality Control McGraw-Hill/Irwin
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Weng-Keen Wong, Greg Cooper, Denver Dash *, John Levander, John Dowling,
BA 555 Practical Business Analysis
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling,
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering On-line Alert Systems for Production Plants A Conflict Based Approach.
The Space-Time Scan Statistic for Multiple Data Streams
Conclusions On our large scale anthrax attack simulations, being able to infer the work zip appears to improve detection time over just using the home.
Part II – TIME SERIES ANALYSIS C2 Simple Time Series Methods & Moving Averages © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Spatiotemporal Cluster Detection in ESSENCE Biosurveillance Systems Panelist: Howard Burkom National Security Technology Department, John Hopkins University.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Total Quality Management BUS 3 – 142 Statistics for Variables Week of Mar 14, 2011.
Overview of ‘Syndromic Surveillance’ presented as background to Multiple Data Source Issue for DIMACS Working Group on Adverse Event/Disease Reporting,
“To Ignore or Not to Ignore?” Follow-up to Statistically Significant Signals" Biosurveillance Information Exchange Working Group Reflections from San Diego.
The Bell Shaped Curve By the definition of the bell shaped curve, we expect to find certain percentages of the population between the standard deviations.
Control charts : Also known as Shewhart charts or process-behaviour charts, in statistical process control are tools used to determine whether or not.
Active Learning Lecture Slides
Data Selection In Ad-Hoc Wireless Sensor Networks Olawoye Oyeyele 11/24/2003.
Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.
Forecasting and Statistical Process Control MBA Statistics COURSE #5.
SPONSOR JAMES C. BENNEYAN DEVELOPMENT OF A PRESCRIPTION DRUG SURVEILLANCE SYSTEM TEAM MEMBERS Jeffrey Mason Dan Mitus Jenna Eickhoff Benjamin Harris.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
A Wavelet-based Anomaly Detector for Disease Outbreaks Thomas Lotze Galit Shmueli University of Maryland College Park Sean Murphy Howard Burkom Johns Hopkins.
Introduction to Statistical Quality Control, 4th Edition
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
Cluster Detection Comparison in Syndromic Surveillance MGIS Capstone Project Proposal Tuesday, July 8 th, 2008.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
DAVIS AQUILANO CHASE PowerPoint Presentation by Charlie Cook F O U R T H E D I T I O N Forecasting © The McGraw-Hill Companies, Inc., 2003 chapter 9.
Using the Repeated Two-Sample Rank Procedure for Detecting Anomalies in Space and Time Ronald D. Fricker, Jr. Interfaces Conference May 31, 2008.
A Process Control Screen for Multiple Stream Processes An Operator Friendly Approach Richard E. Clark Process & Product Analysis.
Chapter 4 Control Charts for Measurements with Subgrouping (for One Variable)
Slide 1 Copyright © 2004 Pearson Education, Inc..
Detecting Anomalies in Space and Time with Application to Biosurveillance Ronald D. Fricker, Jr. August 15, 2008.
CHAPTER 7 STATISTICAL PROCESS CONTROL. THE CONCEPT The application of statistical techniques to determine whether the output of a process conforms to.
1 SMU EMIS 7364 NTU TO-570-N Control Charts Basic Concepts and Mathematical Basis Updated: 3/2/04 Statistical Quality Control Dr. Jerrell T. Stracener,
~PPT Howard Burkom 1, PhD Yevgeniy Elbert 2, MSc LTC Julie Pavlin 2, MD MPH Christina Polyak 2, MPH 1 The Johns Hopkins University Applied Physics.
No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns.
Visualization and analysis of clusters in large populations of fraud cases.
Chapter 51Introduction to Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2012 John Wiley & Sons, Inc.
Bayesian Biosurveillance of Disease Outbreaks RODS Laboratory Center for Biomedical Informatics University of Pittsburgh Gregory F. Cooper, Denver H.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
Establishing baselines Detecting a Trend What to do following a Trend How to re-baseline Life Cycle of a Trend.
Towards Improved Sensitivity, Specificity, and Timeliness of Syndromic Surveillance Systems Anna L. Buczak, PhD, Linda J. Moniz, PhD, Joseph Lombardo,
Types of risk Market risk
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Copyright (c) 2005 John Wiley & Sons, Inc.
APHA, Washington, November, 2007
Types of risk Market risk
Baselining PMU Data to Find Patterns and Anomalies
One Health Early Warning Alert
SLOPE: A MATLAB Revival
Public Health Surveillance
Special Control Charts II
The Life Cycle of a Trend Savannah River Nuclear Solutions, LLC
Scenario-Based Evaluation of Cluster Detection and Tracking Capability
Presumptions Subgroups (samples) of data are formed.
Presentation transcript:

Practical Aspects of Alerting Algorithms in Biosurveillance Howard S. Burkom The Johns Hopkins University Applied Physics Laboratory National Security Technology Department Biosurveillance Information Exchange Working Group DIMACS Program/Rutgers University Piscataway, NJ February 22, 2006

Outline What information do temporal alerting algorithms give the health monitor? How can typical data issues introduce bias or other misinformation? How do spatial scan statistics and other spatiotemporal methods give the monitor a different look at the data? What data issues are important for the quality of this information?

Conceptual approaches to Aberration Detection What does ‘aberration’ mean? Different approaches for a single data source: Process control-based: “The underlying data distribution has changed” – many measures Model-based: “The data do not fit an analytical model based on a historical baseline” – many models Can combine these approaches Spatiotemporal Approach: “The relationship of local data to neighboring data differs from expectations based on model or recent history”

Comparing Alerting Algorithms Criteria: Sensitivity –Probability of detecting an outbreak signal –Depends on effect of outbreak in data Specificity ( 1 – false alert rate ) –Probability(no alert | no outbreak ) –May be difficult to prove no outbreak exists Timeliness –Once the effects of an outbreak appear in the data, how soon is an alert expected?

Aggregating Data in Time baseline interval Used to get some estimate of normal data behavior Mean, variance Regression coefficients Expected covariate distrib. -- spatial -- age category -- % of claims/syndrome guardband Avoids contamination of baseline with outbreak signal Data stream(s) to monitor in time: Counts to be tested for anomaly Nominally 1 day Longer to reduce noise, test for epicurve shape Will shorten as data acquisition improves test interval

Elements of an Alerting Algorithm –Values to be tested: raw data, or residuals from a model? –Baseline period Historical data used to determine expected data behavior Fixed or a sliding window? Outlier removal: to avoid training on unrepresentative data What does algorithm do when there is all zero/no baseline data? Is a warmup period of data history required? –Buffer period (or guardband) Separation between the baseline period and interval to be tested –Test period Interval of current data to be tested –Reset criterion to prevent flooding by persistent alerts caused by extreme values –Test statistic: value computed to make alerting decisions –Threshold: alert issued if test statistic exceeds this value

Rash Syndrome Grouping of Diagnosis Codes

Example: Daily Counts with Injected Cases Injected Cases Presumed Attributable to Outbreak Event

Example: Algorithm Alerts Indicated Test Statistic Exceeds Chosen Threshold

EWMA Monitoring Exponential Weighted Moving Average Average with most weight on recent X k : S k =  S k-1 + (1-  )X k, where 0 <  Test statistic: S k compared to expectation from sliding baseline Basic idea: monitor (S k –  k ) /  k Added sensitivity for gradual events Larger  means less smoothing

Example with Detection Statistic Plot Statistic Exceeds Threshold Threshold

Example: EWMA applied to Rash Data

Effects of Data Problems missed event Additional flags

Importance of spatial data for biosurveillance –Purely temporal methods can find anomalies, IF you know which case counts to monitor Location of outbreak? Extent? –Advantages of spatial clustering Tracking progression of outbreak Identifying population at risk

x x x x x x x x x x x x x Evaluating Candidate Clusters x x x x Surveillance Region Candidate cluster: The scan statistic gives a measure of: “how unlikely is the number of cases inside relative to the number outside, given the expected spatial distribution of cases” (Thus, a populous region won’t necessarily flag.)

x x x x x x x x x x x x x Selecting Candidate Clusters

Searching for Spatial Clustering form cylinders: bases are circles about each centroid in region A, height is time calculate statistic for event count in each cylinder relative to entire region, within space & time limits most significant clusters: regions whose centroids form base of cylinder with maximum statistic but how unusual is it? Repeat procedure with Monte Carlo runs, compare max statistic to maxima of each of these x x x x x x x x x centroids of data collection regions region A x x x

Scan Statistic Demo

Scan Statistics: Advantages Gives monitor guidance for cluster size, location, significance Avoids preselection bias regarding cluster size or location Significance testing has control for multiple testing Can tailor problem design by data, objective: –Location (zipcode, hospital/provider site, patient/customer residence, school/store address) –Time windows used (cases, history, guardband) –Background estimation method: model, history, population, eligible customers

Surveillance Application OTC Anti-flu Sales, Dates: 15-24Apr2002 Total sales as of 25Apr: 1804 potential cluster: center at sales, 39 exp. from recent data rel. risk = 1.6 p = 0.041

Distribution of Nonsyndromic Visits 4 San Diego Hospitals

Effect of Data Discontinuities on OTC Cough/Cold Clusters Before removing problem zips, cluster groups are dominated by zips that “turn on” after sustained periods of zero or abnormally low counts. After editing, more interesting cluster groups emerge. Days Zip (S to N)

School Nurse Data: All Visits unreported

Cluster Investigation by Record Inspection Records Corresponding to a Respiratory Cluster

Backups

Cumulative Summation Approach (CUSUM) Widely adapted to disease surveillance Devised for prompt detection of small shifts Look for changes of 2k standard deviations from the mean   often k = 0.5) Take normalized deviation: often Z t = (x t –  ) /  Compare lower, upper sums to threshold h: S H,j = max ( 0, (Z t - k) + S H,j-1 ) S L,j = max ( 0, (-Z t - k) + S L,j-1 ) Phase I sets  h, k Upper Sum: Keep adding differences between today’s count and k std deviations above mean. Alert when the sum exceeds threshold h.

CuSum Example: CDC EARS Methods C1-C3 Three adaptive methods chosen by National Center for Infectious Diseases after 9/1/2001 as most consistent Look for aberrations representing increases, not decreases Fixed mean, variance replaced by values from sliding baseline (usually 7 days) Baseline for C1-MILD (-1 to -7 day) Baseline C2-MEDIUM (-3 to -9days) Baseline for C3-ULTRA (-3 to -9 days) Current Count Day-9 Day-8 Day-7 Day-6 Day-5 Day-4 Day-3 Day-2 Day-1 Day 0

Calculation for C1-C3: Individual day statistic for day j with lag n: S j,n = Max {0, ( Count j – [μ n + σ n ] ) / σ n }, where μ n is 7-day average with n-day lag ( so μ 3 is mean of counts in [j-3, j-9] ), and σ n = standard deviation of same 7-day window C1 statistic for day k is S k,1 (no lag) C2 statistic for day k is S k,3 (2-day lag) C3 statistic for day k is S k,3 + S k-1,3 + S k-2,3,where S k-1,3, S k-2,3 are added if they do not exceed the threshold Upper bound threshold of 2: equivalent to 3 standard deviations above mean

Detailed Example, I Fewer alerts AND more sensitive: why?

Detailed Example, II Signal Detected only with 28-day baseline

Detailed Example, III “the rest of the story”