What’s Strange About Recent Events (WSARE)

What’s Strange About Recent Events (WSARE)
Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) DIMACS Tutorial on Statistical and Other Analytic Health Surveillance Methods

Motivation Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved) Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more… 100 6/1/03 9:12 1 781 Fever M 20s NE ? … 101 10:45 787 Diarrhea F 40s 102 11:03 786 Respiratory 60s N 103 11:07 2 E 104 12:15 717 105 13:01 3 780 Viral 50s NW 106 13:05 487 SW 107 13:57 Unmapped SE 108 14:22 :

The Problem From this data, can we detect if a disease outbreak is happening?

We’re talking about a non-specific disease detection
The Problem From this data, can we detect if a disease outbreak is happening? We’re talking about a non-specific disease detection

The Problem From this data, can we detect if a disease outbreak is happening? How early can we detect it?

The Problem From this data, can we detect if a disease outbreak is happening? How early can we detect it? The question we’re really asking: In the last n hours, has anything strange happened?

Traditional Approaches
What about using traditional anomaly detection? Typically assume data is generated by a model Finds individual data points that have low probability with respect to this model These outliers have rare attributes or combinations of attributes Need to identify anomalous patterns not isolated data points

What about monitoring aggregate daily counts of certain attributes? We’ve now turned multivariate data into univariate data Lots of algorithms have been developed for monitoring univariate data: Time series algorithms Regression techniques Statistical Quality Control methods Need to know apriori which attributes to form daily aggregates for! Mention that there might be a nasty disease that might not show up in the aggregates but might show up in a region x gender

What if we don’t know what attributes to monitor?

What if we don’t know what attributes to monitor? What if we want to exploit the spatial, temporal and/or demographic characteristics of the epidemic to detect the outbreak as early as possible?

We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Number of cases involving people working in southern part of the city Respiratory syndrome cases among females Number of cases involving teenage girls living in the western part of the city Viral syndrome cases involving senior citizens from eastern part of city Botulinic syndrome cases Number of children from downtown hospital And so on…

We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Number of cases involving people working in southern part of the city Respiratory syndrome cases among females Number of cases involving teenage girls living in the western part of the city You’ll need hundreds of univariate detectors! We would like to identify the groups with the strangest behavior in recent events. Viral syndrome cases involving senior citizens from eastern part of city Botulinic syndrome cases Number of children from downtown hospital And so on…

Our Approach We use Rule-Based Anomaly Pattern Detection
Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40  Age < 50 Related work: Market basket analysis [Agrawal et. al, Brin et. al.] Contrast sets [Bay and Pazzani] Spatial Scan Statistic [Kulldorff] Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance [Brossette et. al.]

WSARE v2.0 Inputs: 1. Multivariate date/time-indexed biosurveillance-relevant data stream 2. Time Window Length 3. Which attributes to use? “Emergency Department Data” “Ignore key” “Last 24 hours” Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more… 100 6/1/03 9:12 1 781 Fever M 20s NE ? … 101 10:45 787 Diarrhea F 40s 102 11:03 786 Respiratory 60s N :

WSARE v2.0 Inputs: Outputs:
1. Multivariate date/time-indexed biosurveillance-relevant data stream 2. Time Window Length 3. Which attributes to use? 3. And here’s how seriously you should take it 2. Here’s why Outputs: 1. Here are the records that most surprise me Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more… 100 6/1/03 9:12 1 781 Fever M 20s NE ? … 101 10:45 787 Diarrhea F 40s 102 11:03 786 Respiratory 60s N :

WSARE v2.0 Overview Obtain Recent and Baseline datasets
2. Search for rule with best score All Data Recent Data 3. Determine p-value of best scoring rule through randomization test Baseline 4. If p-value is less than threshold, signal alert

Step 1: Obtain Recent and Baseline Data
Data from last 24 hours Baseline Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day

Step 2. Search for Best Scoring Rule
For each rule, form a 2x2 contingency table eg. Perform Fisher’s Exact Test to get a p-value for each rule => call this p-value the “score” Take the rule with the lowest score. Call this rule RBEST. This score is not the true p-value of RBEST because we are performing multiple hypothesis tests on each day to find the rule with the best score CountRecent CountBaseline Age Decile = 3 48 45 Age Decile  3 86 220 Explain Ctoday and Cother

The Multiple Hypothesis Testing Problem
Suppose we reject null hypothesis when score < , where  = 0.05 For a single hypothesis test, the probability of making a false discovery =  Suppose we do 1000 tests, one for each possible rule Probability(false discovery) could be as bad as: 1 – ( 1 – 0.05)1000 >> 0.05

Step 3: Randomization Test
June 4, 2002 C2 June 5, 2002 C3 June 12, 2002 C4 June 19, 2002 C5 June 26, 2002 C6 C7 July 2, 2002 C8 July 3, 2002 C9 July 10, 2002 C10 July 17, 2002 C11 July 24, 2002 C12 July 30, 2002 C13 July 31, 2002 C14 C15 June 4, 2002 C2 June 12, 2002 C3 July 31, 2002 C4 June 26, 2002 C5 C6 June 5, 2002 C7 July 2, 2002 C8 July 3, 2002 C9 July 10, 2002 C10 July 17, 2002 C11 July 24, 2002 C12 July 30, 2002 C13 June 19, 2002 C14 C15 Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand Find the rule with the best score on DBRand.

Step 3: Randomization Test
Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised. Could show the mechanical process of the randomization test here Estimated p-value of the rule is: # better scores / # iterations

Two Kinds of Analysis Day by Day Historical Analysis
If we want to run WSARE just for the current day… …then we end here. Historical Analysis If we want to review all previous days and their p-values for several years and control for some percentage of false positives… …then we’ll once again run into overfitting problems …we need to compensate for multiple hypothesis testing because we perform a hypothesis test on each day in the history

We only need to do this for historical analysis!
False Discovery Rate [Benjamini and Hochberg] Can determine which of these p-values are significant Specifically, given an αFDR, FDR guarantees that Given an αFDR, FDR produces a threshold below which any p-values in the history are considered significant Explain that WSARE 2.0 results will be shown after WSARE 3.0 is explained

WSARE v3.0

WSARE v2.0 Review Obtain Recent and Baseline datasets

Obtaining the Baseline
Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day.

Obtaining the Baseline
Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. What if this assumption isn’t true? What if data from 7, 14, 21 and 28 days prior is better? We would like to determine the baseline automatically!

Temporal Trends But health care data has many different trends due to
Seasonal effects in temperature and weather Day of Week effects Holidays Etc. Allowing the baseline to be affected by these trends may dramatically alter the detection time and false positives of the detection algorithm

Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp )

WSARE v3.0 Generate the baseline…
“Taking into account recent flu levels…” “Taking into account that today is a public holiday…” “Taking into account that this is Spring…” “Taking into account recent heatwave…” “Taking into account that there’s a known natural Food-borne outbreak in progress…” Bonus: More efficient use of historical data

Conditioning on observed environment: Well understood for Univariate Time Series
Signal Time Example Signals: Number of ED visits today Number of ED visits this hour Number of Respiratory Cases Today School absenteeism today Nyquil Sales today

An easy case Signal Time Upper Safe Range Mean
Dealt with by Statistical Quality Control Record the mean and standard deviation up the the current time. Signal an alarm if we go outside 3 sigmas

Conditioning on Seasonal Effects
Signal Time

Conditioning on Seasonal Effects
Signal Time Fit a periodic function (e.g. sine wave) to previous data. Predict today’s signal and 3-sigma confidence intervals. Signal an alarm if we’re off. Reduces False alarms from Natural outbreaks. Different times of year deserve different thresholds.

Example [Tsui et. Al] Weekly counts of P&I from week 1/98 to 48/00
From: “Value of ICD‑9–Coded Chief Complaints for Detection of Epidemics”, Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, Chung-Chou Ho Chang, AMIA 2000

Seasonal Effects with Long-Term Trend
Weekly counts of IS from week 1/98 to 48/00. From: “Value of ICD‑9–Coded Chief Complaints for Detection of Epidemics”, Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, Chung-Chou Ho Chang, AMIA 2000

Seasonal Effects with Long-Term Trend
Called the Serfling Method [Serfling, 1963] Weekly counts of IS from week 1/98 to 48/00. Fit a periodic function (e.g. sine wave) plus a linear trend: E[Signal] = a + bt + c sin(d + t/365) Good if there’s a long term trend in the disease or the population. From: “Value of ICD‑9–Coded Chief Complaints for Detection of Epidemics”, Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, Chung-Chou Ho Chang, AMIA 2000

Day-of-week effects From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp )

Another simple form of ANOVA
Day-of-week effects Another simple form of ANOVA Fit a day-of-week component E[Signal] = a + deltaday E.G: deltamon= +5.42, deltatue= +2.20, deltawed= +3.33, deltathu= +3.10, deltafri= +4.02, deltasat= -12.2, deltasun= From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp )

Analysis of variance (ANOVA)
Good news: If you’re tracking a daily aggregate (univariate data)…then ANOVA can take care of many of these effects. But… What if you’re tracking a whole joint distribution of events?

Idea: Bayesian Networks
Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables “Patients from West Park Hospital are less likely to be young” “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” Not going to explain what a Bayesian network is “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic”

WSARE Overview Obtain Recent and Baseline datasets

Obtaining Baseline Data
All Historical Data Today’s Environment Learn Bayesian Network 2. Generate baseline given today’s environment Baseline

Obtaining Baseline Data
All Historical Data Today’s Environment What should be happening today given today’s environment Learn Bayesian Network 2. Generate baseline given today’s environment Baseline

Step 1: Learning the Bayes Net Structure
Involves searching over DAGs for the structure that maximizes a scoring function. Most common algorithm is hillclimbing. Initial Structure 3 possible operations: Add an arc Delete an arc Reverse an arc

Step 1: Learning the Bayes Net Structure
Involves searching over DAGs for the structure that maximizes a scoring function. Most common algorithm is hillclimbing. Initial Structure But hillclimbing is too slow and single link modifications may not find the correct structure (Xiang, Wong and Cercone 1997). We use Optimal Reinsertion (Moore and Wong 2002). 3 possible operations: Add an arc Delete an arc Reverse an arc

Optimal Reinsertion 1. Select target node in current graph T
2. Remove all arcs connected to T T

Optimal Reinsertion 3. Efficiently find new in/out arcs T
? 3. Efficiently find new in/out arcs ? ? T ? ? ? ? ? 4. Choose best new way to connect T T

The Outer Loop Until no change in current DAG:
Generate random ordering of nodes For each node in the ordering, do Optimal Reinsertion

The Outer Loop For NumJolts:
Begin with randomly corrupted version of best DAG so far Until no change in current DAG: Generate random ordering of nodes For each node in the ordering, do Optimal Reinsertion

The Outer Loop For NumJolts:
Begin with randomly corrupted version of best DAG so far Until no change in current DAG: Generate random ordering of nodes For each node in the ordering, do Optimal Reinsertion Conventional hill-climbing without maxParams restriction

How is Optimal Reinsertion done efficiently?
Scoring functions can be decomposed: T Efficiency Tricks Create an efficient cache of NodeScore(PS->T) values using ADSearch [Moore and Schneider 2002] Restrict PS->T combinations to those with CPTs with maxParams or fewer parameters Additional Branch and Bound is used to restrict space an additional order of magnitude

Environmental Attributes
Divide the data into two types of attributes: Environmental attributes: attributes that cause trends in the data eg. day of week, season, weather, flu levels Response attributes: all other non-environmental attributes

Environmental Attributes
When learning the Bayesian network structure, do not allow environmental attributes to have parents. Why? We are not interested in predicting their distributions Instead, we use them to predict the distributions of the response attributes Side Benefit: We can speed up the structure search by avoiding DAGs that assign parents to the environmental attributes Season Day of Week Weather Flu Level

Step 2: Generate Baseline Given Today’s Environment
Suppose we know the following for today: Season Day of Week Weather Flu Level Today Winter Monday Snow High Day of Week = Monday Flu Level = High Season = Winter Weather = Snow We fill in these values for the environmental attributes in the learned Bayesian network We sample records from the Bayesian network and make this data set the baseline Baseline

Step 2: Generate Baseline Given Today’s Environment
Suppose we know the following for today: Season Day of Week Weather Flu Level Today Winter Monday Snow High Day of Week = Monday Flu Level = High Season = Winter Weather = Snow We fill in these values for the environmental attributes in the learned Bayesian network Sampling is easy because environmental attributes are at the top of the Bayes Net We sample records from the Bayesian network and make this data set the baseline Baseline

Why not use inference? With sampling, we create the baseline data and then use it to obtain the p-value of the rule for the randomization test If we used inference, we will not be able to perform the same randomization test and we need to find some other way to correct for the multiple hypothesis testing Sampling was chosen for its simplicity

Why not use inference? With sampling, we create the baseline data and then use it to obtain the p-value of the rule for the randomization test If we used inference, we will not be able to perform the same randomization test and we need to find some other way to correct for the multiple hypothesis testing Sampling was chosen for its simplicity But there may be clever things to do with inference which may help us. File this under future work

Simulation City with 9 regions and different population in each region
NW 100 N 400 NE 500 W C 200 E 300 SW S SE 600 For each day, sample the city’s environment from the following Bayesian Network Previous Region Food Condition Previous Region Anthrax Concentration Previous Weather Previous Flu Level Anthrax concentration remains high for the affected region for a random length of time Date Season Day of Week Weather Flu Level Region Anthrax Concentration Region Food Condition

Simulation For each person in a region, sample their profile FLU LEVEL
DAY OF WEEK SEASON WEATHER Region Anthrax Concentration Has Anthrax AGE Outside Activity Immune System GENDER Region Grassiness Has Flu Has Sunburn Heart Health DATE Region Food Condition Has Cold Has Allergy REGION Has Heart Attack Capital Letters are visible attributes Disease Has Food Poisoning Actual Symptom For each person in a region, sample their profile REPORTED SYMPTOM ACTION DRUG

Visible Environmental Attributes
FLU LEVEL DAY OF WEEK SEASON WEATHER Region Anthrax Concentration Has Anthrax AGE Outside Activity Immune System GENDER Region Grassiness Has Flu Has Sunburn Heart Health DATE Region Food Condition Has Cold Has Allergy REGION Has Heart Attack Disease Has Food Poisoning Actual Symptom REPORTED SYMPTOM ACTION DRUG

Simulation FLU LEVEL DAY OF WEEK SEASON WEATHER Region Anthrax
Concentration Has Anthrax AGE Outside Activity Immune System GENDER Region Grassiness Has Flu Has Sunburn Heart Health DATE Region Food Condition Has Cold Has Allergy REGION Has Heart Attack Disease Has Food Poisoning Actual Symptom Diseases: Allergy, cold, sunburn, flu, food poisoning, heart problems, anthrax (in order of precedence) REPORTED SYMPTOM ACTION DRUG

Simulation FLU LEVEL DAY OF WEEK SEASON WEATHER Region Anthrax
Concentration Has Anthrax AGE Outside Activity Immune System GENDER Region Grassiness Has Flu Has Sunburn Heart Health DATE Region Food Condition Has Cold Has Allergy REGION Has Heart Attack Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset. Disease Has Food Poisoning Actual Symptom REPORTED SYMPTOM ACTION DRUG

Simulation Plot

Simulation Plot Anthrax release (not highest peak)

Simulation 100 different data sets
Each data set consisted of a two year period Anthrax release occurred at a random point during the second year Algorithms allowed to train on data from the current day back to the first day in the simulation Any alerts before actual anthrax release are considered a false positive Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days

Other Algorithms used in Simulation
1. Standard algorithm Time Signal Mean Upper Safe Range 2. WSARE 2.0 3. WSARE 2.5 Use all past data but condition on environmental attributes

Results on Simulation Add 24 hrs to simulate real world release of data

Conclusion One approach to biosurveillance: one algorithm monitoring millions of signals derived from multivariate data instead of Hundreds of univariate detectors WSARE is best used as a general purpose safety net in combination with other detectors Modeling historical data with Bayesian Networks to allow conditioning on unique features of today Computationally intense unless we use clever algorithms

Conclusion WSARE 2.0 deployed during the past year
WSARE 3.0 about to go online WSARE now being extended to additionally exploit over the counter medicine sales

For more information References:
Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2002). Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks. Proceedings of AAAI-02 (pp ). MIT Press. Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2003). Bayesian Network Anomaly Pattern Detection for Disease Outbreaks. Proceedings of ICML 2003. Moore, A., and Wong, W. K. (2003). Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning. Proceedings of ICML 2003. AUTON lab website:

What’s Strange About Recent Events (WSARE)

Similar presentations

Presentation on theme: "What’s Strange About Recent Events (WSARE)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What’s Strange About Recent Events (WSARE)

Similar presentations

Presentation on theme: "What’s Strange About Recent Events (WSARE)"— Presentation transcript:

Similar presentations

About project

Feedback