Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarization and Deviation Detection -- What is new?

Similar presentations


Presentation on theme: "Summarization and Deviation Detection -- What is new?"— Presentation transcript:

1 Summarization and Deviation Detection -- What is new?

2 2 Outline  Summarization  KEFIR – Key Findings Reporter  WSARE – What is Strange About Recent Events

3 3 What is New? Old data new data

4 4 Summarization  Concisely summarize what is new and different, unexpected  with respect to previous values  with respect to expected values  …  Focus on what is actionable!

5 5 Problem: Healthcare Costs  Healthcare costs in US: 1 out of 7 GDP $ and rising  potential problems: fraud, misuse, …  understanding where the problems are is first step to fixing them  GTE – self insured for medical costs  GTE healthcare costs – $X00,000,000  Task: Analyze employee health care data and generate a report that describes the major problems

6 6 GTE Key Findings Reporter: KEFIR  KEFIR Approach:  Analyze all possible deviations  Select interesting findings  Augment key findings with:  Explanations of plausible causes  Recommendations of appropriate actions  Convert findings to a user-friendly report with text and graphics

7 KEFIR Search Space

8 8 Drill-Down Example

9 9 What Change Is Important?

10 10 Deviation Detection  Drill Down through the search space  Generate a finding for each measure  deviation from previous period  deviation from norm  deviation projected for next period, if no action

11 Interestingness of Deviations Impact: how much the deviation affects the bottom line Savings Percentage: how much of the deviation from the norm can be expected to be saved by the action

12 Recommendations Hierarchical recommendation rules define appropriate intervention strategies for important measures and study areas. Example:measure = admission rate per 1000 & study_area = Inpatient admissions & percent_change > 0.10 If Then Utilization review is needed in the area of admission certification. Expected Savings: 20%

13 13 Explanation A measure is explained by finding the path of related measures with the highest impact The large increase in m 1 in group s 1 was caused by an increase in m 3, which was caused by a rise in m 5, primarily in sector s 13.

14 14 Report Generation  Automatic generation of business-user-oriented reports  Natural language generation with template matching  Graphics  delivered via browser

15

16 16 Sample KEFIR pages Overview Inpatient admissions

17 Status  Prototype implemented in GTE in 1995  KEFIR received GTE’s highest award for technical achievement in 1995  Key business user left GTE in 1996 and system was no longer used  Publication:  Selecting and Reporting What is Interesting: The KEFIR Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996Advances in Knowledge Discovery and Data Mining

18 What’s Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) http://www.autonlab.org/wsare Designed to be easily applicable to any date/time- indexed biosurveillance-relevant data stream

19 19 Motivation Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… 1036/1/0311:072787DiarrheaM60sE?… 1046/1/0312:151717RespiratoryM60sENE… 1056/1/0313:013780ViralF50s?NW… 1066/1/0313:053487RespiratoryF40sSW … 1076/1/0313:572786UnmappedM50sSESW… 1086/1/0314:221780ViralM40s??… : : : : : : : : : : : Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)

20 20 Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Respiratory syndrome cases among females Viral syndrome cases involving senior citizens from eastern part of city Number of children from downtown hospital Number of cases involving people working in southern part of the city Number of cases involving teenage girls living in the western part of the city Botulinic syndrome cases And so on… You’ll need hundreds of univariate detectors! We would like to identify the groups with the strangest behavior in recent events.

21 21 WSARE Approach  Rule-Based Anomaly Pattern Detection  Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40  Age < 50

22 22 WSARE v2.0 Overview 2.Search for rule with best score 3.Determine p-value of best scoring rule through randomization test All Data 4.If p-value is less than threshold, signal alert Recent Data Baseline 1.Obtain Recent and Baseline datasets

23 23 Step 1: Obtain Recent and Baseline Data Recent Data Baseline Data from last 24 hours Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day

24 24 Example Sat 12-23-2001 35.8% (48/134) of today's cases have 30 <= age < 40 17.0% (45/265) of other (baseline) cases have 30 <= age < 40

25 25 Step 2. Search for Best Rule For each rule, form a 2x2 contingency table eg.  Perform Fisher’s Exact Test to get a p-value (score) for each rule (for this data 0.00005)  Find rule R-best with the lowest score.  Caution: This score is not the true p-value of R BEST because of multiple tests Count Recent Count Baseline Age Decile = 34845 Age Decile  3 86220

26 26 Step 3: Randomization Test  Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DB Rand  Find the rule with the best score on DB Rand. June 4, 2002C2 June 5, 2002C3 June 12, 2002C4 June 19, 2002C5 June 26, 2002C6 June 26, 2002C7 July 2, 2002C8 July 3, 2002C9 July 10, 2002C10 July 17, 2002C11 July 24, 2002C12 July 30, 2002C13 July 31, 2002C14 July 31, 2002C15 June 4, 2002C2 June 12, 2002C3 July 31, 2002C4 June 26, 2002C5 July 31, 2002C6 June 5, 2002C7 July 2, 2002C8 July 3, 2002C9 July 10, 2002C10 July 17, 2002C11 July 24, 2002C12 July 30, 2002C13 June 19, 2002C14 June 26, 2002C15

27 27 Step 3: Randomization Test Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised. Estimated p-value of the rule is: # better scores / # iterations

28 28 Results on Actual ED Data from 2001 1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000 14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000 12.42% ( 58/467) of today's cases have Respiratory Syndrome = True 6.53% (653/10000) of baseline have Respiratory Syndrome = True 3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000 1.44% ( 9/625) of today's cases have 100 <= Age < 110 0.08% ( 8/10000) of baseline have 100 <= Age < 110 4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000 83.80% (481/574) of today's cases have Unknown Syndrome = False 74.29% (7430/10001) of baseline have Unknown Syndrome = False 5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000 14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False

29 29 WSARE 3:0 Improving the Baseline Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. Baseline We would like to determine the baseline automatically! What if this assumption isn’t true? What if data from 7, 14, 21 and 28 days prior is better?

30 30 Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249)

31 31 WSARE v3.0 Generate the baseline…  “Taking into account recent flu levels…”  “Taking into account that today is a public holiday…”  “Taking into account that this is Spring…”  “Taking into account recent heatwave…”  “Taking into account that there’s a known natural Food- borne outbreak in progress…” Bonus: More efficient use of historical data

32 32 Idea: Bayesian Networks “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “Patients from West Park Hospital are less likely to be young” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic”

33 33 Obtaining Baseline Data Baseline All Historical Data Today’s Environment 1.Learn Bayesian Network 2. Generate baseline given today’s environment What should be happening today given today’s environment

34 34 Simulation DATE DAY OF WEEK SEASON FLU LEVEL WEATHER REGION AGE GENDER Region Grassiness Region Anthrax Concentration Region Food Condition Immune System Outside Activity Has Anthrax Has Flu Has Allergy Has Heart Attack Has Sunburn Has Cold Heart Health Has Food Poisoning Disease ACTION Actual Symptom REPORTED SYMPTOM DRUG Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset.

35 35 Simulation  100 different data sets  Each data set consisted of a two year period  Anthrax release occurred at a random point during the second year  Algorithms allowed to train on data from the current day back to the first day in the simulation  Any alerts before actual anthrax release are considered a false positive  Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days

36 36 Simulation Plot Anthrax release (not highest peak)

37 37 Results on Simulation

38 38 Summary  Summarization of what is new and interesting  Key ideas  search many possible findings  compare to past data and expected data  avoid overfitting  focus on actionable changes  Example systems  KEFIR (GTE, 1992-1995)  WSARE (CMU/Pitt, 2002-3)


Download ppt "Summarization and Deviation Detection -- What is new?"

Similar presentations


Ads by Google