 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.

Slides:



Advertisements
Similar presentations
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Advertisements

Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Probability & Statistical Inference Lecture 7 MSc in Computing (Data Analytics)
Rapid Detection of Significant Spatial Clusters Daniel B. Neill Andrew W. Moore The Auton Lab Carnegie Mellon University School of Computer Science
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Weng-Keen Wong, Greg Cooper, Denver Dash *, John Levander, John Dowling,
Introduction to Hypothesis Testing
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling,
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Lecture 2: Thu, Jan 16 Hypothesis Testing – Introduction (Ch 11)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 10 Notes Class notes for ISE 201 San Jose State University.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 8-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Inferences About Process Quality
BCOR 1020 Business Statistics Lecture 20 – April 3, 2008.
1 Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
Chapter 8 Introduction to Hypothesis Testing
5-3 Inference on the Means of Two Populations, Variances Unknown
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical Inference Dr. Mona Hassan Ahmed Prof. of Biostatistics HIPH, Alexandria University.
One Sample  M ean μ, Variance σ 2, Proportion π Two Samples  M eans, Variances, Proportions μ1 vs. μ2 σ12 vs. σ22 π1 vs. π Multiple.
AM Recitation 2/10/11.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
Chapter 10 Hypothesis Testing
Confidence Intervals and Hypothesis Testing - II
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Fundamentals of Hypothesis Testing: One-Sample Tests
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap th Lesson Introduction to Hypothesis Testing.
STAT 5372: Experimental Statistics Wayne Woodward Office: Office: 143 Heroy Phone: Phone: (214) URL: URL: faculty.smu.edu/waynew.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
Chapter 10 Hypothesis Testing
1 Introduction to Hypothesis Testing. 2 What is a Hypothesis? A hypothesis is a claim A hypothesis is a claim (assumption) about a population parameter:
Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
STEP BY STEP Critical Value Approach to Hypothesis Testing 1- State H o and H 1 2- Choose level of significance, α Choose the sample size, n 3- Determine.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
HYPOTHESIS TESTING. Statistical Methods Estimation Hypothesis Testing Inferential Statistics Descriptive Statistics Statistical Methods.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Confidence intervals and hypothesis testing Petter Mostad
1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Chap 8-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 8 Introduction to Hypothesis.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Simple examples of the Bayesian approach For proportions and means.
Math 4030 – 9a Introduction to Hypothesis Testing
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
STEP BY STEP Critical Value Approach to Hypothesis Testing 1- State H o and H 1 2- Choose level of significance, α Choose the sample size, n 3- Determine.
© Copyright McGraw-Hill 2004
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Bayesian Disease Outbreak Detection that Includes a Model of Unknown Diseases Yanna Shen and Gregory F. Cooper Intelligent Systems Program and Department.
Hypothesis Testing and Statistical Significance
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Statistics for Business and Economics Module 1:Probability Theory and Statistical Inference Spring 2010 Lecture 8: Tests of significance and confidence.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Bayesian Biosurveillance of Disease Outbreaks RODS Laboratory Center for Biomedical Informatics University of Pittsburgh Gregory F. Cooper, Denver H.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Bayesian Biosurveillance of Disease Outbreaks
Discrete Event Simulation - 4
Last Update 12th May 2011 SESSION 41 & 42 Hypothesis Testing.
Chapter 9 Hypothesis Testing: Single Population
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie Mellon University, School of Computer Science 2 University of Pittsburgh, Center for Biomedical Informatics {neill,

 2005 Carnegie Mellon University Prospective disease surveillance Nationwide disease surveillance data, aggregated by zip code. Daily counts of OTC drug sales in 18 product categories from 20,000 retail stores (NRDM). Daily counts of Emergency Department visits, grouped by syndrome. Each day we want to answer the questions: what’s happening, and where? Are there any emerging clusters of symptoms that are worthy of further investigation? Where are they, how large, and how serious? Goal: automatically detect emerging disease outbreaks, as quickly as possible, while keeping number of false positives low. National Retail Data Monitor: rods.health.pitt.edu

 2005 Carnegie Mellon University Spatial cluster detection Given: count c i and baseline b i for each zip code s i. Does any spatial region S have sufficiently high counts c i to be indicative of an emerging disease epidemic in that area? (e.g. number of Emergency Dept. visits, or over-the-counter drug sales, of a specific type) At-risk population or expected count, inferred from historical data Our typical assumption: counts are aggregated to a uniform grid; search over set of rectangles on grid. S

 2005 Carnegie Mellon University The spatial scan statistic The spatial scan statistic (Kulldorff, 1997) is a powerful method for spatial cluster detection. Search over a given set of spatial regions. Find those regions which are most likely to be clusters. Correctly adjust for multiple hypothesis testing. Problems with the spatial scan: Difficult to incorporate prior knowledge. Size and shape of outbreak? Impact on disease rate? Computing statistical significance (p-values) by randomization requires searching a huge number of “replica” datasets. Computationally infeasible for massive datasets!

 2005 Carnegie Mellon University The spatial scan statistic The spatial scan statistic (Kulldorff, 1997) is a powerful method for spatial cluster detection. Search over a given set of spatial regions. Find those regions which are most likely to be clusters. Correctly adjust for multiple hypothesis testing. Problems with the spatial scan: Difficult to incorporate prior knowledge. Size and shape of outbreak? Impact on disease rate? Computing statistical significance (p-values) by randomization requires searching a huge number of “replica” datasets and comparing results to original. Computationally infeasible for massive datasets! Here we propose a Bayesian spatial scan statistic, which allows us to incorporate prior knowledge, and (since randomization testing is unnecessary) is much more efficient to compute.

 2005 Carnegie Mellon University The generalized spatial scan 1.Obtain data for a set of spatial locations s i. 2.Choose a set of spatial regions S to search. 3.Choose models of the data under null hypothesis H 0 (no clusters) and alternative hypotheses H 1 (S) (cluster in region S). 4.Derive a score function F(S) based on H 1 (S) and H 0. 5.Find the most anomalous regions (i.e. those regions S with highest F(S)). 6.Determine whether each of these potential clusters is actually an anomalous cluster.

 2005 Carnegie Mellon University q out =.01 Population-based model (Kulldorff, 1997) Each count c i (number of cases in location s i ) is generated from a Poisson distribution with mean q i b i. b i represents the at-risk population, often estimated from census data. q i represents the disease rate. Is there any region with disease rates significantly higher inside than outside? q in =.02

 2005 Carnegie Mellon University The frequentist model Null hypothesis H 0 (no clusters) Alternative hypothesis H 1 (S) (cluster in region S) c i ~ Poisson(q i b i ) q i = q all everywhere q i = q in inside region S, q i = q out elsewhere Use maximum likelihood estimate of q all. Use maximum likelihood estimates of q in and q out, subject to q in > q out.

 2005 Carnegie Mellon University The Bayesian hierarchical model q in ~ Gamma(  in (S),  in (S)) q out ~ Gamma(  out (S),  out (S)) Null hypothesis H 0 (no clusters) Alternative hypothesis H 1 (S) (cluster in region S) c i ~ Poisson(q i b i ) q i = q all everywhere q i = q in inside region S, q i = q out elsewhere q all ~ Gamma(  all,  all ) Top two levels of hierarchy are same as frequentist model. Gamma(  )     

 2005 Carnegie Mellon University Frequentist approachBayesian approach Use likelihood ratio:Use posterior probability: Use maximum likelihood parameter estimates Use marginal likelihood (integrate over possible values of parameters) Calculate statistical significance by randomization testing: Compute maximum F(S) for each of R=1000 replica grids generated under H 0. p-value = (R beat +1) / (R+1), where R beat = # of replicas with max score > original region No randomization testing necessary: Instead, normalize posterior probabilities by computing and dividing by the total data likelihood, P(Data) = P(Data | H 0 ) P(H 0 ) + ∑ S P(Data | H 1 (S)) P(H 1 (S)) This gives probability of an outbreak in each region; sum these to get total probability of an outbreak.

 2005 Carnegie Mellon University Computing Bayesian likelihoods Marginal likelihood approach: integrate over possible values of disease rate parameters (q in, q out, q all ), weighted by prior probability. Conjugate prior allows closed form solution. Gamma prior, Poisson counts  negative binomial where C = ∑ c i, B = ∑ b i

 2005 Carnegie Mellon University Obtaining priors Choose prior outbreak probability P 1, assume uniform region prior. Choose parameter priors (  and  ) by matching  /  and  /  2 to mean and variance of observed disease rate q = C/B. Assume outbreak increases rate by a multiplicative factor m. Use (discretized) uniform distribution for m. m ~ Uniform[1, 3] P(H 0 ) = 1 - P 1 P(H 1 (S)) = P 1 / N reg

 2005 Carnegie Mellon University Computing the Bayesian statistic  all,  all Pr(Data | H 0 ) Pr(Data | H 1 (S))Pr(H 1 (S)) Pr(H 0 )  in (S),  in (S),  out (S),  out (S) C all, B all C in (S), B in (S), C out (S), B out (S) P1P1 score(H 0 ) score(S) Pr(Data) (x) (+) Pr(H 0 | Data) (÷) Pr(H 1 (S) | Data) (÷) Report all regions with probability > P thresh “Sound the alarm” if total probability of outbreak > P alarm OR = do for all regions S

 2005 Carnegie Mellon University Testing detection power We use a semi-synthetic testing framework, in which we inject simulated respiratory outbreaks into real baseline ED and OTC data (assumed to have no outbreaks) 1 year of ED data from Allegheny County, year of OTC data from Allegheny County, BARD-simulated anthrax cases Fictional Linear Onset Outbreak (FLOO) Simulated outbreak = baseline + injected

 2005 Carnegie Mellon University Testing methodology 1.Baseline data without injected cases: compute the score F* for each day. 2.For each simulated outbreak: a)Inject outbreak into baseline data. b)Compute score F* for each day of outbreak. c)For each day of outbreak (t=1..T outbreak ), compute fraction of baseline days with scores higher than maximum score of outbreak days 1 through t. This is the proportion of false positives we would have to accept to detect that outbreak by day t. 3.Average over multiple outbreaks to obtain an AMOC curve (avg. days to detect outbreak vs. false positive rate) 4.Also compute proportion of outbreaks detected and days to detect at various false positive rates (e.g. 1/month). Frequentist: F* = max S F(S) Bayesian: F* = ∑ S Pr(H 1 (S) | Data)

 2005 Carnegie Mellon University Why Bayesian? (part 1) Better detection power! FLOO_ED (4, 14) FLOO_ED (2, 20) FLOO_ED (1, 20) BARD_ED (.125) BARD_ED (.016) FLOO_OTC (40, 14) FLOO_OTC (25, 20) Frequentist Average days to detect (1 false positive/month) BayesianOutbreak type Bayesian better for 6 of 7 outbreaks, by an average of 0.22 days. Should do even better with an informative prior!

 2005 Carnegie Mellon University Why Bayesian? (part 2) It’s fast! Because no randomization testing is necessary, we can search approximately 1000x faster than the (naïve) frequentist approach. For small to moderate grid sizes, this is even faster than the fast spatial scan (Neill and Moore, 2004) For larger grid sizes, the fast spatial scan wins. For 128 x 128 grid: 44 minutes vs. 31 days For 128 x 128 grid: 44 minutes vs. 77 minutes For 256 x 256 grid: 10 hours vs. 12 hours

 2005 Carnegie Mellon University Easier to interpret results (posterior probability of an outbreak, distribution over possible regions). Easier to visualize: Easier to calibrate (by setting prior probability P 1 ). Easier to combine evidence from multiple detectors, by modeling the joint distribution. Why Bayesian? (part 3) Total posterior probability of outbreak: 86.61% Maximum region outbreak probability: 12.27% Maximum cell outbreak probability: 86.57% Cell color is based on posterior probability of outbreak in that cell, ranging from white (0%) to black (100%). Red rectangle represents most likely region.

 2005 Carnegie Mellon University Making the spatial scan fast Naïve frequentist scan 1000 replicas x 12 hrs / replica = 500 days! 256 x 256 grid = 1 billion rectangular regions! Fast frequentist scan Naïve Bayesian scan 12 hrs (to search original grid) 1000 replicas x 36 sec / replica = 10 hrs Fast Bayesian scan?? (Neill and Moore, KDD 2004) (Neill, Moore, and Cooper, NIPS 2005)