Download presentation

Presentation is loading. Please wait.

Published byWalter Doyle Modified about 1 year ago

1
2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie Mellon University, School of Computer Science 2 University of Pittsburgh, Center for Biomedical Informatics {neill, awm}@cs.cmu.edu, gfc@cbmi.pitt.edu

2
2005 Carnegie Mellon University Prospective disease surveillance Nationwide disease surveillance data, aggregated by zip code. Daily counts of OTC drug sales in 18 product categories from 20,000 retail stores (NRDM). Daily counts of Emergency Department visits, grouped by syndrome. Each day we want to answer the questions: what’s happening, and where? Are there any emerging clusters of symptoms that are worthy of further investigation? Where are they, how large, and how serious? Goal: automatically detect emerging disease outbreaks, as quickly as possible, while keeping number of false positives low. National Retail Data Monitor: rods.health.pitt.edu

3
2005 Carnegie Mellon University Spatial cluster detection Given: count c i and baseline b i for each zip code s i. Does any spatial region S have sufficiently high counts c i to be indicative of an emerging disease epidemic in that area? (e.g. number of Emergency Dept. visits, or over-the-counter drug sales, of a specific type) At-risk population or expected count, inferred from historical data Our typical assumption: counts are aggregated to a uniform grid; search over set of rectangles on grid. S

4
2005 Carnegie Mellon University The spatial scan statistic The spatial scan statistic (Kulldorff, 1997) is a powerful method for spatial cluster detection. Search over a given set of spatial regions. Find those regions which are most likely to be clusters. Correctly adjust for multiple hypothesis testing. Problems with the spatial scan: Difficult to incorporate prior knowledge. Size and shape of outbreak? Impact on disease rate? Computing statistical significance (p-values) by randomization requires searching a huge number of “replica” datasets. Computationally infeasible for massive datasets!

5
2005 Carnegie Mellon University The spatial scan statistic The spatial scan statistic (Kulldorff, 1997) is a powerful method for spatial cluster detection. Search over a given set of spatial regions. Find those regions which are most likely to be clusters. Correctly adjust for multiple hypothesis testing. Problems with the spatial scan: Difficult to incorporate prior knowledge. Size and shape of outbreak? Impact on disease rate? Computing statistical significance (p-values) by randomization requires searching a huge number of “replica” datasets and comparing results to original. Computationally infeasible for massive datasets! Here we propose a Bayesian spatial scan statistic, which allows us to incorporate prior knowledge, and (since randomization testing is unnecessary) is much more efficient to compute.

6
2005 Carnegie Mellon University The generalized spatial scan 1.Obtain data for a set of spatial locations s i. 2.Choose a set of spatial regions S to search. 3.Choose models of the data under null hypothesis H 0 (no clusters) and alternative hypotheses H 1 (S) (cluster in region S). 4.Derive a score function F(S) based on H 1 (S) and H 0. 5.Find the most anomalous regions (i.e. those regions S with highest F(S)). 6.Determine whether each of these potential clusters is actually an anomalous cluster.

7
2005 Carnegie Mellon University q out =.01 Population-based model (Kulldorff, 1997) Each count c i (number of cases in location s i ) is generated from a Poisson distribution with mean q i b i. b i represents the at-risk population, often estimated from census data. q i represents the disease rate. Is there any region with disease rates significantly higher inside than outside? q in =.02

8
2005 Carnegie Mellon University The frequentist model Null hypothesis H 0 (no clusters) Alternative hypothesis H 1 (S) (cluster in region S) c i ~ Poisson(q i b i ) q i = q all everywhere q i = q in inside region S, q i = q out elsewhere Use maximum likelihood estimate of q all. Use maximum likelihood estimates of q in and q out, subject to q in > q out.

9
2005 Carnegie Mellon University The Bayesian hierarchical model q in ~ Gamma( in (S), in (S)) q out ~ Gamma( out (S), out (S)) Null hypothesis H 0 (no clusters) Alternative hypothesis H 1 (S) (cluster in region S) c i ~ Poisson(q i b i ) q i = q all everywhere q i = q in inside region S, q i = q out elsewhere q all ~ Gamma( all, all ) Top two levels of hierarchy are same as frequentist model. Gamma( )

10
2005 Carnegie Mellon University Frequentist approachBayesian approach Use likelihood ratio:Use posterior probability: Use maximum likelihood parameter estimates Use marginal likelihood (integrate over possible values of parameters) Calculate statistical significance by randomization testing: Compute maximum F(S) for each of R=1000 replica grids generated under H 0. p-value = (R beat +1) / (R+1), where R beat = # of replicas with max score > original region No randomization testing necessary: Instead, normalize posterior probabilities by computing and dividing by the total data likelihood, P(Data) = P(Data | H 0 ) P(H 0 ) + ∑ S P(Data | H 1 (S)) P(H 1 (S)) This gives probability of an outbreak in each region; sum these to get total probability of an outbreak.

11
2005 Carnegie Mellon University Computing Bayesian likelihoods Marginal likelihood approach: integrate over possible values of disease rate parameters (q in, q out, q all ), weighted by prior probability. Conjugate prior allows closed form solution. Gamma prior, Poisson counts negative binomial where C = ∑ c i, B = ∑ b i

12
2005 Carnegie Mellon University Obtaining priors Choose prior outbreak probability P 1, assume uniform region prior. Choose parameter priors ( and ) by matching / and / 2 to mean and variance of observed disease rate q = C/B. Assume outbreak increases rate by a multiplicative factor m. Use (discretized) uniform distribution for m. m ~ Uniform[1, 3] P(H 0 ) = 1 - P 1 P(H 1 (S)) = P 1 / N reg

13
2005 Carnegie Mellon University Computing the Bayesian statistic all, all Pr(Data | H 0 ) Pr(Data | H 1 (S))Pr(H 1 (S)) Pr(H 0 ) in (S), in (S), out (S), out (S) C all, B all C in (S), B in (S), C out (S), B out (S) P1P1 score(H 0 ) score(S) Pr(Data) (x) (+) Pr(H 0 | Data) (÷) Pr(H 1 (S) | Data) (÷) Report all regions with probability > P thresh “Sound the alarm” if total probability of outbreak > P alarm OR = do for all regions S

14
2005 Carnegie Mellon University Testing detection power We use a semi-synthetic testing framework, in which we inject simulated respiratory outbreaks into real baseline ED and OTC data (assumed to have no outbreaks) 1 year of ED data from Allegheny County, 2002 1 year of OTC data from Allegheny County, 2004-2005 BARD-simulated anthrax cases Fictional Linear Onset Outbreak (FLOO) Simulated outbreak = baseline + injected

15
2005 Carnegie Mellon University Testing methodology 1.Baseline data without injected cases: compute the score F* for each day. 2.For each simulated outbreak: a)Inject outbreak into baseline data. b)Compute score F* for each day of outbreak. c)For each day of outbreak (t=1..T outbreak ), compute fraction of baseline days with scores higher than maximum score of outbreak days 1 through t. This is the proportion of false positives we would have to accept to detect that outbreak by day t. 3.Average over multiple outbreaks to obtain an AMOC curve (avg. days to detect outbreak vs. false positive rate) 4.Also compute proportion of outbreaks detected and days to detect at various false positive rates (e.g. 1/month). Frequentist: F* = max S F(S) Bayesian: F* = ∑ S Pr(H 1 (S) | Data)

16
2005 Carnegie Mellon University Why Bayesian? (part 1) Better detection power! FLOO_ED (4, 14)1.861.88 FLOO_ED (2, 20)3.323.20 FLOO_ED (1, 20)6.685.78 BARD_ED (.125)1.731.63 BARD_ED (.016)3.933.81 FLOO_OTC (40, 14)3.583.48 FLOO_OTC (25, 20)5.395.19 Frequentist Average days to detect (1 false positive/month) BayesianOutbreak type Bayesian better for 6 of 7 outbreaks, by an average of 0.22 days. Should do even better with an informative prior!

17
2005 Carnegie Mellon University Why Bayesian? (part 2) It’s fast! Because no randomization testing is necessary, we can search approximately 1000x faster than the (naïve) frequentist approach. For small to moderate grid sizes, this is even faster than the fast spatial scan (Neill and Moore, 2004) For larger grid sizes, the fast spatial scan wins. For 128 x 128 grid: 44 minutes vs. 31 days For 128 x 128 grid: 44 minutes vs. 77 minutes For 256 x 256 grid: 10 hours vs. 12 hours

18
2005 Carnegie Mellon University Easier to interpret results (posterior probability of an outbreak, distribution over possible regions). Easier to visualize: Easier to calibrate (by setting prior probability P 1 ). Easier to combine evidence from multiple detectors, by modeling the joint distribution. Why Bayesian? (part 3) Total posterior probability of outbreak: 86.61% Maximum region outbreak probability: 12.27% Maximum cell outbreak probability: 86.57% Cell color is based on posterior probability of outbreak in that cell, ranging from white (0%) to black (100%). Red rectangle represents most likely region.

19
2005 Carnegie Mellon University Making the spatial scan fast Naïve frequentist scan 1000 replicas x 12 hrs / replica = 500 days! 256 x 256 grid = 1 billion rectangular regions! Fast frequentist scan Naïve Bayesian scan 12 hrs (to search original grid) 1000 replicas x 36 sec / replica = 10 hrs Fast Bayesian scan?? (Neill and Moore, KDD 2004) (Neill, Moore, and Cooper, NIPS 2005)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google