Presentation is loading. Please wait.

Presentation is loading. Please wait.

General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group 2 https://sites.google.com/site/8715xuntang/ 1.

Similar presentations


Presentation on theme: "General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group 2 https://sites.google.com/site/8715xuntang/ 1."— Presentation transcript:

1 General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group 2 https://sites.google.com/site/8715xuntang/ 1

2 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 2

3 Motivation Public safety –Crime hotspot Epidemiology –Geographical surveillance of disease –Early detection of disease outbreak –Finding source of an outbreak Theory –Diffusion theory 3 London Cholera Epidemic 1854 Snow’s Map for London Cholera Epidemic 1854 Diffusion is NOT always isotropic because preferred axis of transmission exists (e.g., highways)

4 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 4

5 Basic Concepts Study area: An finite rectangular area S (m, n) in 2-D Euclidean space, where m and n represent the length of each side, respectively. Activity set: A set A of activities distributed in the study area, where |A| indicates the number of activities. Each activity is presented by its location a = (x, y). 5

6 Basic Concepts Ellipse: Ellipse is a generalization of circle. Represented by 5 parameters, namely its center location (c_x, c_y), radii along x and y axies (r_x, r_y), and a rotation angle. 6

7 Basic Concepts: Interest Measure 7 Interest measure: compare point density inside ellipse E with that outside. Continuous Poisson point process [1] : p: intensity of homogenous Poisson distribution inside ellipse q: intensity of homogenous Poisson distribution outside ellipse H 0 : p = q, i.e. Pr [a point is inside ellipse E] = Pr [a point is outside ellipse E] H 1 : p > q, i.e. Pr [a point is inside ellipse E] > Pr [a point is outside ellipse E] Likelihood ratio LR E (H 0, H 1, Poisson Distribution) = Likelihood (H 1 ) / Likelihood (H 0 ) = Poisson Likelihood Function / Poisson Likelihood Function [1]M. Kulldorff. A spatial scan statistic. Communications in Statistics-Theory and methods, 26(6):1481–1496, 1997.

8 Basic Concepts: Interest Measure 8 Interest measure: compare point density inside ellipse with outside. study area = 5*4 |A| = 51 where c = number of activities in ellipse E B = expected number of activities in E |A| is the cardinality of activity set A For the green ellipse,

9 Basic Concepts: Statistical Significance 9 Is it a chance pattern? Monte Carlo simulation Shuffle the data points in each trial Find the highest logLR in each trial Compute p-value study area = 5*4 |A| = 51

10 Basic Concepts - Discretization Interval Parametric space of ellipses is infinite, discretization is needed (1) We partition the study area into cells, center and radius change by cell (2) An interval between two enumerated angles is also specified (e.g., 10 degrees) Discretizing interval is the “precision” of hotspot detection Example: two ellipses with the same center, same r_y, but different r_x. 10

11 Problem Statement Input –An activity set A= {a 1, a 2, a 3,…, a n } in a study area, where each activity has a location a i = (x i, y i ) –A log likelihood ratio threshold (t lr ) –A p-value threshold (t p ) –The number of Monte Carlo simulation trials (m) –An radius interval i r and an angle interval i a Output –Ellipses with p-value ≤ t p, Log LR c ≥ t lr Objective –Computational efficiency Constraints –Correctness and completeness under i r and i a 11 study area = 5*4 |A| = 51 t lr = 5 t p = 0.01 m = 100 i r = 0.2 i a = 10 o

12 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 12

13 Challenges Large data sets –E.g., approximately 10 5 cases in the H1N1 epidemic in 2009 1. Computational complexity Given |A| activities in a study area in size of N 2, angle interval is i r –O( N 4 * (180 / i r ) ) candidate ellipses –O( |A| ) to evaluate an ellipse –Monte Carlo simulations multiply the cost 10 2 …10 3 by times Interest Measures are not monotonic –Log Likelihood Ratio function of a sub-ellipse is not bounded by Log Likelihood Ratio of an ellipse. 13

14 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 14

15 Related Work 15

16 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 16

17 NaïveEHD Algorithm Idea: Exhaustively enumerate all possible ellipses given a precision. Algorithm: Step 1: enumerate center_x, center_y, radius_x, radius_y, angle O(N 4 * |angle|) Step 2: evaluate log likelihood ratio of each ellipse O(|A|) Step 3: Monte Carlo simulations to calculate p-values O(m * |A| * N 4 * |angle| ) Total cost: O(m * |A| * N 4 * |angle|) 17

18 SmartEHD Algorithm (1) Key idea: (1) Upper-bound pruning (2) Evaluate a collection of ellipses in one time Log likelihood ratio (logLR) = f (number of points, area) Lemma 2: With a fixed area, logLR increases when number of points increases Lemma 3: With a fixed number of points, logLR increases when area decreases 18 Lemma 1: Ellipses with the same center and major (longer) axis can be bounded by a square

19 SmartEHD Algorithm (2) A filter and refine – based approach: loop for each possible square bounding box loop for minor (shorter) axis from small to large determine the upper-bound of logLR (U lr ) using the total points and the minor axis if ( U lr < logLR threshold ) break; else compute the exact logLR of ellipses in all angles (no need to check all points) 19

20 SmartEHD Algorithm (3) How to quickly compute the number of points inside each square? Algorithm: 1. Store number of activities in each cell 2. for i = 1:N for j = 1:N lookup_table(i, j) = lookup_table(i-1, j) + cell(i, j) 3. for i = 1:N for j = 1:N lookup_table(i, j) = lookup_table(i, j-1) + cell(i, j) 20 Cost: O(N 2 ), where N is the number of cells in a row Cost: O(N 2 ) Cost: O(|A|), where |A| is the total number of activities

21 SmartEHD Algorithm (3) How to quickly compute the number of points inside each square? Algorithm: 1. Store number of activities in each cell 2. for i = 1:N for j = 1:N lookup_table(i, j) = lookup_table(i-1, j) + cell(i, j) 3. for i = 1:N for j = 1:N lookup_table(i, j) = lookup_table(i, j-1) + cell(i, j) lookup_table(m, n) = 21 Cost: O(N 2 ), where N is the number of cells in a row Cost: O(N 2 ) Cost: O(|A|), where |A| is the total number of activities

22 SmartEHD Algorithm (4) With lookup table, computing the sum of points in any subset of cells costs O(1) Example: sum (2:3, 2:3) = lookup_table (3, 3) – lookup_table (1,3) – lookup_table(3,1) + lookup_table (1, 1) = 35 – 9 – 12 + 2 = 16 22

23 SmartEHD Algorithm (5) Algorithm: Suppose there are |N| cells in each dimension of the study area and |A| total activities. Step 1: store number of points for each cell O(|A|) Step 2: make lookup table O(N 2 ) Step 3: upper-bound pruning loop for each possible square bounding box O(N 3 ) loop for minor (shorter) axis from small to large O(N) determine the upper-bound of logLR (U lr ) using the total points and the minor axis O(1) if (U lr < logLR threshold) break; else compute the exact logLR of ellipses in all angles f 1 * O(f 2 *|A|*|angle|) Step 4: Monte Carlo simulations: O(f 1 * m * f 2 * |A|*N 4 *|angle| ) Total: O(f 1 * m * f 2 * |A|*N 4 *|angle| ) 23

24 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 24

25 Validation 25

26 Theoretical Analysis Proofs: Correctness: All significant elliptical hotspots are examined by the log likelihood ratio and p-value threshold Completeness: All possible ellipses under the given cell size and angle interval are enumerated and examined Asymptotical time complexity: Naïve approach: O(m *|A|*N 4 *|angle|) SmartEHD: O(f 1 * m * f 2 * |A|*N 4 *|angle|) Coefficients f 1 and f 2 come from the pruning 26

27 Case Study (London Cholera) 250 deaths in 1853 London Cholera outbreak 27

28 Case Study (London Cholera) 250 deaths in 1853 London Cholera outbreak 28 Hotspots by SatScan Hotspots by Elliptical hotspot Detection (Interval: 1/20 side length, angle interval = 10 o )

29 Case Study (Manhattan Robbery) 272 robberies in Manhattan, NY in December 2015 29

30 Case Study (Manhattan Robbery) 272 robberies in Manhattan, NY in December 2015 30 Hotspots by SatScan Hotspots by Elliptical hotspot Detection (Interval: 1/20 side length, angle interval = 10 o )

31 Experiment Results - Setup Data: Synthetic data with ellipses and noise with subject to change of: (1)total number of points (2)size of ellipses (3)number of points inside ellipse (4)number of ellipses 31 Platform: Macbook Pro: Intel Core i7 2.2G and 16GB RAM. Java.

32 Outline Motivation Basic Concepts and Problem Statement Challenges Related Work Proposed Approach Validation Summary 32

33 Summary Formulate the General Elliptical Hotspot Detection (EHD) problem. Propose a naïve algorithm Propose a smart algorithm –Look up table –Filter-and-refine approach based on square bounding box Mathematical proof and asymptotically complexity analysis Case study on real data Experimental evaluation on synthetic data (To be finished) 33


Download ppt "General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group 2 https://sites.google.com/site/8715xuntang/ 1."

Similar presentations


Ads by Google