Download presentation

Presentation is loading. Please wait.

Published byMekhi Astor Modified about 1 year ago

1
1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹ ¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu ²Ned Levine and Assoicates, Houston, TX, Ned@nedlevine.comNed@nedlevine.com ³National Institute of Justice, Washington D.C, Ronald.Wilson@usdoj.govRonald.Wilson@usdoj.gov

2
2 Outline Introduction Motivation Problem Statement Related Work Contributions Conclusion and Future Work Self-Join Index Experimental Evaluation

3
3 Motivation Crime Analysis: Where are the burglary hotspots ? Epidemiology: Is Cancer Spatially Clustered ? Transportation: Which major highways require traffic calming measures ? Application Domains Query: Where are the Burglary hotspots ? An Example

4
4 W-Matrix and W-Queries K-Function Queries that perform a repeated computation of the W-Matrix : W-Queries. W-Matrix W-Queries Moran’s I Geary’s C G Statistic W N : Row Normalized W-Matrix Neighborhood Graph Hotspots

5
5 W-Operations N3 N1 N2 N6 N5 N7 N4 Notion of neighbors, successors and predecessors. Operations NeighborsSuccessor(s)Predecessor(s)CompositeOthers InputOperation Output get-all-neighbors() get-all-neighbors(N2) N3 N1 N2 N6 N5 N7 N4 get-all-successors() get-all-successors(N2) get-all-predecessors() get-a-successor(N2,Node-id)Delete(N2,N1,N3) get-all-predecessors-of-a- successor() get-a-predecessor-of-a- successor() get-a-successor() get-a-predecessor(N2,Node-id) get-all-predecessors(N2) get-a-predecessor() Delete() get-all-predecessors-of-a-successor(N2, Node-id) get-a-predecessor-of-a- successor(N2,Node-id,Node-id)

6
6 W-Query Processing Algorithms Algorithm CalcRipleyK get-all-neighbors(N) Frequency ← Size(get-all-successors(N)) Algorithm Hotspots_JI Stage 1: Hotspot Identification Identify a Seed. get-all-neighbors(Seed) get-all-successors(Seed) Stage 2: Hotspot Refinement P ← get-a-predecessor-of-a-successor(Seed,succ-id) If P is Correlates better with the Successor than with the Seed. Remove the Successor from successor list. Stage 3: Update Remaining Nodes For each, S in Hotspot Delete(S) Input Output K 3000 – 2500 – 2000 – 1500 – 1000 – 500 – 0 – 5 – 10 – 15 – 20 – 30 – 40 – Distance (Miles) Complete Spatial Randomness

7
7 Problem Statement Given: A spatial (crime) data warehouse. A set of W- Operations. Find: A suitable spatial index type representation. Objective: User response time is minimized. Constraints: Dataset is updated infrequently. Concurrency control and recovery considerations are addressed separately. Courtsey: Ned Levine and Associates Input Data Output & W Operations Courtsey: Ned Levine and Associates

8
8 Challenges Scalability to Large Datasets Dataset Size = 14852 Crime Reports CrimeStat Libraries’ Response Time = 2Hrs 30 Minutes Query: Where are the Burglary hotspots ?

9
9 Related Work: Classification SDBMS ToolSpatial Indices SupportedSpatial Self-Join Indices CrimeStatNO Oracle spatialR Tree, Quad TreeNO SQL Server 2008Grid filesNO Post GISR TreeNO ESRI ArcSDEGrid FilesNO SDBMS Tools Current R Tree family index structures perform Repeated on-the-fly W computation. Computationally Expensive!! Our Approach: Pre-computed W ! (Self-join)

10
10 Contributions Modeled W-Queries Proposed a set of W-Operations W-Query Processing Algorithms Self-join Index Representation Algebraic Cost model: Operations Experimental Evaluation Experimental Setup User Response time analysis

11
11 Self-Join Index: Representation Key Observations Classical Join Index : Edge List Which representation can localize neighbor, successor and predecessor information ? W-Matrix ↔ Self-joinNeighborhood Graph Self-Join Adjacency List Index Edge List Adjacency list LOCALIZES successor, predecessor and row normalized Information Edge List SCATTERS these. W-Matrix : Neighborhood Graph or Self-join

12
12 Algebraic Cost Models Overview Worst case retrieval costs for W -Operations. Notation: a.Let Z be the cost of accessing a single spatial instance from the Self-join Index b.|S|: Average number of successors of a particular node. c.|P|: Average number of predecessors of a particular node. d.Let CRR : Connectivity Residue Ratio (adapted from CCAM) be the probability that a node or a spatial instance is found on a particular page. e.|S R |: is the number of instances satisfying the Neighbor Relation R. f.|S D |: is the total size of the spatial dataset. g.ρ : selectivity for a neighbor query with a neighbor size, R, {|S R |/(|S D |-1)}X|S D |

13
13 Algebraic Cost Model: Self-Join Index Node Retrieval Cost = Z = 1 lookup cost for a Join Index get-all-neighbors(): = cost of selecting neighbors X probability that the instances that satisfy neighbor relation Cost of selecting neighbors (for one data item) = {|S R | / (|S D |-1)}.|S D |.Z probability that the instances that satisfy the neighbor relation are not in the same page = (1-CRR) Total Cost of Find() = {ρ. Z.(1-CRR) X |S D |} get-all-successors(): = number of successors X probability that the successors are not in the same page X cost of retrieving them = |S|X(1-CRR). Z get-all-predecessors(): = number of predecessors X probability that the predecessors are not in the same page X cost of retrieving them = |P|.(1-CRR). Z

14
14 Algebraic Cost Model: Self-Join Index get-all-predecessors-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = (|P|.Z+1).(1-CRR) get-a-predecessor-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = 2.Z. (1-CRR) Delete( ) : = cost of retrieving all the successors = Z.(1-CRR).| |

15
15 Experimental Evaluation: Experiment Setup Self-Join Index Generator Candidate Algorithms (CalcRipleyK, Hotspots_JI) Response time Analysis Size of the Police Precincts W Query Processing Algorithms Dataset Size SJALI Experiment Goals: Compare candidates on response times. Metric of Comparison: Response time Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date) Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM Candidates CrimeStat Libraries R-Tree: Tree Matching Self-Join Index

16
16 Baltimore Auto-theft Dataset Crime Report Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports Courtsey: Ned Levine and Associates(www.nedlevine.com )www.nedlevine.com

17
17 Response Time Analysis: Comparison with R-Tree Response time comparison for hotspot identification. Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs R-Tree: Response time Reduced by a factor of 2.

18
18 Response Time Analysis: Comparison with CrimeStat Response time comparison for hotspot identification.Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs CrimeStat: Response time Reduced by a factor of 40.

19
19 Conclusions W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation. W-Operations of W-Queries. Self-join adjacency list index more scalable than R-Tree and CrimeStat. Future work Experimental Quantification I/O costs of W-Query Processing Algorithms. I/O Cost Models for W-Query Processing Algorithms. Further I/O Optimization Extracting optimal page access sequences for processing W-Queries. Optimizing the number of W-Query operations. Other W-Queries Local Moran’s I, Local Getis Ord. Larger datasets of >=100000, will R-Tree be comparable ?

20
20 Acknowledgment Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities. This Work was supported by Grants from NSF, USDOD and NIJ. Thank You for your Questions, Comments and Patience!

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google