Download presentation

Presentation is loading. Please wait.

Published byMekhi Astor Modified over 2 years ago

1
1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹ ¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu ²Ned Levine and Assoicates, Houston, TX, Ned@nedlevine.comNed@nedlevine.com ³National Institute of Justice, Washington D.C, Ronald.Wilson@usdoj.govRonald.Wilson@usdoj.gov

2
2 Outline Introduction Motivation Problem Statement Related Work Contributions Conclusion and Future Work Self-Join Index Experimental Evaluation

3
3 Motivation Crime Analysis: Where are the burglary hotspots ? Epidemiology: Is Cancer Spatially Clustered ? Transportation: Which major highways require traffic calming measures ? Application Domains Query: Where are the Burglary hotspots ? An Example

4
4 W-Matrix and W-Queries K-Function Queries that perform a repeated computation of the W-Matrix : W-Queries. W-Matrix W-Queries Moran’s I Geary’s C G Statistic W N : Row Normalized W-Matrix Neighborhood Graph Hotspots

5
5 W-Operations N3 N1 N2 N6 N5 N7 N4 Notion of neighbors, successors and predecessors. Operations NeighborsSuccessor(s)Predecessor(s)CompositeOthers InputOperation Output get-all-neighbors() get-all-neighbors(N2) N3 N1 N2 N6 N5 N7 N4 get-all-successors() get-all-successors(N2) get-all-predecessors() get-a-successor(N2,Node-id)Delete(N2,N1,N3) get-all-predecessors-of-a- successor() get-a-predecessor-of-a- successor() get-a-successor() get-a-predecessor(N2,Node-id) get-all-predecessors(N2) get-a-predecessor() Delete() get-all-predecessors-of-a-successor(N2, Node-id) get-a-predecessor-of-a- successor(N2,Node-id,Node-id)

6
6 W-Query Processing Algorithms Algorithm CalcRipleyK get-all-neighbors(N) Frequency ← Size(get-all-successors(N)) Algorithm Hotspots_JI Stage 1: Hotspot Identification Identify a Seed. get-all-neighbors(Seed) get-all-successors(Seed) Stage 2: Hotspot Refinement P ← get-a-predecessor-of-a-successor(Seed,succ-id) If P is Correlates better with the Successor than with the Seed. Remove the Successor from successor list. Stage 3: Update Remaining Nodes For each, S in Hotspot Delete(S) Input Output K 3000 – 2500 – 2000 – 1500 – 1000 – 500 – 0 – 5 – 10 – 15 – 20 – 30 – 40 – Distance (Miles) Complete Spatial Randomness

7
7 Problem Statement Given: A spatial (crime) data warehouse. A set of W- Operations. Find: A suitable spatial index type representation. Objective: User response time is minimized. Constraints: Dataset is updated infrequently. Concurrency control and recovery considerations are addressed separately. Courtsey: Ned Levine and Associates Input Data Output & W Operations Courtsey: Ned Levine and Associates

8
8 Challenges Scalability to Large Datasets Dataset Size = 14852 Crime Reports CrimeStat Libraries’ Response Time = 2Hrs 30 Minutes Query: Where are the Burglary hotspots ?

9
9 Related Work: Classification SDBMS ToolSpatial Indices SupportedSpatial Self-Join Indices CrimeStatNO Oracle spatialR Tree, Quad TreeNO SQL Server 2008Grid filesNO Post GISR TreeNO ESRI ArcSDEGrid FilesNO SDBMS Tools Current R Tree family index structures perform Repeated on-the-fly W computation. Computationally Expensive!! Our Approach: Pre-computed W ! (Self-join)

10
10 Contributions Modeled W-Queries Proposed a set of W-Operations W-Query Processing Algorithms Self-join Index Representation Algebraic Cost model: Operations Experimental Evaluation Experimental Setup User Response time analysis

11
11 Self-Join Index: Representation Key Observations Classical Join Index : Edge List Which representation can localize neighbor, successor and predecessor information ? W-Matrix ↔ Self-joinNeighborhood Graph Self-Join Adjacency List Index Edge List Adjacency list LOCALIZES successor, predecessor and row normalized Information Edge List SCATTERS these. W-Matrix : Neighborhood Graph or Self-join

12
12 Algebraic Cost Models Overview Worst case retrieval costs for W -Operations. Notation: a.Let Z be the cost of accessing a single spatial instance from the Self-join Index b.|S|: Average number of successors of a particular node. c.|P|: Average number of predecessors of a particular node. d.Let CRR : Connectivity Residue Ratio (adapted from CCAM) be the probability that a node or a spatial instance is found on a particular page. e.|S R |: is the number of instances satisfying the Neighbor Relation R. f.|S D |: is the total size of the spatial dataset. g.ρ : selectivity for a neighbor query with a neighbor size, R, {|S R |/(|S D |-1)}X|S D |

13
13 Algebraic Cost Model: Self-Join Index Node Retrieval Cost = Z = 1 lookup cost for a Join Index get-all-neighbors(): = cost of selecting neighbors X probability that the instances that satisfy neighbor relation Cost of selecting neighbors (for one data item) = {|S R | / (|S D |-1)}.|S D |.Z probability that the instances that satisfy the neighbor relation are not in the same page = (1-CRR) Total Cost of Find() = {ρ. Z.(1-CRR) X |S D |} get-all-successors(): = number of successors X probability that the successors are not in the same page X cost of retrieving them = |S|X(1-CRR). Z get-all-predecessors(): = number of predecessors X probability that the predecessors are not in the same page X cost of retrieving them = |P|.(1-CRR). Z

14
14 Algebraic Cost Model: Self-Join Index get-all-predecessors-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = (|P|.Z+1).(1-CRR) get-a-predecessor-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = 2.Z. (1-CRR) Delete( ) : = cost of retrieving all the successors = Z.(1-CRR).| |

15
15 Experimental Evaluation: Experiment Setup Self-Join Index Generator Candidate Algorithms (CalcRipleyK, Hotspots_JI) Response time Analysis Size of the Police Precincts W Query Processing Algorithms Dataset Size SJALI Experiment Goals: Compare candidates on response times. Metric of Comparison: Response time Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date) Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM Candidates CrimeStat Libraries R-Tree: Tree Matching Self-Join Index

16
16 Baltimore Auto-theft Dataset Crime Report Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports Courtsey: Ned Levine and Associates(www.nedlevine.com )www.nedlevine.com

17
17 Response Time Analysis: Comparison with R-Tree Response time comparison for hotspot identification. Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs R-Tree: Response time Reduced by a factor of 2.

18
18 Response Time Analysis: Comparison with CrimeStat Response time comparison for hotspot identification.Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs CrimeStat: Response time Reduced by a factor of 40.

19
19 Conclusions W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation. W-Operations of W-Queries. Self-join adjacency list index more scalable than R-Tree and CrimeStat. Future work Experimental Quantification I/O costs of W-Query Processing Algorithms. I/O Cost Models for W-Query Processing Algorithms. Further I/O Optimization Extracting optimal page access sequences for processing W-Queries. Optimizing the number of W-Query operations. Other W-Queries Local Moran’s I, Local Getis Ord. Larger datasets of >=100000, will R-Tree be comparable ?

20
20 Acknowledgment Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities. This Work was supported by Grants from NSF, USDOD and NIJ. Thank You for your Questions, Comments and Patience!

Similar presentations

OK

Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.

Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on power sharing in india download Ppt on natural disasters free download Ppt on object oriented programming in java Ppt on water pollution remedies Ppt on body language in interview Ppt on information security awareness Ppt on waves tides and ocean currents powerpoint Ppt on success after failure Ppt on steps to effective evangelism Ppt on world book day uk