Download presentation
Presentation is loading. Please wait.
1
Spatial Online Sampling and Aggregation
Lu Wang, Ke Yi Hong Kong University Of Science and Technology Robert Christensen, Feifei Li University of Utah
2
Spatial Online Sampling and Aggregation
Motivation Geo Spatial Data is being collected on a massive scale Approximate aggregations is fast and often effective for this data Users are interested in interactive data analysis * Cell phone trajectories. * Weather data. * Wearables Approximate answers can be sufficient because it supports interactive queries. Exact answers can always be collected offline Spatial Online Sampling and Aggregation
3
Spatial Online Sampling and Aggregation
Problem description Uniform sampling of spatial-temporal points for a query region ๐ Samples from query are reported online We are focused on reporting samples Aggregation estimates can be calculated using traditional statistical methods. Samples can be an input to other data processing algorithms Spatial Online Sampling and Aggregation
4
Spatial Online Sampling and Aggregation
Running Example b k d Spatial data points j c g l i o e Point labels h m a f n Query region ๐ Spatial Online Sampling and Aggregation
5
Spatial Online Sampling and Aggregation
Baseline Methods Query First Calculate ๐=๐โฉ๐. Extract samples from ๐ upon request Must materialize all potential samples At least as costly as reporting all data in ๐ Expensive when ๐ is large b k d j c g l i o e h m a f n Results Buffer e f g h i Spatial Online Sampling and Aggregation
6
Spatial Online Sampling and Aggregation
Baseline Methods Query First Sample First Upon request, pick ๐โ๐, if ๐โ๐ report, otherwise reject and repeat Probability of rejection = 1โ ๐ |๐| Costly if ๐ is small. b k d j c g l i o e h m a Optimizations to Query Frist and Sample First have been implemented. Descriptions of the optimizations are found in the paper. The effects of the optimizations are described in the experimental results. f n Spatial Online Sampling and Aggregation
7
Spatial Online Sampling and Aggregation
LS-Tree We propose the Level Sampling Tree (LS-Tree) LS-Tree is a collection of several R-trees of different sizes The largest R-tree contains all data points The smallest R-tree contains a very small sample of the data points Queries involve performing Query First operations on progressively larger R-trees Spatial Online Sampling and Aggregation
8
Spatial Online Sampling and Aggregation
LS-Tree Building Level Sample Tree Build R-Tree over ๐ 0 =๐ Continue to build R-Tree over ๐ ๐+1 by sampling from ๐ ๐ with probability 1/2 End when ๐ โ is sufficiently small ๐ 0 ๐ 1 ๐ 2 a b d e f g h i c j k l m n o a d f g j m n f j m Spatial Online Sampling and Aggregation
9
Spatial Online Sampling and Aggregation
LS-Tree Queries using LS-Tree Use Query First on ๐ โ . After samples are exhausted, do Query First on ๐ ๐โ1 Only return samples not previously reported. ๐ 0 ๐ 1 ๐ 2 a b d e f g h i c j k l m n j m b k d j d j c LS-tree must pause to perform Query First queries, causing periodic delays in the online sampling. As the query progresses, state must be kept to remember which samples have been previously reported. g l g i o e h m m a a f f f n n Spatial Online Sampling and Aggregation
10
Spatial Online Sampling and Aggregation
LS-Tree LS-tree starts to report samples quickly Samples are reported without replacement Verifying without replacement becomes more costly as the query continuously asks for more and more samples Requires maintenance of multiple R-trees Expensive when supporting transactions LS-tree must pause to perform Query First queries, causing periodic delays in the online sampling. As the query progresses, state must be kept to remember which samples have been previously reported. Spatial Online Sampling and Aggregation
11
RS-tree: Modifying the structure of internal nodes
Add a buffer of samples of points from subtree rooted at the node ๐ข. Each ๐โ๐ ๐ข has equal probability of being an entry in the sample buffer at ๐ข Spatial Online Sampling and Aggregation
12
Spatial Online Sampling and Aggregation
RS-Tree Filling the sample buffers in RS-tree Recursive algorithm, randomly decide how many samples to request from each child node. Sample buffers unneeded in leaf nodes. Root (๐๐๐๐กโ=0) ๐๐๐๐กโ=1 Leaves (๐๐๐๐กโ=2) a b d e f g h i c j k l m n a b d e f g h i c j k l m n o a b d e f g h i c j k l m n o samples b k d j c MBB of given node g l i o e h m a f n Spatial Online Sampling and Aggregation
13
Spatial Online Sampling and Aggregation
RS-Tree Queries over RS-tree Start at the root node. Scan the sample buffer, report those within ๐ Descend the tree, only add nodes which intersect ๐ Obtain samples from nodes with appropriate probability during sampling Root (๐๐๐๐กโ=0) ๐๐๐๐กโ=1 Leaves (๐๐๐๐กโ=2) a b d e f g h i c j k l m n a b d e f g h i c j k l m n a b d e f g h i c j k l m n o a b d e f g h i c j k l m n o Lets consider this instance in more detail o Spatial Online Sampling and Aggregation
14
Spatial Online Sampling and Aggregation
RS-Tree We do not know how many points inside a node satisfy the query We sample each node with probability proportional to number of total items in node. If ๐โ๐ accept, otherwise reject and repeat b k d j c g l i o e h m a f n w.p. ๐ ๐ w.p. ๐ ๐ Spatial Online Sampling and Aggregation
15
Spatial Online Sampling and Aggregation
RS-Tree An estimation of query size is important for aggregations such as SUM and COUNT Estimating query size Use frontier list of a query cursor If ๐
๐ข โ๐, all ๐ ๐ข elements inside ๐ Otherwise estimate ๐ ๐ข โฉ๐ using the uniform samples of the node |๐ ๐ข | ๐ ๐๐๐๐๐๐ ๐ข โฉ๐ ๐ ๐๐๐๐๐๐ ๐ข Query size is a basic value needed for many aggregation queries (SUM, count) Proof and calculations of standard deviation in the paper Spatial Online Sampling and Aggregation
16
Spatial Online Sampling and Aggregation
RS-Tree Queries avoid disk access Samples start being reported quickly Minimal added overhead (verses R-tree) Algorithms for efficient modifications to the RS-tree Samples reported with replacement Spatial Online Sampling and Aggregation
17
Spatial Online Sampling and Aggregation
Experimental Setup Open Street Maps data (2.2 billion points) RS-tree: Range Sampling Tree implementation LS-tree: Level Sampling Tree implementation Random Path: Sample First over R-tree Range Report: Query First implementation Random Shuffle: Sample First with linear scan ๐= number of items inside query region ๐ ๐= number of samples requested for query We will review a few experiments. A complete experimental section exists in the paper. Spatial Online Sampling and Aggregation
18
Analysis of each sampling method (Time & I/O)
Spatial Online Sampling and Aggregation
19
Spatial Online Sampling and Aggregation
Cost of Building LS-tree must build multiple R-trees, taking additional time and space. RS-tree builds an R-tree and fills sample buffer for each inner node. Index Size Raw data 50 GB R-tree 75 GB RS-tree 80 GB LS-tree 151 GB Spatial Online Sampling and Aggregation
20
Spatial Online Sampling and Aggregation
Aggregation query LS-tree starts reporting faster then RS-tree RS-tree converges faster then LS-tree Exact answer is about 1800 Spatial Online Sampling and Aggregation
21
Query speed over large data (2.2 billion points)
(vary sampling percentage) (๐=2.2 million) (vary number of points in full query) (๐=10,000) Spatial Online Sampling and Aggregation
22
RS-tree in real world application
RS-tree is used to support fast spatial samples in the project STORM: Spatio-Temporal Online Reasoning and Management Best Demo Award - SIGMOD 2015 Spatial Online Sampling and Aggregation
23
Spatial Online Sampling and Aggregation
Conclusion Developed spatial indexes supporting sampling LS-tree Stores multiple levels of samples in a collection of R-tree indexes RS-tree Caches samples in each R-tree node. Performed an in depth analysis of each sampling technique Experimentally validated the superior performance of LS-tree and RS-tree verse baseline methods. Spatial Online Sampling and Aggregation
24
Spatial Online Sampling and Aggregation
Conclusion Thank You Spatial Online Sampling and Aggregation
25
RS-tree time to estimate size
Random regions of various sizes are queried. Using 10% of the time to count the size exactly in R-tree, we estimate the size and calculate the error Spatial Online Sampling and Aggregation
26
Spatial Online Sampling and Aggregation
RS-tree digram Spatial Online Sampling and Aggregation
27
Vary sample buffer size
Small data set (24 million points) Query returns 400k samples Spatial Online Sampling and Aggregation
28
Spatial Online Sampling and Aggregation
Aggregation query Exact answer is about 1800 LS-tree starts reporting samples most quickly, but converges slowly RS-tree reports samples faster once they are being reported. LS-tree starts reporting faster then RS-tree RS-tree converges faster then LS-tree Exact answer is about 1800 Spatial Online Sampling and Aggregation
29
Analysis of each sampling method (Time & I/O)
Spatial Online Sampling and Aggregation
30
Hoeffdingโs inequality
Letย X1, ...,ย Xnย beย independent random variablesย bounded by the intervalย [0, 1], i.e., 0 โคย Xiย โค 1. We define the empirical mean of these variables by: X=1/n( ย X1+ X2+โฆ+Xn) Then: If we know ai<Xi<bi
31
Central limit theorem Let {X1, ...,ย Xn} be aย random sampleย of sizeย nโ that is, a sequence ofย independent and identically distributed (iid)ย random variables drawn from distributions ofย expected valuesย given by ยต and finiteย variancesย given by ฯ2. Suppose we are interested in theย sample average
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.