Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spatial Online Sampling and Aggregation

Similar presentations


Presentation on theme: "Spatial Online Sampling and Aggregation"โ€” Presentation transcript:

1 Spatial Online Sampling and Aggregation
Lu Wang, Ke Yi Hong Kong University Of Science and Technology Robert Christensen, Feifei Li University of Utah

2 Spatial Online Sampling and Aggregation
Motivation Geo Spatial Data is being collected on a massive scale Approximate aggregations is fast and often effective for this data Users are interested in interactive data analysis * Cell phone trajectories. * Weather data. * Wearables Approximate answers can be sufficient because it supports interactive queries. Exact answers can always be collected offline Spatial Online Sampling and Aggregation

3 Spatial Online Sampling and Aggregation
Problem description Uniform sampling of spatial-temporal points for a query region ๐‘„ Samples from query are reported online We are focused on reporting samples Aggregation estimates can be calculated using traditional statistical methods. Samples can be an input to other data processing algorithms Spatial Online Sampling and Aggregation

4 Spatial Online Sampling and Aggregation
Running Example b k d Spatial data points j c g l i o e Point labels h m a f n Query region ๐‘„ Spatial Online Sampling and Aggregation

5 Spatial Online Sampling and Aggregation
Baseline Methods Query First Calculate ๐‘†=๐‘ƒโˆฉ๐‘„. Extract samples from ๐‘† upon request Must materialize all potential samples At least as costly as reporting all data in ๐‘„ Expensive when ๐‘„ is large b k d j c g l i o e h m a f n Results Buffer e f g h i Spatial Online Sampling and Aggregation

6 Spatial Online Sampling and Aggregation
Baseline Methods Query First Sample First Upon request, pick ๐‘โˆˆ๐‘ƒ, if ๐‘โˆˆ๐‘„ report, otherwise reject and repeat Probability of rejection = 1โˆ’ ๐‘„ |๐‘ƒ| Costly if ๐‘„ is small. b k d j c g l i o e h m a Optimizations to Query Frist and Sample First have been implemented. Descriptions of the optimizations are found in the paper. The effects of the optimizations are described in the experimental results. f n Spatial Online Sampling and Aggregation

7 Spatial Online Sampling and Aggregation
LS-Tree We propose the Level Sampling Tree (LS-Tree) LS-Tree is a collection of several R-trees of different sizes The largest R-tree contains all data points The smallest R-tree contains a very small sample of the data points Queries involve performing Query First operations on progressively larger R-trees Spatial Online Sampling and Aggregation

8 Spatial Online Sampling and Aggregation
LS-Tree Building Level Sample Tree Build R-Tree over ๐‘† 0 =๐‘ƒ Continue to build R-Tree over ๐‘† ๐‘–+1 by sampling from ๐‘† ๐‘– with probability 1/2 End when ๐‘† โ„“ is sufficiently small ๐‘† 0 ๐‘† 1 ๐‘† 2 a b d e f g h i c j k l m n o a d f g j m n f j m Spatial Online Sampling and Aggregation

9 Spatial Online Sampling and Aggregation
LS-Tree Queries using LS-Tree Use Query First on ๐‘† โ„“ . After samples are exhausted, do Query First on ๐‘† ๐‘–โˆ’1 Only return samples not previously reported. ๐‘† 0 ๐‘† 1 ๐‘† 2 a b d e f g h i c j k l m n j m b k d j d j c LS-tree must pause to perform Query First queries, causing periodic delays in the online sampling. As the query progresses, state must be kept to remember which samples have been previously reported. g l g i o e h m m a a f f f n n Spatial Online Sampling and Aggregation

10 Spatial Online Sampling and Aggregation
LS-Tree LS-tree starts to report samples quickly Samples are reported without replacement Verifying without replacement becomes more costly as the query continuously asks for more and more samples Requires maintenance of multiple R-trees Expensive when supporting transactions LS-tree must pause to perform Query First queries, causing periodic delays in the online sampling. As the query progresses, state must be kept to remember which samples have been previously reported. Spatial Online Sampling and Aggregation

11 RS-tree: Modifying the structure of internal nodes
Add a buffer of samples of points from subtree rooted at the node ๐‘ข. Each ๐‘โˆˆ๐‘ƒ ๐‘ข has equal probability of being an entry in the sample buffer at ๐‘ข Spatial Online Sampling and Aggregation

12 Spatial Online Sampling and Aggregation
RS-Tree Filling the sample buffers in RS-tree Recursive algorithm, randomly decide how many samples to request from each child node. Sample buffers unneeded in leaf nodes. Root (๐‘‘๐‘’๐‘๐‘กโ„Ž=0) ๐‘‘๐‘’๐‘๐‘กโ„Ž=1 Leaves (๐‘‘๐‘’๐‘๐‘กโ„Ž=2) a b d e f g h i c j k l m n a b d e f g h i c j k l m n o a b d e f g h i c j k l m n o samples b k d j c MBB of given node g l i o e h m a f n Spatial Online Sampling and Aggregation

13 Spatial Online Sampling and Aggregation
RS-Tree Queries over RS-tree Start at the root node. Scan the sample buffer, report those within ๐‘„ Descend the tree, only add nodes which intersect ๐‘„ Obtain samples from nodes with appropriate probability during sampling Root (๐‘‘๐‘’๐‘๐‘กโ„Ž=0) ๐‘‘๐‘’๐‘๐‘กโ„Ž=1 Leaves (๐‘‘๐‘’๐‘๐‘กโ„Ž=2) a b d e f g h i c j k l m n a b d e f g h i c j k l m n a b d e f g h i c j k l m n o a b d e f g h i c j k l m n o Lets consider this instance in more detail o Spatial Online Sampling and Aggregation

14 Spatial Online Sampling and Aggregation
RS-Tree We do not know how many points inside a node satisfy the query We sample each node with probability proportional to number of total items in node. If ๐‘โˆˆ๐‘„ accept, otherwise reject and repeat b k d j c g l i o e h m a f n w.p. ๐Ÿ‘ ๐Ÿ• w.p. ๐Ÿ’ ๐Ÿ• Spatial Online Sampling and Aggregation

15 Spatial Online Sampling and Aggregation
RS-Tree An estimation of query size is important for aggregations such as SUM and COUNT Estimating query size Use frontier list of a query cursor If ๐‘… ๐‘ข โŠ†๐‘„, all ๐‘ƒ ๐‘ข elements inside ๐‘„ Otherwise estimate ๐‘ƒ ๐‘ข โˆฉ๐‘„ using the uniform samples of the node |๐‘ƒ ๐‘ข | ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’๐‘  ๐‘ข โˆฉ๐‘„ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’๐‘  ๐‘ข Query size is a basic value needed for many aggregation queries (SUM, count) Proof and calculations of standard deviation in the paper Spatial Online Sampling and Aggregation

16 Spatial Online Sampling and Aggregation
RS-Tree Queries avoid disk access Samples start being reported quickly Minimal added overhead (verses R-tree) Algorithms for efficient modifications to the RS-tree Samples reported with replacement Spatial Online Sampling and Aggregation

17 Spatial Online Sampling and Aggregation
Experimental Setup Open Street Maps data (2.2 billion points) RS-tree: Range Sampling Tree implementation LS-tree: Level Sampling Tree implementation Random Path: Sample First over R-tree Range Report: Query First implementation Random Shuffle: Sample First with linear scan ๐‘ž= number of items inside query region ๐‘„ ๐‘˜= number of samples requested for query We will review a few experiments. A complete experimental section exists in the paper. Spatial Online Sampling and Aggregation

18 Analysis of each sampling method (Time & I/O)
Spatial Online Sampling and Aggregation

19 Spatial Online Sampling and Aggregation
Cost of Building LS-tree must build multiple R-trees, taking additional time and space. RS-tree builds an R-tree and fills sample buffer for each inner node. Index Size Raw data 50 GB R-tree 75 GB RS-tree 80 GB LS-tree 151 GB Spatial Online Sampling and Aggregation

20 Spatial Online Sampling and Aggregation
Aggregation query LS-tree starts reporting faster then RS-tree RS-tree converges faster then LS-tree Exact answer is about 1800 Spatial Online Sampling and Aggregation

21 Query speed over large data (2.2 billion points)
(vary sampling percentage) (๐’’=2.2 million) (vary number of points in full query) (๐’Œ=10,000) Spatial Online Sampling and Aggregation

22 RS-tree in real world application
RS-tree is used to support fast spatial samples in the project STORM: Spatio-Temporal Online Reasoning and Management Best Demo Award - SIGMOD 2015 Spatial Online Sampling and Aggregation

23 Spatial Online Sampling and Aggregation
Conclusion Developed spatial indexes supporting sampling LS-tree Stores multiple levels of samples in a collection of R-tree indexes RS-tree Caches samples in each R-tree node. Performed an in depth analysis of each sampling technique Experimentally validated the superior performance of LS-tree and RS-tree verse baseline methods. Spatial Online Sampling and Aggregation

24 Spatial Online Sampling and Aggregation
Conclusion Thank You Spatial Online Sampling and Aggregation

25 RS-tree time to estimate size
Random regions of various sizes are queried. Using 10% of the time to count the size exactly in R-tree, we estimate the size and calculate the error Spatial Online Sampling and Aggregation

26 Spatial Online Sampling and Aggregation
RS-tree digram Spatial Online Sampling and Aggregation

27 Vary sample buffer size
Small data set (24 million points) Query returns 400k samples Spatial Online Sampling and Aggregation

28 Spatial Online Sampling and Aggregation
Aggregation query Exact answer is about 1800 LS-tree starts reporting samples most quickly, but converges slowly RS-tree reports samples faster once they are being reported. LS-tree starts reporting faster then RS-tree RS-tree converges faster then LS-tree Exact answer is about 1800 Spatial Online Sampling and Aggregation

29 Analysis of each sampling method (Time & I/O)
Spatial Online Sampling and Aggregation

30 Hoeffdingโ€™s inequality
Letย X1, ...,ย Xnย beย independent random variablesย bounded by the intervalย [0, 1], i.e., 0 โ‰คย Xiย โ‰ค 1. We define the empirical mean of these variables by: X=1/n( ย X1+ X2+โ€ฆ+Xn) Then: If we know ai<Xi<bi

31 Central limit theorem Let {X1, ...,ย Xn} be aย random sampleย of sizeย nโ€” that is, a sequence ofย independent and identically distributed (iid)ย random variables drawn from distributions ofย expected valuesย given by ยต and finiteย variancesย given by ฯƒ2. Suppose we are interested in theย sample average


Download ppt "Spatial Online Sampling and Aggregation"

Similar presentations


Ads by Google