Presentation is loading. Please wait.

Presentation is loading. Please wait.

Range Queries on Uncertain Data

Similar presentations


Presentation on theme: "Range Queries on Uncertain Data"ā€” Presentation transcript:

1 Range Queries on Uncertain Data
Jian Li, Tsinghua University Haitao Wang, Utah State University ISAAC 2014

2 One dimensional range queries
Input: a set of points on a line Given a query interval š¼, return the points in the interval š¼ A trivial solution: balanced binary search tree

3 An uncertain point p p can appear in different locations with probabilities Give a query interval š¼, Pr[š‘āˆˆš¼]: the probability of p in š¼, called the š¼-probability of p 0.1 0.3 0.2 0.4 š¼ Pr[š‘āˆˆš¼] = 0.5

4 An uncertain point p: A general case
The location of p is specified by its PDF (probability density function) š‘“ š‘ š‘„ , which is a step function or histogram Give a query interval š¼, Pr[š‘āˆˆš¼]: the š¼-probability of p š‘“ š‘ š‘„ 0.25 0.22 0.2 0.15 š‘„ š¼

5 The cumulative distribution function (CDF)
a piecewise linear function š¶ š‘ š‘„ 1 š¶ š‘ š‘„ā€² š‘“ š‘ š‘„ š‘„ š‘„ā€²

6 Computing the š¼-probability using CDF
a piecewise linear function A query interval š¼ =[ š‘„ š‘™ , š‘„ š‘Ÿ ] Pr[pāˆˆš¼]= š¶ š‘ ( š‘„ š‘Ÿ )āˆ’ š¶ š‘ š‘„ š‘™ š¶ š‘ š‘„ š¶ š‘ ( š‘„ š‘Ÿ ) š¶ š‘ š‘„ š‘™ š‘„ š‘„ š‘™ š‘„ š‘Ÿ

7 Range query problems on uncertain points
Input: a set P of n uncertain points For any query interval š¼: top-1 query: return the point in P with largest š¼-probability top-k query: return the k points in P with largest š¼-probabilities threshold query: given any threshold t, return the points in P with š¼-probabilities ā‰„ t Goal: build data structures on P to quickly answer these queries

8 An application on deterministic data

9 An application on deterministic data (cont.)
A query interval š¼=[7,+āˆž) top-1 query: find the movie whose total percentage of the ratings ā‰„ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ā‰„ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ā‰„ 7 are ā‰„ 80%

10 Previous work: only on threshold queries
A heuristic solution using R-trees, Cheng et al. VLDB 04ā€™ fast in practice, but O(n) time in the worst case Theoretical results: Agarwal et al. PODS 09ā€™ preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size A special case: t is fixed for all queries, preprocessing: O(n) space and O(n log n) time query: O(m + log n) time, where m is the output size Heuristic solutions in 2-D or higher-D, Tao et al. 2005 O(n) time in the worst case

11 An application on deterministic data (cont.)
A query interval š¼=[7,+āˆž) top-1 query: find the movie whose total percentage of the ratings ā‰„ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ā‰„ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ā‰„ 7 are ā‰„ 80%

12 Variations four variations The query interval š¼ =[ š‘„ š‘™ , š‘„ š‘Ÿ ]
unbounded query: either š‘„ š‘™ =āˆ’āˆž or š‘„ š‘Ÿ =+āˆž bounded query: otherwise š‘“ š‘ š‘„ : PDF of each uncertain point p uniform distribution: š‘“ š‘ š‘„ has only one interval histogram distribution: otherwise four variations š‘“ š‘ š‘„ š‘„

13 Our results: uniform unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)

14 Our results: histogram unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k T threshold O(m + log n) T=O(k) if k = ā„¦(log n loglog n) and O(log n + k log k) otherwise

15 Our results: uniform bounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(n log2 n) T threshold O(m + log n) T=O(k) if k = ā„¦(log n loglog n) and O(log n + k log k) otherwise

16 Future work: histogram bounded
No new results Previous work only on threshold queries, P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size

17 The š¼-probability: unbounded
Given a query interval š¼ =[ š‘„ š‘™ , š‘„ š‘Ÿ ]: Pr[pāˆˆš¼]= š¶ š‘ ( š‘„ š‘Ÿ )āˆ’ š¶ š‘ š‘„ š‘™ If š‘„ š‘™ =āˆ’āˆž, š¶ š‘ š‘„ š‘™ =0 and Pr[pāˆˆš¼]= š¶ š‘ ( š‘„ š‘Ÿ ) This is why the unbounded case is easier š¶ š‘ š‘„ š¶ š‘ ( š‘„ š‘Ÿ ) š¶ š‘ š‘„ š‘™ š¶ š‘ ( š‘„ š‘™ ) š‘„ š‘„ š‘™ š‘„ š‘Ÿ

18 The arrangement of CDFs
Key: the intersections of all CDFs with line šæ š‘„ š‘Ÿ top-1: the highest intersection top-k: the highest k intersections threshold: the intersections above the threshold t šæ( š‘„ š‘Ÿ ) š¶ š‘ š‘„ t š‘„ š‘„ š‘Ÿ

19 Top-1: unbounded Preprocessing: compute the upper envelop of all CDFs Query: find the intersection of šæ š‘„ š‘Ÿ with the upper envelop š¶ š‘ š‘„ š‘„ š‘„ š‘Ÿ

20 Difficulty for top-k queries
Arrangements of segments: difficult! Arrangements of lines: much easier! Uniform case: change each CDF to a line š¶ š‘ š‘„ 1 š‘„ šæ( š‘„ š‘Ÿ )

21 Uniform unbounded Given an arrangement of n lines, for any query vertical line šæ š‘„ š‘Ÿ top-k: return the top k intersections threshold: return the intersections above t š¶ š‘ š‘„ t š‘„ š‘„ š‘Ÿ

22 A half-plane range reporting data structure
Problem: Given a line arrangement, for any query point q, return the lines above q Data structure: Partition lines into layers: each layer consists of lines in the upper envelop after removing the previous layers

23 Threshold query: uniform unbounded
Given š¼ =(āˆ’āˆž, š‘„ š‘Ÿ ] and the threshold t determine the intersections of šæ( š‘„ š‘Ÿ ) and the upper envelops above t for each such intersection, walk along the envelop towards left and right to find the lines that intersect šæ( š‘„ š‘Ÿ ) above t query time: O(log n + m) šæ( š‘„ š‘Ÿ ) t š‘„ š‘Ÿ

24 Top-k query: uniform unbounded
Use a heap: O(log n + k log k) query time Observation: largest k elements in O(k) sorted arrays a selection algorithm on sorted matrices, Frederickson and Johnson, 82ā€™ > O(log n + k) time šæ( š‘„ š‘Ÿ ) š‘„ š‘Ÿ

25 Our results: uniform unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)

26 Uniform bounded Transform the problem to the unbounded case
If š‘„ š‘™ ā‰¤ the left endpoint of the blue interval Pr[pāˆˆš¼] = Pr[pāˆˆš¼ā€™] for š¼ā€² =(āˆ’āˆž, š‘„ š‘Ÿ ] It becomes the unbounded case! š‘„ š‘™ š¼ā€² š¼ š‘„ š‘Ÿ

27 Uniform bounded (cont.)
Classify blue intervals into three types L-type: left endponits ā‰„ š‘„ š‘™ R-type: right endponits ā‰¤ š‘„ š‘Ÿ M-type: each contains š¼ š¼ š‘„ š‘™ š‘„ š‘Ÿ

28 Uniform bounded (cont.)
Top-1 queries: L-type and R-type: use a persistent data structure to maintain O(n) upper envelops in the preprocessing M-type: transform to segment dragging queries in 2D Top-k queries: L-type and R-type: use a binary tree T, and on each node, build a data structure as in the unbounded case build a fractional cascading structure on T M-type: transform to a range query in 3D Threshold queries: Similar as for top-k queries

29 Histogram unbounded A segment query problem
Given a set of n segments, for any point q, return all segments vertically above q P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n) space and O(n log n) time query: O(log n + m) time q

30 Thank you for your attention!


Download ppt "Range Queries on Uncertain Data"

Similar presentations


Ads by Google