Range Queries on Uncertain Data

Range Queries on Uncertain Data
Jian Li, Tsinghua University Haitao Wang, Utah State University ISAAC 2014

One dimensional range queries
Input: a set of points on a line Given a query interval 𝐼, return the points in the interval 𝐼 A trivial solution: balanced binary search tree

An uncertain point p p can appear in different locations with probabilities Give a query interval 𝐼, Pr[𝑝∈𝐼]: the probability of p in 𝐼, called the 𝐼-probability of p 0.1 0.3 0.2 0.4 𝐼 Pr[𝑝∈𝐼] = 0.5

An uncertain point p: A general case
The location of p is specified by its PDF (probability density function) 𝑓 𝑝 𝑥 , which is a step function or histogram Give a query interval 𝐼, Pr[𝑝∈𝐼]: the 𝐼-probability of p 𝑓 𝑝 𝑥 0.25 0.22 0.2 0.15 𝑥 𝐼

The cumulative distribution function (CDF)
a piecewise linear function 𝐶 𝑝 𝑥 1 𝐶 𝑝 𝑥′ 𝑓 𝑝 𝑥 𝑥 𝑥′

Computing the 𝐼-probability using CDF
a piecewise linear function A query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ] Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 )− 𝐶 𝑝 𝑥 𝑙 𝐶 𝑝 𝑥 𝐶 𝑝 ( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 𝑙 𝑥 𝑥 𝑙 𝑥 𝑟

Range query problems on uncertain points
Input: a set P of n uncertain points For any query interval 𝐼: top-1 query: return the point in P with largest 𝐼-probability top-k query: return the k points in P with largest 𝐼-probabilities threshold query: given any threshold t, return the points in P with 𝐼-probabilities ≥ t Goal: build data structures on P to quickly answer these queries

An application on deterministic data

An application on deterministic data (cont.)
A query interval 𝐼=[7,+∞) top-1 query: find the movie whose total percentage of the ratings ≥ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ≥ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ≥ 7 are ≥ 80%

Previous work: only on threshold queries
A heuristic solution using R-trees, Cheng et al. VLDB 04’ fast in practice, but O(n) time in the worst case Theoretical results: Agarwal et al. PODS 09’ preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size A special case: t is fixed for all queries, preprocessing: O(n) space and O(n log n) time query: O(m + log n) time, where m is the output size Heuristic solutions in 2-D or higher-D, Tao et al. 2005 O(n) time in the worst case

An application on deterministic data (cont.)
A query interval 𝐼=[7,+∞) top-1 query: find the movie whose total percentage of the ratings ≥ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ≥ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ≥ 7 are ≥ 80%

Variations four variations The query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ]
unbounded query: either 𝑥 𝑙 =−∞ or 𝑥 𝑟 =+∞ bounded query: otherwise 𝑓 𝑝 𝑥 : PDF of each uncertain point p uniform distribution: 𝑓 𝑝 𝑥 has only one interval histogram distribution: otherwise four variations 𝑓 𝑝 𝑥 𝑥

Our results: uniform unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)

Our results: histogram unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k T threshold O(m + log n) T=O(k) if k = Ω(log n loglog n) and O(log n + k log k) otherwise

Our results: uniform bounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(n log2 n) T threshold O(m + log n) T=O(k) if k = Ω(log n loglog n) and O(log n + k log k) otherwise

Future work: histogram bounded
No new results Previous work only on threshold queries, P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size

The 𝐼-probability: unbounded
Given a query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ]: Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 )− 𝐶 𝑝 𝑥 𝑙 If 𝑥 𝑙 =−∞, 𝐶 𝑝 𝑥 𝑙 =0 and Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 ) This is why the unbounded case is easier 𝐶 𝑝 𝑥 𝐶 𝑝 ( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 𝑙 𝐶 𝑝 ( 𝑥 𝑙 ) 𝑥 𝑥 𝑙 𝑥 𝑟

The arrangement of CDFs
Key: the intersections of all CDFs with line 𝐿 𝑥 𝑟 top-1: the highest intersection top-k: the highest k intersections threshold: the intersections above the threshold t 𝐿( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 t 𝑥 𝑥 𝑟

Top-1: unbounded Preprocessing: compute the upper envelop of all CDFs Query: find the intersection of 𝐿 𝑥 𝑟 with the upper envelop 𝐶 𝑝 𝑥 𝑥 𝑥 𝑟

Difficulty for top-k queries
Arrangements of segments: difficult! Arrangements of lines: much easier! Uniform case: change each CDF to a line 𝐶 𝑝 𝑥 1 𝑥 𝐿( 𝑥 𝑟 )

Uniform unbounded Given an arrangement of n lines, for any query vertical line 𝐿 𝑥 𝑟 top-k: return the top k intersections threshold: return the intersections above t 𝐶 𝑝 𝑥 t 𝑥 𝑥 𝑟

A half-plane range reporting data structure
Problem: Given a line arrangement, for any query point q, return the lines above q Data structure: Partition lines into layers: each layer consists of lines in the upper envelop after removing the previous layers

Threshold query: uniform unbounded
Given 𝐼 =(−∞, 𝑥 𝑟 ] and the threshold t determine the intersections of 𝐿( 𝑥 𝑟 ) and the upper envelops above t for each such intersection, walk along the envelop towards left and right to find the lines that intersect 𝐿( 𝑥 𝑟 ) above t query time: O(log n + m) 𝐿( 𝑥 𝑟 ) t 𝑥 𝑟

Top-k query: uniform unbounded
Use a heap: O(log n + k log k) query time Observation: largest k elements in O(k) sorted arrays a selection algorithm on sorted matrices, Frederickson and Johnson, 82’ > O(log n + k) time 𝐿( 𝑥 𝑟 ) 𝑥 𝑟

Our results: uniform unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)

Uniform bounded Transform the problem to the unbounded case
If 𝑥 𝑙 ≤ the left endpoint of the blue interval Pr[p∈𝐼] = Pr[p∈𝐼’] for 𝐼′ =(−∞, 𝑥 𝑟 ] It becomes the unbounded case! 𝑥 𝑙 𝐼′ 𝐼 𝑥 𝑟

Uniform bounded (cont.)
Classify blue intervals into three types L-type: left endponits ≥ 𝑥 𝑙 R-type: right endponits ≤ 𝑥 𝑟 M-type: each contains 𝐼 𝐼 𝑥 𝑙 𝑥 𝑟

Uniform bounded (cont.)
Top-1 queries: L-type and R-type: use a persistent data structure to maintain O(n) upper envelops in the preprocessing M-type: transform to segment dragging queries in 2D Top-k queries: L-type and R-type: use a binary tree T, and on each node, build a data structure as in the unbounded case build a fractional cascading structure on T M-type: transform to a range query in 3D Threshold queries: Similar as for top-k queries

Histogram unbounded A segment query problem
Given a set of n segments, for any point q, return all segments vertically above q P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n) space and O(n log n) time query: O(log n + m) time q

Thank you for your attention!

Range Queries on Uncertain Data

Similar presentations

Presentation on theme: "Range Queries on Uncertain Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Range Queries on Uncertain Data

Similar presentations

Presentation on theme: "Range Queries on Uncertain Data"— Presentation transcript:

Similar presentations

About project

Feedback