# Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

## Presentation on theme: "Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei."— Presentation transcript:

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei Wang (UNSW & NICTA) Jeffrey Xu Yu (CUHK)

Outline  Background  Framework  Algorithms  Experiment  Conclusion 2

Background  Elements continuously arrive with occurrence probabilities  Problem : How to continuously compute skylines in a sliding window with size N (elements) ? 1 1 2 2 3 3 5 5 4 4 0.1 0.4 0.1 0.8 6 6 0.5 1  Sliding window: N = 5 3

Background Multi-criteria decision making regarding uncertain data:  Online auction  Financial market  … … 4

Related work Probabilistic skyline computation Uncertain stream processing  Probabilistic skyline (VLDB07)  Probabilistic reverse skyline (SIGMOD08)  Probabilistic aggregates and sketches over uncertain streams (SIGMOD07, SODA07, PODS07)  Frequent items on uncertain streams (SIGMOD08)  Top-k queries over uncertain sliding window (VLDB08)  … … 5

Models and Problem Definition  Model: DS is a stream of elements, each element a is in a d-dimensional space and with an occurrence probability P(a) ( in (0, 1]) The skyline probability of an element a is:  Problem Definition: retrieving elements from the most recent N elements, with skyline probability no less than a given threshold q 6

Challenges and Contributions  Space efficiency:  Contribution: Space reduction: O(N) to O(ln d-1 N)  Time efficiency  Contribution: R-tree based efficient incremental algorithms 7

Outline  Background and Preliminaries  Framework  Algorithms  Experiment  Conclusion 8

Framework: what to keep ? 1 1 2 2 3 3 5 5 4 4 0.1 0.4 0.1 0.8 P new (2) < q, element 2 will never become skyline in the window window size N : 5 probability threshold: 0.5 P old (2) = 1 – P(1) 9 P new (2) = (1 – P(3)) * (1 – P(4))

Framework: what to keep ?  Candidate set S N,q :  Correctness: (1) no missing skyline points (2) no false hits to determine S N, q (3) no false positive to determine skyline results (4) no false negative to determine skyline results --- probability based on S N,q may not be accurate, but satisfies the threshold requirement. 10

Framework  Space required for S N,q :  S N,q is the minimum information to be maintained to get a correct answer. 1 1 4 4 2 2 0.3 0.8 0.4 3 3 0.9 window size N : 4 probability threshold q: 0.5 11 P sky (3) = 0.9 * (1 – 0.4) * (1- 0.3) < q 1 2 P sky (3) = 0.9 > q

Space of Candidate Set  Theorem: Candidate Set requires a poly-logarithmic space on average case regarding uniform distributions, O(f(q)ln d-1 N). 12

Outline  Background and Preliminaries  Framework  Algorithms  Experiment  Conclusion 13

Algorithms  We maintain two R-trees  R1: SKY N,q --- skylines  R2: S N,q - SKY N,q --- candidates – skylines 14

Algorithms 1 (.1) 2 (.1) 3 (.4) 4 (.1) 5 (.8) 6 (.8) 7 (.6) 8 (.2) 9 (.5) 10 (.2) 11 (.6) 12 (.1) 13 (.1) window size N : 13 probability threshold q: 0.2 15 not in S N,q R1: SKY N,q R2: S N,q – SKY N,q

Algorithms  New element arrives  Check P sky & P new on R1  Check P new on R2  Handling elements with P new < q  Old element expires  Update P old  Check P sky on R2 16

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 5(.8) 6(.8) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: S N,q - SKY N,q window size N : 13 probability threshold q: 0.2 14(0.8) Before update: P new : (1, 1) P sky : (0.8, 0.8) global P new = 1 – 0.2 After update: global P new *= 1- 0.8 Delete from R1 17 Delete an Entry:

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: S N,q - SKY N,q window size N : 13 probability threshold q: 0.2 14(0.8) Before update: P new : (1, 1) P sky : (0.24, 0.6) global P new = 1 After update: global P new *= 1 – 0.8 min P new = 0.2 ≥ q max P sky = 0.12 < q Move from R1 to R2 18 Move an Entry from R1 to R2:

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: S N,q - SKY N,q window size N : 13 probability threshold q: 0.2 14(0.8) Before update: P new : (0.9, 1) global P new = 1 After update: global P new *= 1 – 0.8 min P new < q; max P new ≥ q Drill down and delete 2 19

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q window size N : 13 probability threshold q: 0.2 14(0.8) R2: S N,q - SKY N,q Update P old of 12 & 13 global P old /= (1 – 0.1) 20 Update P old :

Algorithms: new elements arrives 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q window size N : 13 probability threshold q: 0.2 14(0.8) R2: S N,q - SKY N,q Insert new element: P new = 1. compute P sky 21

Algorithm: old element expires  Delete it from R1 or R2.  Update P old of remaining elements:  Record global P old on intermediate entries fully dominated by it  Check P sky after update 22

Algorithms: old element expires 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: SKY N,q window size N : 13 probability threshold q: 0.2 14(0.8) P old (7) /= 1 – P(3) global P old /= 1 – P(4) 23

Algorithms: handling multiple thresholds  Continuous queries  Users specify k probability thresholds q 1, …, q k. (q i < q i-1 )  Solution: instead of maintaining R1, we maintain R 1, …, R k, each corresponding to a confidence value.  Ad-hoc queries  Users issue a query: retrieve skylines with probability at least q’ (q’ ≥ q k )  Solution: find an R i with q i ≤ q’ < q i-1. Then all elements in {R j : j < i -1} are results. We search R i-1 to output qualified skylines 24

Experiment  Data set:  Real: stock transactions. 2-d. probability assigned randomly. Size: 2 million  Synthetic: spatial location (independent or anti- correlated); probability (uniform or normal); 2d to 5d; 2 million  Default values: p : 0.3; d: 3; N : 1M; spatial distribution: anti-correlated; probability: uniform; 25

Experiment: space 0.1% to the sliding window size for 2-d data; save around 89% space even for 5-d data. 26

Experiment: space Size of S N,q deceases with the increase of P u, while size of SKY N,q increases with it. 27

Experiment: space 28

Experiment: time 29

Experiment: time Maintenance time increases with # probability thresholds; query time deceases with it. 30

Conclusion  We characterize a candidate set with minimum size and propose time efficient techniques.  We extend the framework to handle multiple thresholds. 31

Thanks ! 32

Similar presentations