Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Efficient Computation of the Skyline Cube Yidong Yuan School of Computer Science & Engineering The University of New South Wales & NICTA Sydney, Australia.

Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.

Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries Presented By: Muhammad Aamir Cheema Joint work with Xuemin Lin, Wenjie Zhang,

Efficient Processing of Top-k Spatial Preference Queries

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Information Technology (Some) Research Trends in Location-based Services Muhammad Aamir Cheema Faculty of Information Technology Monash University, Australia.

Bin Jiang, Jian Pei ICDE 2009 Online Interval Skyline Queries on Time Series 1.

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.

Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.

A Flexible Spatio-temporal indexing Scheme for Large Scale GPS Tracks Retrieval Yu Zheng, Longhao Wang, Xing Xie Microsoft Research.

Dense-Region Based Compact Data Cube

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Tian Xia and Donghui Zhang Northeastern University

A Unified Algorithm for Continuous Monitoring of Spatial Queries

Frequency Counts over Data Streams

A Uniﬁed Framework for Efﬁciently Processing Ranking Related Queries

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

A paper on Join Synopses for Approximate Query Answering

Stochastic Skyline Operator

Query in Streaming Environment

TT-Join: Efficient Set Containment Join

A Framework for Clustering Evolving Data Streams

Probabilistic Data Management

Approximate Frequency Counts over Data Streams

Xu Zhou Kenli Li Yantao Zhou Keqin Li

Probabilistic n-of-N Skyline Computation over Uncertain Data Streams

Efficient Subgraph Similarity All-Matching

Range-Efficient Computation of F0 over Massive Data Streams

Publishing in Top Venues

Uncertain Data Mobile Group 报告人：郝兴.

Efficient Processing of Top-k Spatial Preference Queries

Dynamically Maintaining Frequent Items Over A Data Stream

Efficient Aggregation over Objects with Extent

Presentation transcript:

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei Wang (UNSW & NICTA) Jeffrey Xu Yu (CUHK)

Outline  Background  Framework  Algorithms  Experiment  Conclusion 2

Background  Elements continuously arrive with occurrence probabilities  Problem : How to continuously compute skylines in a sliding window with size N (elements) ?  Sliding window: N = 5 3

Background Multi-criteria decision making regarding uncertain data:  Online auction  Financial market  … … 4

Related work Probabilistic skyline computation Uncertain stream processing  Probabilistic skyline (VLDB07)  Probabilistic reverse skyline (SIGMOD08)  Probabilistic aggregates and sketches over uncertain streams (SIGMOD07, SODA07, PODS07)  Frequent items on uncertain streams (SIGMOD08)  Top-k queries over uncertain sliding window (VLDB08)  … … 5

Models and Problem Definition  Model: DS is a stream of elements, each element a is in a d-dimensional space and with an occurrence probability P(a) ( in (0, 1]) The skyline probability of an element a is:  Problem Definition: retrieving elements from the most recent N elements, with skyline probability no less than a given threshold q 6

Challenges and Contributions  Space efficiency:  Contribution: Space reduction: O(N) to O(ln d-1 N)  Time efficiency  Contribution: R-tree based efficient incremental algorithms 7

Outline  Background and Preliminaries  Framework  Algorithms  Experiment  Conclusion 8

Framework: what to keep ? P new (2) < q, element 2 will never become skyline in the window window size N : 5 probability threshold: 0.5 P old (2) = 1 – P(1) 9 P new (2) = (1 – P(3)) * (1 – P(4))

Framework: what to keep ?  Candidate set S N,q :  Correctness: (1) no missing skyline points (2) no false hits to determine S N, q (3) no false positive to determine skyline results (4) no false negative to determine skyline results --- probability based on S N,q may not be accurate, but satisfies the threshold requirement. 10

Framework  Space required for S N,q :  S N,q is the minimum information to be maintained to get a correct answer window size N : 4 probability threshold q: P sky (3) = 0.9 * (1 – 0.4) * (1- 0.3) < q 1 2 P sky (3) = 0.9 > q

Space of Candidate Set  Theorem: Candidate Set requires a poly-logarithmic space on average case regarding uniform distributions, O(f(q)ln d-1 N). 12

Outline  Background and Preliminaries  Framework  Algorithms  Experiment  Conclusion 13

Algorithms  We maintain two R-trees  R1: SKY N,q --- skylines  R2: S N,q - SKY N,q --- candidates – skylines 14

Algorithms 1 (.1) 2 (.1) 3 (.4) 4 (.1) 5 (.8) 6 (.8) 7 (.6) 8 (.2) 9 (.5) 10 (.2) 11 (.6) 12 (.1) 13 (.1) window size N : 13 probability threshold q: not in S N,q R1: SKY N,q R2: S N,q – SKY N,q

Algorithms  New element arrives  Check P sky & P new on R1  Check P new on R2  Handling elements with P new < q  Old element expires  Update P old  Check P sky on R2 16

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 5(.8) 6(.8) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: S N,q - SKY N,q window size N : 13 probability threshold q: (0.8) Before update: P new : (1, 1) P sky : (0.8, 0.8) global P new = 1 – 0.2 After update: global P new *= Delete from R1 17 Delete an Entry:

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: S N,q - SKY N,q window size N : 13 probability threshold q: (0.8) Before update: P new : (1, 1) P sky : (0.24, 0.6) global P new = 1 After update: global P new *= 1 – 0.8 min P new = 0.2 ≥ q max P sky = 0.12 < q Move from R1 to R2 18 Move an Entry from R1 to R2:

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: S N,q - SKY N,q window size N : 13 probability threshold q: (0.8) Before update: P new : (0.9, 1) global P new = 1 After update: global P new *= 1 – 0.8 min P new < q; max P new ≥ q Drill down and delete 2 19

Algorithms: new elements arrives 2(.1) 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q window size N : 13 probability threshold q: (0.8) R2: S N,q - SKY N,q Update P old of 12 & 13 global P old /= (1 – 0.1) 20 Update P old :

Algorithms: new elements arrives 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q window size N : 13 probability threshold q: (0.8) R2: S N,q - SKY N,q Insert new element: P new = 1. compute P sky 21

Algorithm: old element expires  Delete it from R1 or R2.  Update P old of remaining elements:  Record global P old on intermediate entries fully dominated by it  Check P sky after update 22

Algorithms: old element expires 3(.4) 4(.1) 7(.6) 8(.2) 9(.5) 10(.2) 11(.6) 12(.1) 13(.1) R1: SKY N,q R2: SKY N,q window size N : 13 probability threshold q: (0.8) P old (7) /= 1 – P(3) global P old /= 1 – P(4) 23

Algorithms: handling multiple thresholds  Continuous queries  Users specify k probability thresholds q 1, …, q k. (q i < q i-1 )  Solution: instead of maintaining R1, we maintain R 1, …, R k, each corresponding to a confidence value.  Ad-hoc queries  Users issue a query: retrieve skylines with probability at least q’ (q’ ≥ q k )  Solution: find an R i with q i ≤ q’ < q i-1. Then all elements in {R j : j < i -1} are results. We search R i-1 to output qualified skylines 24

Experiment  Data set:  Real: stock transactions. 2-d. probability assigned randomly. Size: 2 million  Synthetic: spatial location (independent or anti- correlated); probability (uniform or normal); 2d to 5d; 2 million  Default values: p : 0.3; d: 3; N : 1M; spatial distribution: anti-correlated; probability: uniform; 25

Experiment: space 0.1% to the sliding window size for 2-d data; save around 89% space even for 5-d data. 26

Experiment: space Size of S N,q deceases with the increase of P u, while size of SKY N,q increases with it. 27

Experiment: space 28

Experiment: time 29

Experiment: time Maintenance time increases with # probability thresholds; query time deceases with it. 30

Conclusion  We characterize a candidate set with minimum size and propose time efficient techniques.  We extend the framework to handle multiple thresholds. 31

Thanks ! 32