2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Slides:



Advertisements
Similar presentations
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Representing and Querying Correlated Tuples in Probabilistic Databases
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Chapter 5 Some Important Discrete Probability Distributions
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
Efficient Query Evaluation on Probabilistic Databases
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
Estimation of the Number of Relevant Images in Infinite Databases Presented by: Xiaoling Wang Supervisor: Prof. Clement Leung.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Evaluating Hypotheses
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Discrete Probability Distributions
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
Inferences About Process Quality
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Stat 321- Day 13. Last Time – Binomial vs. Negative Binomial Binomial random variable P(X=x)=C(n,x)p x (1-p) n-x  X = number of successes in n independent.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Random Sampling, Point Estimation and Maximum Likelihood.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
Introduction to Behavioral Statistics Probability, The Binomial Distribution and the Normal Curve.
The Practice of Statistics Third Edition Chapter 8: The Binomial and Geometric Distributions 8.1 The Binomial Distribution Copyright © 2008 by W. H. Freeman.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
CpSc 881: Machine Learning Evaluating Hypotheses.
The Binomial Distribution
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Chapter 4. Random Variables - 3
1 CHAPTER (7) Attributes Control Charts. 2 Introduction Data that can be classified into one of several categories or classifications is known as attribute.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Central Limit Theorem Let X 1, X 2, …, X n be n independent, identically distributed random variables with mean  and standard deviation . For large n:
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
3/7/20161 Now it’s time to look at… Discrete Probability.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Business Statistics,
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
SIMILARITY SEARCH The Metric Space Approach
Mining Frequent Itemsets over Uncertain Databases
Chapter 5 Some Important Discrete Probability Distributions
8.1 The Binomial Distribution
Determination of Sample Size
Machine Learning: Lecture 5
Presentation transcript:

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua, Jian Pei Simon Fraser University Presenter: Wang Liang Supervisor: Prof. David Cheung

Outline Introduction Problem Algorithms Performance of Experiments Conclusions

Introduction Top-k query in certain database: return the k tuples with maximum scores based on some scoring function. Query: Top-2 longest durations that a panda stays in a location in a time

Introduction If the database is uncertain, what’s the answer of top-k query? The uncertain database can be viewed as the summary of a set of possible worlds. Each tuple is associated with a probability. Multiple tuples can have constraints such as mutual exclusion among them. (Generation Rules) {R2, R3} {R5, R6} Possible world: deterministic database instance. Query: Top-2 longest durations that a panda stays in a location in a time

Introduction Two Problems: What does a probabilistic top-k query mean? How can a probabilistic threshold top-k query be answered efficiently?

Outline Introduction Problem Algorithms Performance of Experiments Conclusions

Problem Settings Database: Uncertain Database. A probabilistic threshold top-k query (PT-k query): Query: Q (k, f (x), T) and Threshold: p For each possible world W, Q is applied and a set of k tuples Q k (W) is returned. Top-k probability of tuple t is the probability that t is in Q k (W) in all W. The answer set to a PT-k query is the set of all tuples whose top-k probability values are at least p.

Example A probabilistic threshold top-k query (PT-k query): Query: Q (2, Duration (x), Table 3) and Threshold: 0.3 For each possible world W, Q is applied and a set of k tuples Q k (W) is returned. Top-k probability of t is the probability that t is in Q k (W) in all W. The answer set to a PT-k query is the set of all tuples whose top-k probability values are at least p. P=0.3

Outline Introduction Problem Algorithms Performance of Experiments Conclusions

Algorithms An Exact Algorithm A Sampling Method A Poisson Approximation Based Method

An Exact Algorithm The Basic Case Handling Generation Rules Pruning Techniques

The Basic Case Assumption: all tuples are independent. Scan the tuples in table in the ranking order. Let be the set of list of all tuples in table in the ranking order. For tuple t i, is dominant set. Pr (t i, j) is the probability that tuple t i is ranked at the j-th position in all possible worlds. Pr (S t i, j) is the probability that j tuples in S t i appear in all possible worlds. Pr k (t i ) is the top-k probability of t i. In the basic case, for

Handling Generation Rules Rule-Tuple Compression: Let be the set of list of all tuples in table in the ranking order. For a tuple t i, two situations due to the presence of multi-tuple generation rules complicate the computation. Situation one: t i is an independent tuple. Some tuples involved in generation rule R are ranked higher than t i. Situation two: t i is involved in generation rule R, and some tuples in R are ranked higher than t i.

Situation One Situation one: t i is an independent tuple. Some tuples involved in generation rule R are ranked higher than t i. Solution: Suppose: R: is in ranking order. is ranked higher than t i. The tuples involved in R can be divided into two parts:

Situation Two Situation two: t i is involved in generation rule R, and some tuples in R are ranked higher than t i. Solution: Suppose: R: is in ranking order. is ranked higher than t i (t rm0 ). The tuples involved in R can be divided into two parts:

Example Query: top-3 with p = 0.3 Pr (t i, j) is the probability that tuple t i is ranked at the j-th position in all possible worlds. Pr (St i, j) is the probability that j tuples in St i appear in all possible worlds. Pr k (t i ) is the top-k probability of t i. Generation Rules: {t1,t2,t8} {t4,t5}

Pruning Techniques Four pruning rules Two of them can avoid checking some tuples that can not satisfy the probability threshold. Two of them is about the stopping conditions.

Time Complexity R T is the set of all generation rules in table T. n is the number of tuples in table T. span(R) is the number of tuples in generation rules R.

Algorithms An Exact Algorithm A Sampling Method A Poisson Approximation Based Method

A Sampling Method Trade off the accuracy of answers against the efficiency. For a tuple t, let X t be a random variable as an indicator to the event that t is ranked top-k in possible worlds. X t = 1, if t is ranked in the top-k list. X t = 0, otherwise. Then Pr k (t) = E[X t ]. Generate a set of samples S of possible worlds, compute the mean of X t in S, namely E S [X t ], as the approximation of E[X t ].

A Sampling Method Scan table T once to generate one sample. An independent tuple t i is included in s with probability Pr (t i ) For a mutli-tuple generation rules R:. s takes a probability Pr (R) to include one tuple involved in R. If s takes a tuple in R, the tuple t rl is chosen with probability Pr (t rl ) / Pr (R) Compute the top-k tuples in s. For each tuple t in the top-k list, X t = 1. The top-k probability of t i : Pr k (t i ) = E S [X t i ] Stopping Condition: Chernoff-Hoeffiding bound:

Algorithms An Exact Algorithm A Sampling Method A Poisson Approximation Based Method

A Poisson Approximation Based Method Let be a set of independent random variables, such that and. Let. Then.If all p i ’s are identical, are called Bernoulli trials and X follows a binomial distribution; otherwise, are called Poisson trials, and X follows a Poisson binomial distribution. Construct a set of Poisson trials corresponding to S t i as follows. Independent tuple, construct a random trial Multi-tuple rule R( ). Combine the tuples in into a rule-tuple t R such that and construct Let then

A Poisson Approximation Based Method Distribution of Poisson Binomial Probability If, then the top-k probability of t i is small. General stopping condition: For example: if k = 100, p = 0.3, then the stopping condition is

A Poisson Approximation Based Method When the success probability is small and the number of Poisson trials is large, Poisson binomial distribution can be approximated well by Poisson distribution. For a set of Poisson trials such that, let. X follows a Poisson binomial distribution. Let, the probability of can be approximated by

A Poisson Approximation Based Method Time Complexity n’ is the number of tuples read before the general stopping condition is satisfied which depends on parameter k, probability threshold p and the probability distribution of tuples.

Outline Introduction Problem Algorithms Performance of Experiments Conclusions

Experiments Setting PC: 3.0 GHz Pentium 4 CPU, 1.0 GB main memory, and a 160 GB hard disk, running the Microsoft Windows XP Professional Edition operating system. Algorithms: implemented in Microsoft Visual C++ V6.0 Data set: a real data set and some synthetic data sets. Real data set: International Ice Patrol Iceberg Sightings Database: to show the difference among answers of different definition of top-k query on uncertain data. Synthetic data sets: to evaluate the algorithms.

Synthetic Data Sets 20,000 tuples and 2,000 multi-tuple generation rules. The number of tuples involved in each multi-tuple generation rule follows the normal distribution N(5, 2). The probability values of independent tuples and multi-tuple generation rules follow the normal distribution N(0.5, 0.2) and N(0.7, 0.2). By default, k = 200 and p = 0.3. NB: since ranking queries are extensively supported by modern database management systems, they treat the generation of a ranked list of uncertain tuples as a black box, and test algorithms on top of the ranked list.

Scan Depth Stopping condition: number of tuples scanned by Poisson approximation based method. Avg sample length: average number of tuples read by the sampling algorithm to generate a sample unit. Exact algo: number of tuples scanned by exact algorithm. Answer set: the number of tuples in answer set.

Efficiency RC: Exact algorithm with rule-tuple compression only. RC+AR: Exact algorithm with RC and aggressive reordering. RC+LR: Exact algorithm with RC and lazy reordering. Sampling: Sampling Method. The runtime of the Poisson approximation based method is always less than one second

The approximation quality The recall and precision of Poisson approximation based method are always higher than 85% with runtime less than one second. Precision: percentage of tuples returned by sampling method that are in the actual top-k list returned by the exact method. Recall: percentage of tuples returned by the exact method that are also returned by the sampling method.

Scalability (a): number of tuples from 20,000 to 100,000. number of multi-tuple rules: 10%. k = 200 and p = 0.3 (b): number of tuples is fixed. Vary the number of rules from 500 to 2,500. The runtime increases mildly when the database size increases. Due to the pruning rules and the improvement on extracting sample unit.

Outline Introduction Problem Algorithms Performance of Experiments Conclusions

Conclusions Proposed a new definition of top-k query on uncertain data. Developed three different algorithm: An exist algorithm, a sampling method, and a Poisson approximation based method.

The End

The approximation quality Average error rate: Precision: percentage of tuples returned by sampling method that are in the actual top-k list returned by the exact method. Recall: percentage of tuples returned by the exact method that are also returned by the sampling method.

Pruning Technique Let t 1 …t m …t n be the tuples in the ranking order. Assume L = t 1 …t m are read. Let LR be the set of open rules with respect to t m+1. For any tuple t i (i > m), If t i is not in any rule in LR, the top-k probability of t i If t i is in a rule in LR, the top-k probability of t i

The Reference