2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua, Jian Pei Simon Fraser University Presenter: Wang Liang Supervisor: Prof. David Cheung

2009.01 Outline Introduction Problem Algorithms Performance of Experiments Conclusions

2009.01 Introduction Top-k query in certain database: return the k tuples with maximum scores based on some scoring function. Query: Top-2 longest durations that a panda stays in a location in a time

2009.01 Introduction If the database is uncertain, what’s the answer of top-k query? The uncertain database can be viewed as the summary of a set of possible worlds. Each tuple is associated with a probability. Multiple tuples can have constraints such as mutual exclusion among them. (Generation Rules) {R2, R3} {R5, R6} Possible world: deterministic database instance. Query: Top-2 longest durations that a panda stays in a location in a time

2009.01 Introduction Two Problems: What does a probabilistic top-k query mean? How can a probabilistic threshold top-k query be answered efficiently?

2009.01 Problem Settings Database: Uncertain Database. A probabilistic threshold top-k query (PT-k query): Query: Q (k, f (x), T) and Threshold: p For each possible world W, Q is applied and a set of k tuples Q k (W) is returned. Top-k probability of tuple t is the probability that t is in Q k (W) in all W. The answer set to a PT-k query is the set of all tuples whose top-k probability values are at least p.

2009.01 Example A probabilistic threshold top-k query (PT-k query): Query: Q (2, Duration (x), Table 3) and Threshold: 0.3 For each possible world W, Q is applied and a set of k tuples Q k (W) is returned. Top-k probability of t is the probability that t is in Q k (W) in all W. The answer set to a PT-k query is the set of all tuples whose top-k probability values are at least p. P=0.3

2009.01 Algorithms An Exact Algorithm A Sampling Method A Poisson Approximation Based Method

2009.01 An Exact Algorithm The Basic Case Handling Generation Rules Pruning Techniques

2009.01 The Basic Case Assumption: all tuples are independent. Scan the tuples in table in the ranking order. Let be the set of list of all tuples in table in the ranking order. For tuple t i, is dominant set. Pr (t i, j) is the probability that tuple t i is ranked at the j-th position in all possible worlds. Pr (S t i, j) is the probability that j tuples in S t i appear in all possible worlds. Pr k (t i ) is the top-k probability of t i. In the basic case, for

2009.01 Handling Generation Rules Rule-Tuple Compression: Let be the set of list of all tuples in table in the ranking order. For a tuple t i, two situations due to the presence of multi-tuple generation rules complicate the computation. Situation one: t i is an independent tuple. Some tuples involved in generation rule R are ranked higher than t i. Situation two: t i is involved in generation rule R, and some tuples in R are ranked higher than t i.

2009.01 Situation One Situation one: t i is an independent tuple. Some tuples involved in generation rule R are ranked higher than t i. Solution: Suppose: R: is in ranking order. is ranked higher than t i. The tuples involved in R can be divided into two parts:

2009.01 Situation Two Situation two: t i is involved in generation rule R, and some tuples in R are ranked higher than t i. Solution: Suppose: R: is in ranking order. is ranked higher than t i (t rm0 ). The tuples involved in R can be divided into two parts:

2009.01 Example Query: top-3 with p = 0.3 Pr (t i, j) is the probability that tuple t i is ranked at the j-th position in all possible worlds. Pr (St i, j) is the probability that j tuples in St i appear in all possible worlds. Pr k (t i ) is the top-k probability of t i. Generation Rules: {t1,t2,t8} {t4,t5}

2009.01 Pruning Techniques Four pruning rules Two of them can avoid checking some tuples that can not satisfy the probability threshold. Two of them is about the stopping conditions.

2009.01 Time Complexity R T is the set of all generation rules in table T. n is the number of tuples in table T. span(R) is the number of tuples in generation rules R.

2009.01 A Sampling Method Trade off the accuracy of answers against the efficiency. For a tuple t, let X t be a random variable as an indicator to the event that t is ranked top-k in possible worlds. X t = 1, if t is ranked in the top-k list. X t = 0, otherwise. Then Pr k (t) = E[X t ]. Generate a set of samples S of possible worlds, compute the mean of X t in S, namely E S [X t ], as the approximation of E[X t ].

2009.01 A Sampling Method Scan table T once to generate one sample. An independent tuple t i is included in s with probability Pr (t i ) For a mutli-tuple generation rules R:. s takes a probability Pr (R) to include one tuple involved in R. If s takes a tuple in R, the tuple t rl is chosen with probability Pr (t rl ) / Pr (R) Compute the top-k tuples in s. For each tuple t in the top-k list, X t = 1. The top-k probability of t i : Pr k (t i ) = E S [X t i ] Stopping Condition: Chernoff-Hoeffiding bound:

2009.01 A Poisson Approximation Based Method Let be a set of independent random variables, such that and. Let. Then.If all p i ’s are identical, are called Bernoulli trials and X follows a binomial distribution; otherwise, are called Poisson trials, and X follows a Poisson binomial distribution. Construct a set of Poisson trials corresponding to S t i as follows. Independent tuple, construct a random trial Multi-tuple rule R( ). Combine the tuples in into a rule-tuple t R such that and construct Let then

2009.01 A Poisson Approximation Based Method Distribution of Poisson Binomial Probability If, then the top-k probability of t i is small. General stopping condition: For example: if k = 100, p = 0.3, then the stopping condition is

2009.01 A Poisson Approximation Based Method When the success probability is small and the number of Poisson trials is large, Poisson binomial distribution can be approximated well by Poisson distribution. For a set of Poisson trials such that, let. X follows a Poisson binomial distribution. Let, the probability of can be approximated by

2009.01 A Poisson Approximation Based Method Time Complexity n’ is the number of tuples read before the general stopping condition is satisfied which depends on parameter k, probability threshold p and the probability distribution of tuples.

2009.01 Experiments Setting PC: 3.0 GHz Pentium 4 CPU, 1.0 GB main memory, and a 160 GB hard disk, running the Microsoft Windows XP Professional Edition operating system. Algorithms: implemented in Microsoft Visual C++ V6.0 Data set: a real data set and some synthetic data sets. Real data set: International Ice Patrol Iceberg Sightings Database: to show the difference among answers of different definition of top-k query on uncertain data. Synthetic data sets: to evaluate the algorithms.

2009.01 Synthetic Data Sets 20,000 tuples and 2,000 multi-tuple generation rules. The number of tuples involved in each multi-tuple generation rule follows the normal distribution N(5, 2). The probability values of independent tuples and multi-tuple generation rules follow the normal distribution N(0.5, 0.2) and N(0.7, 0.2). By default, k = 200 and p = 0.3. NB: since ranking queries are extensively supported by modern database management systems, they treat the generation of a ranked list of uncertain tuples as a black box, and test algorithms on top of the ranked list.

2009.01 Scan Depth Stopping condition: number of tuples scanned by Poisson approximation based method. Avg sample length: average number of tuples read by the sampling algorithm to generate a sample unit. Exact algo: number of tuples scanned by exact algorithm. Answer set: the number of tuples in answer set.

2009.01 Efficiency RC: Exact algorithm with rule-tuple compression only. RC+AR: Exact algorithm with RC and aggressive reordering. RC+LR: Exact algorithm with RC and lazy reordering. Sampling: Sampling Method. The runtime of the Poisson approximation based method is always less than one second

2009.01 The approximation quality The recall and precision of Poisson approximation based method are always higher than 85% with runtime less than one second. Precision: percentage of tuples returned by sampling method that are in the actual top-k list returned by the exact method. Recall: percentage of tuples returned by the exact method that are also returned by the sampling method.

2009.01 Scalability (a): number of tuples from 20,000 to 100,000. number of multi-tuple rules: 10%. k = 200 and p = 0.3 (b): number of tuples is fixed. Vary the number of rules from 500 to 2,500. The runtime increases mildly when the database size increases. Due to the pruning rules and the improvement on extracting sample unit.

2009.01 Conclusions Proposed a new definition of top-k query on uncertain data. Developed three different algorithm: An exist algorithm, a sampling method, and a Poisson approximation based method.

2009.01 The End

2009.01 The approximation quality Average error rate: Precision: percentage of tuples returned by sampling method that are in the actual top-k list returned by the exact method. Recall: percentage of tuples returned by the exact method that are also returned by the sampling method.

2009.01 Pruning Technique Let t 1 …t m …t n be the tuples in the ranking order. Assume L = t 1 …t m are read. Let LR be the set of open rules with respect to t m+1. For any tuple t i (i > m), If t i is not in any rule in LR, the top-k probability of t i If t i is in a rule in LR, the top-k probability of t i

2009.01 The Reference

2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Similar presentations

Presentation on theme: "2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Similar presentations

Presentation on theme: "2009.01 Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,"— Presentation transcript:

Similar presentations

About project

Feedback