Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

Similar presentations


Presentation on theme: "Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:"— Presentation transcript:

1 Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

2 Outline DB@UNSW 2  Background and Preliminaries  Probabilistic Threshold Range Aggregate Query  Exact query processing  Approximate query processing: Simple Sampling & Double Sampling  Experiments  Conclusion

3 Applications DB@UNSW 3  Many applications involve data that is imperfect due to  data randomness and incompleteness  limitation of equipment  delay or lose in data transfer  … …  Applications  Sensor networks  Environmental surveillance  Moving objects  Data cleaning and integration  … …

4 Applications DB@UNSW 4  Sensor Networks:  Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08)

5 Applications DB@UNSW 5  Mobile Equipments / Moving Objects  A mobile object reports its location periodically, the exact location is often uncertain.

6 Applications DB@UNSW 6  Satellite data

7 Applications DBG @ UNSW  Data Quality  Social Data Collection: Errors and estimation inherent in customer surveys and sampling 7

8 Outline DB@UNSW 8  Background and Preliminaries  Modeling Uncertainty & Related Work  Probabilistic Threshold Range Query  Conclusion

9 Modeling Uncertainty ( cont. ) DB@UNSW 9  Uncertain Objects Model 1. Continuous case: described using a probability density function (PDF) f U such that. E.g., uniform distribution, normal distribution.

10 Modeling Uncertainty ( cont. ) DB@UNSW 10  Uncertain Objects Model 2. Discrete case : described using a set of instances each instance u has an occurrence probability p u

11 Possible World Semantics DB@UNSW 11  Given a set of uncertain objects {U 1,U 2,..., U n }, a possible world W = {u 1,u 2,.., u n } is a set of n instances --- one instance per uncertain object  The probability of a possible worlds is P(W) =  Let Ω be the set of all possible world, clearly,

12 Probabilistic Queries: DB@UNSW 12  Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]  Aggregate Queries [BDJR05, MJ07, CG07]  Join Queries [CSP06, AW07]  Top-k queries [SIC07, YLSK08, RDS07, HJZL08]  Nearest Neighbor Queries [KKR07, CCMC08]  Skyline Queries [PJLY07]  … …

13 Range query DBG @ UNSW 13  Uncertain objects, exact query  Probability threshold is often assigned

14 Related Work DB@UNSW 14  Range Queries [TCXNKP05, BPS06, AY08] Given a rectangle r and a probabilistic threshold t, find all objects that appear in r with probability at least t. Appearance probability

15 U-tree DB@UNSW 15 Probabilistically Constrained Region ( PCR ) [TCXNKP05] PCR (0.2)Multi PCRs

16 Outline DB@UNSW 16  Introduction  Modeling Uncertainty & Related Work  Probabilistic Threshold Range Aggregate Query (PTRA)  Conclusion

17 Contribution DB@UNSW 17  Formally define PTRA query  aU-Tree structure for exact PTRA query  singleSample and doubleSample techniques for approximate answer.

18 Problem Statement DB@UNSW 18 Given a set of uncertain objects and query q, return the number of uncertain objects with appearance probability no less than threshold p q

19 Problem Definition DB@UNSW 19 Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b)

20 Exact Query Processing ( aU-Tree) DB@UNSW 20  Main idea: add aggregate information on U-tree  Advantage: stop at intermediate level if pruned or fully covered by the query  Disadvantage: otherwise, still need to drill down to the leaf nodes.  For a large portion of uncertain objects, appearance probability needs to be computed  Expensive for a massive number of instances per object!

21 Exact Query Processing ( aU-Tree) DB@UNSW 21

22 singleSample DB@UNSW 22  Sampling the instances of the uncertain objects.  If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m

23 singleSample ( cont. ) DB@UNSW 23 An immediate application of Chernoff-Hoeffding bound

24 doubleSample DB@UNSW 24  Single Sampling is expensive when there is a massive number of objects!  Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects.

25 doubleSample: Accuracy DB@UNSW 25 Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. Using Chernoff-Hoeffding bound.

26 doubleSample: Our Approach DB@UNSW 26  Skew!  Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.)  The optimization problem is NP-hard.  Observation:  Min-skew is a good heuristic to conduct such a group.  aU-tree groups objects with a similar principle to the min- skew.

27 doubleSample: Our Approach DB@UNSW 27  Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard!  Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K.  Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.)  Step 2: sample objects in each subtree.  Step 3. sample instances in each sampled object.

28 Experiments DB@UNSW 28 Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz, 2G memory, Debian linux, Page size 8K

29 Efficiency DB@UNSW 29

30 Accuracy DB@UNSW 30

31 Accuracy ( cont. ) DB@UNSW 31

32 Conclusion DB@UNSW 32  Definition of PTRA  aU-Tree technique  Sampling technique  Future work. Any approach with theoretic guarantee?

33 DB@UNSW 33 Thanks

34 Min-Skew technique DB@UNSW 34


Download ppt "Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:"

Similar presentations


Ads by Google