Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.

Similar presentations


Presentation on theme: "On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008."— Presentation transcript:

1 On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008

2 Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

3 Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

4 Probabilistic Databases Motivation Uncertainty/vagueness/imprecision in data History Imcomplete information in relational DB [Imielinski & Lipski 1984] Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.] Comeback Flourish of uncertain data in real world application Examples: WWW, Biological data, Sensor network etc.

5 Probabilistic Database Model [Fubr & Rölleke 1997] Probabilisitc Database Model A generalizaiton of relational DB Probabilistic Relational Algebra (PRA) A generalization of standard relational algebra

6 DocNoTerm 1233412334 IR DB IR DB AI Prob 0.9 0.7 0.8 0.5 0.8 DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) A Table in Probabilistic Database Event expression Independent events

7 Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference -

8 Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference -

9 DocNoTerm 1233412334 IR DB IR DB AI Prob 0.9 0.7 0.8 0.5 0.8 DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Selection DocNoTerm 1313 IR Prob 0.9 0.8 Complex Event e DT(1, IR) e DT(3, IR) In derived table Propositional expression of basic events

10 DocNoTerm 1233412334 IR DB IR DB AI Prob 0.9 0.7 0.8 0.5 0.8 DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Projection Term IR DB AI Prob 0.98 0.85 0.80 Complex Event e DT(1, IR) e DT(3, IR) e DT(2, DB) e DT(4, AI)

11 Join DocNoTerm 1212 IR DB Prob 0.9 0.7 DocTerm: Basic Event e DT(1, IR) e DT(2, DB) DocNoAName 1212 Bauer Meier Prob 0.9 0.8 Basic Event e DU(1, Bauer) e DU(2, Meier) DocAu: DocAu. DocNo ANameDocTerm. DocNo Term 11221122 Bauer Meier 12121212 IR DB IR DB Prob 0.9*0.9 0.9*0.7 0.8*0.9 0.8*0.7 Complex Event e DU(1, Bauer) e DT(1, IR) e DU(1, Bauer) e DT(2, DB) e DU(2, Meier) e DT(1, IR) e DU(2, Meier) e DT(2, DB)

12 DocNoTerm 1233412334 IR DB IR DB AI Prob 0.9 0.7 0.8 0.5 0.8 DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) Join + Projection DocNo 1313 Prob 0.9 0.8 Complex Event e DT(1, IR) e DT(3, IR) IR: DocNo 2323 Prob 0.7 0.5 Complex Event e DT(2, DB) e DT(3, DB) DB: DocNoAName 12223441222344 Bauer Meier Schmidt Koch Bauer Prob 0.9 0.3 0.9 0.8 0.7 0.9 0.6 Basic Event e DU(1, Bauer) e DU(2, Bauer) e DU(2, Meier) e DU(2, Schmidt) e DU(3, Schmidt) e DU(3, Koch) e DU(3, Bauer) DocAu: AName Bauer Schimdt AName Bauer Meier Schmidt Prob 0.81 0.56 Complex Event e DU(1, Bauer) e DT(1, IR) e DU(3, S) e DT(3, IR) Prob 0.21 0.63 0.91 Complex Event e DU(2, Bauer) e DT(2, DB) e DU(2, Meier) e DT(2, DB) (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) AName Bauer Schmidt 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 ProbComplex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) 0.4368

13 DocNoTerm 1233412334 IR DB IR DB AI Prob 0.9 0.7 0.8 0.5 0.8 DocTerm: Basic Event e DT(1, IR) e DT(2, DB) e DT(3, IR) e DT(3, DB) e DT(4, AI) DocNo 1313 Prob 0.9 0.8 Complex Event e DT(1, IR) e DT(3, IR) IR: DocNo 2323 Prob 0.7 0.5 Complex Event e DT(2, DB) e DT(3, DB) DB: DocNoAName 12223441222344 Bauer Meier Schmidt Koch Bauer Prob 0.9 0.3 0.9 0.8 0.7 0.9 0.6 Basic Event e DU(1, Bauer) e DU(2, Bauer) e DU(2, Meier) e DU(2, Schmidt) e DU(3, Schmidt) e DU(3, Koch) e DU(3, Bauer) DocAu: AName Bauer Schimdt AName Bauer Meier Schmidt Prob 0.81 0.56 Complex Event e DU(1, Bauer) e DT(1, IR) e DU(3, S) e DT(3, IR) Prob 0.21 0.63 0.91 Complex Event e DU(2, Bauer) e DT(2, DB) e DU(2, Meier) e DT(2, DB) (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) AName Bauer Schmidt 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 ProbComplex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) 0.4368 Intensional Semantics v.s. Extensional Semantics Join + Projection

14 Intensional v.s Extensional Intensional Semantics Assume data independence of base tables Keeps track of data dependence during the evaluation Extensional Semantics Assume data independence during the evaluation Could be WRONG with probability computation!

15 When Intensional = Extensional? No identical underlying basic events in the event expression AName Bauer Schmidt Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 Complex Event (e DU(1, B) e DT(1, IR) ) (e DU(2, B) e DT(2, DB) ) (e DU(3, S) e DT(3, IR) ) ( (e DU(2, S) e DT(2, DB) ) (e DU(3, S) e DT(3, DB) ) ) 0.4368 Identical basic event

16 Fubr & Rölleke 1997 Summary Probabilisitc DB Model Concept of event Basic v.s. complex event Event expression Probabilistic Relational Algebra Just like in Relational Algebra… Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event expressions

17 Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion

18 Top-k Queries Traditonally, given Objects:o 1, o 2, …, o n An non-negative integer: k A scoring function s: Question: What are the k objects with the highest score? Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.

19 Scoring Function A scoring function s over a deterministic relation R is For any t i and t j from R,

20 Outline Background Motivation Examples Smart Enviroment Example Sensor Network Example Top-k Queries in Probabilistic Databases Conclusion

21 Motivating Example I Smart Environment Sample Question “Who were the two visitors in the lab last Saturday night?” Data Biometric data from sensors  We would be able to see how those data match the profile of every candidate -- a scoring function Historical statistics  e. g. Probability of a certain candidate being in lab on Saturday nights

22 Motivating Example I (cont.) Face Voice Detection, Detection, Aiden score( 0.70, 0.60, … ) = 0.65 Bob score( 0.50, 0.60, … ) = 0.55 Chris score( 0.50, 0.40, … ) = 0.45 Probability of being in lab on Saturday nights 0.3 0.9 0.4 Personnel Biometrics score( … ) Question: Find two people in the lab last Saturday night a Top-2 query over the above probabilistic database under the above scoring function

23 Motivating Example II Sensor Network in a Habitat Sample Question “What is the temperature of the warmest spot?” Data Sensor readings from different sensors At a sampling time, only one “real” reading from a sensor Each sensor reading comes with a confidence value

24 Motivating Example II (cont.) Temp (F) 22 10 25 15 Prob 0.6 0.1 Question: What is the temperature of the warmest spot? a Top-1 query over the above probabilistic database under the scoring function proportional to temperature 0.4 0.6 C 1 (from Sensor 1) C 2 (from Sensor 2)

25 Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion

26 Models A probabilistic relation R p = R:the support deterministic relation p:probability function C :a partition of R, such that Simple v.s. General probabilistic relation Simple Assume tuple independence, i.e. |C |=|R| E.g. smart environment example General Tuples can be independent or exclusive, i.e. |C |<|R| E.g. sensor network example

27 Challenges Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) How to compute the top-k answer of R p ? (Query Evaluation)

28 What is a “Good” Semantics? Desired Properties Exact-k Faithfulness Stability

29 Properties Exact-k If R has at least k tuples, then exactly k tuples are returned as the top-k answer Faithfulness A “better” tuple, i.e. higher in score and probability, is more likely to be in the top-k answer, compared to a “worse” one Stability Raising the score/prob. of a winning tuple will not cause it to lose Lowering the score/prob. of a losing tuple will not cause it to win

30 Global-Topk Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) Global-Topk Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds Global-Topk satisfies aforementioned three properties

31 Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob. 0.3 0.9 0.4 Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris 0.0180.0420.3780.0280.162 0.108 0.2520.012 Top-2 possible worlds Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292 Global-Topk Semantics: Pr(Aiden in top-2) = 0.3 Pr(Bob in top-2) = 0.9 Top-2 Answer

32 Other Semantics Soliman, Ilyas & Chang 2007 Two Alternative Semantics U-Topk U-kRanks

33 U-Topk Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) U-Topk Return the most probable top-k answer set that belongs to possible worlds U-Topk does not satisfies all three properties

34 Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob. 0.3 0.9 0.4 Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris 0.0180.0420.3780.0280.162 0.108 0.2520.012 Top-2 possible worlds Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27 U-Topk Semantics: … Pr({Bob}) = 0.378 Top-2 Answer

35 U-kRanks Semantics Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) U-kRanks For i=1,2,…,k, return the most probable i th -ranked tuples across all possible worlds U-kRanks does not satisfies all three properties

36 Smart Environment Example Score( 0.50, 0.40, … ) = 0.45Chris Score( 0.50, 0.60, … ) = 0.55Bob Score( 0.70, 0.60, … ) = 0.65Aiden Face Voice Detection, Detection, Prob. 0.3 0.9 0.4 Personnel Biometrics Score( … ) Query: Find two people in lab on last Saturday night Aiden Bob Chris AidenBobChris Aiden Bob Aiden Chris Bob Chris 0.0180.0420.3780.0280.162 0.108 0.2520.012 Top-2 possible worlds e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292 U-kRanks Semantics: Top-2 Answer {Bob} AidenBob Rank-1 Rank-2 0.3 0 0.63 0.27 0.028 0.264 Chris Highest at rank-1 Highest at rank-2

37 Properties SemanticsExact-kFaithfulnessStability Global-Topk U-Topk U-kRanks Yes No Yes Yes/No* No Yes No * Yes when the relation is simple, No otherwise A better sementics

38 Challenges Given A probabilistic relation R p = An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over R p ? (Semantics) How to compute the top-k answer of R p ? (Query Evaluation) Global- Topk

39 Global-Topk in Simple Relation Given R p =, a scoring function s, a non-negative integer k Assumptions Tuples are independent, i.e. |C |=|R| R={t 1,t 2,…t n }, ordered in the decreasing order of their scores, i.e.

40 Global-Topk in Simple Relation Query Evaluation Recursion P k,s (t i ): Global-Topk probability of tuple t i Dynamic Programming

41 Optimization Threshold Algorithm (TA) [Fagin & Lotem 2001] Given a system of objects, such that For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute An aggregation function f combines individual attribute scores x i, i=1,2,…m, to obtain the overall object score f(x 1,x 2,…,x m ) f is monotonic  f(x 1,x 2,…,x m )<= f(x’ 1,x’ 2,…,x’ m ) whenever x i <=x’ i for every i TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g. top-k, skyline, etc.

42 Applying TA Optimization Global-Topk Two attributes: probability & score Aggregation function: Global-Topk probability

43 Global-Topk in General Relation Given R p =, a scoring function s, a non-negative integer k Assumptions Tuples are independent or exclusive, i.e. |C |<|R| R={t 1,t 2,…t n }, ordered in the decreasing order of their scores, i.e.

44 Global-Topk in General Relation Induced Event Relation For each tuple in R, there is a probabilistic relation E p = generated by the following two rules E p is simple

45 Sensor Network Example Temp (F) 22 10 25 15 Prob 0.6 0.1 0.4 0.6 C 1 (from Sensor 1) C 2 (from Sensor 2) 150.6 Event t eC1 t et 0.6 = 0.6 = p(t) For example: Induced Event Relation (simple) t= where i=1 Prob Rule 1 Rule 2 Prob. Relation (general)

46 Global-Topk in General Relation

47 Evaluating Global-Topk in General Relation For each tuple t, generate corresponding induced event relation Compute the Global-Topk probability of t by Theorem 4.3 Pick the k tuples with the highest Global-Topk probability

48 Summary on Query Evaluation Simple (Independent Tuples) Dynamic Programming Tuples are ordered on their scores Recursion on the tuple index and k General (Independent/Exclusive Tuples) Polynomial reduction to simple cases

49 Complexity Global- Topk U-TopkU-kRanks SimpleO(kn) GeneralO(kn 2 )Θ(mkn k-1 lg n)*Ω(mn k-1 )* * m is a rule engine related factor m represents how complicated the relationship between tuples could be

50 Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

51 Three intuitive semantic properties for top-k queries in probability databases Global-Topk semantics which satisfies all the properties above Query evaluation algorithm for Global-Topk in simple and general probabilistic databases

52 Future Problems Weak order scoring function Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary tie breaker”) Preference Strength Sensitivity to Score Given a prob. relation R p, if the DB is sufficiently large, by manipulating the scores of tuples, we would be able to get different answers NOT satisfied by our semantics NOT satisfied by any semantics in literature Need to consider preference strength in the semantics

53 Thank you !

54 Related Works Introduction to Probabilistic Databases Probabilistic DB Model & Probabilistic Relational Algebra [Fubr & Rölleke 1997] Top-K Query in Probabilistic Databases On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases [Zhang & Chomicki 2008] Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]


Download ppt "On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008."

Similar presentations


Ads by Google