LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases Matthias Renz Ludwig-Maximilians-Universität München Munich, Germany Dagstuhl Seminar 2008 Uncertainty Management in Information Systems
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Outline Introduction Probabilistic Similarity Queries –multi-step query processing –probabilistic -range/k-NN queries Probabilistic Similarity Ranking –probabilistic ranking models –efficient computation of probabilistic ranking queries Summary
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Introduction modern database applications involve data. often vague and imprecise attributes –sensor data, e.g. traffic monitoring –feature extraction, e.g. person identification probabilistic databases spatial,temporal andmultimedia
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Introduction types of probabilistic databases –relational uncertainty representation tuples with confidence e.g. x-relation model (Trio system) –uncertainty in feature spaces uncertain vectors representations: –continuous, discrete (point objects) –spatially uncertainty representation uncertain spatially extended objects x y IDNAMECONF p1john0.6 p2fred0.3 p3mary0.7 ………
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Introduction types of probabilistic databases –relational uncertainty representation tuples with confidence e.g. x-relation model (Trio system) –uncertainty in feature spaces uncertain vectors representations: –continuous, discrete (point objects) –spatially uncertainty representation uncertain spatially extended objects x y IDNAMECONF p1john0.6 p2fred0.3 p3mary0.7 ………
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Introduction Probabilistic Similarity Queries –given: database with uncertain vectors (uncertain) query object Q –queries: Q Q Q -range query k-NN query ranking query
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Introduction Probabilistic Similarity Queries –given: two databases DB A and DB B with uncertain vectors –queries: –challenges: uncertain similarity distances, uncertain query results join query
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Outline Introduction Probabilistic Similarity Queries –multi-step query processing –probabilistic -range/k-NN queries Probabilistic Similarity Ranking –probabilistic ranking models –efficient computation of probabilistic ranking queries Summary
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Modelling Uncertainty in Feature Spaces Uncertain Vector Data –vector data in d-dimensional space d –objects are represented by multiple d-dimensional vectors that are mutually exclusive a confidence value is assigned to each vector –types of uncertain object representations x y pdf (continuous) x y vector samples (discrete)
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Queries Example: Probabilistic -Range Query –query object and set of uncertain objects (discrete) q i = {q 1,…,q M } and o i ={o i,1,…,o i,N } –distance between q and o i : –probability that the distance between q and o i is less than 0 + :
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Queries Clustered Object Representation build approximations by grouping vector points of an object into clusters object o = {o 1,..,o s }simple object approximation MBR(o) clustered object approximation MBR(C 1 (o)),.., MBR(C k (o))
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Queries advantages of clustered object approximation –efficiently managed by spatial access methods e.g. R-tree, X-tree –supports multi-step query processing true hits can be reported very early reduced refinement cost –efficient computation of approximate answers PTSQ and PTopkSQ efficiently supported
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Queries multi-step query processing: –probabilistic filter Estimation of probability p = P(d(o,q) ≤ ): query point q uncertain object o (clustered object representation) ≤ P(d(o,q) ≤ ) ≤ 0.6 lower bounding prob. estimation upper bounding prob. estimation
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Queries Filter Step for PSQs: –Probabilistic -Range Queries (PTSQ type): for each uncertain object o: –compute lower and upper bounding probabilities based on cluster representations –if lower bounding probability P low > , then report o –if upper bounding probability P upper < , then prune o –otherwise refine o (partial refinement) query point q
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Queries Filter Step for PSQs: –Probabilistic k-NN-Queries (PTSQ type) upper bounding probability that p is NN is P upper =0.7 Example: query point q object o object p
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Outline Introduction Probabilistic Similarity Queries –multi-step query processing –probabilistic -range/k-NN queries Probabilistic Similarity Ranking –probabilistic ranking models –efficient computation of probabilistic ranking queries Summary
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking Ranking Queries –very important for similarity search applications –give the most relevant answers first –are more flexible than -range and NN queries probabilistic ranking queries –results are associated with confidence values –in contrast to -range / NN queries no unique query predicate
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking output of probabilistic ranking: –for each object: discrete pdf over ranking positions prob_ranked q : D {1,..,N}→[0..1] –prob_ranked q (o,k) reports the probability that object o is exactly the k th -nearest-neighbor of the query object q probability k
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking Example: Probabilistic Ranking Output A B E C D G F H I J K L P M N O Q S T R Probability ranking coefficient k Objects Probability Table vector spaceprobabilistic ranking output A B q C D E F G H I J K L M N O P Q R S T
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking probabilistic ranking output is inconvenient for most users coping with probabilistic ranking: –ranking with unique order and confidences A B E C … … … RankOIDConf.
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking probabilistic ranking output is inconvenient for most users coping with probabilistic ranking: –ranking with unique order and confidences –aggregate conf. values to deterministic results A B E C … … … RankOIDConf. How should we extract the conf. from the prob. ranking output? Which ranking order?
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking Approaches for Ranking Objects: –Approach 1: highest confidence [Soliman ICDE’07, Yi ICDE’08] –problem: duplicates neglected objects –Example: 1. (A,0.45) | 2. (C,0.40) | 3. (C,0.45) Result: or with duplicate elimination 1. (A,0.45) | 2. (C,0.40) | 3. (B,0.35)
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking Approaches for Ranking Objects: –Approach 2: highest aggregated confidence object with the highest prob. that it is one of the first k objects is assigned to ranking position k. sensible with duplicate elimination –Example: 1. (A,0.45) | 2. (B,0.35) | 3. (C,0.45) Result: or 1. (A,0.45) | 2. (B,0.75) | 3. (C,1.00)
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking further approaches to determine the ranking order, e.g. –expected ranking position –etc. most intuitive and robust: Approach 2. problem: –full probabilistic ranking information is required required: –efficient computation of prob. ranking output
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking Iterative Probability Computation –ranking applied on object vectors (samples) –during the radial sweep: maintain for each object o the probability –for each accessed sample o i,j, compute the probability P(o i,j,k) that exactly (k-1) objects o o i are within the sweep-range , for k = 1..N. radial sweep with increasing range ABCD PoPo
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking computation of P(o i,j,k): problem: comp. very expensive –a lot of possibilities for i must be reconsidered 1) Approach: –pruning objects that are beyond : reduce DB DB‘ (|DB‘|<<|DB|)
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking applying only relevant objects: A (1.0)B (1.0)F (0.8)D (0.6)H (0.2)C (0.1)E (0.0)G (0.0) N‘ N A B F D H C E G I q o i,j N‘‘
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl problem: computation still exponential 2. Approach –problem can be solved in polynomial time by means of dynamic programming technique: Probabilistic Similarity Ranking F D H C q o i,j F D H C q F D H C q P(2 of 4 in -range)P(1 of 3 in -range) assuming C in -range P(2 of 3 in -range) assuming C not in -range
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Probabilistic Similarity Ranking problem: computation still exponential 2. Approach: –problem can be solved in polynomial time by means of dynamic programming technique: –recursive function:
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Outline Introduction Probabilistic Similarity Queries –multi-step query processing –probabilistic -range/k-NN queries Probabilistic Similarity Ranking –probabilistic ranking models –efficient computation of probabilistic ranking queries Summary
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Summary approaches to accelerate probabilistic similarity queries in vector spaces assumption: –objects are mutually independent –discrete uncertainty representations support by –traditional access methods –multi-step query processing techniques very high speed-up factor using Dyn. Prog.
DATABASE SYSTEMS GROUP M. Renz: Probabilistic Similarity Queries in Uncertain Databases, Seminar 08421, Schloss Dagstuhl Discussion any questions? Thank you for your attention..