Cheng, Chen, Chen, Xie Evaluating Probability Threshold k- Nearest-Neighbor Queries over Uncertain Data Reynold Cheng (University of Hong Kong) Lei Chen (Hong Kong University of Science &Tech) Jinchuan Chen (Hong Kong Polytechnic University) Xike Xie (University of Hong Kong) International Conference on Extending Database Technology 2009

Cheng, Chen, Chen, Xie Agenda 1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Cheng, Chen, Chen, Xie Data Uncertainty Inherent in various applications Location-based services (e.g., using GPS, RFID) [TDRP98, SSDBM99] Natural habitat monitoring with sensor networks [VLDB04a] Biomedical and biometric databases[ICDE06, ICDE07]

Cheng, Chen, Chen, Xie Attribute Uncertainty Model [TDRP98,ISSD99,VLDB04b] pdf y (pdf) Uncertainty region We represent an uncertainty pdf as a histogram

Cheng, Chen, Chen, Xie k-NN Queries k-NN Query over Precise Data - application in LBS [VLDB03] - natural habitat monitoring system [VLDB04a] - network traffic analysis [ICDCS07] - pattern matching in CAM [VLDB04c] k-NN over Uncertain Objects - [VLDB08a] ranks the probability each object is the NN of the query point. - [ICDE07a] use expected distance and does not discuss the probability.

Cheng, Chen, Chen, Xie Agenda 1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Cheng, Chen, Chen, Xie Probability Threshold k-Nearest-Neighbor Query (T-k-PNN) INPUT 1. A query point q, parameter k, threshold T 2. A set of n objects with uncertainty regions and pdfs OUTPUT A number of k-subset p(S) is the qualification probability of the k-subset S

Cheng, Chen, Chen, Xie Example of a k-PNN query (k=3) {O 1, O 2, O 3 } {O 1, O 2, O 4 } O2O2 O3O3 O1O1 O4O4 O5O5 O6O6 O7O7 O8O8 q

Cheng, Chen, Chen, Xie Example of a k-PNN query (k=3) O2O2 O3O3 O1O1 O4O4 O5O5 O6O6 q {O 1, O 2, O 3 } {O 1, O 2, O 4 } … {O 6, O 7, O 8 } k-bound {O 1, O 2, O 3 } {O 1, O 2, O 4 } … {O 4, O 5, O 6 } O7O7 O8O8

Cheng, Chen, Chen, Xie k-bound Filtering (k=3) O2O2 O3O3 O1O1 O4O4 O5O5 O6O6 q k-bound O7O7 O8O8 f1f1 f2f2 f3f3 f k (k-bound): is the k-th minimum maximum distance Since min(r 7 )> f 3, O 7 can not be 3-NN of q. Because there are always 3 objects with distances smaller than f 3. We apply k-bound filtering on an index (e.g. R-tree) to prune unqualified objects.

Cheng, Chen, Chen, Xie Agenda 1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Cheng, Chen, Chen, Xie Basic solution for a T-k-PNN query (k=3,T=0.1) 3-subsetQP {O 1, O 2, O 3 }0.2 {O 1, O 2, O 4 } 0.1 {O 1, O 3, O 4 } 0.1 {O 2, O 3, O 4 } {O 2, O 3, O 5 } 0.05 {O 1, O 3, O 5 } …… 0.05 {O 1, O 2, O 5 } 0.05 {O 1, O 2, O 5 } 0.1 {O 2, O 3, O 4 } 0.1 {O 1, O 3, O 4 } 0.1 {O 1, O 2, O 4 } 0.2{O 1, O 2, O 3 } QP3-subset O2O2 O3O3 O1O1 O4O4 O5O5 O6O6 q k-bound T=0.1 Exact QP is expensive to compute! Too many k-subsets! Step1: k-bound filteringStep2: QP CalculationStep3: Accept S, if qp(S)≥T SymbolMeaning riri |o i − q| d i (r)pdf of r i (distance pdf) D i (r)cdf of r i (distance cdf)

Cheng, Chen, Chen, Xie Qualification Probability

Cheng, Chen, Chen, Xie Basic Solution [TKDE04] O2O2 q n1n1n1n1 f O1O1 O3O3 O4O4 d i (r): distance pdf of O i from qd i (r): distance pdf of O i from q D i (r): distance cdf of O i from qD i (r): distance cdf of O i from q r i : distance of O i from qr i : distance of O i from q n i : smallest distance of O i from q (min(r i ))n i : smallest distance of O i from q (min(r i )) f: shortest max distance of all objects from qf: shortest max distance of all objects from q O5O5 O6O6

Cheng, Chen, Chen, Xie Agenda 1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Cheng, Chen, Chen, Xie Efficient Solution Framework (GVR) Lower bound Upper bound Refinement k-subset Generation k-subset Verification And Refinement k-subsets rejected k-subsets accepted k-subsets Candidate Objects 1. k-bound Filtering 2. Probabilistic Candidate Selection k-subsets Generation Verification Refinement

Cheng, Chen, Chen, Xie Probabilistic Candidates Selection O2O2 O3O3 O1O1 O4O4 O5O5 O6O6 q k-bound Cutoff Probability of O i : Pr(r i ≤f k ) S 1 ={O 4, O 5,O 6 } cp(S 1 )=0.5*0.2*0.1 = 0.01 S 2 ={O 4, O 5 } cp(S 2 )=0.5*0.2 = 0.1 Given T=0.2, if cp(S 2 ) < T, then qp(S 1 )

Cheng, Chen, Chen, Xie Probabilistic Candidates Selection 0.5{O 4 } 0.2{O 5 } 0.1{O 6 } 1{O 3 } 1{O 2 } 1{O 1 } CP1-subset 0.2{O 2, O 3, O 5 } 0.2{O 1, O 3, O 5 } 0.1{O 1, O 4, O 5 } 0.5{O 2, O 3, O 4 } 0.1{O 2, O 4, O 5 } 0.1{O 3, O 4, O 5 } 0.5{O 1, O 3, O 4 } 0.2{O 1, O 2, O 5 } 0.5{O 1, O 2, O 4 } 1{O 1, O 2, O 3 } CP3-subset 1{O 2,O 3 } 0.5{O 2,O 4 } 0.2{O 2,O 5 } 0.5{O 3,O 4 } 0.2{O 3,O 5 } 0.2{O 1,O 5 } 0.1{O 4,O 5 } 0.5{O 1,O 4 } 1{O 1,O 3 } 1{O 1,O 2 } CP2-subset T=0.2, k=3

Cheng, Chen, Chen, Xie Storage Efficient Compression 1{O 2,O 3 } 0.5{O 2,O 4 } 0.2{O 2,O 5 } 0.5{O 3,O 4 } 0.2{O 3,O 5 } 0.2{O 1,O 5 } 0.5{O 1,O 4 } 1{O 1,O 3 } 1{O 1,O 2 } CP2-subset Subsets are sorted in descending order of their CPs. {O 3,O 5 } {O 2,O 5 } {O 1,O 5 } Size-2 Set Original subsets Compressed subsets Store the common prefix of the subsets And the last element of the subset that has the minimum product of cutoff probability greater than T

Cheng, Chen, Chen, Xie Storage Efficient Compression 0.5{O 4 } 0.2{O 5 } 0.1{O 6 } 1{O 3 } 1{O 2 } 1{O 1 } CP1-subset 0.2{O 2, O 3, O 5 } 0.2{O 1, O 3, O 5 } 0.1{O 1, O 4, O 5 } 0.5{O 2, O 3, O 4 } 0.1{O 2, O 4, O 5 } 0.1{O 3, O 4, O 5 } 0.5{O 1, O 3, O 4 } 0.2{O 1, O 2, O 5 } 0.5{O 1, O 2, O 4 } 1{O 1, O 2, O 3 } CP3-subset 1{O 2,O 3 } 0.5{O 2,O 4 } 0.2{O 2,O 5 } 0.5{O 3,O 4 } 0.2{O 3,O 5 } 0.2{O 1,O 5 } 0.1{O 4,O 5 } 0.5{O 1,O 4 } 1{O 1,O 3 } 1{O 1,O 2 } CP2-subset {O 4 } {O 5 } {O 3 } {O 2 } {O 1 } Size-1 Set {O 3,O 5 } {O 2,O 5 } {O 1,O 5 } Size-2 Set Size-3 Set {O 1,O 2,O 5 } {O 1,O 3,O 5 } {O 2,O 3,O 5 } T=0,2, k=3

Cheng, Chen, Chen, Xie O3O3 Seeds Pruning O1O1 O2O2 q O4O4 k=3 f1f1 f2f2 f3f3 min(r 4 ) > f 2 > f 1 Seeds: o 1, o 2, o 3 If o 4 belongs to a 3-nn set S, o 1 and o 2 must also belong to S. r 4 > r 2 r 4 > r 1 min(r 4 ) For example, we can prune the set {o 1,o 3,o 4 }, according to the above rule. max(r 1 ) =f 1 max(r 2 ) =f 2 max(r 3 ) =f 3 No CP calculation is needed. Can prune more candidate k-sets

Cheng, Chen, Chen, Xie Verifiers: Upper and Lower Bounds (T=0.2) Candidates k-subsets (After PCS) 0 1 S1 S ? Verifier Incremental Refinement Classifier S2 S2 1 S3 S3 0 1

Cheng, Chen, Chen, Xie Verification and Refinement PartitionsStair-Case Model Divide the range [min(r 1 ), f k ] into a series of partitions. Extended from the probabilistic verifiers in [ICDE08b] Build a data structure, i.e. stair-case model, to store the distance cdf of each object. Derive the lower and upper bounds of a k-set’s QP based on the stair-case model. Reject (Accept) a k-set once its QP must be lower (larger) than the threshold.

Cheng, Chen, Chen, Xie Lower and Upper Bounds Given that

Cheng, Chen, Chen, Xie Upper- Lower- Bound Verifiers

Cheng, Chen, Chen, Xie Complexity of Verifiers ItemsCost Upper Bound for one ObjO(kM|C|) Lower Bound for one ObjO(kM|C|) Total complexity of verificationO(kM|C||Q|) |Q|=no. of k-subsets generated from PCS |C|=no. of candidates with non-zero prob. M= no. of subregions

Cheng, Chen, Chen, Xie Agenda 1. Introduction 2. Problem Definition 3. Basic Solution 4. Efficient Solution 5. Results

Cheng, Chen, Chen, Xie Experiment Setup Uncertain Object DB Long Beach (53k) ( ) Uncertainty pdf Uniform (default) Gaussian (represented by histograms) Threshold (T)0.1 k6

Cheng, Chen, Chen, Xie 1. k-bound Filtering

Cheng, Chen, Chen, Xie 2. Performance of GVR

Cheng, Chen, Chen, Xie 3. k-subset Generation

Cheng, Chen, Chen, Xie 3. k-subset Generation

Cheng, Chen, Chen, Xie 4. Verification and Refinement

Cheng, Chen, Chen, Xie 5. Time Analysis

Cheng, Chen, Chen, Xie 6. Gaussian Distribution

Cheng, Chen, Chen, Xie Conclusion We proposed an efficient evaluation framework for T-k- PNN query We proposed various techniques: - k-bound to filter away those unqualified objects - PCS to reduce the number of k-subsets - verification/refinement methods to avoid exact calculation Future Work - extend the techniques to other queries

Cheng, Chen, Chen, Xie Q & A Thanks!

