Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction ω 1 = ω 2 = Classification x = ω = f(x)
. “Lazy”“Eager” Introduction x 1 = x 2 = (+) Faster decisions ( - ) Large/complex datasets ( - ) Dynamic datasets ( - ) Dynamic models (Nearest Neighbors)(Decision Trees)
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Large/complex datasets
Motivation
Large/complex datasets Dynamic datasets
Motivation
Large/complex datasets Dynamic datasets Dynamic models
Motivation
Large/complex datasets Dynamic datasets Dynamic models Lazy (model-free)
Motivation Large/complex datasets Dynamic datasets Dynamic models Lazy (model-free) Nearest Neighbors Disk-based
Motivation Nearest Neighbors Suffers from “curse of dimensionality” Not reliable [Beyer et al., ICDT 1999] Not indexable [Shaft et al., ICDT 2005] LOCUS (Lazy Optimal Classifier of Unlimited Scalability)
Motivation Category? LOCUS (Lazy Optimal Classifier of Unlimited Scalability)
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability)
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Scaling?
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Accuracy?
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier Other features?
Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier Parallelizable
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
LOCUS x = ω 2 = ω 1 = (f 1 [0, 20], f 2 [0, 10]) f2f2 f1f1 Example
LOCUS f2f2 f1f1 Ideally: Dense space
LOCUS f2f2 f1f1 ω( ) = ? Ideally: Dense space
LOCUS f2f2 f1f1 ω( ) =
LOCUS f2f2 f1f1 Reality: Many features Large domains Sparse space
Reality: Many features Large domains Sparse space LOCUS f2f2 f1f1 ω( ) = ? ?
LOCUS f2f2 f1f1 ω( ) = ? ω 1 : 2 ω 2 : 1 3-NN
LOCUS f2f2 f1f1 ω( ) = ω 1 : 2 ω 2 : 1 3-NN
LOCUS f2f2 f1f1 ω( ) = ? LOCUS
f2f2 f1f1 ω( ) = ? ω 1 : 7 ω 2 : 3 LOCUS
f2f2 f1f1 ω( ) = ω 1 : 7 ω 2 : 3 LOCUS
f2f2 f1f1 Disk-based implementation LOCUS
2δ12δ1 2δ22δ2 SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) ω 1 : 7 ω 2 : 3 ω( ) =
LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) What if R is large? Classical optimization techniques for a well-known type of aggregate queries Indexing Presorting Materialized views
LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) Method reliability? LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)
LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) What if a feature, say f 2, is categorical? (e.g. sex)
LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 =x 2 GROUP BY ω R(f 1, f 2, ω) Not a problem, since generally in practice: Combinations of categorical and numeric features Categorical features have small domains Hence, they do not contribute to sparsity What if a feature, say f 2, is categorical? (e.g. sex)
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
SELECT Parallel Execution R1R1 R2R2 R3R3 R4R4 R = R 1 R 2 R 3 R 4
Parallel Execution ω 1 : 5 ω 2 : 2 ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 R1R1 R2R2 R3R3 R4R4 Count: distributive function ω 1 : 23 ω 2 :
ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 ω 1 : 5 ω 2 : 2 Parallel Execution Small network traffic Load balancing Lightweight operations on the main server SELECT R1R1 R2R2 R3R3 R4R4 ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 ω 1 : 5 ω 2 :
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Experimental Evaluation LOCUS vs DTs and NNs (weka) Synthetic datasets Ten functions [Agrawal et al., IEEE TKDE 1993] D = 9 N [5 10 3, 5 10 6 ] Real-world datasets UCI Repository
Experimental Evaluation Classification error rate (synthetic datasets, N = 5 10 4 )
Experimental Evaluation Effect of dataset size on classification error rate of LOCUS (synthetic datasets, N [5 10 3, 5 10 6 ])
Experimental Evaluation Effect of dataset size on time scalability of LOCUS (synthetic datasets, N [5 10 3, 5 10 6 ])
Experimental Evaluation Classification error rate (real-world datasets)
Experimental Evaluation Effect of dataset size on classification error rate (dataset CovType, N [5 10 3, 5 10 5 ])
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation
Conclusions & Future Work LOCUS Lazy (complex/dynamic datasets and models) Efficient (based on simple SQL queries) Reliable (converging to optimal) Parallelizable
Conclusions & Future Work Similar techniques for feature selection regression Implementation of a parallel version
Questions?
Thank you!