Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Aggregating local image descriptors into compact codes

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.

Fast Algorithms For Hierarchical Range Histogram Constructions

Lazy vs. Eager Learning Lazy vs. eager learning

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.

CES 514 – Data Mining Lecture 8 classification (contd…)

Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.

Comparison of Instance-Based Techniques for Learning to Predict Changes in Stock Prices iCML Conference December 10, 2003 Presented by: David LeRoux.

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

Scalable Text Mining with Sparse Generative Models

CS Instance Based Learning1 Instance Based Learning.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Ch 4. The Evolution of Analytic Scalability

ENN: Extended Nearest Neighbor Method for Pattern Recognition

Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.

Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.

CURE for Cubes: C ubing U sing a R OLAP E ngine Konstantinos Morfonios Yannis Ioannidis University of Athens VLDB 2006.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Ensemble Methods: Bagging and Boosting

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.

Efficient Processing of Top-k Spatial Preference Queries

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Bell Laboratories Intrinsic complexity of classification problems Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner,

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Class Imbalance in Text Classification

Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Data Mining and Decision Support

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

CS Machine Learning Instance Based Learning (Adapted from various sources)

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.

ItemBased Collaborative Filtering Recommendation Algorithms 1.

SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN

Combining Bagging and Random Subspaces to Create Better Ensembles

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER

A Fast Trust Region Newton Method for Logistic Regression

Efficient Image Classification on Vertically Decomposed Data

A Black-Box Approach to Query Cardinality Estimation

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Information Retrieval

Augmented Sketch: Faster and More Accurate Stream Processing

Model Averaging with Discrete Bayesian Network Classifiers

Spatio-temporal Pattern Queries

Instance Based Learning (Adapted from various sources)

Efficient Image Classification on Vertically Decomposed Data

Ch 4. The Evolution of Analytic Scalability

ITEM BASED COLLABORATIVE FILTERING RECOMMENDATION ALGORITHEMS

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction ω 1 = ω 2 = Classification x = ω = f(x)

. “Lazy”“Eager” Introduction x 1 = x 2 = (+) Faster decisions ( - ) Large/complex datasets ( - ) Dynamic datasets ( - ) Dynamic models (Nearest Neighbors)(Decision Trees)

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Large/complex datasets

Motivation

Large/complex datasets Dynamic datasets

Motivation

Large/complex datasets Dynamic datasets Dynamic models

Motivation

Large/complex datasets Dynamic datasets Dynamic models Lazy (model-free)

Motivation Large/complex datasets Dynamic datasets Dynamic models Lazy (model-free) Nearest Neighbors Disk-based

Motivation Nearest Neighbors Suffers from “curse of dimensionality” Not reliable [Beyer et al., ICDT 1999] Not indexable [Shaft et al., ICDT 2005] LOCUS (Lazy Optimal Classifier of Unlimited Scalability)

Motivation Category? LOCUS (Lazy Optimal Classifier of Unlimited Scalability)

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability)

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Scaling?

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Accuracy?

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier Other features?

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier Parallelizable

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

LOCUS x = ω 2 = ω 1 = (f 1  [0, 20], f 2  [0, 10]) f2f2 f1f1 Example

LOCUS f2f2 f1f1 Ideally: Dense space

LOCUS f2f2 f1f1 ω( ) = ? Ideally: Dense space

LOCUS f2f2 f1f1 ω( ) =

LOCUS f2f2 f1f1 Reality: Many features Large domains  Sparse space

Reality: Many features Large domains  Sparse space LOCUS f2f2 f1f1 ω( ) = ? ?

LOCUS f2f2 f1f1 ω( ) = ? ω 1 : 2 ω 2 : 1  3-NN

LOCUS f2f2 f1f1 ω( ) = ω 1 : 2 ω 2 : 1  3-NN

LOCUS f2f2 f1f1 ω( ) = ? LOCUS

f2f2 f1f1 ω( ) = ? ω 1 : 7 ω 2 : 3  LOCUS

f2f2 f1f1 ω( ) =  ω 1 : 7 ω 2 : 3 LOCUS

f2f2 f1f1 Disk-based implementation LOCUS

2δ12δ1 2δ22δ2 SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) ω 1 : 7 ω 2 : 3 ω( ) = 

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) What if R is large? Classical optimization techniques for a well-known type of aggregate queries Indexing Presorting Materialized views

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) Method reliability? LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) What if a feature, say f 2, is categorical? (e.g. sex)

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 =x 2 GROUP BY ω R(f 1, f 2, ω) Not a problem, since generally in practice: Combinations of categorical and numeric features Categorical features have small domains Hence, they do not contribute to sparsity What if a feature, say f 2, is categorical? (e.g. sex)

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

SELECT Parallel Execution R1R1 R2R2 R3R3 R4R4 R = R 1  R 2  R 3  R 4

Parallel Execution ω 1 : 5 ω 2 : 2 ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 R1R1 R2R2 R3R3 R4R4 Count: distributive function ω 1 : 23 ω 2 :

ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 ω 1 : 5 ω 2 : 2 Parallel Execution Small network traffic Load balancing Lightweight operations on the main server SELECT R1R1 R2R2 R3R3 R4R4 ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 ω 1 : 5 ω 2 :

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Experimental Evaluation LOCUS vs DTs and NNs (weka) Synthetic datasets  Ten functions [Agrawal et al., IEEE TKDE 1993]  D = 9  N  [5  10 3, 5  10 6 ] Real-world datasets  UCI Repository

Experimental Evaluation Classification error rate (synthetic datasets, N = 5  10 4 )

Experimental Evaluation Effect of dataset size on classification error rate of LOCUS (synthetic datasets, N  [5  10 3, 5  10 6 ])

Experimental Evaluation Effect of dataset size on time scalability of LOCUS (synthetic datasets, N  [5  10 3, 5  10 6 ])

Experimental Evaluation Classification error rate (real-world datasets)

Experimental Evaluation Effect of dataset size on classification error rate (dataset CovType, N  [5  10 3, 5  10 5 ])

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Conclusions & Future Work LOCUS  Lazy (complex/dynamic datasets and models)  Efficient (based on simple SQL queries)  Reliable (converging to optimal)  Parallelizable

Conclusions & Future Work Similar techniques for  feature selection  regression Implementation of a parallel version

Questions?

Thank you!