D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

CHAPTER 13: Alpaydin: Kernel Machines
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Support Vector Machines
Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
Support Vector Machines and Kernel Methods
Chapter 5: Linear Discriminant Functions
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Sparse Kernels Methods Steve Gunn.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Data mining and statistical learning - lecture 13 Separating hyperplane.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Efficient Model Selection for Support Vector Machines
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Jakob Verbeek December 11, 2009
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
SUPPORT VECTOR MACHINES
Support Vector Machines
Support Vector Machine
Support Feature Machine for DNA microarray data
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch8: Nonparametric Methods
CH 5: Multivariate Methods
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines Most of the slides were taken from:
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Parametric Methods Berlin Chen, 2005 References:
Nonparametric density estimation and classification
COSC 4368 Machine Learning Organization
Presentation transcript:

D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

INTRODUCTION Problem of data description or One-Class Classification – make a description of a training set of objects and detect which (new) objects resemble this training set Data description can be used for: 1.Outlier detection – to detect uncharacteristic objects from a data set Outliers in the data often show an exceptionally large or small feature value in comparison with other training objects Many machine learning techniques it would be useful to do the outlier detection first in order to detect and reject them to avoid unfounded confident classifications

INTRODUCTION 2. Classification problem where one of the classes is undersampled Measurement of the normal working conditions is very cheap and easy to obtain Measurement of the outliers (when there are problems with the machine) would require the destruction of the machine in all possible ways – not cheap

INTRODUCTION 3. Comparison of two data sets We train a classifier on some data after long optimization… When we need to solve a similar problem with some new data. That data can be compared with the old training set.. But if the data is not comparable we will need to train a new classifier. FEMALE MALE ? 0 1 CLASSIFIERCLASSIFIER

SOLUTIONS FOR SOLVING DATA DESCRIPTION Most often the solution focus on outlier detection: Simplest solution: - Generate outlier data around the target set. - Ordinary classifier is than trained to distinguish between the target data and outliers. - This method requires near-target objects belonging to the outlier class. (if not already there – have to be created) - Scales poorly in high dimensional problems. Bayesian approach can also be used for detecting outliers: - Instead of using the most probable weight configuration of a classifier to compute the output, the output is weighted by the probability that the weight configuration is correct given the data. - This method can then provide an estimate of the probability for a certain object given the model family. Low probabilities will then indicate a possible outlier. - The method is computationally expensive.

SOLUTIONS FOR SOLVING DATA DESCRIPTION Our solution: One-class classifiers - One class is the target class, and all other data is outlier data. - Create a spherically shaped boundary around the complete target set. - To minimize the chance of accepting outliers, the volume of this description is minimized. - Outlier sensitivity can be controlled by changing the ball-shaped boundary into a more flexible boundary. - Example outliers can be included into the training procedure to find a more efficient description.

METHOD We assume vectors x are column vectors. We have a training set {xi }, i = 1,..., N for which we want to obtain a description. We further assume that the data shows variances in all feature directions. NORMAL DATA DESCRIPTION - The sphere is characterized by center a and radius R > 0. - We minimize the volume of the sphere by minimizing R², and demand that the sphere contains all training objects xi. - To allow the possibility of outliers in the training set, the distance from xi to the center a should not be strictly smaller than R², but larger distances should be penalized. - Minimization problem: F(R, a) = R² + Ci∑ξi with constraints ||xi − a||² ≤ R² + ξi, ξi ≥ 0

METHOD NORMAL DATA DESCRIPTION L(R, a, αi, γi, ξi ) = R² + C∑ξi − ∑αi {R² + ξi − (||xi||² − 2a · xi + ||a||²)} − ∑γi ξi L should be minimized with respect to R, a, ξi and maximized with respect to αi and γi: } With subject to: 0 ≤ αi ≤ C } Support vectors

METHOD SVDD with negative examples -When negative examples (objects which should be rejected) are available, they can be incorporated in the training to improve the description. -In contrast with the training (target) examples which should be within the sphere, the negative examples should be outside it. -Minimization problem: With constraints: }

METHOD on the right the same data set with one outlier object. A new description has to be computed to reject this outlier. With a minimal adjustment to the old description, the outlier is placed on the boundary of the description. It becomes a support vector for the outlier class. on the left we have Normal DD with no outliers, circles are support vectors. 3 object required to describe the data set. Although the description is adjusted to reject the outlier object, it does not fit tightly around the rest of the target set. A more flexible description is required.

METHOD Introducing kernel functions - Replacing the new inner product by a kernel function K(xi, xj ) = ((xi )·(xj )) - Mapping of the data into another (possibly high dimensional) feature space is defined. - An ideal kernel function would map the target data onto a bounded, spherically shaped area in the feature space and outlier objects outside this area. 1. The polynomial kernel:, d is the dimension For degree d = 6 the description is a sixth order polynomial. Here the training objects most remote from the origin become support objects. Problem - Large regions in the input space without target objects will be accepted by the description.

METHOD 2. The Gaussian kernel: - For small values of s all objects become support vectors. Test object is selected when: - For very large s the solution approximates the original spherically shaped solution. - Decreasing the parameter C constraints the values for αi more, and more objects become support vectors. - Also with decreasing C the error on the target class increases, but the covered volume of the data description decreases.

METHOD - The performance of the one-class classifiers which only use information from the target set, perform worse, but in some cases still comparable to the classifiers which use information of both the target and outlier data. - In most cases the data descriptions with the polynomial kernel perform worse than with the Gaussian kernel, except for a few cases. SVDD characteristics - The minimum number of support vectors is an indication of the target error which can minimally be achieved (we introduce essential support vectors). - Leave-one-out error estimate on the target set: - Increasing dimensionality the volume of the outlier block tends to grow faster than the volume of the target class. Overlap between the target and outlier data decreases and the classification problem becomes easier but we need more data for estimation of the boundary. - Parameter s can be set to give the desired number of support vectors.

EXPERIMENTS - How SVDD works in a real one-class classification problems compared to other methods: Gaussian, Parzen density, Mixture of Gaussians and kNN - Machine diagnostics problem: the characterization of a submersible water pump - target objects (measurements on a normal operating pump) - outlier objects, negative examples (measurements on a damaged pump)

EXPERIMENTS - A good discrimination between target and outlier objects means both a small fraction of outlier accepted and a large fraction of target objects accepted. Data set 1 - whole working area 64 D 15 D – by PCA

EXPERIMENTS Data sets 2 to 5 – approximations of the first one In almost all cases SVDD obtains better performance. Other methods improve by reducing the dimensionality – not useful in practice

CONCLUSION AND COMMENTS CONCLUSION It is possible to solve multidimensional outlier detection problem by obtaining a boundary around the data. Inspired by the Support Vector Machines this boundary can be described by a few support vectors. STRONG POINTS Shows comparable or better results for sparse and complex data sets. Needs less training data to work, compared to other methods. WEEK POINTS Sets of outliers can be “good” and “poor” in which case manual optimization of C1 an C2 is required. Selection of s and C can sometimes be hard. For high sample sizes, density estimation methods are preferred.