Support Feature Machines: Support Vectors are not enough Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University,

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
Pattern Recognition and Machine Learning
K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.
Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.
Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Reduced Support Vector Machine
Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.
Global Visualization of Neural Dynamics
Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
A Kernel-based Support Vector Machine by Peter Axelberg and Johan Löfhede.
Support Vector Machines
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,
Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.
Lecture 10: Support Vector Machines
Comparing Kernel-based Learning Methods for Face Recognition Zhiguo Li
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Efficient Model Selection for Support Vector Machines
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
An Introduction to Support Vector Machines (M. Law)
Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Linear Models for Classification
Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Feature Machine for DNA microarray data
Deep Feedforward Networks
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Tomasz Maszczyk and Włodzisław Duch Department of Informatics,
Support Vector Machines
Support Vector Machines
Presentation transcript:

Support Feature Machines: Support Vectors are not enough Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland WCCI 2010

PlanPlan Main ideaMain idea SFM vs SVMSFM vs SVM Description of our approachDescription of our approach Types of new featuresTypes of new features ResultsResults ConclusionsConclusions

Main idea I SVM is based on LD and margin maximization.SVM is based on LD and margin maximization. Cover theorem: extended feature space = better separability of data, flat decision borders.Cover theorem: extended feature space = better separability of data, flat decision borders. Kernel methods implicitly create new features localized around SV (for localized kernels), based on similarity.Kernel methods implicitly create new features localized around SV (for localized kernels), based on similarity. Instead of the original input space, SVM works in the "kernel space“ without explicitly constructing it.Instead of the original input space, SVM works in the "kernel space“ without explicitly constructing it.

Main idea II SVM does not work well when there is complex logical structure in the data (ex. parity problem).SVM does not work well when there is complex logical structure in the data (ex. parity problem). Each SV may provide a useful feature.Each SV may provide a useful feature. Additional features may be generated by: random linear projections; ICA or PCA derived from data; various projection pursuit algorithms (QPC).Additional features may be generated by: random linear projections; ICA or PCA derived from data; various projection pursuit algorithms (QPC). Define appropriate feature space => optimal solution.Define appropriate feature space => optimal solution. To do be the best, learn from the rest (transfer learning, from other models): prototypes; linear combinations; fragments of branches in DT etc.To do be the best, learn from the rest (transfer learning, from other models): prototypes; linear combinations; fragments of branches in DT etc. The final classification model in enhanced space may not be so important if appropriate space is defined.The final classification model in enhanced space may not be so important if appropriate space is defined.

SFM vs SVM SFM generalize SVM approach by explicitly building feature space: enhance your input space adding kernel features z i (X)=K(X;SV i ) + any other useful types of features. SFM advantages comparing to SVM: LD on explicit representation of features = easy interpretation.LD on explicit representation of features = easy interpretation. Kernel-based SVM  SVML in explicitly constructed kernel space.Kernel-based SVM  SVML in explicitly constructed kernel space. Extend input + kernel space => improvementExtend input + kernel space => improvement

SFM vs SVM How to extend the feature space, creating SF space? Use various kernels with various parameters.Use various kernels with various parameters. Use global features obtained from various projections.Use global features obtained from various projections. Use local features to handle exceptions.Use local features to handle exceptions. Use feature selection to define optimal support feature space.Use feature selection to define optimal support feature space. Many algorithms may be used in SF space to generate the final solution. In the current version three types of features are used.

SFM feature types 1.Projections on N randomly generated directions in the original input space (Cover theorem). 2.Restricted random projections (aRPM) on a random direction z i (x) = w i ·x may be useful in some range of z i values is large pure cluster are found in some intervals [a,b]; this creates binary features h i (x){0,1}; QPC is used to optimize w i and improve cluster sizes. 3.Kernel-based features: here only Gaussian kernels with the same β for each SV k i (x)=exp(-βΣ|x i -x| 2 ) Number of features grows with number of training vectors; reduce SF space using simple filters (MI).

AlgorithmAlgorithm Fix the values of α, β and η parametersFix the values of α, β and η parameters for i=0 to N dofor i=0 to N do Randomly generate new direction w i [0,1] nRandomly generate new direction w i [0,1] n Project all x on this direction z i = w i ·x (features z)Project all x on this direction z i = w i ·x (features z) Analyze p(z i |C) distributions to determine if there are pure clusters,Analyze p(z i |C) distributions to determine if there are pure clusters, if the number of vectors in cluster H j (z i ;C) exceeds η thenif the number of vectors in cluster H j (z i ;C) exceeds η then Accept new binary feature h ijAccept new binary feature h ij end ifend if end forend for Create kernel features k i (x), i=1..mCreate kernel features k i (x), i=1..m Rank all original and additional features f i using Mutual InformationRank all original and additional features f i using Mutual Information Remove features for which MI(k i,C)≤αRemove features for which MI(k i,C)≤α Build linear model on the enhanced feature spaceBuild linear model on the enhanced feature space Classify test data mapped into enhanced spaceClassify test data mapped into enhanced space

SFM - summary In essence SFM algorithm constructs new feature space, followed by a simple linear model or any other learning model.In essence SFM algorithm constructs new feature space, followed by a simple linear model or any other learning model. More attention is paid to generation of features than to the sophisticated optimization algorithms or new classification methods.More attention is paid to generation of features than to the sophisticated optimization algorithms or new classification methods. Several parameters may be used to control the process of feature creation and selection but here they are fixed or set in an automatic way.Several parameters may be used to control the process of feature creation and selection but here they are fixed or set in an automatic way. New features created in this way are based on those transformations of inputs that have been found interesting for some task, and thus have meaningful interpretation.New features created in this way are based on those transformations of inputs that have been found interesting for some task, and thus have meaningful interpretation. SFM solutions are highly accurate and easy to understand.SFM solutions are highly accurate and easy to understand.

Features description X - original features K - kernel features (Gaussian local kernels) Z - unrestricted linear projections H - restricted (clustered) projections 15 feature spaces based on combinations of these different type of features may be constructed: X, K, Z, H, K+Z, K+H, Z+H, K+Z+H, X+K, X+Z, X+H, X+K+Z, X+K+H, X+Z+H, X+K+Z+H. Here only partial results are presented (big table). The final vector X is thus composed from a number of X = [x 1..x n, z 1.., h 1.., k 1..] features. In the SF space linear discrimination is used (SVML), although other methods may find better solution.

DatasetsDatasets

Results (SVM vs SFM in the kernel space only)

Results ( SFM in extended spaces)

Results (kNN in extended spaces)

Results (SSV in extended spaces)

ConclusionsConclusions SFM is focused on generation of new features, rather than optimization and improvement of classifiers.SFM is focused on generation of new features, rather than optimization and improvement of classifiers. SFM may be seen as mixture of experts; each expert is a simple model based on single feature: projection, localized projection, optimized projection, various kernel features.SFM may be seen as mixture of experts; each expert is a simple model based on single feature: projection, localized projection, optimized projection, various kernel features. For different data different types of features may be important => no universal set of features, but easy to test and select.For different data different types of features may be important => no universal set of features, but easy to test and select.

ConclusionsConclusions Kernel-based SVM is equivalent to the use of kernel features combined with LD.Kernel-based SVM is equivalent to the use of kernel features combined with LD. Mixing different kernels and different types of features: better feature space than single-kernel solution.Mixing different kernels and different types of features: better feature space than single-kernel solution. Complex data require decision borders with different complexity. SFM offers multiresolution (ex: different dispersions for every SV).Complex data require decision borders with different complexity. SFM offers multiresolution (ex: different dispersions for every SV). Kernel-based learning implicitly project data into high- dimensional space, creating there flat decision borders and facilitating separability.Kernel-based learning implicitly project data into high- dimensional space, creating there flat decision borders and facilitating separability.

ConclusionsConclusions Learning is simplified by changing the goal of learning to easier target and handling the remaining nonlinearities with well defined structure. Instead of hiding information in kernels and sophisticated optimization techniques features based on kernels and projection techniques make this explicit. Finding interesting views on the data, or constructing interesting information filters, is very important because combination of the transformation-based systems should bring us significantly closer to practical applications that automatically create the best data models for any data.

Thank You!