Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Linear Classifiers (perceptrons)

An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Linear Separators.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Discriminative and generative methods for bags of features
Pattern Recognition and Machine Learning
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Theory Simulations Applications Theory Simulations Applications.
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
An Introduction to Support Vector Machines Martin Law.
This week: overview on pattern recognition (related to machine learning)
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
An Introduction to Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines and Kernels
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold, Ulrik Brandes, Martin Mader, Uwe Nagel

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #2 Goals of the Tutorial At lunch time on Tuesday, you will Have learned about Linear Classifiers and SVMs Have improved a kernel based classifier Will know what Finnish looks like Have a hunch what a kernel is Had a chance at winning a trophy.

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #3 Outline – Monday (13:15 – 23:30) The Theory: –Motivation: Learning Classifiers from Data –Linear Classifiers Delta Learning Rule –Kernel Methods & Support Vector Machines Dual Representation Maximal Margin Kernels The Environment: –KNIME: A short Intro Practical Stuff: –How to develop nodes in KNIME –Install on your laptop(s) You work, we rest: –Invent a new (and better) Kernel Dinner (Invent an even better Kernel…)

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #4 Outline – Tuesday (9:00 – 12:00) ~9-11: refine your kernel 11:00 score test data set 11:13 winning kernel(s) presented 12:00 Lunch and Award “Ceremony”

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #5 Learning Models Assumptions: no major influence of non-observed inputs System observed outputs observed inputs other inputs Data Model

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #6 Predicting Outcomes Assumptions: static system Model predicted outputs new inputs

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #7 Learning Classifiers from Data Training data consists of input with labels, e.g. –Credit card transactions (fraud: no/yes) –Hand written letter (“A”, … “Z”) –Drug candidate classification (toxic, non toxic) –… Multi-label classification problems can be reduced to a binary yes/no classification Many, many algorithms around. Why? –Choice of algorithm influences generalization capability –There is no best algorithm for all classification problems

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #8 Linear Discriminant Simple linear, binary classifier: –Class A if f(x) positive –Class B if f(x) negative e.g. is the decision function.

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #9 Linear Discriminant Linear discriminants represent hyperplanes in feature space

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #10 Primal Perceptron Rosenblatt (1959) introduced simple learning algorithm for linear discriminants (“perceptron”):

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #11 Rosenblatt Algorithm Algorithm is –On-line (pattern by pattern approach) –Mistake driven (updates only in case of wrong classification) Algorithm converges guaranteed if a hyperplane exists which classifies all training data correctly (data is linearly separable) Learning rule: One observation: –Weight vector (if initialized properly) is simply a weighted sum of input vectors (b is even more trivial)

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #12 Dual Representation Weight vector is a weighted sum of input vectors: “difficult” training patterns have larger alpha “easier” ones have smaller or zero alpha

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #13 Dual Representation Dual Representation of the linear discriminant function:

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #14 Dual Representation Dual Representation of Learning Algorithm

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #15 Dual Representation Learning Rule Harder to learn examples have larger alpha (higher information content) The information about training examples enters algorithm only through the inner products (which we could pre-compute!)

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #16 Dual Representation in other spaces All we need for training: –Computation of inner products of all training examples If we train in a different space: –Computation of inner products in the projected space

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #17 Kernel Functions A kernel allows us (via K) to compute the inner product of two points x and y in the projected space without ever entering that space...

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #18 …in Kernel Land… The discriminant function in our project space: And, using a kernel:

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #19 The Gram Matrix All data necessary for –The decision function –The training of the coefficients can be pre-computed using a Kernel or Gram Matrix: (If K is symmetric and positive semi-definite then K() is a Kernel.)

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #20 Kernels A simple kernel is And the corresponding projected space: Why?

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #21 Kernels A few (slightly less) simple kernels are And the corresponding projected spaces are of dimension …computing the inner products in the projected space becomes pretty expensive rather quickly…

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #22 Kernels Gaussian Kernel: Polynomial Kernel of degree d:

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #23 Why? Great: we can also apply Rosenblatt’s algorithm to other spaces implicitly. So what?

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #24 Transformations…

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #25 Polynomial Kernel

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #26 Gauss Kernel

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #27 Kernels Note that we do not need to know the projection Φ, it is sufficient to prove that K(.) is a Kernel. A few notes: –Kernels are modular and closed: we can compose new Kernels based on existing ones –Kernels can be defined over non numerical objects text: e.g. string matching kernel images, trees, graphs, … Note also: A good Kernel is crucial –Gram Matrix diagonal: classification easy and useless –No Free Kernel: too many irrelevant attributes: Gram Matrix diagonal.

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #28 Finding Linear Discriminants Finding the hyperplane (in any space) still leaves lots of room for variations – does it? We can define “margins” of individual training examples: (appropriately normalized this is a “geometrical” margin) The margin of a hyperplane (with respect to a training set): And a maximal margin of all training examples indicates the maximum margin over all hyperplanes.

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #29 (maximum) Margin of a Hyperplane

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #30 Support Vector Machines Dual Representation –Classifier as weighted sum over inner products of training pattern (or only support vectors) and the new pattern. –Training analog Kernel-Induced feature space –Transformation into higher-dimensional space (where we will hopefully be able to find a linear separation plane). –Representation of solution through few support vectors (alpha>0). Maximum Margin Classifier –Reduction of Capacity (Bias) via maximization of margin (and not via reduction of degrees of freedom). –Efficient parameter estimation: see IDA book.

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #31 Soft and Hard Margin Classifiers What can we do if no linear separating hyperplane exists? Instead of focusing on find a hard margin, allow minor violations –Introduce (positive) slack variables (patterns with slack are allowed to lie in margin) –Misclassifications are allowed if slack can be negative.

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #32 Kernel Methods: Summary Main idea of Kernel Methods: –Embed data into a suitable vector space –Find linear classifier (or other linear pattern of interest) in the new space Needed: Mapping (implicit or explicit) Key Assumptions –Information about relative position is often all that is needed by learning methods –The inner products between points in the projected space can be computed in the original space using special functions (kernels).

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #33 Support Vector Machines Powerful classifier computation of optimal classifier is possible Choice of kernel is critical

GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel #34 KNIME Coffee Break. And then: –KNIME, the Konstanz Information Miner –SVMs (and other classifiers) in KNIME