Part 4: ADVANCED SVM-based LEARNING METHODS

Name: Part 4: ADVANCED SVM-based LEARNING METHODS
Uploaded: 2017-08-18T19:56:12+00:00
Duration: PTM22S39
Description: Part 4: ADVANCED SVM-based LEARNING METHODS

Part 4: ADVANCED SVM-based LEARNING METHODS
Vladimir Cherkassky University of Minnesota Presented at Tech Tune Ups, ECE Dept, June 1, 2011 Electrical and Computer Engineering 1 1 1 1 1

OUTLINE Motivation for non-standard approaches: high-dimensional data
Alternative Learning Settings - Transduction and SSL - Inference Through Contradictions - Learning using privileged information (or SVM+) - Multi-task Learning Summary

Insights provided by SVM(VC-theory)
Why linear classifiers can generalize? (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both What happens when d>>n ? - standard inductive methods usually fail

How to improve generalization for HDLSS?
Conventional approach: Incorporate a priori knowledge into learning method Preprocessing and feature selection Model parameterization (~ good kernels in SVM) Assumption: a priori knowledge about good model Non-standard learning formulations: Incorporate a priori knowledge into new non-standard learning formulation (learning setting) Assumption: a priori knowledge is about properties of application data and/or goal of learning Which type of assumptions makes more sense?

OUTLINE Motivation for non-standard approaches
Alternative Learning Settings - Transduction and SSL - Inference Through Contradictions - Learning with Structured Data - Multi-task Learning Summary

Examples of non-standard settings
Application domain: hand-written digit recognition Standard inductive setting Transduction: labeled training + unlabeled data Learning through contradictions: labeled training data ~ examples of digits 5 and 8 unlabeled examples (Universum) ~ all other (eight) digits Learning using hidden information: Training data ~ t groups (i.e., from t different persons) Test data ~ group label not known Multi-task learning: Training data ~ t groups (from different persons) Test data ~ t groups (group label is known)

Modifications of Inductive Setting
Standard Inductive learning assumes Finite training set Predictive model derived using only training data Prediction for all possible test inputs Possible modifications 1. Predict only for given test points  transduction 2. A priori knowledge in the form of additional ‘typical’ samples  learning through contradiction 3. Additional (group) info about training data  Learning using privileged information (LUPI) aka SVM+ 4. Additional (group) info about training + test data  Multi-task learning

Transduction (Vapnik, 1982, 1995)
How to incorporate unlabeled test data into the learning process? Assume binary classification Estimating function at given points Given: labeled training data and unlabeled test points Estimate: class labels at these test points Goal of learning: minimization of risk on the test set: where

Induction vs Transduction

Transduction based on margin size
Single unlabeled test point X

Many test points X aka working samples

Binary classification, linear parameterization, joint set of (training + working) samples Two objectives of transductive learning: (TL1) separate labeled training data using a large-margin hyperplane (as in standard inductive SVM) (TL2) separating (explain) working data set using a large-margin hyperplane.

Standard SVM hinge loss for labeled samples Loss function for unlabeled samples:  Mathematical optimization formulation

Optimization formulation for SVM transduction
Given: joint set of (training + working) samples Denote slack variables for training, for working Minimize subject to where  Solution (~ decision boundary) Unbalanced situation (small training/ large test)  all unlabeled samples assigned to one class Additional constraint:

Optimization formulation (cont’d)
Hyperparameters control the trade-off between explanation and margin size Soft-margin inductive SVM is a special case of soft-margin transduction with zero slacks Dual + kernel version of SVM transduction Transductive SVM optimization is not convex (~ non-convexity of the loss for unlabeled data) –  different opt. heuristics ~ different solutions Exact solution (via exhaustive search) possible for small number of test samples (m) – but this solution is NOT very useful (~ inductive SVM).

Many applications for transduction
Text categorization: classify word documents into a number of predetermined categories classification: Spam vs non-spam Web page classification Image database classification All these applications: - high-dimensional data - small labeled training set (human-labeled) - large unlabeled test set

Example application Prediction of molecular bioactivity for drug discovery Training data~1,909; test~634 samples Input space ~ 139,351-dimensional Prediction accuracy: SVM induction ~74.5%; transduction ~ 82.3% Ref: J. Weston et al, KDD cup 2001 data analysis: prediction of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003

Semi-Supervised Learning (SSL)
Labeled data + unlabeled data  Model Similar to transduction (but not the same): - Goal 1 ~ prediction for unlabeled samples - Goal 2 ~ estimate an inductive model Many algorithms Applications similar to transduction Typically - Transduction works better for HDLSS - SSL works better for low-dimensional data

Example: Self-Learning Algorithm
Given initial labeled set L and unlabeled set U Repeat: (1) estimate a classifier using labeled set L (2) classify randomly chosen unlabeled sample using decision rule estimated in Step (1) (3) move this new labeled sample to set L Iterate steps (1) – (3) until all unlabeled samples are classified.

Example of Self-Learning Algorithm
Noisy Hyperbolas: unlabeled samples in green Initial condition:

Example of Self-Learning Algorithm
Iteration 50 Iteration 100 (final)

Inference through contradiction (Vapnik 2006)
Motivation: what is a priori knowledge? - info about the space of admissible models - info about admissible data samples Labeled training samples + unlabeled samples from the Universum Universum samples encode info about the region of input space (where application data lives): - Usually from a different distribution than training/test data Examples of the Universum data Large improvement for small training samples

Inference through contradictions aka Universum learning

Main Idea Handwritten digit recognition: digit 5 vs 8
Fig. courtesy of J. Weston (NEC Labs)

Learning with the Universum
Inductive setting for binary classification Given: labeled training data and unlabeled Universum samples Goal of learning: minimization of prediction risk (as in standard inductive setting) Balance between two goals: - explain labeled training data using large-margin hyperplane - achieve maximum falsifiability ~ max # contradictions on the Universum  Math optimization formulation (extension of SVM)

-insensitive loss for Universum samples

Random averaging Universum
Average Class 1 Class -1 Hyper-plane

Random Averaging for digits 5 and 8
Two randomly selected examples Universum sample:

Application Study (Vapnik, 2006)
Binary classification of handwritten digits 5 and 8 For this binary classification problem, the following Universum sets had been used: U1: randomly selected digits (0,1,2,3,4,6,7,9) U2: randomly mixing pixels from images 5 and 8 U3: average of randomly selected examples of 5 and 8 Training set size tried: 250, 500, … 3,000 samples Universum set size: 5,000 samples Prediction error: improved over standard SVM, i.e. for 500 training samples: 1.4% vs 2% (SVM)

Cultural Interpretation of Universum: jokes, absurd examples:
neither Hillary nor Obama dadaism

Application Study: predicting gender of human faces
Binary classification setting Difficult problem: dimensionality ~ large (10K - 20K) labeled sample size ~ small (~ ) Humans perform very well for this task Issues: - possible improvement (vs standard SVM) - how to choose ‘good’ Universum? - model parameter tuning

Male Faces: examples

Female Faces: examples

Universum Faces: neither male nor female

Empirical Study (cont’d)
Universum generation: U1 Average: of male and female samples randomly selected from the training set (U. of Essex database) U2 Empirical Distribution: estimate pixel-wise distribution of the training data. Generate a new picture from this distribution U3 Animal faces:

Universum generation: examples
U1 Averaging: U2 Empirical Distribution: 36 36

Results of gender classification
Classification accuracy: improves vs standard SVM by ~ 2% with U1 Universum, and by ~ 1% with U2 Universum. Universum by averaging gives better results for this problem, when number of Universum samples N = 500 or 1,000

Results of gender classification
Universum ~ Animal Faces: Degrades classification accuracy by 2-5% (vs standard SVM) Animal faces are not relevant to this problem 38 38

Learning with Structured Data(Vapnik, 2006)
• Application: Handwritten digit recognition Labeled training data provided by t persons (t >1) Goal 1: find a classifier that will generalize well for future samples generated by these persons ~ Learning with Structured Data or Learning using Hidden Information Goal 2: find t classifiers with generalization (for each person) ~ Multi-Task Learning (MTL) • Application: Medical diagnosis Labeled training data provided by t groups of patients (t >1), say men and women (t = 2) Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LWSD Goal 2: find t classifiers specialized for each group of patients ~ MTL

Different Ways of Using Group Information
SVM sSVM: f(x) SVM+ f(x) SVM+: SVM f1(x) mSVM: SVM f2(x) f1(x) MTL: svm+MTL f2(x) 40 40

SVM+ technology (Vapnik, 2006)
Map the input vectors simultaneously into: - Decision space (standard SVM classifier) - Correcting space (where correcting functions model slack variables for different groups) Decision space/function ~ the same for all groups Correcting functions ~ different for each group (but correcting space may be the same) SVM+ optimization formulation incorporates: - the capacity of decision function - capacity of correcting functions for group r - relative importance (weight) of these two capacities

SVM+ approach (Vapnik, 2006)
Correcting space Correcting functions mapping Correcting space mapping Decision function Decision space Group1 Group2 Class 1 slack variable for group r Class -1

SVM+ Formulation Decision Space Correcting Space subject to:

SVM+ for Multi-task Learning (Liang 2008)
New learning formulation: SVM+MTL Define decision function for each group as Common decision function models the relatedness among groups Correcting functions fine-tune the model for each group (task) . 44 44

svm+MTL Formulation Decision Space Correcting Space subject to: 45 45

Empirical Validation Different ways of using group info  different learning settings: - which one yields better generalization? - how performance is affected by sample size? Empirical comparisons: - synthetic data set 46 46

Different Ways of Using Group Information
SVM sSVM: f(x) SVM+ f(x) SVM+: SVM f1(x) mSVM: SVM f2(x) f1(x) MTL: svm+MTL f2(x) 47 47

Comparison for Synthetic Data Set
Generate x where each The coefficient vectors of three tasks are specified as For each task and each data vector, Details of methods used: - linear SVM classifier (single parameter C) - SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ) - Independent validation set for model selection 48 48

Experimental Results Comparison results (ave over 10 trials):
n ~ number of training samples per task ave test error (%): Methods: sSVM SVM+ mSVM SVM+MTL n=15 19.9 19.1 29.3 20.8 n=100 11.9 11.7 8.8 8.5 Note: relative performance depends on sample size Note: SVM+ always better than SVM SVM+MTL always better than mSVM 49 49

OUTLINE Motivation for non-standard approaches
Alternative Learning Settings Summary: Advantages/limitations of non-standard settings

Advantages+limitations of nonstandard settings
- make common sense - follow methodological framework (VC-theory) - yield better generalization (but not always) Limitations - need to formalize application requirements  need to understand application domain - generally more complex learning formulations - more difficult model selection - few known empirical comparisons (to date) SVM+ is a promising new technology for hard problems

References and Resources
Vapnik, V. Estimation of Dependencies Based on Empirical Data. Empirical Inference Science: Afterword of 2006, Springer, 2006 Cherkassky, V. and F. Mulier, Learning from Data, second edition, Wiley, 2007 Chapelle, O., Schölkopf, B., and A. Zien, Eds., Semi-Supervised Learning, MIT Press, 2006 Cherkassky, V. and Y. Ma, Introduction to Predictive learning, Springer, 2011 (to appear) Hastie, T., R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference and Prediction, New York: Springer, 2001 Schölkopf, B. and A. Smola, Learning with Kernels. MIT Press, 2002. Public-domain SVM software Main web page link LIBSVM software library SVM-Light software library Non-standard SVM-based methodologies: Universum, SVM+, MTL

Part 4: ADVANCED SVM-based LEARNING METHODS

Similar presentations

Presentation on theme: "Part 4: ADVANCED SVM-based LEARNING METHODS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part 4: ADVANCED SVM-based LEARNING METHODS

Similar presentations

Presentation on theme: "Part 4: ADVANCED SVM-based LEARNING METHODS"— Presentation transcript:

Similar presentations

About project

Feedback