MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn (hiahn@media.mit.edu)

Slides:



Advertisements
Similar presentations
COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Advertisements

ECG Signal processing (2)
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Support Vector Machines
Machine learning continued Image source:
Indian Statistical Institute Kolkata
What is Statistical Modeling
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Final review LING572 Fei Xia Week 10: 03/11/
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Paper: A. Kapoor, H. Ahn, and R. Picard, “Mixture of Gaussian Processes for Combining Multiple Modalities,” MIT Media Lab Technical Report, Paper.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Constructing a Predictor to Identify Drug and Adverse Event Pairs
Support Feature Machine for DNA microarray data
Machine Learning – Classification David Fenyő
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
Evaluating Classifiers
An Empirical Comparison of Supervised Learning Algorithms
Support Vector Machines
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Goodfellow: Chap 5 Machine Learning Basics
Cross Domain Distribution Adaptation via Kernel Mapping
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Machine Learning Basics
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Pawan Lingras and Cory Butz
Machine Learning Week 1.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hyperparameters, bias-variance tradeoff, validation
Experiments in Machine Learning
Minimax Probability Machine (MPM)
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Support Vector Machines
Advanced Artificial Intelligence Classification
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Concave Minimization for Support Vector Machine Classifiers
Mathematical Foundations of BME
What is The Optimal Number of Features
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Discriminative Training
Support Vector Machines 2
Presentation transcript:

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn (hiahn@media.mit.edu)

Objective & Dataset Recognize the affective states of a child solving a puzzle Affective Dataset - 1024 features from Face, Posture, Game - 3 affective states, labels annotated by teachers High interest (61), Low interest (59), Refreshing (16) The dataset consists of samples on which all teachers agreed on the label of affective states Refreshing? Affective Dataset - 1024 features from the three modalities (face, posture and game information) - 3 affective states (labels) annotated by teachers (high interest, low interest and refreshing) - High (61 samples), Low (59 samples), Refreshing (16 samples) Classification Task Binary classification -- { High interest } vs. { Low interest or Refreshing }

Task & Approaches Binary Classification High interest (61 samples) vs. Low Interest or Refreshing (75 samples) Approaches - Semi-Supervised Learning: Gaussian Process (GP) - Support Vector Machine - k-Nearest Neighbor (k = 1)

GP Semi-Supervised Learning Given , predict the labels of unlabeled pts Assume the data, data generation process X : inputs, y : vector of labels, t : vector of hidden soft labels, Each label (binary classification) Final classifier y = sign[ t ] = sign [ ] Define Similarity function  Infer given

GP Semi-Supervised Learning Infer given  Bayesian Model  : Prior of the classifier : Likelihood of the classifier given the labeled data

GP Semi-Supervised Learning  How to model the prior & the likelihood ? The prior : Using GP, (Soft labels vary smoothly across the data manifold!) The likelihood :

GP Semi-Supervised Learning EP (Expectation Propagation)  approximating the posterior as a Gaussian Select hyperparameter { kernel width σ, labeling error rate ε } that maximizes evidence ! Advantage of using EP  we get the evidence as a side product EP estimates the leave-one-out predictive performance without performing any expensive cross-validation.

Support Vector Machine OSU SVM toolbox RBF kernel : Hyperparameter {C, σ} Selection  Use leave-one-out validation !

kNN (k = 1) The label of test point follows that of its nearest point This algorithm is simple to implement and the accuracy of this algorithm can be used as a base line. However, sometimes this algorithm gives a good result !

Split of the dataset & Experiment GP Semi-supervised learning - randomly select labeled data (p % of overall data), use the remaining data as unlabeled data, predict the labels of unlabeled data (In this setting, unlabeled data == test data) - 50 tries for each p (p = 10, 20, 30, 40, 50) - Each time select the hyperparameter that maximizes the evidence from EP SVM and kNN - randomly select train data (p % of overall data), use the remaining data as test data, predict the labels of test data - In the SVM, leave-one-out validation for hyperparameter selection was achieved by using the train data

GP – evidence & accuracy The case of Percentage of train points per class = 50 % (average over 10 tries) (Note) An offset was added to log evidence to plot all curves in the same figure. Max of Rec Accuracy ≈ Max of Log Evidence  Find the optimal hyperparameter by using evidence from EP

SVM – hyperparameter selection Evidence from Leave-one-out validation Log (C) Log (1/ ) Select the hyperparameter {C, sigma} that maximizes the evidence from leave-one-out validation !

Classification Accuracy As expected, kNN is bad at small # of train pts and better at large # of train pts SVM has good accuracy even when the # of train pts is small, why? GP has bad accuracy when the # of train pts is large, why?

Analysis-SVM Why does SVM give a good test accuracy even when the number of train points is small ? The best things I can tell… {# Support Vectors} / {# of Train Points} is high in this task, in particular when the percentage of train points is low. The support vectors decide the decision boundary. But it is not guaranteed that the SV ratio is highly related with the test accuracy. Actually it is known that {Leave-one-out CV error} is less than {# Support Vectors} / {# of Train Points}. 2. CV accuracy rate is high even when the # of train pts is small. CV accuracy rate is very related with Test accuracy rate.

Analysis-GP Why does GP give a bad test accuracy when the number of train points is small ? Percentage of train points per class = 50 % Max of Rec Accuracy ≈ Max of Log Evidence Percentage of train points per class = 10 % Log Evidence curve is flat  fail to find optimal Sigma !

Conclusion GP SVM kNN (k = 1) Small number of train points  bad accuracy Large number of train points  good accuracy SVM Regardless of the number of train points  good accuracy kNN (k = 1)