1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Classification / Regression Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Model Assessment and Selection
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Sparse vs. Ensemble Approaches to Supervised Learning
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Data mining and statistical learning - lecture 13 Separating hyperplane.
SVM Support Vectors Machines
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
1 C.A.L. Bailer-Jones. Machine Learning. Unsupervised learning and clustering Machine learning, pattern recognition and statistical data modelling Lecture.
1 C.A.L. Bailer-Jones. Machine learning and pattern recognition Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 C.A.L. Bailer-Jones. Machine Learning. Support vector machines Machine learning, pattern recognition and statistical data modelling Lecture 9. Support.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support Vector Machines
Support vector machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Deep Feedforward Networks
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Boosting and Additive Trees
Machine learning, pattern recognition and statistical data modelling
Machine learning, pattern recognition and statistical data modelling
CH. 2: Supervised Learning
Support vector machines
Support Vector Machines
Support Vector Machine _ 2 (SVM)
Support vector machines
Model generalization Brief summary of methods
Support vector machines
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Supervised machine learning: creating a model
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Support Vector Machines 2
Presentation transcript:

1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10. Model selection and combination Coryn Bailer-Jones

2 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Topics ● SVMs for regression ● How many clusters in mixture models? – model selection – BIC and AIC ● Combining weak learners – boosting – classification and regression trees

3 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination A reminder of Support Vector Machines

4 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination A reminder of Support Vector Machines ● SVMs operate on principle of separating hyperplanes – maximize margin – only data points near margin are relevant (the support vectors) ● Nonlinearity via kernels – possible because data appear only as dot products – project data into a higher dimensional space – projection is only implicit (big computational saving) ● Solution – Lagrangian dual – convex cost function therefore unique solution

5 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination SVMs for regression

6 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination SVMs for regression

7 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination SVMs for regression In Teff prediction problem using PS1 simulated data, the training set errror increases monotonically with increasing for fixed C and gamma

8 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Mixture models

9 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Gaussian mixture model applied to the geyser data Application of the Mclust{mclust} program faithful{datasets} data set. See the R scripts on the lecture web site

10 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model selection methods ● Two levels of inference – optimal parameters for a given model – optimal model. How do we choose between two different models, even different types (e.g. SVM or mixture model)? – both involve a fundamental trade-off between fit on training data and model complexity: within a given model we can include a regularization term ● One approach is cross validation – let predictive error be your guide – variants: k-fold, leave-one-out, generalized – strictly need a third data set for model comparison (second used for regularization parameter fixing) – CV is also slow and still depends on a specific (and finite) data set

11 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model Selection: Akaike Information Criterion (AIC) smaller for better model fits expectation values taken w.r.t truth C is constant over all models, g

12 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model Selection: Akaike Information Criterion (AIC) factor 2 for “historical” reasons

13 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model Selection: Bayesian Information Criterion (BIC)

14 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Mclust models (covariance parametrizations)

15 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Model selection using R mclust package Application of the Mclust{mclust} package to the faithful{datasets} data set. See the R scripts on the lecture web site Note that what mclust reports as the BIC is actually negative BIC!

16 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Boosting: combining weak classifiers

17 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination © Hastie, Tibshirani, Friedman (2001)

18 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination A weak learner: classification trees ● N continuous or discrete input variables; multiple output classes ● make successive binary splits on a single variable – makes a hierarchical partitioning of the data space – fits simple model (a constant!) to each, i.e. all objects in partition are of a single class ● iteratively grow tree by splitting each node on variable j at point s which reduces loss (error) the most ● properties of resulting trees – partitioning can form non-contiguous regions for each class – class boundaries are parallel to axes ● regression trees – split by minimizing sum of squares in partition (fitted value in partition is just the average) ● regularization – grow large tree, then prune back

19 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Classification tree © Venables & Ripley (2002)

20 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Regression tree © Hastie, Tibshirani, Friedman (2001)

21 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Example of boosting a classification tree

22 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Boosted tree performance © Hastie, Tibshirani, Friedman (2001)

23 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Boosting ● boosting fits an additive model: each classifier is a basis function ● it is equivalent to doing “forward stagewise modelling”: – it adds a new term (basis function) without modifying existing ones – the next term is the one which minimizes the loss function over the choice of models ● it can be shown that it does this by using an exponential rather than square error loss – this is more logical anyway for indicator variables – see Hastie et al. sections 10.3 & 10.4 for a proof ● trees often used as the weak learner – they are fast to fit ● boosting shows good performance on many problems ● many variants on the basic model

24 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Forward stagewise additive modelling algorithm

25 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination R packages for boosting ● ada – many types of boosting – nice manual/document (Culp et al. 2006) which covers the chisq exampeCulp et al ● adabag – Adaboost.M1 and bagging with trees ● boost ● gbm – various loss functions, also for regression ● GAMBoost – fits GAMs using boosting ● rpart – classification and regression trees (basis for many boosting algorithms)

26 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Vapnik-Chernovenkis (VC) dimension ● effective number of parameters not very general ● consider class of functions f(x, ) for separating two class data ● given set of points is shattered by this class if, no matter how class labels are assigned, a member of this class can separate them ● VC dimension of the class of functions is the largest number of points in some configuration which can be shattered by at least one member of class – not necessary that all configurations be shattered ● VC dimension is an alternative measure of the “capacity” of a model to fit complex data

27 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Vapnik-Chernovenkis (VC) dimension © Hastie, Tibshirani, Friedman (2001)

28 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Vapnik-Chernovenkis (VC) dimension © Hastie, Tibshirani, Friedman (2001)

29 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination VC dimension of models ● Generally a linear indicator function in p dimensions has VC dimension p+1 ● sin(x) has an infinite VC dimension (see Burges 1998) – but note that four equally spaced points cannot be shattered ● k-nn also has infinite VC dimension ● SVMs have very large, even infinite VC dimension ● VC dimension measures “capacity” of model – larger VC dimension gives more flexibility... –...but potentially poor performance (unconstrained) ● VC dimension can be used for model selection – it can be used to calculate upper bound on true error given error on a (finite) training set (cf. AIC, BIC) – larger VC dimension sometimes means poorer generalization ability (bias/variance trade off)

30 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Summary ● model selection – account for model complexity and bias from finite-sized training set – evalute error (log likelihood) on training sample and apply a 'correction' ● classification and regression trees (CART) – greedy, top-down partitioning algorithm (then 'prune' back) – splits (partition boundaries) are parallel to axis – constant fit to partitions ● boosting – combine weak learners (e.g. CART) to get a powerful additive model – recursively build up models by reweighting data