Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Non-Bayes classifiers. Linear discriminants, neural networks.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Ensemble Classifiers.
Machine Learning Supervised Learning Classification and Regression
Who am I? Work in Probabilistic Machine Learning Like to teach 
Fall 2004 Backpropagation CS478 - Machine Learning.
Machine Learning – Classification David Fenyő
Data Mining Practical Machine Learning Tools and Techniques
Predictive Learning from Data
Artificial Neural Networks
An Empirical Comparison of Supervised Learning Algorithms
Learning with Perceptrons and Neural Networks
ECE 5424: Introduction to Machine Learning
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Boosting and Additive Trees (2)
COMP24111: Machine Learning and Optimisation
Trees, bagging, boosting, and stacking
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
COMP61011 : Machine Learning Ensemble Models
Basic machine learning background with Python scikit-learn
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Probabilistic Models for Linear Regression
Neural Networks and Backpropagation
Machine Learning Today: Reading: Maria Florina Balcan
Probabilistic Models with Latent Variables
Neuro-Computing Lecture 4 Radial Basis Function Network
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Lecture 18: Bagging and Boosting
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Predictive Learning from Data
Multilayer Perceptron & Backpropagation
Neural Networks Geoff Hulten.
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Model generalization Brief summary of methods
Neural networks (1) Traditional multi-layer perceptrons
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
A task of induction to find patterns
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Introduction to Neural Networks
Image recognition.
Presentation transcript:

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Motivation 1 Many models can be trained on the same data Typically none is strictly better than others Recall “no free lunch theorem” Can we “combine” predictions from multiple models? Yes, typically with significant reduction of error! Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Motivation 2 Combined prediction using Adaptive Basis Functions M basis functions with own parameters Weight / confidence of each basis function Parameters including M trained using data Another interpretation: automatically learning best representation of data for the task at hand Difference with mixture models? 𝑓 𝑥 = 𝑖=1 𝑀 𝑤 𝑚 𝜙 𝑚 (𝑥; 𝑣 𝑚 ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Examples of Model Combinations Also called Ensemble Learning Decision Trees Bagging Boosting Committee / Mixture of Experts Feed forward neural nets / Multi-layer perceptrons … Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Decision Trees Partition input space into cuboid regions Simple model for each region Classification: Single label; Regression: Constant real value Sequential process to choose model per instance Decision tree Figure from Murphy Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Learning Decision Trees Decision for each region Regression: Average of training data for the region Classification: Most likely label in the region Learning tree structure and splitting values Learning optimal tree intractable Greedy algorithm Find (node, dim., value) w/ largest reduction of “error” Regression error: residual sum of squares Classification: Misclassification error, entropy, … Stopping condition Preventing overfitting: Pruning using cross validation Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Pros and Cons of Decision Trees Easily interpretable decision process Widely used in practice, e.g. medical diagnosis Not very good performance Restricted partition of space Restricted to choose one model per instance Unstable Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Mixture of Supervised Models 𝑓 𝑥 = 𝑖 𝜋 𝑘 𝜙 𝑘 (𝑥,𝑤) Mixture of linear regression models Mixture of logistic regression models Figure from Murphy Training using EM Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Conditional Mixture of Supervised Models Mixture of experts 𝑓 𝑥 = 𝑖 𝜋 𝑘 (𝑥) 𝜙 𝑘 (𝑥,𝑤) Figure from Murphy Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Bootstrap Aggregation / Bagging Individual models (e.g. decision trees) may have high variance along with low bias Construct M bootstrap datasets Train separate copy of predictive model on each Average prediction over copies If the errors are uncorrelated, then bagged error reduces linearly with M 𝑓 𝑥 = 1 𝑀 𝑖 𝑓 𝑚 (𝑥) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Random Forests Training same algorithm on bootstraps creates correlated errors Randomly choose (a) subset of variables and (b) subset of training data Good predictive accuracy Loss in interpretability Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Boosting Combining weak learners, 𝜖-better than random E.g. Decision stumps Sequence of weighted datasets Weight of data point in each iteration proportional to no of misclassifications in earlier iterations Specific weighting scheme depends on loss function Theoretical bounds on error Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Example loss functions and algorithms Squared error 𝑦 𝑖 −𝑓 𝑥 𝑖 2 Absolute error | 𝑦 𝑖 −𝑓( 𝑥 𝑖 )| Squared loss 1− 𝑦 𝑖 𝑓( 𝑥 𝑖 ) 2 0-1 loss 𝐼( 𝑦 𝑖 ≠𝑓( 𝑥 𝑖 )) Exponential loss exp (− 𝑦 𝑖 𝑓( 𝑥 𝑖 )) Logloss 1 log 2 log (1+ 𝑒 − 𝑦 𝑖 𝑓( 𝑥 𝑖 ) ) Hinge loss 1− 𝑦 𝑖 𝑓 𝑥 𝑖 + Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Example: AdaBoost Binary classification problem + Exponential loss Initialize 𝑤 𝑛 (1) = 1 𝑁 Train classifier 𝑦 𝑚 (𝑥) minimizing 𝑛 𝑤 𝑛 𝑚 𝐼( 𝑦 𝑚 𝑥 𝑛 ≠ 𝑦 𝑛 ) Evaluate 𝜖 𝑚 = 𝑛 𝑤 𝑛 𝑚 𝐼 𝑦 𝑚 𝑥 𝑛 ≠ 𝑦 𝑛 𝑛 𝑤 𝑛 𝑚 and 𝛼 𝑚 = log 1− 𝜖 𝑚 𝜖 𝑚 Update wts 𝑤 𝑛 (𝑚+1) = 𝑤 𝑛 (𝑚) exp { 𝛼 𝑚 𝐼( 𝑦 𝑚 𝑥 𝑛 ≠ 𝑦 𝑛 )} Predict 𝑓 𝑀 𝑥 =𝑠𝑔𝑛( 𝑖=1 𝑀 𝛼 𝑚 𝑦 𝑚 (𝑥) ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Neural networks: Multilayer Perceptrons Multiple layers of logistic regression models Parameters of each optimized by training Motivated by models of the brain Powerful learning model regardless Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

LR and R remembered … Linear models with fixed basis functions Non-linear transformation 𝜙 𝑖 linear followed by non-linear transformation 𝑦 𝑥,𝑤 =𝑓( 𝑖 𝑤 𝑖 𝜙 𝑖 (𝑥)) 𝑦 𝑥;𝑤,𝑣 =ℎ( 𝑗=1 𝑡𝑜 𝑀 𝑤 𝑘𝑗 𝑔( 𝑖=1 𝑡𝑜 𝐷 𝑣 𝑗𝑖 𝑥 𝑖 ) ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Feed-forward network functions M linear combinations of input variables Apply non-linear activation function Linear combinations to get output activations Apply output activation function to get outputs 𝑎 𝑗 = 𝑖=1 𝑡𝑜 𝐷 𝑣 𝑗𝑖 𝑥 𝑖 𝑧 𝑗 =𝑔( 𝑎 𝑗 ) 𝑏 𝑘 = 𝑗=1 𝑡𝑜 𝑀 𝑤 𝑘𝑗 𝑧 𝑗 𝑦 𝑘 =ℎ( 𝑏 𝑘 ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Network Representation 𝑖𝑛𝑝𝑢𝑡𝑠 ℎ𝑖𝑑𝑑𝑒𝑛 𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑔 𝑎 𝑀 𝑧 𝑀 𝑣 𝑤 𝑥 𝐷 ℎ 𝑏 1 𝑦 1 Easy to generalize to multiple layers 𝑣 𝑗𝑖 𝑤 𝑘𝑗 𝑥 𝑖 𝑎 𝑗 𝑧 𝑗 𝑏 𝑘 𝑦 𝑘 𝑏 𝐾 𝑦 𝐾 𝑥 1 𝑎 1 𝑧 1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Power of feed-forward networks Universal approximators Why are >2 layers needed? A 2 layer network with linear outputs can uniformly approximate any smooth continuous function with arbitrary accuracy given sufficient number of nodes in hidden layer Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Training Formulate error function in terms of weights Optimize weights using gradient descent Deriving the gradient looks complicated because of feed-forward … 𝐸 𝑤,𝑣 = 𝑖=1 𝑡𝑜 𝑁 𝑦 𝑥 𝑛 ;𝑤,𝑣 − 𝑦 𝑛 2 𝑤,𝑣 (𝑡+1) = 𝑤,𝑣 (𝑡) −𝜂𝛻𝐸( 𝑤,𝑣 (𝑡) ) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Error Backpropagation Full gradient: sequence of local computations and propagations over the network 𝜕 𝐸 𝑛 𝜕 𝑤 𝑘𝑗 = 𝜕 𝐸 𝑛 𝜕 𝑏 𝑛𝑘 𝜕 𝑏 𝑛𝑘 𝜕 𝑤 𝑘𝑗 = 𝛿 𝑛𝑘 𝑤 𝑧 𝑛𝑗 Output layer 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝛿 𝑛𝑘 𝑤 = 𝑦 𝑛𝑘 − 𝑦 𝑛𝑘 𝑔 ℎ 𝑣 𝑗𝑖 𝑤 𝑘𝑗 𝑥 𝑖 𝑎 𝑗 𝑧 𝑗 𝑏 𝑘 𝑦 𝑘 𝛿 𝑛𝑗 𝑣 𝛿 𝑛𝑘 𝑤 𝑦 𝑘 𝜕 𝐸 𝑛 𝜕 𝑣 𝑗𝑖 = 𝜕 𝐸 𝑛 𝜕 𝑎 𝑛𝑗 𝜕 𝑎 𝑛𝑗 𝜕 𝑣 𝑗𝑖 = 𝛿 𝑛𝑗 𝑣 𝑥 𝑛𝑖 Hidden layer 𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝛿 𝑛𝑗 𝑣 = 𝑘 𝜕 𝐸 𝑛 𝜕 𝑏 𝑛𝑘 𝜕 𝑏 𝑛𝑘 𝜕 𝑎 𝑛𝑗 = 𝑘 𝛿 𝑛𝑘 𝑤 𝑤 𝑘𝑗 𝑔′( 𝑎 𝑛𝑗 ) 𝜕𝐸 𝜕𝑤 = 𝑛 𝜕 𝐸 𝑛 𝜕𝑤 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Backpropagation Algorithm Apply input vector 𝑥 𝑛 and compute derived variables 𝑎 𝑗 , 𝑧 𝑗 , 𝑏 𝑘 , 𝑦 𝑘 Compute 𝛿 𝑛𝑘 𝑤 at all output nodes Back propagate 𝛿 𝑛𝑘 𝑤 to compute 𝛿 𝑛𝑗 𝑣 at all hidden nodes Compute derivatives 𝜕 𝐸 𝑛 𝛿 𝑤 𝑘𝑗 and 𝜕 𝐸 𝑛 𝛿 𝑣 𝑗𝑖 Batch: Sum derivatives over all input vectors Vanishing gradient problem Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Neural Network Regularization Given such a large number of parameters, preventing overfitting is vitally important Choosing the number of layers + no of hidden nodes Controlling the weights Weight decay Early stopping Weight sharing Structural regularization Convolutional neural networks for invariances in image data Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

So… Which classifier is the best in practice? Extensive experimentation by Caruana et al 2006 See more recent study here Low dimensions (9-200) Boosted decision trees Random forests Bagged decision trees SVM Neural nets K nearest neighbors Boosted stumps Decision tree Logistic regression Naïve Bayes High dimensions (500-100K) HMC MLP Boosted MLP Bagged MLP Boosted trees Random forests Remember the “No Free Lunch” theorem … Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya