Neural networks (1) Traditional multi-layer perceptrons

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Beyond Linear Separability

Slides from: Doug Gray, David Poole

NEURAL NETWORKS Backpropagation Algorithm

Neural networks Introduction Fitting neural networks

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Machine Learning Neural Networks

Lecture 14 – Neural Networks

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.

Collaborative Filtering Matrix Factorization Approach

Classification Part 3: Artificial Neural Networks

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Classification / Regression Neural Networks 2

Non-Bayes classifiers. Linear discriminants, neural networks.

11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

EEE502 Pattern Recognition

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Neural Networks 2nd Edition Simon Haykin

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

Fall 2004 Backpropagation CS478 - Machine Learning.

Deep Feedforward Networks

Artificial Neural Networks

The Gradient Descent Algorithm

Artificial Neural Networks I

Learning with Perceptrons and Neural Networks

CSE 473 Introduction to Artificial Intelligence Neural Networks

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

Neural Networks and Backpropagation

Classification / Regression Neural Networks 2

Machine Learning Today: Reading: Maria Florina Balcan

CSC 578 Neural Networks and Deep Learning

Goodfellow: Chap 6 Deep Feedforward Networks

Lecture 11. MLP (III): Back-Propagation

Hyperparameters, bias-variance tradeoff, validation

Collaborative Filtering Matrix Factorization Approach

Artificial Neural Networks

Neural Network - 2 Mayank Vatsa

Neural Networks Geoff Hulten.

Lecture Notes for Chapter 4 Artificial Neural Networks

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Overfitting and Underfitting

Backpropagation.

Machine Learning: Lecture 4

Machine Learning: UNIT-2 CHAPTER-1

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Artificial Intelligence 10. Neural Networks

Neural networks (3) Regularization Autoencoder

COSC 4335: Part2: Other Classification Techniques

Backpropagation.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Introduction to Neural Networks

Image recognition.

Presentation transcript:

Neural networks (1) Traditional multi-layer perceptrons http://neuralnetworksanddeeplearning.com/

Neural network K-class classification: K nodes in top layer Continuous outcome: Single node in top layer

Zm are created from linear combinations of the inputs, Yk is modeled as a function of linear combinations of the Zm For regression, typically K = 1, 𝑍 𝑚 =𝜎 𝛼 0𝑚 + 𝛼 𝑚 𝑇 𝑋 ,m=1,…,M 𝑌= 𝛽 0 + 𝛽 𝑇 𝑍 K-class classification: 𝑍 𝑚 =𝜎 𝛼 0𝑚 + 𝛼 𝑚 𝑇 𝑋 ,m=1,…,M 𝑇 𝑘 = 𝛽 0𝑘 + 𝛽 𝑘 𝑇 𝑍, 𝑘=1,…, 𝐾 𝑌 𝑘 = 𝑒 𝑇 𝑘 𝑙=1 𝐾 𝑒 𝑇 𝑙 , 𝑘=1,…,𝐾 Or in more general terms: 𝑓 𝑘 𝑋 = 𝑔 𝑘 𝑇 regression: 𝑔 𝑘 𝑇 = 𝑇 𝑘 , 𝐾=1 classification: 𝑔 𝑘 𝑇 = 𝑒 𝑇 𝑘 𝑙=1 𝐾 𝑒 𝑇 𝑙

An old activation function Neural network An old activation function

Neural network Other activation functions are used. We will continue using sigmoid function for this discussion.

A simple network with linear functions. Neural network A simple network with linear functions. “bias”: intercept y1: x1 + x2 + 0.5 ≥ 0 y2: x1 +x2 −1.5 ≥ 0 z1 = +1 if and only if both y1=1 and y2=-1

Neural network

Neural network

Fitting Neural Networks Set of parameters (weights): Objective function: Regression (typically K=1): Classification: cross-entropy (deviance)

Fitting Neural Networks minimizing R(θ) is by gradient descent, called “back-propagation” Middle-layer values for each data point: We use the square error loss for demonstration:

Fitting Neural Networks Rules of derivatives used here: Sum rule & constant multiple rule: (af(x)+bg(x))’ = (af(x))’ + (bg(x))’ = af ’(x)+bg’(x) Chain rule: (f(g(x)))’ = f’(g(x)) g’(x) Note: we are going to take derivatives against the coefficients 𝛼 𝑎𝑛𝑑 𝛽.

= Fitting Neural Networks Derivatives: Descent along the gradient: k β m α l i: observation index :learning rate

Fitting Neural Networks By definition

Fitting Neural Networks General workflow of back-propagation: Forward: fix weights and compute 𝑓 𝑘 ( 𝑥 𝑖 ) Backward: compute 𝛿 𝑘𝑖 back propagate to compute 𝑠 𝑚𝑖 use both 𝛿 𝑘𝑖 and 𝑠 𝑚𝑖 to compute the gradients update the weights

Fitting Neural Networks

Fitting Neural Networks Can use parallel computing - each hidden unit passes and receives information only to and from units that share a connection. Online training the fitting scheme allows the network to handle very large training sets, and also to update the weights as new observations come in. Training neural network is an “art” – the model is generally over-parametrized optimization problem is non-convex and unstable A neural network model is a blackbox and hard to directly interpret

Fitting Neural Networks Initiation When weight vectors are close to length zero all Z values are close to zero. The sigmoid curve is close to linear. the overall model is close to linear. a relatively simple model. (This can be seen as a regularized solution) Start with very small weights. Let the neural network learn necessary nonlinear relations from the data. Starting with large weights often leads to poor solutions.

Fitting Neural Networks Overfitting The model is too flexible, involving too many parameters. May easily overfit the data. Early stopping – do not let the algorithm converge. Because the model starts with linear, this is a regularized solution (towards linear). Explicit regularization (“weight decay”) – minimize tends to shrink smaller weights more. Cross-validation is used to estimate λ.

Fitting Neural Networks

Fitting Neural Networks

Fitting Neural Networks Number of Hidden Units and Layers Too few – might not have enough flexibility to capture the nonlinearities in the data Too many – overly flexible, BUT extra weights can be shrunk toward zero if appropriate regularization is used. ✔

Examples “A radial function is in a sense the most difficult for the neural net, as it is spherically symmetric and with no preferred directions.”

Examples

Examples

Going beyond single hidden layer An old benchmark problem: classification of handwritten numerals.

Going beyond single hidden layer 5x5  1 No weight sharing 3x3  1 each of the units in a single 8 × 8 feature map share the same set of nine weights (but have their own bias parameter)  Decision boundaries of parallel lines 5x5  1 weight shared 3x3  1 same operation on different parts

Going beyond single hidden layer A training epoch: one sweep through the entire training set. Too few: underfitting Too many: potential overfitting