The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Multi-Layer Perceptron (MLP)

EA C461 - Artificial Intelligence

Neural networks Introduction Fitting neural networks

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Biointelligence Laboratory, Seoul National University

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Learning Functions and Neural Networks II Lecture 9 Luoting Fu Spring 2012.

Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.

Supervised Learning Recap

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Linear Models for Classification: Probabilistic Methods

Chapter 4: Linear Models for Classification

Machine Learning Neural Networks

Lecture 14 – Neural Networks

Pattern Recognition and Machine Learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Chapter 5 NEURAL NETWORKS

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.

Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.

Back-Propagation Algorithm

Chapter 6: Multilayer Neural Networks

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Artificial Neural Networks

LOGO Classification III Lecturer: Dr. Bo Yuan

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Aula 4 Radial Basis Function Networks

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Radial Basis Function Networks

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Biointelligence Laboratory, Seoul National University

Classification Part 3: Artificial Neural Networks

Multi-Layer Perceptrons Michael J. Watts

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Linear Models for Classification

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.

EEE502 Pattern Recognition

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Artificial Neural Networks

ECE 5424: Introduction to Machine Learning

第 3 章神经网络.

Machine Learning Today: Reading: Maria Florina Balcan

Goodfellow: Chap 6 Deep Feedforward Networks

Neuro-Computing Lecture 4 Radial Basis Function Network

Artificial Neural Networks

Deep Learning for Non-Linear Control

Neural networks (1) Traditional multi-layer perceptrons

Presentation transcript:

The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE

error, e input data, x output, y adaptive layer target, z pre-processing layer

Adaptive Basis Functions “linear” models –fixed pre-processing –parameters → cost “benign” –easy to optimize but –combinatorial –arbitrary choices what is best pre-processor to choose?

The Multi-Layer Perceptron formulated from loose biological principles popularized mid 1980s –Rumelhart, Hinton & Williams 1986 Werbos 1974, Ho 1964 “learn” pre-processing stage from data layered, feed-forward structure –sigmoidal pre-processing –task-specific output non-linear model

Generic MLP INPUTLAYERINPUTLAYER OUTPUTLAYEROUTPUTLAYER HIDDEN LAYERS

Two-Layer MLP

A Sigmoidal Unit

Combinations of Sigmoids

In 2-D from Pattern Classification, 2e, by Duda, Hart & Stork, © 2001 John Wiley

Universal Approximation linear combination of “enough” sigmoids –Cybenko, 1990 single hidden layer adequate –more may be better choose hidden parameters (w, b) optimally problem solved?

Pros compactness –potential to obtain same veracity with much smaller model c.f. sparsity/complexity control in linear models

Compactness of Model

& Cons parameter → cost “malign” –optimization difficult –many solutions possible –effect of hidden weights in output non-linear

Multi-Modal Cost Surface global min local min gradient?

Heading Downhill assume –minimization (e.g. SSE) –analytically intractable step parameters downhill w new = w old + step in right direction backpropagation (of error) –slow but efficient conjugate gradients, Levenburg/Marquardt –for preference

backprop conjugate gradients

Implications correct structure can get “wrong” answer –dependency on initial conditions –might be good enough train / test (cross-validation) required –is poor behaviour due to network structure? ICs? additional dimension in development

Is It a Problem? pros seem to outweigh cons good solutions often arrived at quickly all previous issues apply –sample density –sample distribution –lack of prior knowledge

How to Use to “generalize” a GLM –linear regression – curve-fitting linear output + SSE –logistic regression – classification logistic output + cross-entropy (deviance) extend to multinomial, ordinal –e.g. softmax outptut + cross entropy –Poisson regression – count data

What is Learned? the right thing –in a maximum likelihood sense theoretical conditional mean of target data, E(z|x) –implies probability of class membership for classification P(C i |x) estimated if good estimate then y → E(z|x)

Simple 1-D Function

More Complex 1-D Example

Local Solutions

A Dichotomy data courtesy B Ripley

Class-Conditional Probabilities

Linear Decision Boundary induced by P(C 1 |x) = P(C 2 |x)

Class-Conditional Probabilities

Non-Linear Decision Boundary

Class-Conditional Probabilities

Over-fitted Decision Boundary

Invariances e.g. scale, rotation, translation augment training data design regularizer –tangent propagation (Simmard et al, 1992) build into MLP structure –convolutional networks (LeCun et al, 1989)

Multi-Modal Outputs assumed distribution of outputs unimodal sometimes not –e.g. inverse robot kinematics E(z|x) not unique use mixture model for output density Mixture Density Networks (Bishop, 1996)

Bayesian MLPs deterministic treatment point estimates, E(z|x) can use MLP as estimator in a Bayesian treatment –non-linearity → approximate solutions approximate estimate of spread E((z-y) 2 |x)

NETTalk Introduced by T Sejnowski & C Rosenberg First realistic ANNs demo English text-to-speech non-phonetic Context sensitive –brave, gave, save, have(within word) –read, read(within sentence) Mimicked DECTalk (rule-based)

Representing Text 29 “letters” (a–z + space, comma, stop) –29 binary inputs indicating letter Context –local (within word) – global (sentence) 7 letters at a time: middle → AFs –3 each side give context, e.g. boundaries

Coding

NETTalk Performance 1024 words continuous, informal speech 1000 most common words Best performing MLP 26 o/p, 7x29 i/p, 80 hidden 309 units, 18,320 weights 94% training accuracy 85% testing accuracy