The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Multi-Layer Perceptron (MLP)
EA C461 - Artificial Intelligence
Neural networks Introduction Fitting neural networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Biointelligence Laboratory, Seoul National University
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Learning Functions and Neural Networks II Lecture 9 Luoting Fu Spring 2012.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 5 NEURAL NETWORKS
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Back-Propagation Algorithm
Chapter 6: Multilayer Neural Networks
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Aula 4 Radial Basis Function Networks
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Biointelligence Laboratory, Seoul National University
Classification Part 3: Artificial Neural Networks
Multi-Layer Perceptrons Michael J. Watts
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Linear Models for Classification
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
EEE502 Pattern Recognition
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Artificial Neural Networks
ECE 5424: Introduction to Machine Learning
第 3 章 神经网络.
Machine Learning Today: Reading: Maria Florina Balcan
Goodfellow: Chap 6 Deep Feedforward Networks
Neuro-Computing Lecture 4 Radial Basis Function Network
Artificial Neural Networks
Deep Learning for Non-Linear Control
Neural networks (1) Traditional multi-layer perceptrons
Presentation transcript:

The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE

error, e input data, x output, y adaptive layer target, z pre-processing layer

Adaptive Basis Functions “linear” models –fixed pre-processing –parameters → cost “benign” –easy to optimize but –combinatorial –arbitrary choices what is best pre-processor to choose?

The Multi-Layer Perceptron formulated from loose biological principles popularized mid 1980s –Rumelhart, Hinton & Williams 1986 Werbos 1974, Ho 1964 “learn” pre-processing stage from data layered, feed-forward structure –sigmoidal pre-processing –task-specific output non-linear model

Generic MLP INPUTLAYERINPUTLAYER OUTPUTLAYEROUTPUTLAYER HIDDEN LAYERS

Two-Layer MLP

A Sigmoidal Unit

Combinations of Sigmoids

In 2-D from Pattern Classification, 2e, by Duda, Hart & Stork, © 2001 John Wiley

Universal Approximation linear combination of “enough” sigmoids –Cybenko, 1990 single hidden layer adequate –more may be better choose hidden parameters (w, b) optimally problem solved?

Pros compactness –potential to obtain same veracity with much smaller model c.f. sparsity/complexity control in linear models

Compactness of Model

& Cons parameter → cost “malign” –optimization difficult –many solutions possible –effect of hidden weights in output non-linear

Multi-Modal Cost Surface global min local min gradient?

Heading Downhill assume –minimization (e.g. SSE) –analytically intractable step parameters downhill w new = w old + step in right direction backpropagation (of error) –slow but efficient conjugate gradients, Levenburg/Marquardt –for preference

backprop conjugate gradients

Implications correct structure can get “wrong” answer –dependency on initial conditions –might be good enough train / test (cross-validation) required –is poor behaviour due to network structure? ICs? additional dimension in development

Is It a Problem? pros seem to outweigh cons good solutions often arrived at quickly all previous issues apply –sample density –sample distribution –lack of prior knowledge

How to Use to “generalize” a GLM –linear regression – curve-fitting linear output + SSE –logistic regression – classification logistic output + cross-entropy (deviance) extend to multinomial, ordinal –e.g. softmax outptut + cross entropy –Poisson regression – count data

What is Learned? the right thing –in a maximum likelihood sense theoretical conditional mean of target data, E(z|x) –implies probability of class membership for classification P(C i |x) estimated if good estimate then y → E(z|x)

Simple 1-D Function

More Complex 1-D Example

Local Solutions

A Dichotomy data courtesy B Ripley

Class-Conditional Probabilities

Linear Decision Boundary induced by P(C 1 |x) = P(C 2 |x)

Class-Conditional Probabilities

Non-Linear Decision Boundary

Class-Conditional Probabilities

Over-fitted Decision Boundary

Invariances e.g. scale, rotation, translation augment training data design regularizer –tangent propagation (Simmard et al, 1992) build into MLP structure –convolutional networks (LeCun et al, 1989)

Multi-Modal Outputs assumed distribution of outputs unimodal sometimes not –e.g. inverse robot kinematics E(z|x) not unique use mixture model for output density Mixture Density Networks (Bishop, 1996)

Bayesian MLPs deterministic treatment point estimates, E(z|x) can use MLP as estimator in a Bayesian treatment –non-linearity → approximate solutions approximate estimate of spread E((z-y) 2 |x)

NETTalk Introduced by T Sejnowski & C Rosenberg First realistic ANNs demo English text-to-speech non-phonetic Context sensitive –brave, gave, save, have(within word) –read, read(within sentence) Mimicked DECTalk (rule-based)

Representing Text 29 “letters” (a–z + space, comma, stop) –29 binary inputs indicating letter Context –local (within word) – global (sentence) 7 letters at a time: middle → AFs –3 each side give context, e.g. boundaries

Coding

NETTalk Performance 1024 words continuous, informal speech 1000 most common words Best performing MLP 26 o/p, 7x29 i/p, 80 hidden 309 units, 18,320 weights 94% training accuracy 85% testing accuracy