Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Artificial Intelligence 12. Two Layer ANNs
Multi-Layer Perceptron (MLP)
Beyond Linear Separability
NEURAL NETWORKS Backpropagation Algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The back-propagation training algorithm
Back-Propagation Algorithm
Artificial Neural Networks
Artificial Neural Networks
CS 4700: Foundations of Artificial Intelligence
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
Artificial Neural Networks
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Non-Bayes classifiers. Linear discriminants, neural networks.
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CS621 : Artificial Intelligence
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
EEE502 Pattern Recognition
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
CS 388: Natural Language Processing: Neural Networks
Learning with Perceptrons and Neural Networks
Real Neurons Cell structures Cell body Dendrites Axon
CSE 473 Introduction to Artificial Intelligence Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
CSE P573 Applications of Artificial Intelligence Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Artificial Intelligence 13. Multi-Layer ANNs
of the Artificial Neural Networks.
Perceptron as one Type of Linear Discriminants
CSE 573 Introduction to Artificial Intelligence Neural Networks
Artificial Neural Networks
Neural Networks Geoff Hulten.
Machine Learning Neural Networks (2).
Artificial Intelligence 12. Two Layer ANNs
Backpropagation David Kauchak CS159 – Fall 2019.
Seminar on Machine Learning Rada Mihalcea
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Presentation transcript:

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton

Multi-Layer Networks Built from Perceptron Units Perceptrons not able to learn certain concepts – Can only learn linearly separable functions But they can be the basis for larger structures – Which can learn more sophisticated concepts – Say that the networks have “perceptron units”

Problem With Perceptron Units The learning rule relies on differential calculus – Finding minima by differentiating, etc. Step functions aren’t differentiable – They are not continuous at the threshold Alternative threshold function sought – Must be differentiable – Must be similar to step function i.e., exhibit a threshold so that units can “fire” or not fire Sigmoid units used for backpropagation – There are other alternatives that are often used

Sigmoid Units Take in weighted sum of inputs, S and output: Advantages: – Looks very similar to the step function – Is differentiable – Derivative easily expressible in terms of σ itself:

Example ANN with Sigmoid Units Feed forward network – Feed inputs in on the left, propagate numbers forward Suppose we have this ANN – With weights set arbitrarily

Propagation of Example Suppose input to ANN is 10, 30, 20 First calculate weighted sums to hidden layer: – S H1 = (0.2*10) + (-0.1*30) + (0.4*20) = = 7 – S H2 = (0.7*10) + (-1.2*30) + (1.2*20) = = -5 Next calculate the output from the hidden layer: – Using: σ(S) = 1/(1 + e -S ) – σ(S H1 ) = 1/(1 + e -7 ) = 1/( ) = – σ(S H2 ) = 1/(1 + e 5 ) = 1/( ) = – So, H1 has fired, H2 has not

Propagation of Example Next calculate the weighted sums into the output layer: – S O1 = (1.1 * 0.999) + (0.1 * ) = – S O2 = (3.1 * 0.999) + (1.17 * ) = Finally, calculate the output from the ANN – σ(S O1 ) = 1/(1+e ) = 1/( ) = – σ(S O2 ) = 1/(1+e ) = 1/( ) = Output from O2 > output from O1 – So, the ANN predicts category associated with O2 – For the example input (10,30,20)

Backpropagation Learning Algorithm Same task as in perceptrons – Learn a multi-layer ANN to correctly categorise unseen examples – We’ll concentrate on ANNs with one hidden layer Overview of the routine – Fix architecture and sigmoid units within architecture i.e., number of units in hidden layer; the way the input units represent example; the way the output units categorises examples – Randomly assign weights to the the whole network Use small values (between –0.5 and 0.5) – Use each example in the set to retrain the weights – Have multiple epochs (iterations through training set) Until some termination condition is met (not necessarily 100% acc)

Weight Training Calculations (Overview) Use notation w ij to specify: – Weight between unit i and unit j Look at the calculation with respect to example E Going to calculate a value Δ ij for each w ij – And add Δ ij on to w ij Do this by calculating error terms for each unit The error term for output units is found – And then this information is used to calculate the error terms for the hidden units So, the error is propagated back through the ANN

Propagate E through the Network Feed E through the network (as in example above) Record the target and observed values for example E – i.e., determine weighted sum from hidden units, do sigmoid calc – Let t i (E) be the target values for output unit i – Let o i (E) be the observed value for output unit i Note that for categorisation learning tasks, – Each t i (E) will be 0, except for a single t j (E), which will be 1 – But o i (E) will be a real valued number between 0 and 1 Also record the outputs from the hidden units – Let h i (E) be the output from hidden unit i

Error terms for each unit The Error Term for output unit k is calculated as: The Error Term for hidden unit k is: In English: – For hidden unit h, add together all the errors for the output units, multiplied by the appropriate weight. – Then multiply this sum by h k (E)(1 – h k (E))

Final Calculations Choose a learning rate, η (= 0.1 again, perhaps) For each weight w ij – Between input unit i and hidden unit j – Calculate: – Where x i is the input to the system to input unit i for E For each weight w ij – Between hidden unit i and output unit j – Calculate: – Where h i (E) is the output from hidden unit i for E Finally, add on each Δ ij on to w ij

Worked Backpropagation Example Start with the previous ANN We will retrain the weights – In the light of example E = (10,30,20) – Stipulate that E should have been categorised as O1 – Will use a learning rate of η = 0.1

Previous Calculations Need the calculations from when we propagated E through the ANN: t 1 (E) = 1 and t 2 (E) = 0 [from categorisation] o 1 (E) = and o 2 (E) = 0.957

Error Values for Output Units t 1 (E) = 1 and t 2 (E) = 0 [from categorisation] o 1 (E) = and o 2 (E) = So:

Error Values for Hidden Units δ O1 = and δ O2 = h 1 (E) = and h 2 (E) = So, for H1, we add together: – (w 11 *δ 01 ) + (w 12 *δ O2 ) = (1.1*0.0469)+(3.1* ) = – And multiply by: h 1 (E)(1-h 1 (E)) to give us: * (0.999 * ( )) = = δ H1 For H2, we add together: – (w 21 *δ 01 ) + (w 22 *δ O2 ) = (0.1*0.0469)+(1.17* ) = – And multiply by: h 2 (E)(1-h 2 (E)) to give us: * (0.067 * ( )) = = δ H2

Calculation of Weight Changes For weights between the input and hidden layer

Calculation of Weight Changes For weights between hidden and output layer Weight changes are not very large – Small differences in weights can make big differences in calculations – But it might be a good idea to increase η

Calculation of Network Error Could calculate Network error as – Proportion of mis-categorised examples But there are multiple output units, with numerical output – So we use a more sophisticated measure: Not as complicated as it looks – Square the difference between target and observed Squaring ensures we get a positive number Add up all the squared differences – For every output unit and every example in training set

Problems with Local Minima Backpropagation is gradient descent search – Where the height of the hills is determined by error – But there are many dimensions to the space One for each weight in the network Therefore backpropagation – Can find its ways into local minima One partial solution: – Random re-start: learn lots of networks Starting with different random weight settings – Can take best network – Or can set up a “committee” of networks to categorise examples Another partial solution: Momentum

Adding Momentum Imagine rolling a ball down a hill Without Momentum With Momentum Gets stuck here

Momentum in Backpropagation For each weight – Remember what was added in the previous epoch In the current epoch – Add on a small amount of the previous Δ The amount is determined by – The momentum parameter, denoted α – α is taken to be between 0 and 1

How Momentum Works If direction of the weight doesn’t change – Then the movement of search gets bigger – The amount of additional extra is compounded in each epoch – May mean that narrow local minima are avoided – May also mean that the convergence rate speeds up Caution: – May not have enough momentum to get out of local minima – Also, too much momentum might carry search Back out of the global minimum, into a local minimum

Problems with Overfitting Plot training example error versus test example error: Test set error is increasing! – Network is overfitting the data – Learning idiosyncrasies in data, not general principles – Big problem in Machine Learning (ANNs in particular)

Avoiding Overfitting Bad idea to use training set accuracy to terminate One alternative: Use a validation set – Hold back some of the training set during training – Like a miniature test set (not used to train weights at all) – If the validation set error stops decreasing, but the training set error continues decreasing Then it’s likely that overfitting has started to occur, so stop – Be careful, because validation set error could get into a local minima itself Worthwhile running the training for longer, and wait and see Another alternative: use a weight decay factor – Take a small amount off every weight after each epoch – Networks with smaller weights aren’t as highly fine tuned (overfit)

Suitable Problems for ANNs Examples and target categorisation – Can be expressed as real values ANNs are just fancy numerical functions Predictive accuracy is more important – Than understanding what the machine has learned Black box non-symbolic approach, not easy to digest Slow training times are OK – Can take hours and days to train networks Execution of learned function must be quick – Learned networks can categorise very quickly Very useful in time critical situations (is that a tank, car or old lady?) Error: ANNs are fairly robust to noise in data