2 Overview Motivation & Goals Perceptron-Learning Gradient Algorithms & the D-RuleMulti Layer NetsThe Backpropagation AlgorithmExample Application: Recognition of FacesMore Network ArchitecturesApplication Areas of ANNs
3 Model: The BrainA complex learning system with simple learning units: the neurons.A network of ~ neurons where each of the neurons has ~ connections.Transmission time of a neuron: ~ (speed versus flexibility)Observation: face recognition time = ~ ® parallelism.
4 Goals of ANNs input output Learning instead of programming Learning complex functions with simple learning unitsParallel computation (e.g. layer model)Network parameter shall be automatically found by a learning algorithmAn ANN « black box.inputoutput
5 When are ANNs used? input output Input instances are described as a vector of discrete or real valuesThe output of a target function is a single value or a vector of discrete or real valued attributesInput data contains noiseTarget function unknown or difficult to describeinputoutput
6 The Perceptron (as a NN Unit) (1/2) A linear unit with threshold.S
8 Geometrical Classification (Decision Surface) A perceptron can classify only linear separable training data. ® We need networks of these units.+-not linear separableEx. XOR-Function++-+linear separableEx. OR-Function0.50.30.5
9 The Perceptron Learning Rule (1/2) Training of a perceptron = Learning the best hypothesis, which classifies all training dataA hypothesis = a vector of weights
10 The Perceptron Learning Rule (2/2) Idea:Initialise the weights with random valuesApply the perceptron iterative to each training example and modify the weights according to the learning rulewhere:· t : target output· o: actual output· h: the learning rateStep 2 is repeated for all training examples until all of them are correctly classified.
11 The Perceptron-Learning Rule: Convergence The perceptron learning rule converges if:The training examples are linear separable and h is chosen small enough (e.g. 0.1).Intuitive explanation:
12 The Gradient Descend Algorithm & the D-Rule (1/5) Better: the D-Rule converges even if the training examples are not linear separable.Idea: Use the gradient descend algorithm to search for the best hypothesis in hypothesis space. The best hypothesis is the one which maximally minimises the square error Þ Basis of the backpropagation algorithm.
13 The Gradient Descend Algorithm & the D-Rule (2/5) Because of steadiness the D-learning rule is applied on a linear unit instead of on the perceptron.Linear unit:The square error to be minimised:1SD: set of training examples : target output of example d : computed output of example d,where:
14 The Gradient Descend Algorithm & the D-Rule (3/5) Geometric Interpretation: H-Space, error function (e.g. 2-dimensional).
15 The Gradient Descend Algorithm & the D-Rule (4/5) DerivationGradient:Learning Rule:
16 The Gradient Descend Algorithm & the D-Rule (5/5) Standard methode: Do until termination criterion is satisfiedInitialiseFor all Compute o For allFor all
17 The D-ruleStochastic methode: Do until termination criterion is satisfiedInitialiseFor all Compute o For allÜ the D Rule
18 RemarksAdvantages of the stochastic approximation of the gradient: Þ quicker convergence (incremental update of the weights) Þ less likely to stuck in a local minimum.
19 Remarks x2 - + - + x1 not linear separable Single perceptrons learn only linear separable training data Þ We need multi layer networks of several 'neurons'.Example: the XOR problem:x2-+-0.5x1x21.-1.+x1not linear separable
21 Supervised Learning Backpropagation NN Since 1985 the BP algorithm has become one of the widely spread and successful learning algorithms for NNs.Idea: The minimum of the error function of a learning function is searched by descending in direction of the gradient.The vector of weights which minimises the error in the network is seen as the solution of the learning problem.So the gradient of the error function must exist for all points inside the weight space must be differentiable
22 Learning in Backpropagation Networks The sigmoid unit:Properties of the sigmoid unit:with
23 Definitions used by the BP Algorithm Input unitsi: input from node i to unit j : weight of the jth input to unit I outputs: set of output units: output of unit i: target output of unit i: error term of unit nBackpropagationjHidden unitsOutput units
24 The Backpropagation Algorithm Initialise all weights to small random numbersUntil termination criterion satisfied doFor each training example doCompute the network's outcomeFor each output unit kFor each hidden unit hUpdate each network weight where
25 Derivation of the BP Algorithm For each training example d:withandwhere(weighted sum of inputs for unit j)Hidden unitsiInput unitsOutput unitsj
26 Derivation of the BP Algorithm Output layer:Hidden layer:And thereforeDownstream(j): the set of units whose immediate inputs include the output of unit j
28 Convergence of the BP Algorithm Generalisation to arbitrary acyclic directed network architectures is simple.In practice it works well, but it sometimes sticks in a local but not always global minimum Þ introduction of a momentum a (“escape routes”) : Disadvantage: global minima can be left out by this “jumping”!Training can take thousands of iterations ® slow (accelerated by momentum).Over-fitting versus adaptability of the NN.
29 Example: Recognition of Faces Given: 32 photos of 20 persons, in different positions: ® Direction of view: right, left, up or straight. ® With and without sunglasses. ® Expression: happy, sad, neutral...
30 Example: Recognition of Faces Goal: Classification of the photos concerning the direction of viewPreparation of the input: • Rastering the photos acceleration of the learning process • Input vector = the grayscale values of the 30 * 32 pixels. • Output vector = (left, straight, right, up) Solution = max(left, right, up, straight) e.g. o = (0.9, 0.1, 0.1, 0.1) looking to the left
32 Recurrent Neural Networks They are directed cyclic networks “with memory” ® Outputs at time t = Inputs at time t+1 ® The cycles allow to feed results back into the network.(+) They are more expressive than acyclic networks(-) Training of recurrent networks is expensive.In some cases recurrent networks can be trained using a variant of the Backpropagation algorithm.Example: Forecast of the next stock market prices y(t+1), based on the current indicator x(t) and the last indicator x(t-1).
33 Recurrent NNs y(t+1) y(t+1) b c(t) x(t) c(t) x(t) Recurrent FeedforwardnetworkRecurrentnetworkc(t-1)x(t-2)Recurrent network(unfolded in time)c(t-2)