Download presentation
Presentation is loading. Please wait.
Published byLucas Black Modified over 9 years ago
1
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo
2
2 Practical issues in the training l Underfitting l Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression
3
3 Least squares l We wish to use some real-valued input variables x to predict the value of a target y We collect training data of pairs ( x i, y i ), i=1,…N Suppose we have a model f that maps each x example to a value of y’ l Sum of squares function: –Sum of the squares of the deviation between the observed target value y and the predicted value y’
4
4 Least squares Find a function f such that the sum of squares is minimized For example, your function is in the form of linear functions f (x) = w T x l Least squares with a linear function of parameters w is called “linear regression”
5
5 Linear regression l Linear regression has a closed-form solution for w l The minimum is attained at the zero derivative
6
6 l x is evenly distributed from [0,1] l y = f(x) + random error l y = sin(2πx) + ε, ε ~ N(0,σ) Polynomial Curve Fitting
7
7
8
8 Sum-of-Squares Error Function
9
9 0 th Order Polynomial
10
10 1 st Order Polynomial
11
11 3 rd Order Polynomial
12
12 9 th Order Polynomial
13
13 Over-fitting Root-Mean-Square (RMS) Error:
14
14 Polynomial Coefficients
15
15 Data Set Size: 9 th Order Polynomial
16
16 Data Set Size: 9 th Order Polynomial
17
17 Regularization l Penalize large coefficient values l Ridge regression
18
18 Regularization:
19
19 Regularization:
20
20 Regularization: vs.
21
21 Polynomial Coefficients
22
22 Ridge Regression l Derive the analytic solution to the optimization problem for ridge regression Using KKT condition – first order derivative = 0
23
23 Neural networks l Introduction l Different designs of NN l Feed-forward Network (MLP) l Network Training l Error Back-propagation l Regularization
24
24 Introduction l Neuroscience studies how networks of neurons produce intellectual behavior, cognition, emotion and physiological responses l Computer science studies how to simulate knowledge in cognitive science, including the way neurons process signals l Artificial neural networks simulate the connectivity in the neural system, the way it passes through signal, and mimic the massively parallel operations of the human brain
25
25 Common features Dentrites
26
26 Different types of NN l Adaptive NN: have a set of adjustable parameters that can be tuned l Topological NN l Recurrent NN
27
27 Different types of NN l Feed-forward NN l Multi-layer perceptron l Linear perceptron Hidden Layer Input Layer Output Layer
28
28 Different types of NN l Radial basis function NN (RBFN)
29
29 Multi-Layer Perceptron l Layered perceptron networks can realize any logical function, however there is no simple way to estimate the parameters/generalize the (single layer) Perceptron convergence procedure l Multi-layer perceptron (MLP) networks are a class of models that are formed from layered sigmoidal nodes, which can be used for regression or classification purposes. l They are commonly trained using gradient descent on a mean squared error performance function, using a technique known as error back propagation in order to calculate the gradients. l Widely applied to many prediction and classification problems over the past 15 years.
30
30 Linear perceptron Input layeroutput layer :::: x1 wt w2 w1 xt x2 y Σ y = w1*x1 + w2*x2 + … + wt*xt Many functions can not be approximated using perceptron
31
31 Multi-Layer Perceptron l XOR (exclusive OR) problem l 0+0=0 l 1+1=2=0 mod 2 l 1+0=1 l 0+1=1 l Perceptron does not work here! Single layer generates a linear decision boundary
32
32 Multi-Layer Perceptron :::: x1 wt1 (1) W21 (1) W11 (1) xt x2 y f(Σ)f(Σ) f(Σ)f(Σ) f(Σ)f(Σ) W11 (2) W22 (1) W21 (2) Each link is associated with a weight, and these weights are the tuning parameters to be learned Each neuron except ones in the input layer receives inputs from the previous layer, and reports an output to next layer Input layer Hidden layeroutput layer
33
33 Each neuron wnwn...... w2w2 w1w1 summation f is Activation function o The activation function f can be Identity function f(x) = x Sigmoid function Hyperbolic tangent
34
34 1 st layer 2 nd layer 3 rd layer Universal Approximation: Three-layer network can in principle approximate any function with any accuracy! Universal Approximation of MLP
35
35 Feed-forward network function l The output from each hidden node l The final output x1 xt x2 y N nodes M nodes Signal flows
36
36 Network Training l A supervised neural network is a function h(x;w) that maps from inputs x to target y l Usually training a NN does not involve the change of NN structures (such as how many hidden layers or how many hidden nodes) l Training NN refers to adjusting the values of connection weights so that h(x;w) adapts to the problem l Use sum of squares as the error metric Use gradient descent
37
37 Gradient descent l Review of gradient descent l Iterative algorithm containing many iterations l Each iteration, the weights w receive a small update l Terminate –until the network is stable (in other words, the training error cannot be reduced further) E(w new ) < E(w) not hold –until the error on a validation set starts to climb up (early stopping)
38
38 Error Back-propagation The update of the weights goes backwards because we have to use the chain rule to evaluate the gradient of E(w) Learning is backwards x1 xt x2 y=h(x;w) M nodes N nodes Signal flows forwards W ij W jk
39
39 Error Back-propagation l Update the weights in the output layer first l Propagate errors from the high layer to low layer l Recall Learning is backwards x1 xt x2 y=h(x;w) M nodes N nodes W ij W jk
40
40 Evaluate gradient l First compute the partial derivatives for weights in the output layer l Second compute the partial derivatives for weights in the hidden layer
41
41 Back-propagation algorithm l Design the structure of NN l Initialize all connection weights l For t = 1, to T –Present training examples, propagate forwards from input layer to output layer, compute y, and evaluate the errors –Pass errors backwards through the network to recursively compute derivatives, and use them to update weights –If termination rule is met, stop; or continue l end
42
42 Notes on back-propagation l Note that these rules apply to different kinds of feed-forward networks. It is possible for connections to skip layers, or to have mixtures. However, errors always start at the highest layer and propagate backwards
43
43 Activation and Error back-propagation
44
44 Two schemes of training l There are two schemes of updating weights –Batch: Update weights after all examples have been presented (epoch). –Online: Update weights after each example is presented. l Although the batch update scheme implements the true gradient descent, the second scheme is often preferred since –it requires less storage, –it has more noise, hence is less likely to get stuck in a local minima (which is a problem with nonlinear activation functions). In the online update scheme, order of presentation matters!
45
45 Problems of back-propagation l It is extremely slow, if it does converge. l It may get stuck in a local minima. l It is sensitive to initial conditions. l It may start oscillating.
46
46 Overfitting – number of hidden units Over-fitting Sinusoidal data set used in polynomial curve fitting example
47
47 Regularization (1) l How to adjust the number of hidden units to get the best performance while avoiding over-fitting l Add a penalty term to the error function l The simplest regularizer is the weight decay:
48
48 Regularization (2) l A method to Early Stopping –obtain good generalization performance and –control the effective complexity of the network l Instead of iteratively reducing the error until a minimum error on the training data set has been reached l We have a validation set of data available l Stop when the NN achieves the smallest error w.r.t. the validation data set
49
49 Effect of early stopping Validation Set Training Set Error vs. Number of iterations A slight increase in the validation set error
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.