Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Similar presentations


Presentation on theme: "11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering"— Presentation transcript:

1 11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo

2 2 Practical issues in the training l Underfitting l Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression

3 3 Least squares l We wish to use some real-valued input variables x to predict the value of a target y We collect training data of pairs ( x i, y i ), i=1,…N Suppose we have a model f that maps each x example to a value of y’ l Sum of squares function: –Sum of the squares of the deviation between the observed target value y and the predicted value y’

4 4 Least squares Find a function f such that the sum of squares is minimized For example, your function is in the form of linear functions f (x) = w T x l Least squares with a linear function of parameters w is called “linear regression”

5 5 Linear regression l Linear regression has a closed-form solution for w l The minimum is attained at the zero derivative

6 6 l x is evenly distributed from [0,1] l y = f(x) + random error l y = sin(2πx) + ε, ε ~ N(0,σ) Polynomial Curve Fitting

7 7

8 8 Sum-of-Squares Error Function

9 9 0 th Order Polynomial

10 10 1 st Order Polynomial

11 11 3 rd Order Polynomial

12 12 9 th Order Polynomial

13 13 Over-fitting Root-Mean-Square (RMS) Error:

14 14 Polynomial Coefficients

15 15 Data Set Size: 9 th Order Polynomial

16 16 Data Set Size: 9 th Order Polynomial

17 17 Regularization l Penalize large coefficient values l Ridge regression

18 18 Regularization:

19 19 Regularization:

20 20 Regularization: vs.

21 21 Polynomial Coefficients

22 22 Ridge Regression l Derive the analytic solution to the optimization problem for ridge regression Using KKT condition – first order derivative = 0

23 23 Neural networks l Introduction l Different designs of NN l Feed-forward Network (MLP) l Network Training l Error Back-propagation l Regularization

24 24 Introduction l Neuroscience studies how networks of neurons produce intellectual behavior, cognition, emotion and physiological responses l Computer science studies how to simulate knowledge in cognitive science, including the way neurons process signals l Artificial neural networks simulate the connectivity in the neural system, the way it passes through signal, and mimic the massively parallel operations of the human brain

25 25 Common features Dentrites

26 26 Different types of NN l Adaptive NN: have a set of adjustable parameters that can be tuned l Topological NN l Recurrent NN

27 27 Different types of NN l Feed-forward NN l Multi-layer perceptron l Linear perceptron Hidden Layer Input Layer Output Layer

28 28 Different types of NN l Radial basis function NN (RBFN)

29 29 Multi-Layer Perceptron l Layered perceptron networks can realize any logical function, however there is no simple way to estimate the parameters/generalize the (single layer) Perceptron convergence procedure l Multi-layer perceptron (MLP) networks are a class of models that are formed from layered sigmoidal nodes, which can be used for regression or classification purposes. l They are commonly trained using gradient descent on a mean squared error performance function, using a technique known as error back propagation in order to calculate the gradients. l Widely applied to many prediction and classification problems over the past 15 years.

30 30 Linear perceptron Input layeroutput layer :::: x1 wt w2 w1 xt x2 y Σ y = w1*x1 + w2*x2 + … + wt*xt Many functions can not be approximated using perceptron

31 31 Multi-Layer Perceptron l XOR (exclusive OR) problem l 0+0=0 l 1+1=2=0 mod 2 l 1+0=1 l 0+1=1 l Perceptron does not work here! Single layer generates a linear decision boundary

32 32 Multi-Layer Perceptron :::: x1 wt1 (1) W21 (1) W11 (1) xt x2 y f(Σ)f(Σ) f(Σ)f(Σ) f(Σ)f(Σ) W11 (2) W22 (1) W21 (2) Each link is associated with a weight, and these weights are the tuning parameters to be learned Each neuron except ones in the input layer receives inputs from the previous layer, and reports an output to next layer Input layer Hidden layeroutput layer

33 33 Each neuron wnwn...... w2w2  w1w1 summation f is Activation function o The activation function f can be Identity function f(x) = x Sigmoid function Hyperbolic tangent

34 34 1 st layer 2 nd layer 3 rd layer Universal Approximation: Three-layer network can in principle approximate any function with any accuracy! Universal Approximation of MLP

35 35 Feed-forward network function l The output from each hidden node l The final output x1 xt x2 y N nodes M nodes Signal flows

36 36 Network Training l A supervised neural network is a function h(x;w) that maps from inputs x to target y l Usually training a NN does not involve the change of NN structures (such as how many hidden layers or how many hidden nodes) l Training NN refers to adjusting the values of connection weights so that h(x;w) adapts to the problem l Use sum of squares as the error metric Use gradient descent

37 37 Gradient descent l Review of gradient descent l Iterative algorithm containing many iterations l Each iteration, the weights w receive a small update l Terminate –until the network is stable (in other words, the training error cannot be reduced further) E(w new ) < E(w) not hold –until the error on a validation set starts to climb up (early stopping)

38 38 Error Back-propagation The update of the weights goes backwards because we have to use the chain rule to evaluate the gradient of E(w) Learning is backwards x1 xt x2 y=h(x;w) M nodes N nodes Signal flows forwards W ij W jk

39 39 Error Back-propagation l Update the weights in the output layer first l Propagate errors from the high layer to low layer l Recall Learning is backwards x1 xt x2 y=h(x;w) M nodes N nodes W ij W jk

40 40 Evaluate gradient l First compute the partial derivatives for weights in the output layer l Second compute the partial derivatives for weights in the hidden layer

41 41 Back-propagation algorithm l Design the structure of NN l Initialize all connection weights l For t = 1, to T –Present training examples, propagate forwards from input layer to output layer, compute y, and evaluate the errors –Pass errors backwards through the network to recursively compute derivatives, and use them to update weights –If termination rule is met, stop; or continue l end

42 42 Notes on back-propagation l Note that these rules apply to different kinds of feed-forward networks. It is possible for connections to skip layers, or to have mixtures. However, errors always start at the highest layer and propagate backwards

43 43 Activation and Error back-propagation

44 44 Two schemes of training l There are two schemes of updating weights –Batch: Update weights after all examples have been presented (epoch). –Online: Update weights after each example is presented. l Although the batch update scheme implements the true gradient descent, the second scheme is often preferred since –it requires less storage, –it has more noise, hence is less likely to get stuck in a local minima (which is a problem with nonlinear activation functions). In the online update scheme, order of presentation matters!

45 45 Problems of back-propagation l It is extremely slow, if it does converge. l It may get stuck in a local minima. l It is sensitive to initial conditions. l It may start oscillating.

46 46 Overfitting – number of hidden units Over-fitting Sinusoidal data set used in polynomial curve fitting example

47 47 Regularization (1) l How to adjust the number of hidden units to get the best performance while avoiding over-fitting l Add a penalty term to the error function l The simplest regularizer is the weight decay:

48 48 Regularization (2) l A method to Early Stopping –obtain good generalization performance and –control the effective complexity of the network l Instead of iteratively reducing the error until a minimum error on the training data set has been reached l We have a validation set of data available l Stop when the NN achieves the smallest error w.r.t. the validation data set

49 49 Effect of early stopping Validation Set Training Set Error vs. Number of iterations A slight increase in the validation set error


Download ppt "11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering"

Similar presentations


Ads by Google