Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.

Similar presentations


Presentation on theme: "Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation."— Presentation transcript:

1 Artificial Neural Network

2 Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation algorithm Successful in many practical problems, such as interpreting visual scenes and speech recognition. Robust to errors in the training data

3 Biological Motivation Biological learning systems are built of very complex webs of interconnected neurons A densely interconnected set of simple units, where each unit takes a number of real-valued inputs and produces a simple real-valued output The human brain contain a densely interconnected network of approximately 10 11 neurons. Each neurons interconnected 10 3 other neurons. The fastest neuron switching times are quite slow than computer switching speeds, yet human make complex decisions quickly.

4 Biological Motivation Not exactly same as biological systems. Two group of research – Using ANNs to study and model biological learning processes – The goal of obtaining highly effective machine learning algorithms

5 Neural Network Representation Steer an autonomous vehicle driving at normal speeds on public highways

6 Neural Network Representation ALVINN is typical of ANNs – Direct and cycle free Other Structures – Acyclic and cyclic – Directed or undirected Backpropagation algorithm assume network is a fixed structure that corresponds to a directed graph, possibly containing cycles Choose weight value for each edge in the graph

7 Appropriate Problems for Neural Network Learning Instances are represented by many attribute-value pairs The target function output may be discrete-valued, real- valued, or a vector of several real- or discrete-valued attributes The training examples may contain errors Long training times are acceptable Fast evaluation of the learned target function may be required The ability of humans to understand the learned target function is not important

8 Perceptrons One type of ANN is based on a unit called a perceptron. A perceptron take a vector of real-valued inputs, calculates a linear combination of these inputs, then output a 1 if the result is greater than some threshold. Each w i is a real-valued constant, or weight, that determines the contribution of x i to the perceptron output

9 Perceptrons To simplify notation, we imagine an additional constant input x 0, allowing us to write the above inequality as Or in vector form as Perceptron function where

10 Perceptrons Learning a perceptron involves choosing values for the weights w 0, w 1,…, w n. Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors

11 Representational Power of Perceptrons A hyperplane decision surface in the n- dimensional space of instances One side of hyperplane is 1, the other is -1 The decision hyperplane Linearly separable sets.

12 Representational Power of Perceptrons A single perceptron can be used to represent many Boolean functions. – How to implement AND and OR? Perceptrons can represent all of main Boolean functions AND, OR, NAND, and NOR XOR is non-separable training examples

13 Representational Power of Perceptrons Boolean function can be represented by some network of interconnected units based on these primitives. Every Boolean function can be represented by network of perceptrons only two levels deep. Networks can represent a rich variety of functions and single units along cannot.

14 The Perceptron Training Rules Learning the weight for a single perceptron Determine a weight vector that causes the perceptron to produce the correct 1/-1 Two algorithms – Perceptron rule – Delta rule Converge to somewhat different acceptable hypotheses

15 The Perceptron Training Rules Perceptron Rules – Random weight – Iteratively apply the perceptron to each training example and modify weights whenever it misclassifies an example – Iterating as many as needed until all the examples has been correctly classified

16 Gradient Descent and the Delta Rule Delta rule can converge even the examples are not linearly separable. Gradient descent to search the hypothesis space of possible weight vectors to find the best one Search hypothesis space containing many different types of continuously parameterized hypotheses

17 Gradient Descent To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

18 Gradient Descent and the Delta Rule Consider the task of training an un-thresholded perceptron Training error of a hypothesis

19 Visualizing the Hypothesis Space The w0, w1 plane represents the entire hypothesis space Start arbitrary initial weight vector, then repeatedly modifying it in small step in the direction that produces the steepest descent.

20 Gradient Descent Direction of steepest descent along the error surface. – Derivative of E with respect to each component of the vector, written – The gradient specifies the direction that produces the steepest increase in E. is that direction Gradient descent rule

21 Derivative Rules

22 Gradient Descent The vector of derivatives that form the gradient can be obtained by differentiating E

23 Gradient Descent Algorithm Gradient-Descent (training_example, η) – Each training example is a pair of the form, where is the vector of input values, and t is the target output value. η is the learning rate Initialize each w i to some small random value Until the termination condition is met, Do – Initialize each Δw i to zero – For each in training_example, Do Input the instance to the unit and compute the output o For each linear unit weight w i, Do –  w i  w i +  (t-o)x i(1) – For each linear unit weight w i, Do w i  w i +  w i

24 Gradient Descent Algorithm Gradient-Descent – Pick an initial random weight vector – Update each weight w i by adding  w i Error surface contains only a single global minimum this algorithm will converge to a weight vector with minimum error. Determine the η

25 Stochastic Approximation to Gradient Descent Searching through a large or infinite hypothesis space that can be applied whenever – The hypothesis space contains continuously parameterized hypotheses – The error can be differentiated with respect to these hypothesis parameters. Major problem of Gradient Descent – Converging to a local minimum can sometimes be quite slow – No guarantee to find the global minimum

26 Stochastic Gradient Descent Algorithm Gradient-Descent (training_example, η) – Each training example is a pair of the form, where is the vector of input values, and t is the target output value. η is the learning rate Initialize each w i to some small random value Until the termination condition is met, Do – Initialize each Δw i to zero – For each in training_example, Do Input the instance to the unit and compute the output o For each linear unit weight w i, Do – w i  w i +  (t-o)x i

27 Stochastic Approximation to Gradient Descent Stochastic gradient descent – Approximate the gradient descent search by updating weights incrementally, following the calculation of the error for each individual example – A distinct error function for each individual training example  w i  (t-o)x i – Provide a reasonable approximation to descending the gradient with respect to our original error function – By making the η sufficiently small, it can be made to approximate rule gradient descent arbitrarily closely.

28 Gradient Descent and Stochastic Gradient Descent In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example Summing over multiple examples in standard gradient descent requires more computation per weight update step. The step size is larger than stochastic gradient descent In case where there are multiple local minima with respect to E, stochastic gradient descent avoid falling into these local minima.

29 Multilayer Network and The Backpropagation Algorithm Single perceptrons can only express linear decision surfaces. This kind of multilayer networks learned by the Backpropagation algorithm are capable of expressing a rich variety of nonlinear decision surface.

30 Multilayer Network and The Backpropagation Algorithm

31 A Differentiable Threshold Unit What type of unit shall we use as the basis for constructing multiplayer networks? Multiple layers of cascaded linear units still produce linear functions The Perceptron unit is a option, however, its discontinuous threshold makes it undifferentiable and hence unsuitable for gradient descent Sigmoid unit – Output is a nonlinear function of its input – Output is a differentiable function of its input

32 A Differentiable Threshold Unit Sigmoid unit computes it output o where

33 A Differentiable Threshold Unit Sigmoid function – Logistic function – Output ranges between 0 to 1 – Increasing with its input – Its derivative is easily expressed in terms of its output Other function – Easily calculated derivatives are sometimes used in place of σ – For example, e -y in the sigmoid function can be replaced by e -k.y where k is some positive number that determine the steepness of the threshold

34 Backpropagation Algorithm The backpropagation algorithm learns the weights for a multilayer network It employs gradient descent to attempt to minimize the squared error between the network output values and the target values Sum the errors over all of the network output units

35 Backpropagation Algorithm Search a large hypothesis space defined by all possible weight values for all the units in the network One major difference in the case of multilayer networks is that the error surface can have multiple local minima, The gradient descent is guaranteed only to converge toward some local minimum It still can produce excellent results in many real- world applications

36 Backpropagation Algorithm Backpropagation(training_example, η, n in,n out, n hidden ) – Create a feed-forward network with n in inputs, n out outputs and n hidden hidden units – Initialize all network weights to small random number (between -0.05 and 0.05) – Until the termination condition is met, Do For each in training_examples, Do – Propagate the input forward through the network: – 1. input the instance to the network and compute the output o u of every unit u in the network – Propagate the errors backward through the network: – 2. For each network output unit k, calculate its error term  k  o k (1-o k )(t k -o k ) – 3. For each hidden unit h, calculate its error term – 4. Update each network weight w ji  w ji +  w ji , where  w ji =  j x ji

37 Backpropagation Algorithm This algorithm applies to layered networks containing two layers of sigmoid units, with units at each layer connected to all units from the preceding layer This is the incremental or stochastic gradient descent version of Backpropagation algorithm – An index is assigned to each node in the network, where a node is either an input or output of some unit – x ji denotes the input from node i to unit j, and w ji denotes the corresponding weight –  n denotes the error term associated with unit n.

38 Backpropagation Algorithm Constructing a network with the desired number of hidden and output units Initializing all network weights to small random values Given a fixed network structure, the main loop of the algorithm then repeatedly iterates over the training examples. For each training example, calculate the error for this example, computes the gradient and update the weights The gradient descent step is iterated until the network performs acceptably well

39 Weight Update Rule Similar to the delta rule Update each weight in proportion to the learning rate , the input value x ji and the error in the output of the unit The error (t-o) in the delta rule is replaced by a more complex error term  j

40 Error in Backpropagation Algorithm Error for output unit k –  k is the familiar (t k -o k ) as delta rule , multiplied by the factor o k (1-o k ), which is derivative of the sigmoid squashing function Error for hidden unit – No target values are directly available to indicate the error of hidden units’ values – The error terms for hidden unit h is calculated by summing the errors  k for each output unit influenced by h, weighting each of the  k by w kh. The weight from hidden unit h to output unit k.

41 Backpropagation Algorithm Updating weights incrementally, following the presentation of each training example. This corresponds to a stochastic approximation to gradient descent To obtain the true gradient of E, one would sum the  j x ji values over all training examples before altering weight values Iterated thousands of times in a typical application. Termination condition can be used to halt the procedure – Choose to halt after certain iteration – Error on training examples falls below some threshold – Error on a separate validation set of examples meets some criterion Avoid overfitting

42 Derivation of the Backpropagation Rules The stochastic gradient descent involves iterating through the training examples one at a time, for each training example d descending the gradient of the error E d with respect to this single example

43 Subscripts and Variables x ji , the ith input to unit j W ji, the weight associate with the ith input to unit j net j , the weighted sum of inputs for unit j o j , the output computed by unit j t j , the target output for unit j  , sigmoid function outputs , the set of units in the final layer of the network Downstream(j) , the set of units whose immediate inputs include the output of unit j

44 Derivation of the Backpropagation Rules, We consider two cases in turn, the case where unit j is an output unit for the network, and the case where j is an internal unit

45 Hidden Unit Weight

46 Convergence and Local Minima It can guarantee to converge toward some local minimum E and not necessarily to the global minimum error Back Propagation is a highly effective function approximation method in practice

47 Overfitting

48

49 Some weights begin to grow in order to reduce the error over the training data, and the complexity of the learned decision surface increase Given enough iterations, Backpropagation will often be able to create overly complex decision surfaces that fit noise in the training data or unrepresentative characteristics of the particular training sample

50 Solution for Overfitting Weight Decay – Decrease each weight by some small factor during each iteration Validation data – Cross validation – K-fold – Different test data

51 An Illustrative Example Training data – Image of 20 different people – 32 image per person Expression (happy, sad, angry, neutral) Direction which they were looking (L, R, S, U) Whether they were wearing sunglasses\ 624 greyscale image, each with a resolution of 120*128 Output: Which direction they were looking

52 Learned Hidden Representations 30*32 resolution input images Network weights after 100 iterations Network weights after 1 iterations Left Straight right up

53 Face Recognition Input encoding – ANN input is to be some representation of the image – Preprocess the image to extract edges, regions of uniform intensity, or other local image features. One difficulty with this design option is that it would lead to a variable number of features per image – Encode the image as a fixed set of 30*32 pixel intensity values with one network input per pixel. Values range from 0 to 255

54 Face Recognition The ANN must output one of four values indicating the direction in which the person is looking – Single output unit – Four distinct output unit (1-of-n) 1-of-n – More degrees of freedom to the network for representing the target function – Difference between the highest-valued output and the second-highest can be used as a measure of the confidence in the network prediction

55 Face Recognition Output – Four target values … – We use – Avoiding target values of 0 and 1 is that sigmoid units cannot produce these output values given finite weights – Values of 0.1 and 0.9 are achievable using a sigmoid unit with finite weight

56 Face Recognition Network graph structure – How many units to include in the network and how to interconnect them – Layered network with feedforward connection from every unit in one layer to every unit in the next (Two layers) – How many hidden layers 3 units, 90% accuracy with 5 minutes running time 100 units, 91%-92% accuracy with 1 hour running time – Extra hidden units above this number do not dramatically affect generalization accuracy – Increasing number of hidden units often increases the tendency to overfit the training data

57 Face Recognition Other learning algorithm parameters – Learning rate 0.3 and momentum a was set to 0.3 – Lower rate results in more running time – Full gradient descent was used in all these experiments – Weight are assigned to 0 at beginning


Download ppt "Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation."

Similar presentations


Ads by Google