Download presentation
Presentation is loading. Please wait.
Published byCecil Chandler Modified over 8 years ago
1
Artificial Neural Network
2
Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation algorithm Successful in many practical problems, such as interpreting visual scenes and speech recognition. Robust to errors in the training data
3
Biological Motivation Biological learning systems are built of very complex webs of interconnected neurons A densely interconnected set of simple units, where each unit takes a number of real-valued inputs and produces a simple real-valued output The human brain contain a densely interconnected network of approximately 10 11 neurons. Each neurons interconnected 10 3 other neurons. The fastest neuron switching times are quite slow than computer switching speeds, yet human make complex decisions quickly.
4
Biological Motivation Not exactly same as biological systems. Two group of research – Using ANNs to study and model biological learning processes – The goal of obtaining highly effective machine learning algorithms
5
Neural Network Representation Steer an autonomous vehicle driving at normal speeds on public highways
6
Neural Network Representation ALVINN is typical of ANNs – Direct and cycle free Other Structures – Acyclic and cyclic – Directed or undirected Backpropagation algorithm assume network is a fixed structure that corresponds to a directed graph, possibly containing cycles Choose weight value for each edge in the graph
7
Appropriate Problems for Neural Network Learning Instances are represented by many attribute-value pairs The target function output may be discrete-valued, real- valued, or a vector of several real- or discrete-valued attributes The training examples may contain errors Long training times are acceptable Fast evaluation of the learned target function may be required The ability of humans to understand the learned target function is not important
8
Perceptrons One type of ANN is based on a unit called a perceptron. A perceptron take a vector of real-valued inputs, calculates a linear combination of these inputs, then output a 1 if the result is greater than some threshold. Each w i is a real-valued constant, or weight, that determines the contribution of x i to the perceptron output
9
Perceptrons To simplify notation, we imagine an additional constant input x 0, allowing us to write the above inequality as Or in vector form as Perceptron function where
10
Perceptrons Learning a perceptron involves choosing values for the weights w 0, w 1,…, w n. Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors
11
Representational Power of Perceptrons A hyperplane decision surface in the n- dimensional space of instances One side of hyperplane is 1, the other is -1 The decision hyperplane Linearly separable sets.
12
Representational Power of Perceptrons A single perceptron can be used to represent many Boolean functions. – How to implement AND and OR? Perceptrons can represent all of main Boolean functions AND, OR, NAND, and NOR XOR is non-separable training examples
13
Representational Power of Perceptrons Boolean function can be represented by some network of interconnected units based on these primitives. Every Boolean function can be represented by network of perceptrons only two levels deep. Networks can represent a rich variety of functions and single units along cannot.
14
The Perceptron Training Rules Learning the weight for a single perceptron Determine a weight vector that causes the perceptron to produce the correct 1/-1 Two algorithms – Perceptron rule – Delta rule Converge to somewhat different acceptable hypotheses
15
The Perceptron Training Rules Perceptron Rules – Random weight – Iteratively apply the perceptron to each training example and modify weights whenever it misclassifies an example – Iterating as many as needed until all the examples has been correctly classified
16
Gradient Descent and the Delta Rule Delta rule can converge even the examples are not linearly separable. Gradient descent to search the hypothesis space of possible weight vectors to find the best one Search hypothesis space containing many different types of continuously parameterized hypotheses
17
Gradient Descent To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.
18
Gradient Descent and the Delta Rule Consider the task of training an un-thresholded perceptron Training error of a hypothesis
19
Visualizing the Hypothesis Space The w0, w1 plane represents the entire hypothesis space Start arbitrary initial weight vector, then repeatedly modifying it in small step in the direction that produces the steepest descent.
20
Gradient Descent Direction of steepest descent along the error surface. – Derivative of E with respect to each component of the vector, written – The gradient specifies the direction that produces the steepest increase in E. is that direction Gradient descent rule
21
Derivative Rules
22
Gradient Descent The vector of derivatives that form the gradient can be obtained by differentiating E
23
Gradient Descent Algorithm Gradient-Descent (training_example, η) – Each training example is a pair of the form, where is the vector of input values, and t is the target output value. η is the learning rate Initialize each w i to some small random value Until the termination condition is met, Do – Initialize each Δw i to zero – For each in training_example, Do Input the instance to the unit and compute the output o For each linear unit weight w i, Do – w i w i + (t-o)x i(1) – For each linear unit weight w i, Do w i w i + w i
24
Gradient Descent Algorithm Gradient-Descent – Pick an initial random weight vector – Update each weight w i by adding w i Error surface contains only a single global minimum this algorithm will converge to a weight vector with minimum error. Determine the η
25
Stochastic Approximation to Gradient Descent Searching through a large or infinite hypothesis space that can be applied whenever – The hypothesis space contains continuously parameterized hypotheses – The error can be differentiated with respect to these hypothesis parameters. Major problem of Gradient Descent – Converging to a local minimum can sometimes be quite slow – No guarantee to find the global minimum
26
Stochastic Gradient Descent Algorithm Gradient-Descent (training_example, η) – Each training example is a pair of the form, where is the vector of input values, and t is the target output value. η is the learning rate Initialize each w i to some small random value Until the termination condition is met, Do – Initialize each Δw i to zero – For each in training_example, Do Input the instance to the unit and compute the output o For each linear unit weight w i, Do – w i w i + (t-o)x i
27
Stochastic Approximation to Gradient Descent Stochastic gradient descent – Approximate the gradient descent search by updating weights incrementally, following the calculation of the error for each individual example – A distinct error function for each individual training example w i (t-o)x i – Provide a reasonable approximation to descending the gradient with respect to our original error function – By making the η sufficiently small, it can be made to approximate rule gradient descent arbitrarily closely.
28
Gradient Descent and Stochastic Gradient Descent In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example Summing over multiple examples in standard gradient descent requires more computation per weight update step. The step size is larger than stochastic gradient descent In case where there are multiple local minima with respect to E, stochastic gradient descent avoid falling into these local minima.
29
Multilayer Network and The Backpropagation Algorithm Single perceptrons can only express linear decision surfaces. This kind of multilayer networks learned by the Backpropagation algorithm are capable of expressing a rich variety of nonlinear decision surface.
30
Multilayer Network and The Backpropagation Algorithm
31
A Differentiable Threshold Unit What type of unit shall we use as the basis for constructing multiplayer networks? Multiple layers of cascaded linear units still produce linear functions The Perceptron unit is a option, however, its discontinuous threshold makes it undifferentiable and hence unsuitable for gradient descent Sigmoid unit – Output is a nonlinear function of its input – Output is a differentiable function of its input
32
A Differentiable Threshold Unit Sigmoid unit computes it output o where
33
A Differentiable Threshold Unit Sigmoid function – Logistic function – Output ranges between 0 to 1 – Increasing with its input – Its derivative is easily expressed in terms of its output Other function – Easily calculated derivatives are sometimes used in place of σ – For example, e -y in the sigmoid function can be replaced by e -k.y where k is some positive number that determine the steepness of the threshold
34
Backpropagation Algorithm The backpropagation algorithm learns the weights for a multilayer network It employs gradient descent to attempt to minimize the squared error between the network output values and the target values Sum the errors over all of the network output units
35
Backpropagation Algorithm Search a large hypothesis space defined by all possible weight values for all the units in the network One major difference in the case of multilayer networks is that the error surface can have multiple local minima, The gradient descent is guaranteed only to converge toward some local minimum It still can produce excellent results in many real- world applications
36
Backpropagation Algorithm Backpropagation(training_example, η, n in,n out, n hidden ) – Create a feed-forward network with n in inputs, n out outputs and n hidden hidden units – Initialize all network weights to small random number (between -0.05 and 0.05) – Until the termination condition is met, Do For each in training_examples, Do – Propagate the input forward through the network: – 1. input the instance to the network and compute the output o u of every unit u in the network – Propagate the errors backward through the network: – 2. For each network output unit k, calculate its error term k o k (1-o k )(t k -o k ) – 3. For each hidden unit h, calculate its error term – 4. Update each network weight w ji w ji + w ji , where w ji = j x ji
37
Backpropagation Algorithm This algorithm applies to layered networks containing two layers of sigmoid units, with units at each layer connected to all units from the preceding layer This is the incremental or stochastic gradient descent version of Backpropagation algorithm – An index is assigned to each node in the network, where a node is either an input or output of some unit – x ji denotes the input from node i to unit j, and w ji denotes the corresponding weight – n denotes the error term associated with unit n.
38
Backpropagation Algorithm Constructing a network with the desired number of hidden and output units Initializing all network weights to small random values Given a fixed network structure, the main loop of the algorithm then repeatedly iterates over the training examples. For each training example, calculate the error for this example, computes the gradient and update the weights The gradient descent step is iterated until the network performs acceptably well
39
Weight Update Rule Similar to the delta rule Update each weight in proportion to the learning rate , the input value x ji and the error in the output of the unit The error (t-o) in the delta rule is replaced by a more complex error term j
40
Error in Backpropagation Algorithm Error for output unit k – k is the familiar (t k -o k ) as delta rule , multiplied by the factor o k (1-o k ), which is derivative of the sigmoid squashing function Error for hidden unit – No target values are directly available to indicate the error of hidden units’ values – The error terms for hidden unit h is calculated by summing the errors k for each output unit influenced by h, weighting each of the k by w kh. The weight from hidden unit h to output unit k.
41
Backpropagation Algorithm Updating weights incrementally, following the presentation of each training example. This corresponds to a stochastic approximation to gradient descent To obtain the true gradient of E, one would sum the j x ji values over all training examples before altering weight values Iterated thousands of times in a typical application. Termination condition can be used to halt the procedure – Choose to halt after certain iteration – Error on training examples falls below some threshold – Error on a separate validation set of examples meets some criterion Avoid overfitting
42
Derivation of the Backpropagation Rules The stochastic gradient descent involves iterating through the training examples one at a time, for each training example d descending the gradient of the error E d with respect to this single example
43
Subscripts and Variables x ji , the ith input to unit j W ji, the weight associate with the ith input to unit j net j , the weighted sum of inputs for unit j o j , the output computed by unit j t j , the target output for unit j , sigmoid function outputs , the set of units in the final layer of the network Downstream(j) , the set of units whose immediate inputs include the output of unit j
44
Derivation of the Backpropagation Rules, We consider two cases in turn, the case where unit j is an output unit for the network, and the case where j is an internal unit
45
Hidden Unit Weight
46
Convergence and Local Minima It can guarantee to converge toward some local minimum E and not necessarily to the global minimum error Back Propagation is a highly effective function approximation method in practice
47
Overfitting
49
Some weights begin to grow in order to reduce the error over the training data, and the complexity of the learned decision surface increase Given enough iterations, Backpropagation will often be able to create overly complex decision surfaces that fit noise in the training data or unrepresentative characteristics of the particular training sample
50
Solution for Overfitting Weight Decay – Decrease each weight by some small factor during each iteration Validation data – Cross validation – K-fold – Different test data
51
An Illustrative Example Training data – Image of 20 different people – 32 image per person Expression (happy, sad, angry, neutral) Direction which they were looking (L, R, S, U) Whether they were wearing sunglasses\ 624 greyscale image, each with a resolution of 120*128 Output: Which direction they were looking
52
Learned Hidden Representations 30*32 resolution input images Network weights after 100 iterations Network weights after 1 iterations Left Straight right up
53
Face Recognition Input encoding – ANN input is to be some representation of the image – Preprocess the image to extract edges, regions of uniform intensity, or other local image features. One difficulty with this design option is that it would lead to a variable number of features per image – Encode the image as a fixed set of 30*32 pixel intensity values with one network input per pixel. Values range from 0 to 255
54
Face Recognition The ANN must output one of four values indicating the direction in which the person is looking – Single output unit – Four distinct output unit (1-of-n) 1-of-n – More degrees of freedom to the network for representing the target function – Difference between the highest-valued output and the second-highest can be used as a measure of the confidence in the network prediction
55
Face Recognition Output – Four target values … – We use – Avoiding target values of 0 and 1 is that sigmoid units cannot produce these output values given finite weights – Values of 0.1 and 0.9 are achievable using a sigmoid unit with finite weight
56
Face Recognition Network graph structure – How many units to include in the network and how to interconnect them – Layered network with feedforward connection from every unit in one layer to every unit in the next (Two layers) – How many hidden layers 3 units, 90% accuracy with 5 minutes running time 100 units, 91%-92% accuracy with 1 hour running time – Extra hidden units above this number do not dramatically affect generalization accuracy – Increasing number of hidden units often increases the tendency to overfit the training data
57
Face Recognition Other learning algorithm parameters – Learning rate 0.3 and momentum a was set to 0.3 – Lower rate results in more running time – Full gradient descent was used in all these experiments – Weight are assigned to 0 at beginning
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.