CS 484 – Artificial Intelligence Announcements Homework 5 due today, October 30 Book Review due today, October 30 Lab 3 due Thursday, November 1 Homework 6 due Tuesday, November 6 Current Event Kay - today Chelsea - Thursday, November 1 CS 484 – Artificial Intelligence
Neural Networks Lecture 12
Artificial Neural Networks Artificial neural networks (ANNs) provide a practical method for learning real-valued functions discrete-valued functions vector-valued functions Robust to errors in training data Successfully applied to such problems as interpreting visual scenes speech recognition learning robot control strategies CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Biological Neurons The human brain is made up of billions of simple processing units – neurons. Inputs are received on dendrites, and if the input levels are over a threshold, the neuron fires, passing a signal through the axon to the synapse which then connects to another neuron. biological learning systems are built out of very complex webs of interconnected neurons Human brain contains approx 10^11 neurons each connects to approx. 10^4 others Neuron activity is typically excited or inhibited through connections to other neurons fastest neuron switching time 10^-3 (computer switching 10^-10) Recognize mom: 10^-1 brains highly parallel More complexity in biological systems than is modeled by ANNs and many features of ANNs are known to be inconsistent with biological systems Ex: ANNs whose individual units output a single constant value, whereas biological neurons output a complex time series of spikes Today we are interested in obtaining highly effective machine learning algorithms, independent of whether these algorithms mirror biological processes CS 484 – Artificial Intelligence
Neural Network Representation ALVINN uses a learned ANN to steer an autonomous vehicle driving at normal speeds on public highways Input to network: 30x32 grid of pixel intensities obtained from a forward-pointed camera mounted on the vehicle Output: direction in which the vehicle is steered Trained to mimic observed steering commands of a human driving the vehicle for approximately 5 minutes ALVINN successfully drive at speeds up to 70 miles per hour for 90 miles (driving in the left lane of a divided public highway, with other vehicles present) CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence ALVINN Each node (circle) corresponds to the output of a single network unit and the lines entering the node from below are its inputs 4 nodes receive inputs from all of the 30X32 pixels called hidden units because their output is available only within the network and not as part of the global network output Each computes a single real-valued output based on a weighted combination of its 960 inputs Hidden outputs used as inputs to a second layer of 30 "output" units. Each output unit corresponds to a particular steering direction, output values determine which steering direction is recommended most strongly Right corresponds to weights associated with one of the hidden nodes. Large matrix for picture. White indicated positive weights, black negative Smaller rectangle weights for the hidden unit to each of 30 outputs ALVINN is typical of many ANNs (directed-acyclic graph) ANNs can be graphs of many structures (acyclic or cyclic, directed or undirected) We focus on directed graphs (possibly with cycles) learning corresponds to choosing a weight value for each edge in the graph CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Appropriate problems ANN learning well-suit to problems which the training data corresponds to noisy, complex data (inputs from cameras or microphones) Can also be used for problems with symbolic representations Most appropriate for problems where Instances have many attribute-value pairs Target function output may be discrete-valued, real-valued, or a vector of several real- or discrete-valued attributes Training examples may contain errors Long training times are acceptable Fast evaluation of the learned target function may be required The ability for humans to understand the learned target function is not important Problems where decision trees are used instances described by a vector of predefined features (pixels in ALVINN). Input may be highly correlated or independent. Input value can be real values (floating point) ALVINN – vector of 30 attributes, recommending a steering direction. Real number between 0 and 1 corresponding to the confidence in predicting the corresponding steering direction Single network could output both the steering command and suggested acceleration – concatenate the vectors that encode the two outputs Takes longer to training than a decision tree (few seconds to many hours) Depends on factors such as number of weights in the network, the number of training examples considered, and the settings of various learning algorithm parameters Once trained, evaluating the network on a particular instance is typically very fast. ALVINN applies its neural network several times pre second to continually update its steering command as the vehicle drives forward Learned neural networks less easily communicated to humans than learned rules CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Artificial Neurons (1) Artificial neurons are based on biological neurons. Each neuron in the network receives one or more inputs. An activation function is applied to the inputs, which determines the output of the neuron – the activation level. a – at a threshold output becomes 1 b – compress a real value to a number between 0 and 1 (0.9 is usually consider on since never reaches 1, and 0.1 for zero c – weighted sum of inputs as activation level The charts on the right show three typical activation functions. CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Artificial Neurons (2) A typical activation function works as follows: Each node i has a weight, wi associated with it. The input to node i is xi. t is the threshold. So if the weighted sum of the inputs to the neuron is above the threshold, then the neuron fires. CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Perceptrons A perceptron is a single neuron that classifies a set of inputs into one of two categories (usually 1 or -1). If the inputs are in the form of a grid, a perceptron can be used to recognize visual images of shapes. The perceptron usually uses a step function, which returns 1 if the weighted sum of inputs exceeds a threshold, and 0 otherwise. Take real-valued inputs, calculates a linear combination of the these inputs, then outputs a 1 if the result is greater than some threshold and a zero otherwise NOTE: some systems output zero, and others -1 w0 is defined as the negative threshold that the sum of weights must surpass (x0 is 1) CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Training Perceptrons Learning involves choosing values for the weights The perceptron is trained as follows: First, inputs are given random weights (usually between –0.5 and 0.5). An item of training data is presented. If the perceptron mis-classifies it, the weights are modified according to the following: where t is the target output for the training example, o is the output generated by the preceptron and a is the learning rate, between 0 and 1 (usually small such as 0.1) Cycle through training examples until successfully classify all examples Each cycle known as an epoch determine a weight vector that causes the perceptron to produce correct +-1 output for each of the given training examples learning constant can decay over time converges to successful weight: When classifies correctly, error is 0 and weight doesn't change If outputs -1 and should output +1, weight altered to increase the sum. If xi>0, than increasing wi brings it closer to correct classification Example: xi=.8, a = 0.1, t = 1, and o = -1 weight changes by 0.16 if t=-1 and o = 1, then weight is decreased CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Bias of Perceptrons Perceptrons can only classify linearly separable functions. The first of the following graphs shows a linearly separable function (OR). The second is not linearly separable (Exclusive-OR). Perceptron represents a hyperplane decision surface in n-dimensional space of instances Outputs a 1 for instances lying on one side of the hyperplane and 0 for instances on the other side Can represent AND, OR, NOR, and NAND – which means than any boolean function can be represented by some network of interconnected units based on these primitives use perceptrons rather than circuits Inputs are fed to multiple units Given X = \sum_{I=1}^{n} w_I x_I Y = \left\{ +1 & for X > T // -1 for X <= t Divide search space using a line for which X = t With 2 inputs: w_1 x_1 + w_2 x_2 = t CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Convergence Perceptron training rule only converges when training examples are linearly separable and a has a small learning constant Another approach uses the delta rule and gradient descent Same basic rule for finding update value Changes Do not incorporate the threshold in the output value (unthresholded perceptron) Wait to update weight until cycle is complete Converges asymptotically toward the minimum error hypothesis, possibly requiring unbounded time, but converges regardless of whether the training data are linearly separable Proven to converge within a finite number of applications of the preceptron training rule to a weight vector that correctly classifies all training examples, provided a sufficiently small a There is another algorithm that will converge if the examples are not linearly separable called the delta rule using gradient descent – remove the threshold (1 to n rather than 0 to n) and wait to apply the updates until finished the epoch Converges because there is a single global minimum If a is too large, gradient descent may overstep the minimum, so a is usually reduced as the number of steps grow Can be slow – required 1000s of steps (cycles) If there are multiple local minima, might not find global minima CS 484 – Artificial Intelligence
Multilayer Neural Networks Multilayer neural networks can classify a range of functions, including non linearly separable ones. Each input layer neuron connects to all neurons in the hidden layer. The neurons in the hidden layer connect to all neurons in the output layer. A feed-forward network CS 484 – Artificial Intelligence
Speech Recognition ANN trained to recognize 1 of 10 vowel sound occurring in the context "h_d" 2 input parameters 10 outputs highly nonlinear decision surface represented by the learned network This shows test examples which were distinct form the training examples CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Sigmoid Unit (x) is the sigmoid function Nice property: differentiable Derive gradient descent rules to train One sigmoid unit - node Multilayer networks of sigmoid units Don't want to use the step function because can't differentiate – gradient descend learning rule makes use of derivative sigmoid function known as squashing function CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Backpropagation Multilayer neural networks learn in the same way as perceptrons. However, there are many more weights, and it is important to assign credit (or blame) correctly when changing weights. E sums the errors over all of the network output units CS 484 – Artificial Intelligence
Backpropagation Algorithm Create a feed-forward network with nin inputs, nhidden hidden units, and nout output units. Initialize all network weights to small random numbers Until termination condition is met, Do For each <x,t> in training examples, Do Propagate the input forward through the network: Input the instance x to the network and compute the output ou of every unit u in the network Propagate the errors backward through the network: For each network output unit k, calculate its error term δk For each hidden unit h, calculate its error term δh Update each network weight wji where First constructs a fixed network structure Main loop repeatedly iterates over the training examples. For each training example: 1 applies network; 2. calculates error of the output; 3. computes the gradient with respect to the error; 4. updates the weights xji denotes input from node i to unit j, and wji denotes the corresponding weight deltan denotes the error term associated with unit n. It plays a role analogous to the quanity (t-o) alpha is learning rate First consider delta of output: delta rule times derivative of the sigmoid function delta of hidden unit don't have targets for hidden units – sum over the errors for the outputs influenced by h – weight the delta by the weight of the edge (degree to which h is responsible for the error) update weights incrementally known as a stochastic approximation to gradient descent when to stop: after fixed number of iterations error falls below some threshold once the error on a separate validation set of examples meets some criterion Important question – come back to May not find the global minimum because many local minima. Can run several times to find global minima – in practice it works well CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Example: Learning AND a b c Initial Weights: w_da = .2 w_db = .1 w_dc = -.1 w_d0 = .1 w_ea = -.5 w_eb = .3 w_ec = -.2 w_e0 = 0 w_fd = .4 w_fe = -.2 w_f0 = -.1 d e f Training Data: AND(1,0,1) = 0 AND(1,1,1) = 1 Alpha = 0.1 # Feedforward sum_d = w_d0 + w_da*train1[a] + w_db*train1[b] + w_dc*train1[c] # = .1 + .2*1 + .1*0 + -.1*1 o_d = 1.0 / (1 + math.exp(-sum_d)) # = 0.549833997312 sum_e = w_e0 + w_ea*train1[a] + w_eb*train1[b] + w_ec*train1[c] # = 0 + -.5*1 + .3*0 + -.2*1 o_e = 1.0 / (1 + math.exp(-sum_e)) # = 0.331812227832 sum_f = w_f0 + w_fd*o_d + w_fe*o_e # = -.1 + .4*0.550 + -.2*0.332 o_f = 1.0 / (1 + math.exp(-sum_f)) # = 0.513389586297 # backpropogation delta_f = o_f * (1 - o_f) * (train1[out] - o_f) # = -0.128255355565 delta_d = o_d * (1 - o_d) * (w_fd * delta_f) # = -0.0126981304165 delta_e = o_e * (1 - o_e) * (w_fe * delta_f) # = 0.0056871726795 # reassign weight values w_da = w_da + alpha * delta_d * train1[a] # .2 -> 0.198730186958 w_db = w_db + alpha * delta_d * train1[b] # .1 -> 0.1 w_dc = w_dc + alpha * delta_d * train1[c] # -.1 -> -0.101269813042 w_d0 = w_d0 + alpha * delta_d * 1 # .1 -> 0.0987301869583 w_ea = w_ea + alpha * delta_e * train1[a] # -.5 -> -0.499431282732 w_eb = w_eb + alpha * delta_e * train1[b] # .3 -> 0.3 w_ec = w_ec + alpha * delta_e * train1[c] # -.2 -> -0.199431282732 w_e0 = w_e0 + alpha * delta_e * 1 # 0 -> 0.00056871726795 w_fd = w_fd + alpha * delta_f * o_d # .4 -> 0.392948084517 w_fe = w_fe + alpha * delta_f * o_e # -.2 -> -0.204255669526 w_f0 = w_f0 + alpha * delta_f * 1 # -.1 -> -0.112825535556 CS 484 – Artificial Intelligence
Hidden Layer representation Target Function: Can this be learned? CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Yes Input Hidden Values Output 10000000 → .89 .04 .08 01000000 .15 .99 .99 00100000 .01 .97 .27 00010000 .99 .97 .71 00001000 .03 .05 .02 00000100 .01 .11 .88 00000010 .80 .01 .98 00000001 .60 .94 .01 Discovered intermediate representations – finding new hidden layer features that are not explicit in the input representation, but capture properties that are most relevant to learning the target function 3 hidden nodes forced to re-represent the eight input values by some relevant features You can see the learned encoding is similar to the binary encoding of eight values CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Plots of Squared Error Error of each of the eight output units, as the number of training iterations increases error decreases as gradient descent proceeds (some more quickly than others) CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Hidden Unit (.15 .99 .99) Network passes through a number of different encodings before converging to the final encoding CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Evolving weights Notice the significant changes in the weight values coincide with significant changes in the hidden layer encoding and squared error CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Momentum One of many variations Modify the update rule by making the weight update on the nth iteration depend partially on the update that occurred in the (n-1)th iteration Minimizes error over training examples Speeds up training since it can take 1000s of iterations Consider the gradient descent search trajectory is analogous to that of a ball rolling down the error surface. The effect of beta is like momentum that tends to keep the ball rolling in the same direction from one iteration to the next Has many effects: keep the ball rolling though small local minima or along flat regions gradually increases the step size where gradient is unchanging – speeds convergence CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence When to stop training Continue until error falls below some predefined threshold Bad choice because Backpropagation is susceptible to overfitting Won't be able to generalize as well over unseen data Notice error on validation set first decreases, then increases occurs because weights are being fit to idiosyncrasies of the training examples that are not represented in the general distribution Effective complexity of the hypotheses that can be reached by backpropagation increases with the number of iterations Smooth surface to begin with Address the problem: weight decay – decreases each weight by some small factor during each iteration – penalty term corresponding to the total magnitude of the network weights – keeps weight values small bias against learning complex decision surfaces CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Cross Validation Common approach to avoid overfitting Reserve part of the training data for testing m examples are partitioned into k disjoint subsets Run the procedure k times Each time a different one of these subsets is used as validation Determine the number of iterations that yield the best performance Mean of the number of iterations is used to train all n examples CS 484 – Artificial Intelligence
Neural Nets for Face Recognition 90% accurate on head poses and identifying 1 of 20 people CS 484 – Artificial Intelligence
CS 484 – Artificial Intelligence Hidden Unit Weights left straight right up CS 484 – Artificial Intelligence
Error gradient for the sigmoid function CS 484 – Artificial Intelligence
Error gradient for the sigmoid function CS 484 – Artificial Intelligence