Artificial Neural Networks Chapter 4 Artificial Neural Networks
Questions: What is ANNs? How to learn an ANN? (algorithm) The presentational power of ANNs(advantage and disadvantage)
What is ANNs ------Background Consider humans Neuron switching time 0.001 second Number of neurons 1010 Connections per neuron 104~5 Scene recognition time 0.1 second much parallel computation Property of neuron: thresholded unit One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed reprensetation
What is ANNs? -----Problems related to ANNs classfication Voice recognition others
Another example: Properties of artificial neural nets (ANNs) Many neuron like threshold switching units Many weighted interconnections among units Highly parallel distributed process Emphasis on tuning weights automatically
4.1 Perceptrons
To simplify notation, set x0 =1 Learning a perceptron involves choosing values for the weight . Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors.
We can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances. Two way to train perceptron: Perceptron Training Rule and Gradient Descent
(1). Perceptron Training Rule Initialize the ωi with random value in the given interval Update the value of ωi according to the training example is target value o is perceptron output is small constant called learning rate
Representation Power of Perceptrons A single perceptron can be used to represent many boolean functions, such as AND, OR, NAND, NOR, but fail to represent XOR. Eg: g(x1, x2) = AND(x1 ,x2) o(x1, x2) = sgn(- 0.8 + 0.5x1 + 0.5x2 ) x1 - 0.8 + 0.5x1 + 0.5x2 O -1 -1.8 1 -0.8 0.2
Representation Power of Perceptrons (a) Can prove it will converge If training data is linearly separable and sufficiently small (b)But some functions not representable ,eg: not linearly separable (c) Every boolean function can be represented by some network of perceptrons only two levels deep
(2). Gradient Descent Key idea: searching the hypothesis space to find the weights that best fit the training examples. Best fit: minimize the squared error Where D is set of training examples
Gradient Descent Gradient: Training rule: or
Gradient Descent
Gradient Descent Algorithm Initialize each ωi to some small random value Until the termination condition is met , Do Initialize each Δωi to zero. For each <x, t> in training examples Do Input the instance x to the unit and compute the output o For each linear unit weight ωi Do For each linear unit weight ωi ,Do
When to use gradient descent Continuously parameterized hypothesis The error can be differentiable
Advantage vs Disadvantage Guaranteed to converge to hypothesis with local minimum error , Given sufficiently small learning rate η; Even when training data contains noise; Training data not linear separable ; Converge to the single global minimum. Disadvantage Converging sometimes can be very slow; No guarantee Converging to global minimum in cases where there are multiple local minima
Incremental (Stochastic) Gradient Descent Vs. standard Gradient Descent Do until satisfied Compute the gradient Stochastic Gradient Descent For each training example d in D
Standard Gradient Descent vs. Stochastic Gradient Descent Stochastic Gradient Descent can approximate Standard Gradient Descent arbitrarily closely if η made small enough; Stochastic mode can converge faster; Stochastic Gradient descent can sometimes avoid falling into local minima.
(3).Perceptron training rule Vs. gradient descent Thresholded perceptron output: Provided examples are linearly separable Converge to a hypothesis that perfectly classfies the training data gradient descent Unthresholded linear output: Regardless of whether the training data are linearly separable Converge asymptotically toward the minimum error hypothesis
Perceptron: Network: 4.2 Multilayer Networks Perceptrons can only express liner decision,we need to express a rich variety of nonlinear decision
Sigmoid unit – a differentiable threshold unit Sigmoid function: Property: Output: Why do we use sigmoid instead of linear and sgn(x)?
The Backpropagation Algorithm The main idea of backpropagation algorithm computing the input and output of each unit foreword; modifying the weights of units pairs backward with respect to errors
Error definition : Batch mode: Individual mode:
oj oi = xji ω ij … …
Training rule for Output Unit weights
Training Rule for Hidden Unit Weights ok Error term
Backpropagation Algorithm Initialize all weights to small random numbers Until termination condition is met Do For each training example Do //Propagate the input forward 1. Input the training example to the network and compute the network outputs //Propagate the errors backward 2. For each output unit k 3. For each hidden unit h 4.Update each network weight where
Hidden layer Representations
Hidden layer Representations
Hidden layer Representations
Hidden layer Representations
Hidden layer Representations
Convergence and local minima Converge to some local minimum and not necessarily to the global minimum error Use stochastic gradient descent rather than the standard gradient descent Initialization will influence the convergence. Training multiple networks network with different initializing random weights,over the same data, then select the best one Training can take thousands of iterations -->slow Initialize weights near zero, Therefore initial networks near linear. Increasingly nonlinear functions is possible as training progresses Add a momentum term to speed convergence
Expressive Capabilities of ANNs Every boolean function can be represented by network with single hidden layer Every bounded continuous function can be approximated with arbitrarily small error by network with one hidden layer Any function can be approximated to arbitrary accuracy by a network with two hidden layers The network with more hidden layers possibly results in the rise of precision , the possibility of converging to a local minima ,however, will increase as well.
When to Consider Neural Networks Input is high dimensional discrete or real valued Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant
Overfitting in ANNs
Strategy applied to avoid overfitting Poor strategy: continue training until the error falls below some threshold A good indicator : the number of iterations that produces the lowest error over the validation set Once the trained weights reach a significantly higher error over the validation set than the stored weights, terminate!
Alternative Error Functions
Recurrent Networks
Recurrent Networks
Thank you !