Download presentation
1
Artificial Neural Networks
Chapter 4 Artificial Neural Networks
2
Questions: What is ANNs? How to learn an ANN? (algorithm)
The presentational power of ANNs(advantage and disadvantage)
3
What is ANNs ------Background
Consider humans Neuron switching time second Number of neurons 1010 Connections per neuron 104~5 Scene recognition time second much parallel computation Property of neuron: thresholded unit One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed reprensetation
4
What is ANNs? -----Problems related to ANNs
classfication Voice recognition others
5
Another example: Properties of artificial neural nets (ANNs) Many neuron like threshold switching units Many weighted interconnections among units Highly parallel distributed process Emphasis on tuning weights automatically
6
4.1 Perceptrons
7
To simplify notation, set x0 =1
Learning a perceptron involves choosing values for the weight Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors.
8
We can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances. Two way to train perceptron: Perceptron Training Rule and Gradient Descent
9
(1). Perceptron Training Rule
Initialize the ωi with random value in the given interval Update the value of ωi according to the training example is target value o is perceptron output is small constant called learning rate
10
Representation Power of Perceptrons
A single perceptron can be used to represent many boolean functions, such as AND, OR, NAND, NOR, but fail to represent XOR. Eg: g(x1, x2) = AND(x1 ,x2) o(x1, x2) = sgn( x x2 ) x1 x x2 O -1 -1.8 1 -0.8 0.2
11
Representation Power of Perceptrons
(a) Can prove it will converge If training data is linearly separable and sufficiently small (b)But some functions not representable ,eg: not linearly separable (c) Every boolean function can be represented by some network of perceptrons only two levels deep
12
(2). Gradient Descent Key idea: searching the hypothesis space to find the weights that best fit the training examples. Best fit: minimize the squared error Where D is set of training examples
13
Gradient Descent Gradient: Training rule: or
14
Gradient Descent
15
Gradient Descent Algorithm
Initialize each ωi to some small random value Until the termination condition is met , Do Initialize each Δωi to zero. For each <x, t> in training examples Do Input the instance x to the unit and compute the output o For each linear unit weight ωi Do For each linear unit weight ωi ,Do
16
When to use gradient descent
Continuously parameterized hypothesis The error can be differentiable
17
Advantage vs Disadvantage
Guaranteed to converge to hypothesis with local minimum error , Given sufficiently small learning rate η; Even when training data contains noise; Training data not linear separable ; Converge to the single global minimum. Disadvantage Converging sometimes can be very slow; No guarantee Converging to global minimum in cases where there are multiple local minima
18
Incremental (Stochastic) Gradient Descent
Vs. standard Gradient Descent Do until satisfied Compute the gradient Stochastic Gradient Descent For each training example d in D
19
Standard Gradient Descent vs. Stochastic Gradient Descent
Stochastic Gradient Descent can approximate Standard Gradient Descent arbitrarily closely if η made small enough; Stochastic mode can converge faster; Stochastic Gradient descent can sometimes avoid falling into local minima.
20
(3).Perceptron training rule Vs. gradient descent
Thresholded perceptron output: Provided examples are linearly separable Converge to a hypothesis that perfectly classfies the training data gradient descent Unthresholded linear output: Regardless of whether the training data are linearly separable Converge asymptotically toward the minimum error hypothesis
21
Perceptron: Network: 4.2 Multilayer Networks
Perceptrons can only express liner decision,we need to express a rich variety of nonlinear decision
22
Sigmoid unit – a differentiable threshold unit
Sigmoid function: Property: Output: Why do we use sigmoid instead of linear and sgn(x)?
23
The Backpropagation Algorithm
The main idea of backpropagation algorithm computing the input and output of each unit foreword; modifying the weights of units pairs backward with respect to errors
24
Error definition : Batch mode: Individual mode:
25
oj oi = xji ω ij … …
26
Training rule for Output Unit weights
27
Training Rule for Hidden Unit Weights
ok Error term
28
Backpropagation Algorithm
Initialize all weights to small random numbers Until termination condition is met Do For each training example Do //Propagate the input forward 1. Input the training example to the network and compute the network outputs //Propagate the errors backward 2. For each output unit k 3. For each hidden unit h 4.Update each network weight where
29
Hidden layer Representations
30
Hidden layer Representations
31
Hidden layer Representations
32
Hidden layer Representations
33
Hidden layer Representations
34
Convergence and local minima
Converge to some local minimum and not necessarily to the global minimum error Use stochastic gradient descent rather than the standard gradient descent Initialization will influence the convergence. Training multiple networks network with different initializing random weights,over the same data, then select the best one Training can take thousands of iterations -->slow Initialize weights near zero, Therefore initial networks near linear. Increasingly nonlinear functions is possible as training progresses Add a momentum term to speed convergence
35
Expressive Capabilities of ANNs
Every boolean function can be represented by network with single hidden layer Every bounded continuous function can be approximated with arbitrarily small error by network with one hidden layer Any function can be approximated to arbitrary accuracy by a network with two hidden layers The network with more hidden layers possibly results in the rise of precision , the possibility of converging to a local minima ,however, will increase as well.
36
When to Consider Neural Networks
Input is high dimensional discrete or real valued Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant
37
Overfitting in ANNs
38
Strategy applied to avoid overfitting
Poor strategy: continue training until the error falls below some threshold A good indicator : the number of iterations that produces the lowest error over the validation set Once the trained weights reach a significantly higher error over the validation set than the stored weights, terminate!
39
Alternative Error Functions
40
Recurrent Networks
41
Recurrent Networks
42
Thank you !
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.