Introduction to Neural Networks

Introduction to Neural Networks
John Paxton Montana State University Summer 2003

Textbook Fundamentals of Neural Networks:
Architectures, Algorithms, and Applications Laurene Fausett Prentice-Hall 1994

Chapter 1: Introduction
Why Neural Networks? Training techniques exist. High speed digital computers. Specialized hardware. Better capture biological neural systems.

Who is interested? Electrical Engineers – signal processing, control theory Computer Engineers – robotics Computer Scientists – artificial intelligence, pattern recognition Mathematicians – modelling tool when explicit relationships are unknown

Characterizations Architecture – a pattern of connections between neurons Learning Algorithm – a method of determining the connection weights Activation Function

Problem Domains Storing and recalling patterns Classifying patterns
Mapping inputs onto outputs Grouping similar patterns Finding solutions to constrained optimization problems

A Simple Neural Network
x1 w1 y x2 w2 yin = x1w1 + x2w2 Activation is f(yin)

Biological Neuron Dendrites receive electrical signals affected by chemical process Soma fires at differing frequencies soma dendrite axon

Observations A neuron can receive many inputs
Inputs may be modified by weights at the receiving dendrites A neuron sums its weighted inputs A neuron can transmit an output signal The output can go to many other neurons

Features Information processing is local
Memory is distributed (short term = signals, long term = dendrite weights) The dendrite weights learn through experience The weights may be inhibatory or excitatory

Features Neurons can generalize novel input stimuli
Neurons are fault tolerant and can sustain damage

Applications Signal processing, e.g. suppress noise on a phone line.
Control, e.g. backing up a truck with a trailer. Pattern recognition, e.g. handwritten characters or face sex identification. Diagnosis, e.g. aryhthmia classification or mapping symptoms to a medical case.

Applications Speech production, e.g. NET Talk. Sejnowski and Rosenberg 1986. Speech recognition. Business, e.g. mortgage underwriting. Collins et. Al Unsupervised, e.g. TD-Gammon.

Single Layer Feedforward NN
x1 y1 w1m wn1 xn ym wnm

Multilayer Neural Network
More powerful Harder to train x1 z1 y1 xn zp ym

Setting the Weight Supervised Unsupervised Fixed weight nets

Activation Functions Identity f(x) = x
Binary step f(x) = 1 if x >= q f(x) = 0 otherwise Binary sigmoid f(x) = 1 / (1 + e-sx)

Activation Functions Bipolar sigmoid f(x) = -1 + 2 / (1 + -sx)
Hyperbolic tangent f(x) = (ex – e-x) / (ex + e-x)

History 1943 McCulloch-Pitts neurons 1949 Hebb’s law
1958 Perceptron (Rosenblatt) 1960 Adaline, better learning rule (Widrow, Huff) 1969 Limitations (Minsky, Papert) 1972 Kohonen nets, associative memory

History 1977 Brain State in a Box (Anderson)
1982 Hopfield net, constraint satisfaction 1985 ART (Carpenter, Grossfield) 1986 Backpropagation (Rumelhart, Hinton, McClelland) 1988 Neocognitron, character recognition (Fukushima)

McCulloch-Pitts Neuron
x1 f(yin) = 1 if yin >= q y x2 x3

Exercises 2 input AND 2 input OR 3 input OR 2 input XOR

Chapter 2: Simple Neural Networks for Pattern Classification
1 x0 w0 w0 is the bias f(yin) = 1 if yin >= 0 f(yin) = 0 otherwise ARCHITECTURE y x1 w1 wn xn

Representations Binary: 0 no, 1 yes Bipolar: -1 no, 0 unknown, 1 yes
Bipolar is superior

Interpreting the Weights
w0 = -1, w1 = 1, w2 = 1 0 = -1 + x1 + x2 or x2 = 1 – x1 YES x1 NO x2 decision boundary

Modelling a Simple Problem
Should I attend this lecture? x1 = it’s hot x2 = it’s raining x0 2.5 y x1 -2 1 x2

Linear Separability 1 1 1 1 1 1 AND OR XOR

Hebb’s Rule Increase the weight between two neurons that are both “on”. Increase the weight between two neurons that are both “off”. wi(new) = wi(old) + xi*y

Algorithm 1. set wi = 0 for 0 <= i <= n
2. for each training vector 3. set xi = si for all input units 4. set y = t 5. wi(new) = wi(old) + xi*y

Example: 2 input AND s0 s1 s2 t 1 -1

Training Procedure w0 w1 w2 x0 x1 x2 y 1 -1 -1 (!) 2 -2

Result Interpretation
-2 + 2x1 + 2x2 = 0 OR x2 = -x1 + 1 This training procedure is order dependent and not guaranteed.

Pattern Recognition Exercise
#.# #. .# #.# #.# #. “X” “O”

Pattern Recognition Exercise
Architecture? Weights? Are the original patterns classified correctly? Are the original patterns with 1 piece of wrong data classified correctly? Are the original patterns with 1 piece of missing data classified correctly?

Perceptrons (1958) Very important early neural network
Guaranteed training procedure under certain circumstances 1 x0 w0 y x1 w1 wn xn

Activation Function f(yin) = 1 if yin > q f(yin) = 0 if -q <= yin <= q f(yin) = -1 otherwise Graph interpretation 1 -1

Learning Rule wi(new) = wi(old) + a*t*xi if error
a is the learning rate Typically, 0 < a <= 1

Algorithm 1. set wi = 0 for 0 <= i <= n (can be random)
2. for each training exemplar do 3. xi = si 4. yin = S xi*wi 5. y = f(yin) 6. wi(new) = wi(old) + a*t*xi if error 7. if stopping condition not reached, go to 2

Example: AND concept bipolar inputs bipolar target q = 0 a = 1

Epoch 1 w0 w1 w2 x0 x1 x2 y t 1 -1 2

Exercise Continue the above example until the learning algorithm is finished.

Perceptron Learning Rule Convergence Theorem
If a weight vector exists that correctly classifies all of the training examples, then the perceptron learning rule will converge to some weight vector that gives the correct response for all training patterns. This will happen in a finite number of steps.

Exercise Show perceptron weights for the 2-of-3 concept x1 x2 x3 y 1
-1

Adaline (Widrow, Huff 1960) Adaptive Linear Network
Learning rule minimizes the mean squared error Learns on all examples, not just ones with errors

Architecture 1 x0 w0 y x1 w1 wn xn

Training Algorithm 1. set wi (small random values typical)
2. set a (0.1 typical) 3. for each training exemplar do 4. xi = si 5. yin = S xi*wi 6. wi(new) = wi(old) + a*(t – yin)*xi 7. go to 3 if largest weight change big enough

Activation Function f(yin) = 1 if yin >= 0 f(yin) = -1 otherwise

Delta Rule squared error E = (t – yin)2
minimize error E’ = -2(t – yin)xi = a(t – yin)xi

Example: AND concept bipolar inputs bipolar targets
w0 = -0.5, w1 = 0.5, w2 = 0.5 minimizes E x0 x1 x2 yin t E 1 .5 .25 -1 -.5 -1.5

Exercise Demonstrate that you understand the Adaline training procedure.

Madaline Many adaptive linear neurons 1 1 y x1 z1 xm zk

Madaline MRI (1960) – only learns weights from input layer to hidden layer MRII (1987) – learns all weights

Chapter 3: Pattern Association
Aristotle’s observed that human memory associates similar items contrary items items close in proximity items close in succession (a song)

Terminology and Issues
Autoassociative Networks Heteroassociative Networks Feedforward Networks Recurrent Networks How many patterns can be stored?

Hebb Rule for Pattern Association
Architecture w11 x1 y1 xn ym wnm

Algorithm 1. set wij = 0 1 <= i <= n, 1 <= j <= m
2. for each training pair s:t 3. xi = si 4. yj = tj 5. wij(new) = wij(old) + xiyj

Example s1 = (1 -1 -1), s2 = (-1 1 1) t1 = (1 -1), t2 = (-1 1)
w11 = 1*1 + (-1)(-1) = 2 w12 = 1*(-1) + (-1)1 = -2 w21 = (-1)1+ 1(-1) = -2 w22 = (-1)(-1) + 1(1) = 2 w31 = (-1)1 + 1(-1) = -2 w32 = (-1)(-1) + 1*1 = 2

Matrix Alternative s1 = (1 -1 -1), s2 = (-1 1 1)
t1 = (1 -1), t2 = (-1 1) =

Final Network f(yin) = 1 if yin > 0, 0 if yin = 0, else -1 2 x1 y1
-2 -2 x2 2 y2 -2 x3 2

Properties Weights exist if input vectors are linearly independent
Orthogonal vectors can be learned perfectly High weights imply strong correlations

Exercises What happens if ( ) is tested? This vector has one mistake. What happens if ( ) is tested? This vector has one piece of missing data. Show an example of training data that is not learnable. Show the learned network.

Delta Rule for Pattern Association
Works when patterns are linearly independent but not orthogonal Introduced in the 1960s for ADALINE Produces a least squares solution

Activation Functions Delta Rule (1) wij(new) = wij(old) + a(tj – yj)*xi*1 Extended Delta Rule (f’(yin.j)) wij(new) = wij(old) + a(tj – yj)*xi*f’(yin.j)

Heteroassociative Memory Net
Application: Associate characters. A <-> a B <-> b

Autoassociative Net Architecture w11 x1 y1 xn yn wnn

Training Algorithm Assuming that the training vectors are orthogonal, we can use the Hebb rule algorithm mentioned earlier. Application: Find out whether an input vector is familiar or unfamiliar. For example, voice input as part of a security system.

Autoassociate Example
= =

Evaluation What happens if (1 1 1) is presented?
Why are the diagonals set to 0?

Storage Capacity 2 vectors (1 1 1), (-1 -1 -1) Recall is perfect
=

Storage Capacity 3 vectors: (1 1 1), (-1 -1 -1), (1 -1 1)
Recall is no longer perfect =

Theorem Up to n-1 bipolar vectors of n dimensions can be stored in an autoassociative net.

Iterative Autoassociative Net
1 vector: s = (1 1 -1) st * s = (1 0 0) -> (0 1 -1) (0 1 -1) -> (2 1 -1) -> (1 1 -1) (1 1 -1) -> (2 2 -2) -> (1 1 -1)

Testing Procedure 1. initialize weights using Hebb learning
2. for each test vector do 3. set xi = si 4. calculate ti 5. set si = ti go to step 4 if the s vector is new

Exercises 1 piece of missing data: (0 1 -1)
2 pieces of missing data: (0 0 -1) 3 pieces of missing data: (0 0 0) 1 mistake: ( ) 2 mistakes: ( )

Discrete Hopfield Net content addressable problems
pattern association problems constrained optimization problems wij = wji wii = 0

Characteristics Only 1 unit updates its activation at a time
Each unit continues to receive the external signal An energy (Lyapunov) function can be found that allows the net to converge, unlike the previous system Autoassociative

Architecture x2 y2 y1 y3 x x3

Algorithm 1. initialize weights using Hebb rule
2. for each input vector do 3. yi = xi 4. do steps 5-6 randomly for each yi 5. yin.i = xi + Syjwji 6. calculate f(yin.i) 7. go to step 2 if the net hasn’t converged

Example training vector: (1 -1) y1 y2 -1 x x2

Example input (0 -1) update y1 = 0 + (-1)(-1) = 1 update y2 = (-1) = -2 -> -1 input (1 -1) update y2 = (-1) = -2 -> -1 update y1 = (-1) = 2 -> 1

Hopfield Theorems Convergence is guaranteed.
The number of storable patterns is approximately n / (2 * log n) where n is the dimension of a vector

Bidirectional Associative Memory (BAM)
Heteroassociative Recurrent Net Kosko, 1988 Architecture x1 y1 ym xn

Activation Function f(yin) = 1, if yin > 0 f(yin) = 0, if yin = 0
f(yin) = -1 otherwise

Algorithm 1. initialize weights using Hebb rule
2. for each test vector do 3. present s to x layer 4. present t to y layer 5. while equilibrium is not reached 6. compute f(yin.j) 7. compute f(xin.j)

Example s1 = (1 1), t1 = (1 -1) s2 = (-1 -1), t2 = (-1 1)

Example Architecture 2 x1 y1 -2 2 y2 x2 -2
present (1 1) to x -> 1 -1 present (1 -1) to y -> 1 1

Hamming Distance Definition: Number of different corresponding bits in two vectors For example, H[(1 -1), (1 1)] = 1 Average Hamming Distance is ½.

About BAMs Observation: Encoding is better when the average Hamming distance of the inputs is similar to the average Hamming distance of the outputs. The memory capacity of a BAM is min(n-1, m-1).

Chapter 4: Competition Force a decision (yes, no, maybe) to be made.
Winner take all is a common approach. Kohonen learning wj(new) = wj(old) + a (x – wj(old)) wj is closest weight vector, determined by Euclidean distance.

MaxNet Lippman, 1987 Fixed-weight competitive net.
Activation function f(x) = x if x > 0, else 0. Architecture a1 a2 1 -e 1

Algorithm 1. wij = 1 if i = j, otherwise –e 2. aj(0) = si, t = 0.
3. aj(t+1) = f[aj(t) –e*S k<>j ak(t)] 4. go to step 3 if more than one node has a non-zero activation Special Case: More than one node has the same maximum activation.

Example s1 = .5, s2 = .1, e = .1 a1(0) = .5, a2(0) = .1

Mexican Hat Kohonen, 1989 Contrast enhancement
Architecture (w0, w1, w2, w3) w0 (xi -> xi) , w1 (xi+1 -> xi and xi-1 ->xi) xi-3 xi-2 xi-1 xi xi+1 xi+2 xi+3

Algorithm 1. initialize weights 2. xi(0) = si
3. for some number of steps do 4. xi(t+1) = f [ Swkxi+k(t) ] 5. xi(t+1) = max(0, xi(t))

Example x1, x2, x3, x4, x5 radius 0 weight = 1 radius 1 weight = 1
all other radii weights = 0 s = ( ) f(x) = 0 if x < 0, x if 0 <= x <= 2, 2 otherwise

Example x(0) = (0 .5 1 .5 1) x1(1) = 1(0) + 1(.5) -.5(1) = 0

Why the name? Plot x(0) vs. x(1) 2 1 x1 x2 x3 x4 x5

Hamming Net Lippman, 1987 Maximum likelihood classifier
The similarity of 2 vectors is taken to be n – H(v1, v2) where H is the Hamming distance Uses MaxNet with similarity metric

Architecture Concrete example: x1 y1 x2 MaxNet y2 x3

Algorithm 1. wij = si(j)/2 2. n is the dimensionality of a vector
3. yin.j = S xiwij + (n/2) 4. select max(yin.j) using MaxNet

Example Training examples: (1 1 1), (-1 -1 -1) n = 3
yin.1 = 1(.5) + 1(.5) + 1(.5) = 3 yin.2 = 1(-.5) + 1(-.5) + 1(-.5) = 0 These last 2 quantities represent the Hamming distance They are then fed into MaxNet.

Kohonen Self-Organizing Maps
Maps inputs onto one of m clusters Human brains seem to be able to self organize.

Architecture x1 y1 ym xn

Neighborhoods Linear 3 2 1 # 1 2 3
Rectangular #

Algorithm 1. initialize wij 2. select topology of yi
3. select learning rate parameters 4. while stopping criteria not reached 5. for each input vector do 6. compute D(j) = S(wij – xi)2 for each j

Algorithm. 7. select minimum D(j)
8. update neighborhood units wij(new) = wij(old) + a[xi – wij(old)] 9. update a 10. reduce radius of neighborhood at specified times

Example Place ( ), ( ), ( ), ( ) into two clusters a(0) = .6 a(t+1) = .5 * a(t) random initial weights

Example Present ( ) D(1) = (.2 – 1)2 + (.6 – 1)2 + (.5 – 0)2 + (.9 – 0)2 = 1.86 D(2) = .98 D(2) wins!

Example wi2(new) = wi2(old) + .6[xi – wi2(old)] (bigger) (bigger) (smaller) (smaller) This example assumes no neighborhood

Example After many epochs ( ) -> category ( ) -> category ( ) -> category ( ) -> category 1

Applications Grouping characters Travelling Salesperson Problem
Cluster units can be represented graphically by weight vectors Linear neighborhoods can be used with the first and last cluster units connected

Learning Vector Quantization
Kohonen, 1989 Supervised learning There can be several output units per class

Architecture Like Kohonen nets, but no topology for output units
Each yi represents a known class x1 y1 ym xn

Algorithm 1. Initialize the weights
(first m training examples, random) 2. choose a 3. while stopping criteria not reached do (number of iterations, a is very small) 4. for each training vector do

Algorithm 5. find minimum || x – wj || 6. if minimum is target class
wj(new) = wj(old) + a[x – wj(old)] else wj(new) = wj(old) – a[x – wj(old)] 7. reduce a

Example (1 1 -1 -1) belongs to category 1
2 output units, y1 represents category 1 and y2 represents category 2

Example Initial weights (where did these come from? a = .1

Example Present training example 3, ( ). It belongs to category 2. D(1) = 16 = (1 + 1)2 + (1 + 1)2 + (-1 -1) (-1-1)2 D(2) = 4 Category 2 wins. That is correct!

Example w2(new) = ( ) [( ) - ( )] = ( )

Issues How many yi should be used?
How should we choose the class that each yi should represent? LVQ2, LVQ3 are enhancements to LVQ that modify the runner-up sometimes

Counterpropagation Hecht-Nielsen, 1987
There are input, output, and clustering layers Can be used to compress data Can be used to approximate functions Can be used to associate patterns

Stages Stage 1: Cluster input vectors
Stage 2: Adapt weights from cluster units to output units

Stage 1 Architecture w11 v11 x1 z1 y1 xn zp ym

Stage 2 Architecture y*1 x*1 tj1 vj1 zj x*n y*m

Full Counterpropagation
Stage 1 Algorithm 1. initialize weights, a, b 2. while stopping criteria is false do 3. for each training vector pair do 4. minimize ||x – wj|| + ||y – vj|| wj(new) = wj(old) + a[x – wj(old)] vj(new) = vj(old) + b[y-vj(old)] 5. reduce a, b

Stage 2 Algorithm 1. while stopping criteria is false
2. for each training vector pair do 3. perform step 4 above 4. tj(new) = tj(old) + a[x – tj(old)] vj(new) = vj(old) + b[y – vj(old)]

Partial Example Approximate y = 1/x [0.1, 10.0] 1 x unit 1 y unit
10 z units 1 x* unit 1 y* unit

Partial Example v11 = .11, w11 = 9.0 v12 = .14, w12 = 7.0 …
test .12, predict 9.0. In this example, the output weights will converge to the cluster weights.

Forward Only Counterpropagation
Sometimes the function y = f(x) is not invertible. Architecture (only 1 z unit active) x1 z1 y1 xn zp ym

Stage 1 Algorithm 1. initialize weights, a (.1), b (.6)
2. while stopping criteria is false do 3. for each input vector do 4. find minimum || x – w|| w(new) = w(old) + a[x – w(old)] 5. reduce a

Stage 2 Algorithm 1. while stopping criteria is false do
2. for each training vector pair do 3. find minimum || x – w || w(new) = w(old) + a[x – w(old)] v(new) = v(old) + b[y – v(old)] 4. reduce b Note: interpolation is possible.

Example y = f(x) over [0.1, 10.0] 10 zi units
After phase 1, zi = 0.5, 1.5, …, 9.5. After phase 2, zi = 5.5, 0.75, …, 0.1

Chapter 5: Adaptive Resonance Theory
1987, Carpenter and Grossberg ART1: clusters binary vectors ART2: clusters continuous vectors

General Weights on a cluster unit can be considered to be a prototype pattern Relative similarity is used instead of an absolute difference. Thus, a difference of 1 in a vector with only a few non-zero components becomes more significant.

General Training examples may be presented several times.
Training examples may be presented in any order. An example might change clusters. Nets are stable (patterns don’t oscillate). Nets are plastic (examples can be added).

Architecture Input layer (xi)
Output layer or cluster layer – competitive (yi) Units in the output layer can be active, inactive, or inhibited.

Sample Network t (top down weight), b (bottom up weight) t11 x1 y1 xn
ym bnm

Nomenclature bij: bottom up weight tij: top down weight
s: input vector x: activation vector n: number of components in input vector m: maximum number of clusters || x ||: S xi p: vigilance parameter

Training Algorithm 1. L > 1, 0 < p <= 1 tji(0) = 1 0 < bij(0) < L / (L – 1 + n) 2. while stopping criterion is false do steps 3 – 12 3. for each training example do steps

Training Algorithm 4. yi = 0 5. compute || s || 6. xi = si
7. if yj (do for each j) is not inhibited then yj = S bij xi 8. find largest yj that is not inhibited 9. xi = si * tji

Training Algorithm 10. compute || x ||
11. if || x || / || s || < p then yj = -1, go to step 8 12. bij = L xi / ( L – 1 + || x || ) tji = xi

Possible Stopping Criterion
No weight changes. Maximum number of epochs reached.

What Happens If All Units Are Inhibited?
Lower p. Add a cluster unit. Throw out the current input as an outlier.

Example n = 4 m = 3 p = 0.4 (low vigilance) L = 2
bij(0) = 1/(1 + n) = 0.2 tji(0) = 1 y1 x2 y2 x3 y3 x4

Example 3. input vector (1 1 0 0) 4. yi = 0 5. || s || = 2
y2 = y3 = y4 = 0.4

Example 8. j = 1 (use lowest index to break ties)
9. x1 = s1 * t11 = 1 * 1 = 1 x2 = s2 * t12 = 1 * 1 = 1 x3 = s3 * t13 = 0 * 1 = 0 x4 = s4 * t14 = 0 * 1 = 0 10. || x || = 2 11. || x || / || s || = 1 >= 0.4

Example 12. b11 = 2 * xi / (2 - 1 + || x ||) = 2 * 1 / (1 + 2) = .667
b31 = b41 = 0 t11 = x1 = 1 t12 = 1 t13 = t14 = 0

Exercise Show the network after the training example ( ) is processed.

Observations Typically, stable weight matrices are obtained quickly.
The cluster units are all topologically independent of one another. We have just looked at the fast learning version of ART1. There is also a slow learning version that updates just one weight per training example.

Chapter 6: Backpropagation
1986 Rumelhart, Hinton, Williams Gradient descent method that minimizes the total squared error of the output. Applicable to multilayer, feedforward, supervised neural networks. Revitalizes interest in neural networks!

Backpropagation Appropriate for any domain where inputs must be mapped onto outputs. 1 hidden layer is sufficient to learn any continuous mapping to any arbitrary accuracy! Memorization versus generalization tradeoff.

Architecture input layer, hidden layer, output layer 1 1 y1 x1 z1 ym
xn zp wpm vnp

General Process Feedforward the input signals.
Backpropagate the error. Adjust the weights.

Activation Function Characteristics
Continuous. Differentiable. Monotonically nondecreasing. Easy to compute. Saturates (reaches limits).

Activation Functions Binary Sigmoid f(x) = 1 / [ 1 + e-x ] f’(x) = f(x)[1 – f(x)] Bipolar Sigmoid f(x) = / [1 + e-x] f’(x) = 0.5 * [1 + f(x)] * [1 – f(x) ]

Training Algorithm 1. initialize weights to small random values, for example [ ] 2. while stopping condition is false do steps 3 – 8 3. for each training pair do steps 4-8

Training Algorithm 4. zin.j = S (xi * vij) zj = f(zin.j)
5. yin.j = S (zi * wij) yj = f(yin.j) 6. error(yj) = (tj – yj) * f’(yin.j) tj is the target value 7. error(zk) = [ S error(yj) * wkj ] * f’(zin.k)

Training Algorithm 8. wkj(new) = wkj(old) + a*error(yj)*zk
vkj(new) = vkj(old) + a*error(zj))*xk a is the learning rate An epoch is one cycle through the training vectors.

Choices Initial Weights
random [ ], don’t want the derivative to be 0 Nguyen-Widrow b = 0.7 * p(1/n) n = number of input units p = number of hidden units vij = b * vij(random) / || vj(random) ||

Choices Stopping Condition (avoid overtraining!)
Set aside some of the training pairs as a validations set. Stop training when the error on the validation set stops decreasing.

Choices Number of Training Pairs
total number of weights / desired average error on test set where the average error on the training pairs is half of the above desired average

Choices Data Representation Number of Hidden Layers
Bipolar is better than binary because 0 units don’t learn. Discrete values: red, green, blue? Continuous values: [ ]? Number of Hidden Layers 1 is sufficient Sometimes, multiple layers might speed up the learning

Example XOR. Bipolar data representation.
Bipolar sigmoid activation function. a = 1 3 input units, 5 hidden units,1 output unit Initial Weights are all 0. Training example (1 -1). Target: 1.

Example 4. z1 = f(1*0 + 1*0+ -1*0) = 0.5 z2 = z3 = z4 = 0.5
5. y1 = f(1* * * * *0) = 0.5 6. error(y1) = (1 – 0.5) * [0.5 * (1 + 0) * (1 – 0)] = 0.25 7. error(z1) = 0 * f’(zin.1) = 0 = error(z2) = error(z3) = error(z4)

Example 8. w01(new) = w01(old) + a*error(y1)*z0
= * 0.25 * 1 = 0.25 v21(new) = v21(old) + a*error(z1)*x2 = * 0 * -1 = 0.

Exercise Draw the updated neural network.
Present the example 1 -1 as an example to classify. How is it classified now? If learning were to occur, how would the network weights change this time?

XOR Experiments Binary Activation/Binary Representation: 3000 epochs.
Bipolar Activation/Bipolar Representation: 400 epochs. Bipolar Activation/Modified Bipolar Representation [ ]: 265 epochs. Above experiment with Nguyen-Widrow weight initialization: 125 epochs.

Variations Momentum D wjk(t+1) = a * error(yj) * zk + m * D wjk(t) D vij(t+1) = similar m is [ ] The previous experiment takes 38 epochs.

Variations Batch update the weights to smooth the changes.
Adapt the learning rate. For example, in the delta-bar-delta procedure each weight has its own learning rate that varies over time. 2 consecutive weight increases or decreases will increase the learning rate.

Variations Alternate Activation Functions
Strictly Local Backpropagation makes the algorithm more biologically plausible by making all computations local cortical units sum their inputs synaptic units apply an activation function thalamic units compute errors equivalent to standard backpropagation

Variations Strictly Local Backpropagation input cortical layer -> input synaptic layer -> hidden cortical layer -> hidden synaptic layer -> output cortical layer-> output synaptic layer -> output thalamic layer Number of Hidden Layers

Hecht-Neilsen Theorem
Given any continuous function f: In -> Rm where I is [0, 1], f can be represented exactly by a feedforward network having n input units, 2n + 1 hidden units, and m output units.

Chapter 7: A Sampler Of Other Neural Nets
Optimization Problems Common Extensions Adaptive Architectures Neocognitron

I. Optimization Problems
Travelling Salesperson Problem. Map coloring. Job shop scheduling. RNA secondary structure.

Advantages of Neural Nets
Can find near optimal solutions. Can handle weak (desirable, but not required) constraints.

TSP Topology Each row has 1 unit that is on
Each column has 1 unit that is on 1st nd 3rd City A City B City C

Boltzmann Machine Hinton, Sejnowski (1983)
Can be modelled using Markov chains Uses simulated annealing Each row is fully interconnected Each column is fully interconnected

Architecture ui,j connected to uk,j+1 with –di,k
ui1 connected to ukn with -dik b U11 -p U1n Un1 Unn

Algorithm 1. Initialize weights b, p p > b p > greatest distance between cities Initialize temperature T Initialize activations of units to random binary values

Algorithm 2. while stopping condition is false, do steps 3 – 8
3. do steps 4 – 7 n2 times (1 epoch) 4. choose i and j randomly 1 <= i, j <= n uij is candidate to change state

Algorithm 5. Compute c = [1 – 2uij]b + S S ukm (-p)
where k <> i, m <> j 6. Compute probability to accept change a = 1 / (1 + e(-c/T) ) 7. Accept change if random number [0..1] < a. If change, uij = 1 – uij 8. Adjust temperature T = .95T

Stopping Condition No state change for a specified number of epochs.
Temperature reaches a certain value.

Example T(0) = 20 ½ units are on initially b = 60 p = 70
10 cities, all distances less than 1 200 or fewer epochs to find stable configuration in 100 random trials

Other Optimization Architectures
Continuous Hopfield Net Gaussian Machine Cauchy Machine Adds noise to input in attempt to escape from local minima Faster annealing schedule can be used as a consequence

II. Extensions Modified Hebbian Learning
Find parameters for optimal surface fit of training patterns

Boltzmann Machine With Learning
Add hidden units 2-1-2 net below could be used for simple encoding/decoding (data compression) y1 x1 z1 x2 y2

Simple Recurrent Net Learn sequential or time varying patterns
Doesn’t necessarily have steady state output input units context units hidden units output units

Architecture c1 x1 z1 y1 xn zp ym cp

Simple Recurrent Net f(ci(t)) = f(zi(t-1)) f(ci(0)) = 0.5
Can use backpropagation Can learn string of characters

Example: Finite State Automaton
4 xi 4 yi 2 zi 2 ci A BEGIN END B

Backpropagation In Time
Rumelhart, Williams, Hinton (1986) Application: Simple shift register 1 (fixed) x1 y1 x1 z1 x2 y2 x2 1 (fixed)

Backpropagation Training for Fully Recurrent Nets
Adapts backpropagation to arbitrary connection patterns.

III. Adaptive Architectures
Probabilistic Neural Net (Specht 1988) Cascade Correlation (Fahlman, Lebiere 1990)

Probabilistic Neural Net
Builds its own architecture as training progresses Chooses class A over class B if hAcAfA(x) > hBcBfB(x) cA is the cost of classifying an example as belonging to A when it belongs to B hA is the a priori probability of an example belonging to class A

Probabilistic Neural Net
fA(x) is the probability density function for class A, fA(x) is learned by the net zA1: pattern unit, fA: summation unit zA1 fA x1 zAj y zB1 fB xn zBk

Cascade Correlation Builds own architecture while training progresses
Tries to overcome slow rate of convergence by other neural nets Dynamically adds hidden units (as few as possible) Trains one layer at a time

Cascade Correlation Stage 1 x0 y1 x1 y2 x2

Cascade Correlation Stage 2 (fix weights into z1) x0 y1 x1 z1 y2 x2

Cascade Correlation Stage 3 (fix weights into z2) x0 y1 z1 z2 x1 y2 x2

Algorithm 1. Train stage 1. If error is not acceptable, proceed.
3. Etc.

IV. Neocognitron Fukushima, Miyako, Ito (1983)
Many layers, hierarchical Very spare and localized connections Self organizing Supervised learning, layer by layer Recognizes handwritten 0, 1, 2, 3, … 9, regardless of position and style

Architecture Layer # of Arrays Size Input 1 192 S1 / C1 12 / 8
192 / 112 S2 / C2 38 / 22 112 / 72 S3 / C3 32 / 30 72 / 72 S4 / C4 16 / 10 32 / 12

Architecture S layers respond to patterns
C layers combine results, use larger field of view For example S11 responds to

Training Progresses layer by layer S1 connections to C1 are fixed
C1 connections to S2 are adaptable A V2 layer is introduced between C1 and S2, V2 is inhibatory C1 to V2 connections are fixed V2 to S2 connections are adaptable

Introduction to Neural Networks

Similar presentations

Presentation on theme: "Introduction to Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Neural Networks

Similar presentations

Presentation on theme: "Introduction to Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback