Download presentation
Presentation is loading. Please wait.
1
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
2
Textbook Fundamentals of Neural Networks:
Architectures, Algorithms, and Applications Laurene Fausett Prentice-Hall 1994
3
Chapter 1: Introduction
Why Neural Networks? Training techniques exist. High speed digital computers. Specialized hardware. Better capture biological neural systems.
4
Who is interested? Electrical Engineers – signal processing, control theory Computer Engineers – robotics Computer Scientists – artificial intelligence, pattern recognition Mathematicians – modelling tool when explicit relationships are unknown
5
Characterizations Architecture – a pattern of connections between neurons Learning Algorithm – a method of determining the connection weights Activation Function
6
Problem Domains Storing and recalling patterns Classifying patterns
Mapping inputs onto outputs Grouping similar patterns Finding solutions to constrained optimization problems
7
A Simple Neural Network
x1 w1 y x2 w2 yin = x1w1 + x2w2 Activation is f(yin)
8
Biological Neuron Dendrites receive electrical signals affected by chemical process Soma fires at differing frequencies soma dendrite axon
9
Observations A neuron can receive many inputs
Inputs may be modified by weights at the receiving dendrites A neuron sums its weighted inputs A neuron can transmit an output signal The output can go to many other neurons
10
Features Information processing is local
Memory is distributed (short term = signals, long term = dendrite weights) The dendrite weights learn through experience The weights may be inhibatory or excitatory
11
Features Neurons can generalize novel input stimuli
Neurons are fault tolerant and can sustain damage
12
Applications Signal processing, e.g. suppress noise on a phone line.
Control, e.g. backing up a truck with a trailer. Pattern recognition, e.g. handwritten characters or face sex identification. Diagnosis, e.g. aryhthmia classification or mapping symptoms to a medical case.
13
Applications Speech production, e.g. NET Talk. Sejnowski and Rosenberg 1986. Speech recognition. Business, e.g. mortgage underwriting. Collins et. Al Unsupervised, e.g. TD-Gammon.
14
Single Layer Feedforward NN
x1 y1 w1m wn1 xn ym wnm
15
Multilayer Neural Network
More powerful Harder to train x1 z1 y1 xn zp ym
16
Setting the Weight Supervised Unsupervised Fixed weight nets
17
Activation Functions Identity f(x) = x
Binary step f(x) = 1 if x >= q f(x) = 0 otherwise Binary sigmoid f(x) = 1 / (1 + e-sx)
18
Activation Functions Bipolar sigmoid f(x) = -1 + 2 / (1 + -sx)
Hyperbolic tangent f(x) = (ex – e-x) / (ex + e-x)
19
History 1943 McCulloch-Pitts neurons 1949 Hebb’s law
1958 Perceptron (Rosenblatt) 1960 Adaline, better learning rule (Widrow, Huff) 1969 Limitations (Minsky, Papert) 1972 Kohonen nets, associative memory
20
History 1977 Brain State in a Box (Anderson)
1982 Hopfield net, constraint satisfaction 1985 ART (Carpenter, Grossfield) 1986 Backpropagation (Rumelhart, Hinton, McClelland) 1988 Neocognitron, character recognition (Fukushima)
21
McCulloch-Pitts Neuron
x1 f(yin) = 1 if yin >= q y x2 x3
22
Exercises 2 input AND 2 input OR 3 input OR 2 input XOR
23
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
24
Chapter 2: Simple Neural Networks for Pattern Classification
1 x0 w0 w0 is the bias f(yin) = 1 if yin >= 0 f(yin) = 0 otherwise ARCHITECTURE y x1 w1 wn xn
25
Representations Binary: 0 no, 1 yes Bipolar: -1 no, 0 unknown, 1 yes
Bipolar is superior
26
Interpreting the Weights
w0 = -1, w1 = 1, w2 = 1 0 = -1 + x1 + x2 or x2 = 1 – x1 YES x1 NO x2 decision boundary
27
Modelling a Simple Problem
Should I attend this lecture? x1 = it’s hot x2 = it’s raining x0 2.5 y x1 -2 1 x2
28
Linear Separability 1 1 1 1 1 1 AND OR XOR
29
Hebb’s Rule Increase the weight between two neurons that are both “on”. Increase the weight between two neurons that are both “off”. wi(new) = wi(old) + xi*y
30
Algorithm 1. set wi = 0 for 0 <= i <= n
2. for each training vector 3. set xi = si for all input units 4. set y = t 5. wi(new) = wi(old) + xi*y
31
Example: 2 input AND s0 s1 s2 t 1 -1
32
Training Procedure w0 w1 w2 x0 x1 x2 y 1 -1 -1 (!) 2 -2
33
Result Interpretation
-2 + 2x1 + 2x2 = 0 OR x2 = -x1 + 1 This training procedure is order dependent and not guaranteed.
34
Pattern Recognition Exercise
#.# #. .# #.# #.# #. “X” “O”
35
Pattern Recognition Exercise
Architecture? Weights? Are the original patterns classified correctly? Are the original patterns with 1 piece of wrong data classified correctly? Are the original patterns with 1 piece of missing data classified correctly?
36
Perceptrons (1958) Very important early neural network
Guaranteed training procedure under certain circumstances 1 x0 w0 y x1 w1 wn xn
37
Activation Function f(yin) = 1 if yin > q f(yin) = 0 if -q <= yin <= q f(yin) = -1 otherwise Graph interpretation 1 -1
38
Learning Rule wi(new) = wi(old) + a*t*xi if error
a is the learning rate Typically, 0 < a <= 1
39
Algorithm 1. set wi = 0 for 0 <= i <= n (can be random)
2. for each training exemplar do 3. xi = si 4. yin = S xi*wi 5. y = f(yin) 6. wi(new) = wi(old) + a*t*xi if error 7. if stopping condition not reached, go to 2
40
Example: AND concept bipolar inputs bipolar target q = 0 a = 1
41
Epoch 1 w0 w1 w2 x0 x1 x2 y t 1 -1 2
42
Exercise Continue the above example until the learning algorithm is finished.
43
Perceptron Learning Rule Convergence Theorem
If a weight vector exists that correctly classifies all of the training examples, then the perceptron learning rule will converge to some weight vector that gives the correct response for all training patterns. This will happen in a finite number of steps.
44
Exercise Show perceptron weights for the 2-of-3 concept x1 x2 x3 y 1
-1
45
Adaline (Widrow, Huff 1960) Adaptive Linear Network
Learning rule minimizes the mean squared error Learns on all examples, not just ones with errors
46
Architecture 1 x0 w0 y x1 w1 wn xn
47
Training Algorithm 1. set wi (small random values typical)
2. set a (0.1 typical) 3. for each training exemplar do 4. xi = si 5. yin = S xi*wi 6. wi(new) = wi(old) + a*(t – yin)*xi 7. go to 3 if largest weight change big enough
48
Activation Function f(yin) = 1 if yin >= 0 f(yin) = -1 otherwise
49
Delta Rule squared error E = (t – yin)2
minimize error E’ = -2(t – yin)xi = a(t – yin)xi
50
Example: AND concept bipolar inputs bipolar targets
w0 = -0.5, w1 = 0.5, w2 = 0.5 minimizes E x0 x1 x2 yin t E 1 .5 .25 -1 -.5 -1.5
51
Exercise Demonstrate that you understand the Adaline training procedure.
52
Madaline Many adaptive linear neurons 1 1 y x1 z1 xm zk
53
Madaline MRI (1960) – only learns weights from input layer to hidden layer MRII (1987) – learns all weights
54
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
55
Chapter 3: Pattern Association
Aristotle’s observed that human memory associates similar items contrary items items close in proximity items close in succession (a song)
56
Terminology and Issues
Autoassociative Networks Heteroassociative Networks Feedforward Networks Recurrent Networks How many patterns can be stored?
57
Hebb Rule for Pattern Association
Architecture w11 x1 y1 xn ym wnm
58
Algorithm 1. set wij = 0 1 <= i <= n, 1 <= j <= m
2. for each training pair s:t 3. xi = si 4. yj = tj 5. wij(new) = wij(old) + xiyj
59
Example s1 = (1 -1 -1), s2 = (-1 1 1) t1 = (1 -1), t2 = (-1 1)
w11 = 1*1 + (-1)(-1) = 2 w12 = 1*(-1) + (-1)1 = -2 w21 = (-1)1+ 1(-1) = -2 w22 = (-1)(-1) + 1(1) = 2 w31 = (-1)1 + 1(-1) = -2 w32 = (-1)(-1) + 1*1 = 2
60
Matrix Alternative s1 = (1 -1 -1), s2 = (-1 1 1)
t1 = (1 -1), t2 = (-1 1) =
61
Final Network f(yin) = 1 if yin > 0, 0 if yin = 0, else -1 2 x1 y1
-2 -2 x2 2 y2 -2 x3 2
62
Properties Weights exist if input vectors are linearly independent
Orthogonal vectors can be learned perfectly High weights imply strong correlations
63
Exercises What happens if ( ) is tested? This vector has one mistake. What happens if ( ) is tested? This vector has one piece of missing data. Show an example of training data that is not learnable. Show the learned network.
64
Delta Rule for Pattern Association
Works when patterns are linearly independent but not orthogonal Introduced in the 1960s for ADALINE Produces a least squares solution
65
Activation Functions Delta Rule (1) wij(new) = wij(old) + a(tj – yj)*xi*1 Extended Delta Rule (f’(yin.j)) wij(new) = wij(old) + a(tj – yj)*xi*f’(yin.j)
66
Heteroassociative Memory Net
Application: Associate characters. A <-> a B <-> b
67
Autoassociative Net Architecture w11 x1 y1 xn yn wnn
68
Training Algorithm Assuming that the training vectors are orthogonal, we can use the Hebb rule algorithm mentioned earlier. Application: Find out whether an input vector is familiar or unfamiliar. For example, voice input as part of a security system.
69
Autoassociate Example
= =
70
Evaluation What happens if (1 1 1) is presented?
Why are the diagonals set to 0?
71
Storage Capacity 2 vectors (1 1 1), (-1 -1 -1) Recall is perfect
=
72
Storage Capacity 3 vectors: (1 1 1), (-1 -1 -1), (1 -1 1)
Recall is no longer perfect =
73
Theorem Up to n-1 bipolar vectors of n dimensions can be stored in an autoassociative net.
74
Iterative Autoassociative Net
1 vector: s = (1 1 -1) st * s = (1 0 0) -> (0 1 -1) (0 1 -1) -> (2 1 -1) -> (1 1 -1) (1 1 -1) -> (2 2 -2) -> (1 1 -1)
75
Testing Procedure 1. initialize weights using Hebb learning
2. for each test vector do 3. set xi = si 4. calculate ti 5. set si = ti go to step 4 if the s vector is new
76
Exercises 1 piece of missing data: (0 1 -1)
2 pieces of missing data: (0 0 -1) 3 pieces of missing data: (0 0 0) 1 mistake: ( ) 2 mistakes: ( )
77
Discrete Hopfield Net content addressable problems
pattern association problems constrained optimization problems wij = wji wii = 0
78
Characteristics Only 1 unit updates its activation at a time
Each unit continues to receive the external signal An energy (Lyapunov) function can be found that allows the net to converge, unlike the previous system Autoassociative
79
Architecture x2 y2 y1 y3 x x3
80
Algorithm 1. initialize weights using Hebb rule
2. for each input vector do 3. yi = xi 4. do steps 5-6 randomly for each yi 5. yin.i = xi + Syjwji 6. calculate f(yin.i) 7. go to step 2 if the net hasn’t converged
81
Example training vector: (1 -1) y1 y2 -1 x x2
82
Example input (0 -1) update y1 = 0 + (-1)(-1) = 1 update y2 = (-1) = -2 -> -1 input (1 -1) update y2 = (-1) = -2 -> -1 update y1 = (-1) = 2 -> 1
83
Hopfield Theorems Convergence is guaranteed.
The number of storable patterns is approximately n / (2 * log n) where n is the dimension of a vector
84
Bidirectional Associative Memory (BAM)
Heteroassociative Recurrent Net Kosko, 1988 Architecture x1 y1 ym xn
85
Activation Function f(yin) = 1, if yin > 0 f(yin) = 0, if yin = 0
f(yin) = -1 otherwise
86
Algorithm 1. initialize weights using Hebb rule
2. for each test vector do 3. present s to x layer 4. present t to y layer 5. while equilibrium is not reached 6. compute f(yin.j) 7. compute f(xin.j)
87
Example s1 = (1 1), t1 = (1 -1) s2 = (-1 -1), t2 = (-1 1)
88
Example Architecture 2 x1 y1 -2 2 y2 x2 -2
present (1 1) to x -> 1 -1 present (1 -1) to y -> 1 1
89
Hamming Distance Definition: Number of different corresponding bits in two vectors For example, H[(1 -1), (1 1)] = 1 Average Hamming Distance is ½.
90
About BAMs Observation: Encoding is better when the average Hamming distance of the inputs is similar to the average Hamming distance of the outputs. The memory capacity of a BAM is min(n-1, m-1).
91
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
92
Chapter 4: Competition Force a decision (yes, no, maybe) to be made.
Winner take all is a common approach. Kohonen learning wj(new) = wj(old) + a (x – wj(old)) wj is closest weight vector, determined by Euclidean distance.
93
MaxNet Lippman, 1987 Fixed-weight competitive net.
Activation function f(x) = x if x > 0, else 0. Architecture a1 a2 1 -e 1
94
Algorithm 1. wij = 1 if i = j, otherwise –e 2. aj(0) = si, t = 0.
3. aj(t+1) = f[aj(t) –e*S k<>j ak(t)] 4. go to step 3 if more than one node has a non-zero activation Special Case: More than one node has the same maximum activation.
95
Example s1 = .5, s2 = .1, e = .1 a1(0) = .5, a2(0) = .1
96
Mexican Hat Kohonen, 1989 Contrast enhancement
Architecture (w0, w1, w2, w3) w0 (xi -> xi) , w1 (xi+1 -> xi and xi-1 ->xi) xi-3 xi-2 xi-1 xi xi+1 xi+2 xi+3
97
Algorithm 1. initialize weights 2. xi(0) = si
3. for some number of steps do 4. xi(t+1) = f [ Swkxi+k(t) ] 5. xi(t+1) = max(0, xi(t))
98
Example x1, x2, x3, x4, x5 radius 0 weight = 1 radius 1 weight = 1
all other radii weights = 0 s = ( ) f(x) = 0 if x < 0, x if 0 <= x <= 2, 2 otherwise
99
Example x(0) = (0 .5 1 .5 1) x1(1) = 1(0) + 1(.5) -.5(1) = 0
100
Why the name? Plot x(0) vs. x(1) 2 1 x1 x2 x3 x4 x5
101
Hamming Net Lippman, 1987 Maximum likelihood classifier
The similarity of 2 vectors is taken to be n – H(v1, v2) where H is the Hamming distance Uses MaxNet with similarity metric
102
Architecture Concrete example: x1 y1 x2 MaxNet y2 x3
103
Algorithm 1. wij = si(j)/2 2. n is the dimensionality of a vector
3. yin.j = S xiwij + (n/2) 4. select max(yin.j) using MaxNet
104
Example Training examples: (1 1 1), (-1 -1 -1) n = 3
yin.1 = 1(.5) + 1(.5) + 1(.5) = 3 yin.2 = 1(-.5) + 1(-.5) + 1(-.5) = 0 These last 2 quantities represent the Hamming distance They are then fed into MaxNet.
105
Kohonen Self-Organizing Maps
Maps inputs onto one of m clusters Human brains seem to be able to self organize.
106
Architecture x1 y1 ym xn
107
Neighborhoods Linear 3 2 1 # 1 2 3
Rectangular #
108
Algorithm 1. initialize wij 2. select topology of yi
3. select learning rate parameters 4. while stopping criteria not reached 5. for each input vector do 6. compute D(j) = S(wij – xi)2 for each j
109
Algorithm. 7. select minimum D(j)
8. update neighborhood units wij(new) = wij(old) + a[xi – wij(old)] 9. update a 10. reduce radius of neighborhood at specified times
110
Example Place ( ), ( ), ( ), ( ) into two clusters a(0) = .6 a(t+1) = .5 * a(t) random initial weights
111
Example Present ( ) D(1) = (.2 – 1)2 + (.6 – 1)2 + (.5 – 0)2 + (.9 – 0)2 = 1.86 D(2) = .98 D(2) wins!
112
Example wi2(new) = wi2(old) + .6[xi – wi2(old)] (bigger) (bigger) (smaller) (smaller) This example assumes no neighborhood
113
Example After many epochs ( ) -> category ( ) -> category ( ) -> category ( ) -> category 1
114
Applications Grouping characters Travelling Salesperson Problem
Cluster units can be represented graphically by weight vectors Linear neighborhoods can be used with the first and last cluster units connected
115
Learning Vector Quantization
Kohonen, 1989 Supervised learning There can be several output units per class
116
Architecture Like Kohonen nets, but no topology for output units
Each yi represents a known class x1 y1 ym xn
117
Algorithm 1. Initialize the weights
(first m training examples, random) 2. choose a 3. while stopping criteria not reached do (number of iterations, a is very small) 4. for each training vector do
118
Algorithm 5. find minimum || x – wj || 6. if minimum is target class
wj(new) = wj(old) + a[x – wj(old)] else wj(new) = wj(old) – a[x – wj(old)] 7. reduce a
119
Example (1 1 -1 -1) belongs to category 1
2 output units, y1 represents category 1 and y2 represents category 2
120
Example Initial weights (where did these come from? a = .1
121
Example Present training example 3, ( ). It belongs to category 2. D(1) = 16 = (1 + 1)2 + (1 + 1)2 + (-1 -1) (-1-1)2 D(2) = 4 Category 2 wins. That is correct!
122
Example w2(new) = ( ) [( ) - ( )] = ( )
123
Issues How many yi should be used?
How should we choose the class that each yi should represent? LVQ2, LVQ3 are enhancements to LVQ that modify the runner-up sometimes
124
Counterpropagation Hecht-Nielsen, 1987
There are input, output, and clustering layers Can be used to compress data Can be used to approximate functions Can be used to associate patterns
125
Stages Stage 1: Cluster input vectors
Stage 2: Adapt weights from cluster units to output units
126
Stage 1 Architecture w11 v11 x1 z1 y1 xn zp ym
127
Stage 2 Architecture y*1 x*1 tj1 vj1 zj x*n y*m
128
Full Counterpropagation
Stage 1 Algorithm 1. initialize weights, a, b 2. while stopping criteria is false do 3. for each training vector pair do 4. minimize ||x – wj|| + ||y – vj|| wj(new) = wj(old) + a[x – wj(old)] vj(new) = vj(old) + b[y-vj(old)] 5. reduce a, b
129
Stage 2 Algorithm 1. while stopping criteria is false
2. for each training vector pair do 3. perform step 4 above 4. tj(new) = tj(old) + a[x – tj(old)] vj(new) = vj(old) + b[y – vj(old)]
130
Partial Example Approximate y = 1/x [0.1, 10.0] 1 x unit 1 y unit
10 z units 1 x* unit 1 y* unit
131
Partial Example v11 = .11, w11 = 9.0 v12 = .14, w12 = 7.0 …
test .12, predict 9.0. In this example, the output weights will converge to the cluster weights.
132
Forward Only Counterpropagation
Sometimes the function y = f(x) is not invertible. Architecture (only 1 z unit active) x1 z1 y1 xn zp ym
133
Stage 1 Algorithm 1. initialize weights, a (.1), b (.6)
2. while stopping criteria is false do 3. for each input vector do 4. find minimum || x – w|| w(new) = w(old) + a[x – w(old)] 5. reduce a
134
Stage 2 Algorithm 1. while stopping criteria is false do
2. for each training vector pair do 3. find minimum || x – w || w(new) = w(old) + a[x – w(old)] v(new) = v(old) + b[y – v(old)] 4. reduce b Note: interpolation is possible.
135
Example y = f(x) over [0.1, 10.0] 10 zi units
After phase 1, zi = 0.5, 1.5, …, 9.5. After phase 2, zi = 5.5, 0.75, …, 0.1
136
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
137
Chapter 5: Adaptive Resonance Theory
1987, Carpenter and Grossberg ART1: clusters binary vectors ART2: clusters continuous vectors
138
General Weights on a cluster unit can be considered to be a prototype pattern Relative similarity is used instead of an absolute difference. Thus, a difference of 1 in a vector with only a few non-zero components becomes more significant.
139
General Training examples may be presented several times.
Training examples may be presented in any order. An example might change clusters. Nets are stable (patterns don’t oscillate). Nets are plastic (examples can be added).
140
Architecture Input layer (xi)
Output layer or cluster layer – competitive (yi) Units in the output layer can be active, inactive, or inhibited.
141
Sample Network t (top down weight), b (bottom up weight) t11 x1 y1 xn
ym bnm
142
Nomenclature bij: bottom up weight tij: top down weight
s: input vector x: activation vector n: number of components in input vector m: maximum number of clusters || x ||: S xi p: vigilance parameter
143
Training Algorithm 1. L > 1, 0 < p <= 1 tji(0) = 1 0 < bij(0) < L / (L – 1 + n) 2. while stopping criterion is false do steps 3 – 12 3. for each training example do steps
144
Training Algorithm 4. yi = 0 5. compute || s || 6. xi = si
7. if yj (do for each j) is not inhibited then yj = S bij xi 8. find largest yj that is not inhibited 9. xi = si * tji
145
Training Algorithm 10. compute || x ||
11. if || x || / || s || < p then yj = -1, go to step 8 12. bij = L xi / ( L – 1 + || x || ) tji = xi
146
Possible Stopping Criterion
No weight changes. Maximum number of epochs reached.
147
What Happens If All Units Are Inhibited?
Lower p. Add a cluster unit. Throw out the current input as an outlier.
148
Example n = 4 m = 3 p = 0.4 (low vigilance) L = 2
bij(0) = 1/(1 + n) = 0.2 tji(0) = 1 y1 x2 y2 x3 y3 x4
149
Example 3. input vector (1 1 0 0) 4. yi = 0 5. || s || = 2
y2 = y3 = y4 = 0.4
150
Example 8. j = 1 (use lowest index to break ties)
9. x1 = s1 * t11 = 1 * 1 = 1 x2 = s2 * t12 = 1 * 1 = 1 x3 = s3 * t13 = 0 * 1 = 0 x4 = s4 * t14 = 0 * 1 = 0 10. || x || = 2 11. || x || / || s || = 1 >= 0.4
151
Example 12. b11 = 2 * xi / (2 - 1 + || x ||) = 2 * 1 / (1 + 2) = .667
b31 = b41 = 0 t11 = x1 = 1 t12 = 1 t13 = t14 = 0
152
Exercise Show the network after the training example ( ) is processed.
153
Observations Typically, stable weight matrices are obtained quickly.
The cluster units are all topologically independent of one another. We have just looked at the fast learning version of ART1. There is also a slow learning version that updates just one weight per training example.
154
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
155
Chapter 6: Backpropagation
1986 Rumelhart, Hinton, Williams Gradient descent method that minimizes the total squared error of the output. Applicable to multilayer, feedforward, supervised neural networks. Revitalizes interest in neural networks!
156
Backpropagation Appropriate for any domain where inputs must be mapped onto outputs. 1 hidden layer is sufficient to learn any continuous mapping to any arbitrary accuracy! Memorization versus generalization tradeoff.
157
Architecture input layer, hidden layer, output layer 1 1 y1 x1 z1 ym
xn zp wpm vnp
158
General Process Feedforward the input signals.
Backpropagate the error. Adjust the weights.
159
Activation Function Characteristics
Continuous. Differentiable. Monotonically nondecreasing. Easy to compute. Saturates (reaches limits).
160
Activation Functions Binary Sigmoid f(x) = 1 / [ 1 + e-x ] f’(x) = f(x)[1 – f(x)] Bipolar Sigmoid f(x) = / [1 + e-x] f’(x) = 0.5 * [1 + f(x)] * [1 – f(x) ]
161
Training Algorithm 1. initialize weights to small random values, for example [ ] 2. while stopping condition is false do steps 3 – 8 3. for each training pair do steps 4-8
162
Training Algorithm 4. zin.j = S (xi * vij) zj = f(zin.j)
5. yin.j = S (zi * wij) yj = f(yin.j) 6. error(yj) = (tj – yj) * f’(yin.j) tj is the target value 7. error(zk) = [ S error(yj) * wkj ] * f’(zin.k)
163
Training Algorithm 8. wkj(new) = wkj(old) + a*error(yj)*zk
vkj(new) = vkj(old) + a*error(zj))*xk a is the learning rate An epoch is one cycle through the training vectors.
164
Choices Initial Weights
random [ ], don’t want the derivative to be 0 Nguyen-Widrow b = 0.7 * p(1/n) n = number of input units p = number of hidden units vij = b * vij(random) / || vj(random) ||
165
Choices Stopping Condition (avoid overtraining!)
Set aside some of the training pairs as a validations set. Stop training when the error on the validation set stops decreasing.
166
Choices Number of Training Pairs
total number of weights / desired average error on test set where the average error on the training pairs is half of the above desired average
167
Choices Data Representation Number of Hidden Layers
Bipolar is better than binary because 0 units don’t learn. Discrete values: red, green, blue? Continuous values: [ ]? Number of Hidden Layers 1 is sufficient Sometimes, multiple layers might speed up the learning
168
Example XOR. Bipolar data representation.
Bipolar sigmoid activation function. a = 1 3 input units, 5 hidden units,1 output unit Initial Weights are all 0. Training example (1 -1). Target: 1.
169
Example 4. z1 = f(1*0 + 1*0+ -1*0) = 0.5 z2 = z3 = z4 = 0.5
5. y1 = f(1* * * * *0) = 0.5 6. error(y1) = (1 – 0.5) * [0.5 * (1 + 0) * (1 – 0)] = 0.25 7. error(z1) = 0 * f’(zin.1) = 0 = error(z2) = error(z3) = error(z4)
170
Example 8. w01(new) = w01(old) + a*error(y1)*z0
= * 0.25 * 1 = 0.25 v21(new) = v21(old) + a*error(z1)*x2 = * 0 * -1 = 0.
171
Exercise Draw the updated neural network.
Present the example 1 -1 as an example to classify. How is it classified now? If learning were to occur, how would the network weights change this time?
172
XOR Experiments Binary Activation/Binary Representation: 3000 epochs.
Bipolar Activation/Bipolar Representation: 400 epochs. Bipolar Activation/Modified Bipolar Representation [ ]: 265 epochs. Above experiment with Nguyen-Widrow weight initialization: 125 epochs.
173
Variations Momentum D wjk(t+1) = a * error(yj) * zk + m * D wjk(t) D vij(t+1) = similar m is [ ] The previous experiment takes 38 epochs.
174
Variations Batch update the weights to smooth the changes.
Adapt the learning rate. For example, in the delta-bar-delta procedure each weight has its own learning rate that varies over time. 2 consecutive weight increases or decreases will increase the learning rate.
175
Variations Alternate Activation Functions
Strictly Local Backpropagation makes the algorithm more biologically plausible by making all computations local cortical units sum their inputs synaptic units apply an activation function thalamic units compute errors equivalent to standard backpropagation
176
Variations Strictly Local Backpropagation input cortical layer -> input synaptic layer -> hidden cortical layer -> hidden synaptic layer -> output cortical layer-> output synaptic layer -> output thalamic layer Number of Hidden Layers
177
Hecht-Neilsen Theorem
Given any continuous function f: In -> Rm where I is [0, 1], f can be represented exactly by a feedforward network having n input units, 2n + 1 hidden units, and m output units.
178
Introduction to Neural Networks
John Paxton Montana State University Summer 2003
179
Chapter 7: A Sampler Of Other Neural Nets
Optimization Problems Common Extensions Adaptive Architectures Neocognitron
180
I. Optimization Problems
Travelling Salesperson Problem. Map coloring. Job shop scheduling. RNA secondary structure.
181
Advantages of Neural Nets
Can find near optimal solutions. Can handle weak (desirable, but not required) constraints.
182
TSP Topology Each row has 1 unit that is on
Each column has 1 unit that is on 1st nd 3rd City A City B City C
183
Boltzmann Machine Hinton, Sejnowski (1983)
Can be modelled using Markov chains Uses simulated annealing Each row is fully interconnected Each column is fully interconnected
184
Architecture ui,j connected to uk,j+1 with –di,k
ui1 connected to ukn with -dik b U11 -p U1n Un1 Unn
185
Algorithm 1. Initialize weights b, p p > b p > greatest distance between cities Initialize temperature T Initialize activations of units to random binary values
186
Algorithm 2. while stopping condition is false, do steps 3 – 8
3. do steps 4 – 7 n2 times (1 epoch) 4. choose i and j randomly 1 <= i, j <= n uij is candidate to change state
187
Algorithm 5. Compute c = [1 – 2uij]b + S S ukm (-p)
where k <> i, m <> j 6. Compute probability to accept change a = 1 / (1 + e(-c/T) ) 7. Accept change if random number [0..1] < a. If change, uij = 1 – uij 8. Adjust temperature T = .95T
188
Stopping Condition No state change for a specified number of epochs.
Temperature reaches a certain value.
189
Example T(0) = 20 ½ units are on initially b = 60 p = 70
10 cities, all distances less than 1 200 or fewer epochs to find stable configuration in 100 random trials
190
Other Optimization Architectures
Continuous Hopfield Net Gaussian Machine Cauchy Machine Adds noise to input in attempt to escape from local minima Faster annealing schedule can be used as a consequence
191
II. Extensions Modified Hebbian Learning
Find parameters for optimal surface fit of training patterns
192
Boltzmann Machine With Learning
Add hidden units 2-1-2 net below could be used for simple encoding/decoding (data compression) y1 x1 z1 x2 y2
193
Simple Recurrent Net Learn sequential or time varying patterns
Doesn’t necessarily have steady state output input units context units hidden units output units
194
Architecture c1 x1 z1 y1 xn zp ym cp
195
Simple Recurrent Net f(ci(t)) = f(zi(t-1)) f(ci(0)) = 0.5
Can use backpropagation Can learn string of characters
196
Example: Finite State Automaton
4 xi 4 yi 2 zi 2 ci A BEGIN END B
197
Backpropagation In Time
Rumelhart, Williams, Hinton (1986) Application: Simple shift register 1 (fixed) x1 y1 x1 z1 x2 y2 x2 1 (fixed)
198
Backpropagation Training for Fully Recurrent Nets
Adapts backpropagation to arbitrary connection patterns.
199
III. Adaptive Architectures
Probabilistic Neural Net (Specht 1988) Cascade Correlation (Fahlman, Lebiere 1990)
200
Probabilistic Neural Net
Builds its own architecture as training progresses Chooses class A over class B if hAcAfA(x) > hBcBfB(x) cA is the cost of classifying an example as belonging to A when it belongs to B hA is the a priori probability of an example belonging to class A
201
Probabilistic Neural Net
fA(x) is the probability density function for class A, fA(x) is learned by the net zA1: pattern unit, fA: summation unit zA1 fA x1 zAj y zB1 fB xn zBk
202
Cascade Correlation Builds own architecture while training progresses
Tries to overcome slow rate of convergence by other neural nets Dynamically adds hidden units (as few as possible) Trains one layer at a time
203
Cascade Correlation Stage 1 x0 y1 x1 y2 x2
204
Cascade Correlation Stage 2 (fix weights into z1) x0 y1 x1 z1 y2 x2
205
Cascade Correlation Stage 3 (fix weights into z2) x0 y1 z1 z2 x1 y2 x2
206
Algorithm 1. Train stage 1. If error is not acceptable, proceed.
3. Etc.
207
IV. Neocognitron Fukushima, Miyako, Ito (1983)
Many layers, hierarchical Very spare and localized connections Self organizing Supervised learning, layer by layer Recognizes handwritten 0, 1, 2, 3, … 9, regardless of position and style
208
Architecture Layer # of Arrays Size Input 1 192 S1 / C1 12 / 8
192 / 112 S2 / C2 38 / 22 112 / 72 S3 / C3 32 / 30 72 / 72 S4 / C4 16 / 10 32 / 12
209
Architecture S layers respond to patterns
C layers combine results, use larger field of view For example S11 responds to
210
Training Progresses layer by layer S1 connections to C1 are fixed
C1 connections to S2 are adaptable A V2 layer is introduced between C1 and S2, V2 is inhibatory C1 to V2 connections are fixed V2 to S2 connections are adaptable
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.