# Feed-forward Networks

## Presentation on theme: "Feed-forward Networks"— Presentation transcript:

Feed-forward Networks
AI - NN Lecture Notes Chapter 8 Feed-forward Networks

§8.1 Introduction To Classification The Classification Model
X = [x x … x ] -- the input patterns of classifier. i (X) -- decision function The response of the classifier is 1 or 2 or … or R. t 1 2 n x 1 or 2 or … or R 1 x i (X) 2 x n Class Pattern Classifier

Geometric Explanation of Classification
Pattern -- an n-dimensional vector. All n-dimensional patterns constitute an n-dimensional Euclidean space E and is called pattern space. If all patterns can be divided into R classes, then the region of the space containing only patterns of r-th class is called the r-th region, r = 1, …, R. Regions are separated from each other by decision surface. A pattern classifier maps sets of patters in E into one of the regions denoted by numbers i = 1, 2, …, R. n n

Classifiers That Use The Discriminant Functions
The membership in a class are determined based on the comparison of R discriminant functions g (X), i =1, …, R, computed for the input pattern under consideration. g (X) are scalar values and the pattern X belongs to the i-th class iff g (X) > g (X), i,j = 1, …, R, i  j. The decision surface equation is g (X) - g (X) = 0. Assuming that the discrimminant functions are known, the block diagram of a basic pattern classifier can be shown below: i i i i j i j

Maximum Selector g (X) g (X) g (X)
1 Maximum Selector g (X) 1 i X i g (X) i Class R g (X) R Discriminators For a given pattern, the i-th discriminator computes the value of the function g (X) called briefly the discriminant. i

When R = 2, the classifier called dichotomizer is simplified as below:
TLU Discriminator i 1 X g (X) i -1 Class Discriminant Its discriminant function is g(X) = g (X) - g (X) 1 2 If g(X) > 0, then X belongs to Class 1; If g (X) < 0, then X belongs to Class 2.

The following figure is an example where 6 patterns
belong to one of the 2 classes and the decision surface is a straight line. x 2 g(X) > 0 g (X) < 0 (0,0) (2,0) x 1 (-1/2, -1) -1 (3/2,-1) (-1, -2) -2 (1, -2) Decision Surface g(X) = 0 g(X) = -2x + x +2 1 2 Infinite number of discrimminant functions may exist.

Training and Classification
Consider neural network classifiers that derive their weights during the learning cycle. The sample patterns, called the training sequence, are presented to the machine along with the correct response provided by the teacher. After each incorrect response, the classifier modifies its parameters by means of iterative, supervised learning based on comparing the targeted correct response with the actual response.

x w x TLU w 1 0 = I =1 or -1 + w -1 x g(y) - x w + + d d - 0 1 1 2 2 n
+ w -1 x n g(y) n - x w + n+1 n+1 + d d - 0

§8.2 Single Layer Perceptron 1. Linear Threshold Unit and Separability
X W 1 1 W T y X n n X W N N t X = (x , …, x ), x R = {1, -1} N 1 t W = (w , …, w ) T R 1 N t y = sgn(W X - T)  {1, -1}

Let X’ = and W’ = (W , T), then we have
-1 t Let X’ = and W’ = (W , T), then we have t N+1 y = sgn(W’ X’)=( w x ) n n n=1 Linearly Separable Patterns Assume that there is a pattern set  which is divided into subset  ,  , …,  , respectively. If a linear machine can classify the pattern from  , as belonging to class i, for i = 1, …, N, then the pattern sets are linearly separable. 2 N 1 i

When R = 2, for given X and Y, if there exists such W,
f and T that makes the relation f: {-1, 1} valid, then the function is said to be linearly separable. Example 1: NAND is a linearly separable function. Given X and Y such that x x y It can be found that W = (-1, -1) and T = -3/2 is the solution: y = sgn(W X - T) = sgn (-x -x +3/2) 1 2 t t 1 2

x x T w x + w x - T y -1 -1 -3/2 1+1-(-3/2)=7/2 1
/ (-3/2)=7/ / (-3/2)=3/ / (-3/2)=3/ / (-3/2)= -1/ 1 2 1 1 2 2 x 2 x w 3/2 Y<0 1 1 3/2 -3/2 x Y>0 1 Y=sgn(-x -x ) x w 1 2 2 2 x + x = 3/2 1 2

Example 2 XOR is not a linearly separable function. Given that x x y
1 2 It is impossible for finding such W and T that satisfying y = sgn(W X - T): If (-1)w + (-1)w < T, then (+1)w + (+1)w > T If (-1)w + (+1)w > T, then (+1)w + (-1)w < T t 1 2 1 2 1 2 1 2 It is seen that linearly separable binary patterns can be classified by a suitably designed neuron, but linearly non-separable binary patterns cannot.

Given training set {X(t), d(t)}, t=0, 1, 2, …, where
Perceptron Learning Given training set {X(t), d(t)}, t=0, 1, 2, …, where X(t) = {x (t), …, x (t), -1} let w = T, x = -1 1 N N+1 N+1 N+1 +1,  w x )  0 -1, otherwise n n n=1 N+1 y = sgn(  w x ) = n n n=1 (1) Set w (0) = small random values, n=1, …, N+1 (2) Input a sample X(t) = {x (t), …, x (t), -1} and d(t) (3) Compute the actual output y(t) (4) Revise the weights w (t+1) = w (t) + [d(t)-y(t)]x (t) (5) Return to (2) until w (t+1) = w (t) for all n. n 1 N n n n n n

Where 0 <  < 1 is a coefficient.
Theorem The perceptron learning algorithm will converge if the function is linearly separable. Gradient Descent Algorithm The perceptron learning algorithm is restricted to the linearly separable function cases (hard limiting activation function). Gradient descent algorithm can be applied in more general cases with the only requirement that the function be differentiable.

Given the training set (x , y ), n = 1, …, N, try to find
W* such that y^ = f(W*x )  y . Let E =  E = (1/2)  (y - y^) = (1/2)  (y - f(W*x )) be the error measure of learning. To minimize E, take grad E = =  = (1/2)  = (1/2)  = -  (y - f(W* • x )) n n n n n N N N 2 2 n n n n n n=1 n=1 n=1 (y - y^) 2 E N E N n n n W W W w n=1 m m n=1 m N 2 (y - f(W*•x )) n n n=1 W m N  f(W*•x ) n n n n=1 W m

The learning (adjusting) rule is thus as follows
 f(W*•x ) N Wx n n = - = -  (y - f(W* • x )) • n n W n=1 Wx m n N  (y - f(W* • x )) f’ • x n n mn n=1 The learning (adjusting) rule is thus as follows W = W -  = W +   > 0 E N  (y - y^) f’ • x , m m m n n mn W n=1 m

§ 8.3 Multi-Layer Perceptron 1. Why Multi-Layer ?
XOR which cannot be implemented by 1-layer network as was seen can be implemented by 2-layer network: x 1 1 y = sgn(1 x x x ) 1 1 2 1 x 1 1.5 0.5 y 2 1 1 x 2 1 X = sgn(1 x x ) 1 2

Single-layer networks has no hidden units and thus have
x x x y 1 1 2 Single-layer networks has no hidden units and thus have no internal representation ability. They can only map similar input patterns to similar output ones. If there is a layer of hidden units, there is always an internal representation of the input patterns that may support any mappings from input to output units.

This figure shows the internal representation abilities.
Classification for XOR Messed Classified Region Types of the Classified Region General form of Classified Region Network Structure 2 Regions Separated by a sphereplane Open Convex Region or Closed Convex Region Arbitrary Forms

2. Back Propagation Algorithm
More precisely, BP is an error back-propagation learning algorithm for multi-layer perceptron networks and is also a sort of generalized gradient descent algorithm. Assumptions: 1) MLP has M layers and one single output node. 2) Each node is of sigmoid type with activation function: f(x) = 1 - x 1 + e 3) Training samples are (x , y ) , n = 1, …, N. n n

4) Error measure is chosen as E =  E = (1/2)  (y -y^ )
Let =  , net =  w O jn j i net ji i=1 jn I) If j is output node, then O = y^ and jn n E y^ n n  = = - (y - y^ ) f’(net ) y^ net jn n n jn n jn Thus net E E jn n n =  O = - (y -y^ )O f’(net ) = W net W jn in n n in jn ij jn ij

II) If j is not output node, then
jn  = n = F’(net ), but jn net Jn O  O jn jn jn net E E ( w O ) kn E n n ki in =  n =  i O  O k net net jn jn k O kn kn jn =   w kn kj k Therefore  = f’(net ) and   w jn jn kn kj k E n =  O = O f’(net )   w W in in jn jn kn ij k kj

The MLP - BP algorithm can then be described as below
(1) Set the initial W (2) Until convergence (w = const), repeat the following (i) For n = 1 to N (number of samples) (a) Compute O , net , y^ and E (b) For m = M to 2, for all jm, compute  E / w (for the unit j in the same layer m) (ii) Revise the weights: w = w -  ,  > 0, ij n n in jn n ij  E n ij ij w ij

Mapping Ability of Feedforward Networks
MLP play a mathematical role of mapping: R  R , Kolmogorov Theorem (1957): Let (x) be a bounded monotonically increasing continuous function, K be a bounded close subset of R , f(X) = f(x , …, x ) be real continuous function on K, then for  > 0, there exist integer N and constants c , T , and w (i,j = 1, …, N) such that m n n 1 n i i ij N n (1) f^(x , …, x ) =  c  (  w x -T ) i 1 n j i=1 i j=1 ij and

Max |f(x , …, x ) - f^(x , …, x ) | <  (2)
1 n 1 n That is to say, for  > 0, there exists a 3-layer network whose hidden unit output function is (x) and whose input and output units are linear, and whose total input- output relation f^(x , …, x ) satisfies Eq(2). Significance: Any continuous mappings R  R can be approximated by a k-layer (k-2 hidden layer) network’s input-output relation, k  3. 1 n m n

§8.4 Applications of Feed forward Networks
MLP can be successfully applied to classification and diagnosing problems whose solution is obtained via experimentation and simulation rather than via rigorous and formal approach to the problem. MLP can also act as an expert system. Formulation of the rules is one of the bottle neck in building up expert systems. However, the layered networks can acquire knowledge without extracting IF-THEN rules if the number of training vector pairs is sufficient to suitably form all decision regions.

1) Fault Diagnosing Automobile engine diagnosing (Marko et al, 1989) -- employ single-hidden layer network -- to identify 26 different faults -- training set consists of 16 sets of data for each failure -- training time needed is 10 minutes -- the main-frame is NESTOR NDS-100 computer -- fault recognition accuracy is 100% Switching System Fault Diagnosing (Wen Fang, 1994) -- BP algorithm -- 3-layer network -- no mis-diagnosing, much better than the existing one.

2) Handwritten Digits Recognition
Postal Codes Recognition (Wang et al, 1995) -- 3-layer network -- preprocessing features -- rejection rate < 5% -- mis-classification rate < 0.01% 3) Other Applications Include -- text reading -- speech recognition -- image recognition -- medical diagnosing -- approximation

-- optimization -- coding -- robot controlling, etc. Advantages and Disadvantages -- learning from examples -- better performance than traditional approach -- long training time (much improved) -- local minima (already overcome)