Presentation is loading. Please wait.

Presentation is loading. Please wait.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Similar presentations


Presentation on theme: "Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining."— Presentation transcript:

1

2

3 Giansalvo EXIN Cirrincione unit #4

4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining probability densities.

5 Linear discriminant functions Two classes (d-1)-dimensional hyperplane

6 Linear discriminant functions Several classes

7 Linear discriminant functions Several classes The decision regions are always simply connected and convex.

8 Logistic discrimination monotonic activation function The decision boundary is still linear two classes Gaussians with     

9 Logistic discrimination logistic sigmoid

10 Logistic discrimination logistic sigmoid

11 Logistic discrimination logistic sigmoid The use of the logistic sigmoid activation function allows the outputs of the discriminant to be interpreted as posterior probabilities.

12 binary input vectors Let P ki denote the probability that the input x i takes the value +1 when the input vector is drawn from the class C k. The corresponding probability that x i = 0 is then given by 1- P ki. Assuming the input variables are statistically independent, the probability for the complete input vector is given by:

13 binary input vectors Linear discriminant functions arise when we consider input patterns in which the variables are binary.

14 binary input vectors Consider a set of independent binary variables having Bernoulli class-conditional densities. For the two-class problem: Both for normally distributed and Bernoulli distributed class-conditional densities, the posterior probabilities are obtained by a logistic single-layer network.

15 homework

16

17

18 Generalized discriminant functions fixed non-linear basis functions It can approximate any CONTINUOUS functional transformation to arbitrary accuracy. Extra basis function equal to one

19 Sum-of-squares error function quadratic in the weights target

20 column space Geometrical interpretation of least squares

21 Pseudo-inverse solution normal equations N x (M+1) c x (M+1) N x c

22 Pseudo-inverse solution singular

23 bias The role of the biases is to compensate for the difference between the averages (over the data set) of the target values and the averages of the output vectors

24 gradient descent Group all of the parameters (weights and biases) together to form a single weight vector w. batch sequential If  is chosen correctly, the gradient descent becomes the Robbins-Monro procedure for finding the root of the regression function

25 Differentiable non-linear activation functions gradient descent logistic sigmoid batch

26 homework Generate and plot a set of data points in two dimensions, drawn from two classes each of which is described by a Gaussian class- conditional density function. Implement the gradient descent algorithm for training a logistic discriminant, and plot the decision boundary at regular intervals during the training procedure on the same graph as the data. Explore the effect of choosing different values for the learning rate. Compare the behaviour of the sequential and batch weight update procedures. gradient descent

27 The perceptron Applied to classification problems in which the inputs are usually binary images of characters or simple shapes fixed weights connected to a random subset of the input pixels

28 The perceptron Define the error function in terms of the total number of misclassifications over the TS. However, an error function based on a loss matrix is piecewise constant w.r.t. the weights and gradient descent cannot be applied. wanted Minimize the perceptron criterion : misclassified proportional to the absolute distances of the misclassified input patterns to the decision boundary The criterion is continuous and piecewise linear

29 The perceptron Apply the sequential gradient descent rule to the perceptron criterion misclassified Cycle through all of the patterns in the TS and test each pattern in turn using the current set of weight values. If the pattern is correctly classified do nothing, otherwise add the pattern vector to the weight vector if the pattern is labelled class C1 C1 or subtract the pattern vector from the weight vector if the pattern is labelled class C2.C2. The value of  is unimportant since its change is equivalent to a re-scaling of the weights and biases.

30 The perceptron

31 The perceptron convergence theorem For any data set which is linearly separable, the perceptron learning rule is guaranteed to find a solution in a finite number of steps. proof solution null initial conditions

32 The perceptron convergence theorem For any data set which is linearly separable, the perceptron learning rule is guaranteed to find a solution in a finite number of steps. proof end proof

33 The perceptron convergence theorem homework Prove that, for arbitrary vectors w and w, the following equality is satisfied: Hence, show that an upper limit on the number of weight updates needed for convergence of the perceptron algorithm is given by: ^

34 If the data set happens not to be linearly separable, then the learning algorithm will never terminate. If we arbitrarily stop the learning process, there is no guarantee that the weight vector found will generalize well for new data.  decrease  during the training process;  the pocket algorithm. It involves retaining a copy (in one’s pocket) of the set of weights which has so far survived unchanged for the longest number of pattern presentations.

35 Even though the data set of input patterns may not be linearly separable in the input space, it can become linearly separable in the  -space. However, it implies the number and complexity of the  j ’s to grow very rapidly (typically exponential). Limitations of the perceptron Limiting the complexity: diameter-limited perceptron receptive field

36 Fisher’s linear discriminant optimal linear dimensionality reduction no bias select a projection which maximizes the class separation  N 1 points of class C 1  N 2 points of class C 2

37 Fisher’s linear discriminant Maximize: class mean of the projected data from class C k arbitrarily large by increasing the magnitude of w unit length Constrained optimization: w  (m 2 - m 1 ) Maximize a function which represents the difference between the projected class means, normalized by a measure of the within-class scatter along the direction of w.

38 Fisher’s linear discriminant The within-class scatter of the transformed data from class C k is described by the within-class covariance given by: Fisher criterion between-class covariance matrix within-class covariance matrix

39 Fisher’s linear discriminant Generalized eigenvector problem

40 Fisher’s linear discriminant EXAMPLE

41 Fisher’s linear discriminant The projected data can subsequently be used to construct a discriminant, by choosing a threshold y 0 so that we classify a new point as belonging to C 1 if y(x)  y 0 and classify it as belonging to C 2 otherwise. Note that y = w T x is the sum of a set of random variables and so we may invoke the central limit theorem and model the class- conditional density functions p(y| C k ) using normal distributions. Once we have obtained a suitable weight vector and a threshold, the procedure for deciding the class of a new vector is identical to that of the perceptron network. So, the Fisher criterion can be viewed as a learning law for the single-layer network.

42 Fisher’s linear discriminant relation to the least-squares approach

43 Fisher’s linear discriminant relation to the least-squares approach Bias threshold

44 Fisher’s linear discriminant relation to the least-squares approach A new vector x is classified as belonging to C 1 if w T (x-m) > 0

45 within-class covariance Fisher’s linear discriminant Several classes d’ linear features

46 Fisher’s linear discriminant Several classes total covariance matrix

47 Fisher’s linear discriminant Several classes In the projected d’-dimensional y-space

48 Fisher’s linear discriminant Several classes One possible criterion... This criterion is unable to find more than (c - 1) linear features

49


Download ppt "Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining."

Similar presentations


Ads by Google