Linear Discriminant Functions Wen-Hung Liao, 11/25/2008
Introduction: LDF Assume we know the proper form of the discriminant functions, instead of the underlying probability densities. Use samples to estimate the parameters of the classifier.(statistical or non-statistical) Will be concerned with discriminant functions that are either linear in the components of x, or linear in some given set of functions of x.
Why LDF? Simplicity vs. accuracy Attractive candidates for initial, trial classifiers Related to neural networks
Approach Find the LDF by minimizing a criterion function. Use gradient descent procedure for minimization Convergence property Computational complexities Example of criterion function: Sample risk, or training error. (Not appropriate, why?) Because a small training error does not guarantee a small test error.
LDF and Decision Surfaces A linear discriminant function: where w : weight vector w 0 : bias or threshold
Two-Category Case Decision rule: Decide w 1 if g(x) > 0, decide w 2 if g(x)<0 In other words, x is assigned to w 1 if the inner product w t x exceeds the threshold – w 0.
Decision Boundary A hyperplane H defined by g(x)=0 If x1 and x2 are both on the decision surface, then: w is normal to any vector lying on the hyperplane.
Distance Measure For any x, where x p is the normal projection of x onto H, and r is the algebraic distance.
Multi-category Case General case: c-1 2-class c(c-1)/2 linear discriminant
Use c linear discriminants
Distance Measure w i -w j is normal to H ij. Distance for x to H ij is given by:
Quadratic DF Add terms involving products of pairs of component of x to obtain the quadratic discriminant function: The separating surface defined by g(x)=0 is a hyperquadric function.
Hyperquadric Surfaces If W=[w ij ] is not singular, then the linear terms in g(x) can be eliminated by translating the axes. Define a scale matrix: Hypersphere Hyperellipsoid Hyperperboloid
Generalized LDF Polynomial discriminant functions Generalized LDF:
Augment Vectors Augment feature vector: Augment weight vector: Mapping a d-dimensional x-space to (d+1)-dimensional y-space
2-Category Separable Case Look for a weight vector that classifies all of the samples correctly. If such a weight does exist, then the samples are said to be linearly separable.
Gradient Descent Procedure Define a criterion function J(a) that is minimized if a is a solution vector. Step 1: Randomly pick a(1), and compute the gradient vector: Step 2: a(2) is obtained by moving some distance from a(1) in the direction of the steepest descent.
Setting the Learning Rate Second-order expansion of J(a): Substituting Minimized when
Newton Descent For nonsingular H Converges faster but more difficult to compute per step.
Perceptron Criterion Function where Y(a) is the set of samples misclassified by a. Since Update rule:
Convergence Proof Refer to page 229 to 232 of textbook.