Presentation on theme: "Linear Discriminant Functions Chapter 5 (Duda et al.)"— Presentation transcript:
1 Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis
2 Generative vs Discriminant Approach Generative approaches find the discriminant function by first estimating the probability distribution of the patterns belonging to each class.Discriminant approaches find the discriminant function explicitly, without assuming a probability distribution.
3 Generative Approach – Example (two categories) More common to use a single discriminant function (dichotomizer) instead of two:Examples:
4 Discriminant Approach Specify parametric form of the discriminant function, for example, a linear discriminant:Decide w1 if g(x) > 0 and w2 if g(x) < 0If g(x)=0, then x lies on the decision boundary and can be assigned to either class.
5 Discriminant Approach (cont’d) Find the “best” decision boundary (i.e., estimate w and w0) using a set of training examples xk.
6 Discriminant Approach (cont’d) The solution is found by minimizing a criterion function (e.g., “training error” or “empirical risk”):Learning algorithms can be applied to find the solution.correct class predicted class
7 Linear Discriminant Functions: two-categories case A linear discriminant function has the following form:The decision boundary, is a hyperplane where the orientation of the hyperplane is determined by w and its location by w0.w is the normal to the hyperplaneIf w0=0, the hyperplane passes through the origin
8 Geometric Interpretation of g(x) g(x) provides an algebraic measure of the distance of x from the hyperplane.x can be expressed as follows:direction of r
9 Geometric Interpretation of g(x) (cont’d) Substitute x in g(x):sinceand
10 Geometric Interpretation of g(x) (cont’d) Therefore, the distance of x from the hyperplane is given by:setting x=0:
11 Linear Discriminant Functions: multi-category case There are several ways to devise multi-category classifiers using linear discriminant functions:(1) One against the restproblem:ambiguous regions
12 Linear Discriminant Functions: multi-category case (cont’d) (2) One against another (i.e., c(c-1)/2 pairs of classes)problem:ambiguous regions
13 Linear Discriminant Functions: multi-category case (cont’d) To avoid the problem of ambiguous regions:Define c linear discriminant functionsAssign x to wi if gi(x) > gj(x) for all j i.The resulting classifier is called a linear machine(see Chapter 2)
14 Linear Discriminant Functions: multi-category case (cont’d) A linear machine divides the feature space in c convex decisions regions.If x is in region Ri, the gi(x) is the largest.Note: although there are c(c-1)/2 pairs of regions, theretypically less decision boundaries
15 Linear Discriminant Functions: multi-category case (cont’d) The decision boundary between adjacent regions Ri and Rj is a portion of the hyperplane Hij given by:(wi-wj) is normal to Hij and the signed distance from x to Hij is
16 Higher Order Discriminant Functions Can produce more complicated decision boundaries than linear discriminant functions.
17 Generalized discriminants - defined through special functions yi(x) called φ functions- α is a dimensional weight vectorthe φ functions yi(x) map a point from the d-dimensional x-space to a point in the -dimensional y-space (usually >> d )φ
18 Generalized discriminants (cont’d) The resulting discriminant function is linear in y-space.Separates points in the transformed space by a hyperplane passing through the origin.
19 ExampleThe corresponding decision regions R1,R2 in the x-space are not simply connected!φ functionsd=1,
20 Example (cont’d)g(x) maps a line in x- space to a parabola in y- space.The plane αty=0 divides the y-space in two decision regions
21 Learning: two-category, linearly separable case Given a linear discriminant functionthe goal is to “learn” the parameters w and w0 from a set of n labeled samples xi where each xi has a class label ω1 or ω2.
22 Augmented feature/parameter space Simplify notation:dimensionality: d (d+1)
23 Classification in augmented space Classification rule:If αtyi>0 assign yi to ω1else if αtyi<0 assign yi to ω2g(x)=αtyDiscriminant:
24 Learning in augmented space: two-category, linearly separable case Given a linear discriminant functionthe goal is to learn the weights (parameters) α from a set of n labeled samples yi where each yi has a class label ω1 or ω2.g(x)=αty
25 Learning in augmented space: effect of training examples Every training sample yi places a constraint on the weight vector α.αty=0 defines a hyperplane in parameter space having y as a normal vector.Given n examples, the solution α must lie on the intersection of n half-spaces.a1a2parameterspace (ɑ1, ɑ2)
26 Learning in augmented space: effect of training examples (cont’d) Visualize solution in the parameter or feature space.parameter space (ɑ1, ɑ2)feature space (y1, y2)a1a2
27 Uniqueness of Solution Solution vector α is usually not unique; we can imposecertain constraints to enforce uniqueness:“Find unit-length weight vector that maximizes theminimum distance from the training examples to theseparating plane”
28 Iterative Optimization Define an error function J(α) (i.e., missclassifications) that is minimized if α is a solution vector.Minimize J(α) iteratively:α(k)α(k+1)search directionlearning rateHow should we define pk?
29 Choosing pk using Gradient Descent learningrate(note: replace a with α)
31 Gradient Descent (cont’d) What is the effect of the learning rate?ηJ(α)slow but converges to solutionfast by overshoots solution
32 Gradient Descent (cont’d) How to choose the learning rate h(k)?If J(α) is quadratic, then H is constant which implies that the learning rate is constant.Taylor series approximationHessian (2nd derivatives)(note:replace a with α)optimum learning rate
33 Choosing pk using Newton’s Method requires inverting H(note: replace a with α)
34 Newton’s method (cont’d) If J(α) is quadratic,Newton’s methodconverges in one step!J(α)
36 “Normalized” Problem If yi in ω2, replace yi by -yi Find α such that: αtyi>0replace yi by -yiSeek a hyperplane that separates patterns from different categoriesSeek a hyperplane that puts normalized patterns on the same (positive) side
37 Perceptron rule Use Gradient Descent assuming: where Y(α) is the set of samples misclassified by α.If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0Find α such that: αtyi>0
38 Perceptron rule (cont’d) The gradient of Jp(α) is:The perceptron update rule is obtained using gradient descent:(note: replace a with α)or
40 Perceptron rule (cont’d) Move the hyperplane so that training samples are on its positive side.a2a1Example:
41 Perceptron rule (cont’d) η(k)=1one exampleat a timePerceptron Convergence Theorem: If training samples are linearly separable, then the sequence of weight vectors by the above algorithm will terminate at a solution vector in a finite number of steps.
42 Perceptron rule (cont’d) order of examples:y2 y3 y1 y3“Batch” algorithmleads to a smoothertrajectory in solutionspace.