 # Linear Discriminant Functions Chapter 5 (Duda et al.)

## Presentation on theme: "Linear Discriminant Functions Chapter 5 (Duda et al.)"— Presentation transcript:

Linear Discriminant Functions Chapter 5 (Duda et al.)
CS479/679 Pattern Recognition Dr. George Bebis

Generative vs Discriminant Approach
Generative approaches find the discriminant function by first estimating the probability distribution of the patterns belonging to each class. Discriminant approaches find the discriminant function explicitly, without assuming a probability distribution.

Generative Approach – Example (two categories)
More common to use a single discriminant function (dichotomizer) instead of two: Examples:

Discriminant Approach
Specify parametric form of the discriminant function, for example, a linear discriminant: Decide w1 if g(x) > 0 and w2 if g(x) < 0 If g(x)=0, then x lies on the decision boundary and can be assigned to either class.

Discriminant Approach (cont’d)
Find the “best” decision boundary (i.e., estimate w and w0) using a set of training examples xk.

Discriminant Approach (cont’d)
The solution is found by minimizing a criterion function (e.g., “training error” or “empirical risk”): Learning algorithms can be applied to find the solution. correct class predicted class

Linear Discriminant Functions: two-categories case
A linear discriminant function has the following form: The decision boundary, is a hyperplane where the orientation of the hyperplane is determined by w and its location by w0. w is the normal to the hyperplane If w0=0, the hyperplane passes through the origin

Geometric Interpretation of g(x)
g(x) provides an algebraic measure of the distance of x from the hyperplane. x can be expressed as follows: direction of r

Geometric Interpretation of g(x) (cont’d)
Substitute x in g(x): since and

Geometric Interpretation of g(x) (cont’d)
Therefore, the distance of x from the hyperplane is given by: setting x=0:

Linear Discriminant Functions: multi-category case
There are several ways to devise multi-category classifiers using linear discriminant functions: (1) One against the rest problem: ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d)
(2) One against another (i.e., c(c-1)/2 pairs of classes) problem: ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d)
To avoid the problem of ambiguous regions: Define c linear discriminant functions Assign x to wi if gi(x) > gj(x) for all j  i. The resulting classifier is called a linear machine (see Chapter 2)

Linear Discriminant Functions: multi-category case (cont’d)
A linear machine divides the feature space in c convex decisions regions. If x is in region Ri, the gi(x) is the largest. Note: although there are c(c-1)/2 pairs of regions, there typically less decision boundaries

Linear Discriminant Functions: multi-category case (cont’d)
The decision boundary between adjacent regions Ri and Rj is a portion of the hyperplane Hij given by: (wi-wj) is normal to Hij and the signed distance from x to Hij is

Higher Order Discriminant Functions
Can produce more complicated decision boundaries than linear discriminant functions.

Generalized discriminants
- defined through special functions yi(x) called φ functions - α is a dimensional weight vector the φ functions yi(x) map a point from the d-dimensional x-space to a point in the -dimensional y-space (usually >> d ) φ

Generalized discriminants (cont’d)
The resulting discriminant function is linear in y-space. Separates points in the transformed space by a hyperplane passing through the origin.

Example The corresponding decision regions R1,R2 in the x-space are not simply connected! φ functions d=1,

Example (cont’d) g(x) maps a line in x- space to a parabola in y- space. The plane αty=0 divides the y-space in two decision regions

Learning: two-category, linearly separable case
Given a linear discriminant function the goal is to “learn” the parameters w and w0 from a set of n labeled samples xi where each xi has a class label ω1 or ω2.

Augmented feature/parameter space
Simplify notation: dimensionality: d  (d+1)

Classification in augmented space
Classification rule: If αtyi>0 assign yi to ω1 else if αtyi<0 assign yi to ω2 g(x)=αty Discriminant:

Learning in augmented space: two-category, linearly separable case
Given a linear discriminant function the goal is to learn the weights (parameters) α from a set of n labeled samples yi where each yi has a class label ω1 or ω2. g(x)=αty

Learning in augmented space: effect of training examples
Every training sample yi places a constraint on the weight vector α. αty=0 defines a hyperplane in parameter space having y as a normal vector. Given n examples, the solution α must lie on the intersection of n half-spaces. a1 a2 parameter space (ɑ1, ɑ2)

Learning in augmented space: effect of training examples (cont’d)
Visualize solution in the parameter or feature space. parameter space (ɑ1, ɑ2) feature space (y1, y2) a1 a2

Uniqueness of Solution
Solution vector α is usually not unique; we can impose certain constraints to enforce uniqueness: “Find unit-length weight vector that maximizes the minimum distance from the training examples to the separating plane”

Iterative Optimization
Define an error function J(α) (i.e., missclassifications) that is minimized if α is a solution vector. Minimize J(α) iteratively: α(k) α(k+1) search direction learning rate How should we define pk?

learning rate (note: replace a with α)

solution space - J(α)

What is the effect of the learning rate? η J(α) slow but converges to solution fast by overshoots solution

How to choose the learning rate h(k)? If J(α) is quadratic, then H is constant which implies that the learning rate is constant. Taylor series approximation Hessian (2nd derivatives) (note:replace a with α) optimum learning rate

Choosing pk using Newton’s Method
requires inverting H (note: replace a with α)

Newton’s method (cont’d)
If J(α) is quadratic, Newton’s method converges in one step! J(α)

“Normalized” Problem If yi in ω2, replace yi by -yi
Find α such that: αtyi>0 replace yi by -yi Seek a hyperplane that separates patterns from different categories Seek a hyperplane that puts normalized patterns on the same (positive) side

Perceptron rule Use Gradient Descent assuming:
where Y(α) is the set of samples misclassified by α. If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0 Find α such that: αtyi>0

Perceptron rule (cont’d)
The gradient of Jp(α) is: The perceptron update rule is obtained using gradient descent: (note: replace a with α) or

Perceptron rule (cont’d)
(note: replace a with α and Yk with Y(α)) missclassified examples

Perceptron rule (cont’d)
Move the hyperplane so that training samples are on its positive side. a2 a1 Example:

Perceptron rule (cont’d)
η(k)=1 one example at a time Perceptron Convergence Theorem: If training samples are linearly separable, then the sequence of weight vectors by the above algorithm will terminate at a solution vector in a finite number of steps.

Perceptron rule (cont’d)
order of examples: y2 y3 y1 y3 “Batch” algorithm leads to a smoother trajectory in solution space.