Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear Discriminant Functions

Similar presentations


Presentation on theme: "Linear Discriminant Functions"— Presentation transcript:

1 Linear Discriminant Functions
主講人:虞台文

2 Contents Introduction
Linear Discriminant Functions and Decision Surface Linear Separability Learning Gradient Decent Algorithm Newton’s Method

3 Linear Discriminant Functions
Introduction

4 Decision-Making Approaches
Probabilistic Approaches Based on the underlying probability densities of training sets. For example, Bayesian decision rule assumes that the underlying probability densities were available. Discriminating Approaches Assume we know the proper forms for the discriminant functions were known. Use the samples to estimate the values of parameters of the classifier.

5 Linear Discriminating Functions
Easy for computing, analysis and learning. Linear classifiers are attractive candidates for initial, trial classifier. Learning by minimizing a criterion function, e.g., training error. Difficulty: a small training error does not guarantee a small test error.

6 Linear Discriminant Functions
Linear Discriminant Functions and Decision Surface

7 Linear Discriminant Functions
weight vector bias or threshold The two-category classification:

8 Implementation w0 1 1 w1 w2 wn x1 x2 xd

9 Implementation 1 1 w0 1 1 w0 w1 w2 wn x0=1 x1 x2 xd

10 Decision Surface 1 1 w0 w1 w2 wn x0=1 x1 x2 xd

11 Decision Surface x1 x2 xn g(x)=0 x w0/||w|| w

12 Decision Surface xn x2 x w x1
A linear discriminant function divides the feature space by a hyperplane. 2. The orientation of the surface is determined by the normal vector w. 3. The location of the surface is determined by the bias w0. x1 x2 xn g(x)=0 x w0/||w|| w

13 Augmented Space x2 x1 w d-dimension Let (d+1)-dimension
g(x)=w1x1+ w2x2 + w0=0 Decision surface:

14 Pass through the origin of augmented space
feature vector Augmented weight vector d-dimension (d+1)-dimension x1 x2 Let w g(x)=w1x1+ w2x2 + w0=0 Decision surface: Pass through the origin of augmented space

15 Augmented Space ay = 0 y1 y2 y0 x1 x2 a w g(x)=w1x1+ w2x2 + w0=0 1

16 Augmented Space Decision surface in feature space:
By using this mapping, the problem of finding weight vector w and threshold w0 is reduced to finding a single vector a. Augmented Space Decision surface in feature space: Pass through the origin only when w0=0. Decision surface in augmented space: Always pass through the origin.

17 Linear Discriminant Functions
Linear Separability

18 The Two-Category Case 2 2 1 1 Not Linearly Separable

19 How to find a? The Two-Category Case 2 1 if yi is labeled 1
Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, 2 if there exists a vector a such that if yi is labeled 1 1 if yi is labeled 2 then the samples are said to be Linearly Separable

20 Normalization How to find a? if yi is labeled 1 if yi is labeled 2
Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, Withdrawing all labels of samples and replacing the ones labeled 2 by their negatives, it is equivalent to find a vector a such that if there exists a vector a such that if yi is labeled 1 if yi is labeled 2 then the samples are said to be Linearly Separable

21 Solution Region in Feature Space
Separating Plane: Solution Region in Feature Space Normalized Case

22 Solution Region in Weight Space
Separating Plane: Solution Region in Weight Space Shrink solution region by margin

23 Linear Discriminant Functions
Learning

24 Criterion Function To facilitate learning, we usually define a scalar criterion function. It usually represents the penalty or cost of a solution. Our goal is to minimize its value. ≡Function optimization.

25 Learning Algorithms To design a learning algorithm, we face the following problems: Whether to stop? In what direction to proceed? How long a step to take? Is the criterion satisfactory? e.g., steepest decent : learning rate

26 Criterion Functions: The Two-Category Case
J(a) Criterion Functions: The Two-Category Case # of misclassified patterns solution state where to go? Is this criterion appropriate?

27 Criterion Functions: The Two-Category Case
Y : the set of misclassified patterns Perceptron Criterion Function solution state where to go? Is this criterion better? What problem it has?

28 Criterion Functions: The Two-Category Case
Y : the set of misclassified patterns A Relative of Perceptron Criterion Function solution state where to go? Is this criterion much better? What problem it has?

29 Criterion Functions: The Two-Category Case
Y : the set of misclassified patterns What is the difference with the previous one? solution state where to go? Is this criterion good enough? Are there others?

30 Gradient Decent Algorithm
Learning Gradient Decent Algorithm

31 Gradient Decent Algorithm
Our goal is to go downhill. Gradient Decent Algorithm a1 a2 J(a) Contour Map a2 (a1, a2) a1

32 Gradient Decent Algorithm
Our goal is to go downhill. Gradient Decent Algorithm Define aJ is a vector

33 Gradient Decent Algorithm
Our goal is to go downhill. Gradient Decent Algorithm

34 Gradient Decent Algorithm
Our goal is to go downhill. Gradient Decent Algorithm a1 a2 J(a) (a1, a2) How long a step shall we take? aJ go this way

35 Gradient Decent Algorithm
(k): Learning Rate Gradient Decent Algorithm Initial Setting a, , k = 0 do k  k + 1 a  a  (k)a J(a) until |a J(a)| < 

36 Trajectory of Steepest Decent
If improper learning rate (k) is used, the convergence rate may be poor. Too small: slow convergence. Too large: overshooting a* Basin of Attraction a2 a0 J(a1) J(a0) Furthermore, the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent. a1

37 Learning Rate Paraboloid Q: symmetric & positive definite

38 Learning Rate Paraboloid
All smooth functions can be approximated by paraboloids in a sufficiently small neighborhood of any point.

39 Learning Rate Global minimum (x*): Set Paraboloid
We will discuss the convergence properties using paraboloids. Learning Rate Paraboloid Global minimum (x*): Set

40 Learning Rate Paraboloid Define Error

41 Learning Rate Paraboloid Define Error We want to minimize Clearly, and

42 Learning Rate Suppose that we are at yk. Let
The steepest decent direction will be Let learning rate be k. That is, yk+1 = yk + k pk.

43 Learning Rate We’ll choose the most ‘favorable’ k . That is, setting
we get

44 Learning Rate If Q = I,

45 Convergence Rate

46 Convergence Rate [Kantorovich Inequality] Let Q be a positive definite, symmetric, nn matrix. For any vector there holds where 1 2  …  n are eigenvalues of Q. The smaller the better

47 Convergence Rate Faster convergence Slower convergence
Condition Number Faster convergence Slower convergence

48 Convergence Rate Paraboloid Suppose we are now at xk. Updating Rule:

49 Trajectory of Steepest Decent
In this case, the condition number of Q is moderately large. One then see that the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent.

50 Learning Newton’s Method

51 Global minimum of a Paraboloid
We can find the global minimum of a paraboloid by setting its gradient to zero.

52 Function Approximation
Taylor Series Expansion All smooth functions can be approximated by paraboloids in a sufficiently small neighborhood of any point.

53 Function Approximation
Hessian Matrix

54 Newton’s Method Set

55 Comparison Newton’s method Gradient Decent

56 Comparison Newton’s Method will usually give a greater improvement per step than the simple gradient decent algorithm, even with optimal value of k. However, Newton’s Method is not applicable if the Hessian matrix Q is singular. Even when Q is nonsingular, compute Q is time consuming O(d3). It often takes less time to set k to a constant (small than necessary) than it is to compute the optimum k at each step.


Download ppt "Linear Discriminant Functions"

Similar presentations


Ads by Google