Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Similar presentations


Presentation on theme: "Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34."— Presentation transcript:

1 Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34

2 Announcements Read selection from Trucco & Verri on deformable contours for guest lecture on Friday On Monday I’ll cover neural networks (Forsyth & Ponce Chapter 22.4), and begin reviewing for the final

3 Outline Linear discriminants –Two-class –Multicategory Criterion functions for computing discriminants Generalized linear discriminants

4 Discriminants for Classification Previously, decision boundary was chosen based on underlying class probability distributions –Completely known distribution –Estimate parameters for distribution with known form –Nonparametrically approximate unknown distribution Idea: Ignore class distributions and simply assume decision boundary is of known form with unknown parameters Discriminant function (two-class): Which side of boundary is data point on? –Linear discriminant  Hyperplane decision boundary In general, not optimal

5 Two-Class Linear Discriminants Represent the n -dimensional data points x in homogeneous coordinates: y = (x T, 1) T Decision boundary is hyperplane a = (w T, w 0 ) T –w = (w 1, …, w n ) T : Plane’s normal (weight vector) –w 0 : Plane’s distance from origin (bias or threshold) in x space (always passes through y space origin)

6 Discriminant Function Define the two-class linear discriminant function with the dot product g(x) = a T y –g(x) = 0  Normal vector to plane and y are orthogonal  y is on the plane –g(x)  0  Angle between vectors is acute  y on side of plane that normal points to  Classify as c 1 –g(x)  0  Angle between vectors is obtuse  y on plane’s other side  Classify as c 2 from Duda et al.

7 Distance to Decision Boundary Distance from y to hyperplane in y space given by projection a T y/  a  Since  a   w , this is a lower bound on the distance of x to hyperplane in x space from Duda et al.

8 Multicategory Linear Discriminants Given C categories, define C discriminant functions g i (x) = a i T y Classify x as a member of c i if g i (x)  g j (x) for all j  i from Duda et al.

9 Characterizing Solutions Separability: There exists at least one a in weight space ( y space) that classifies all samples correctly Solution region: The region of weight space in which every a separates the classes (not the same as decision regions!) from Duda et al. Separable dataNon-separable data

10 Normalization Suppose each data point y i is classified correctly as c 1 when a T y i  0 and c 2 when a T y i  0 Idea: Replace c 2 -labeled samples with negation - y i –This simplifies things, since now we need only look for an a such that a T y i  0 for all of the data from Duda et al.

11 Margin Set minimum distance that decision hyperplane can be from nearest data point with a T y  b. For a particular point y i, this distance is b/  y i  Intuitively, we want a maximal margin from Duda et al.

12 Criterion Functions To actually solve for a discriminant a, define criterion function J(a; y 1, …, y d ) that is minimized when a is a solution –For example, let J e = the number of misclassified data points Minimal ( J e = 0 ) for solutions For practical purposes, we will use something like gradient descent on J to arrive at a solution –J e is unsuitable for gradient descent since it’s only piecewise continuous

13 Example: Plot of J e from Duda et al.

14 Perceptron Criterion Function Define the following piecewise linear function: J p (a) =  y  Y (-a T y) where Y(a) is set of samples misclassified by a This is proportional to sum of distances between misclassified samples and decision hyperplane from Duda et al.

15 Non-Separable Data: Error Minimization Perceptron assumes separability—won’t stop otherwise –Only focuses on erroneous classifications Idea: Minimize mean squared error over all data Trying to put decision hyperplane exactly at the margin leads to linear equations rather than linear inequalities: a T y i = y i T a = b i Stack all data points as row vectors y i T and collect margins b i to get system of equations Y a = b Can solve with pseudoinverse a = Y + b

16 Non-Separable Data: Error Minimization Alternative to pseudoinverse approach is gradient descent on criterion function J s (a) =  Ya - b  2 This is called the Widrow-Hoff or least mean- squared (LMS) procedure Doesn’t necessarily converge to separating hyperplane if one exists Advantages –Avoids problems that occur for singular Y T Y –Avoids need for manipulating large matrices from Duda et al.

17 Generalized Linear Discriminants We originally constructed the vector y from the n - vector x by simply adding 1 coordinate for homogeneous representation Can go further and use any number m of arbitrary functions: y = (y 1 (x), …, y m (x)) —sometimes called basis expansion Even if y i (x) are nonlinear, we can still use linear methods in m –dimensional y space Why? Because we can approximate nonlinear discriminant functions straightforwardly

18 Example: Quadratic Discriminant Define 1-D quadratic discriminant function as g(x) = a 1 + a 2 x + a 3 x 2 –This is nonlinear in x, so we can’t directly use the methods described thus far –But by mapping to 3-D with y = (1, x, x 2 ) T, we can use linear methods (e.g., Perceptron, LMS) to solve for a = (a 1, a 2, a 3 ) T in y space Inefficiency: Number of variables may overwhelm amount of data for larger n since m = (n + 1) (n + 2)/2

19 Example: Quadratic Discriminant from Duda et al. no linear decision boundary separates in x space hyperplane separates in y space

20 Support Vector Machines (SVM) Map input nonlinearly to higher-dimensional space (where in general there is a separating hyperplane) Find separating hyperplane that maximizes distance to nearest data point (i.e., the margin)

21 Example: SVM for Gender Classification of Faces Data: 1,755 21 x 12 cropped face images Error rates –Human: 30.7% (hampered by lack of hair cues?) –SVM: 3.4% (5-fold cross-validation) courtesy of B. Moghaddam Humans’ top misclassifications F M M F M from Moghaddam & Yang, 2001


Download ppt "Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34."

Similar presentations


Ads by Google