Atalay Barkana, Mehmet Koc, Ozen Yelbasi

Atalay Barkana, Mehmet Koc, Ozen Yelbasi
Fisher’s Linear Discriminant Analysis and Its Use in Feature Selection for Undergraduates and Graduates Atalay Barkana, Mehmet Koc, Ozen Yelbasi s:

Main topics covered: Introductory Derivations before Fisher’s LDA Fisher’s Metric and Its Maximization Traces in Fisher’s LDA Weighted Features for Feature Selection

1. Introductory Derivations before Fisher’s LDA 1
1. Introductory Derivations before Fisher’s LDA 1.1 Let’s start with a two-data problem Let’s suppose that we work in 2D-space and we have two vectors representing two objects belonging to one-class 𝐶= 𝒂 1 , 𝒂 2 where 𝒂 1 = 𝑎 11 𝑎 12 𝑇 and 𝒂 2 = 𝑎 21 𝑎 22 𝑇 with 𝑎 11 and 𝑎 21 represent the numerical values belonging to one feature; 𝑎 12 and 𝑎 22 are numerical values for another feature for the two objects. For example, the first feature may be the lengths and the second is the widths of the objects in the same class. A geometric interpretation of the vectors in 2D-space: Let the difference between two vectors be 𝒂 𝑑𝑖𝑓𝑓 = 𝒂 2 − 𝒂 1

Now let’s try to rotate the 𝑥 1 , 𝑥 2 axes so that the projection of 𝒂 𝑑𝑖𝑓𝑓 onto one of the new axes has the largest while onto the other one of the new axes has the smallest projection length. These new axes are shown in purple. From the geometry it is obvious that the projection of 𝒂 𝑑𝑖𝑓𝑓 onto 𝑥 2 ′ is zero and the projection of 𝒂 𝑑𝑖𝑓𝑓 onto 𝑥 1 ′ is proj 𝑥 1 𝒂 𝑑𝑖𝑓𝑓 = 𝒂 𝑑𝑖𝑓𝑓 .

Let the unit basis vectors of 𝑥 1 , 𝑥 2 axes system be 𝒆 1 = 𝑇 and 𝒆 2 = 𝑇 . Similarly, let 𝒘 1 and 𝒘 2 represent the unit basis vectors of 𝑥 1 ′ − 𝑥 2 ′ axes system given in red color. Then 𝒘 1 = 𝒘 2 =1 and 𝒘 1 ∙ 𝒘 2 =0. The projection of 𝒂 𝑑𝑖𝑓𝑓 onto 𝒘 1 according to Calculus is 𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 1 = 𝒂 𝑑𝑖𝑓𝑓 ∙ 𝒘 1 𝒘 1 and onto 𝒘 2 is 𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 2 = 𝒂 𝑑𝑖𝑓𝑓 ∙ 𝒘 2 𝒘 2 According to Linear Algebra; 𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 𝑖 = 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 𝒘 𝑖 = 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝒘 𝑖 , 𝑖=1,2.

𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 𝑖 = 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 𝒘 𝑖 = 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝒘 𝑖 , 𝑖=1,2.
Its length square becomes 𝑙 𝑖 2 = 𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 𝑖 2 = 𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓,𝑝𝑟𝑜𝑗 𝑤 𝑖 = 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 = 𝒘 𝑖 𝑇 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 𝒘 𝑖 = 𝒘 𝑖 𝑇 𝒘 𝑖 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 𝑓𝑜𝑟 𝑖=1,2. Therefore, 𝑙 𝑖 2 = 𝒘 𝑖 𝑇 𝒂 𝑑𝑖𝑓𝑓 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 𝑖 for 𝑖=1,2. Notice that 𝚽=𝒂 𝑑𝑖𝑓𝑓 𝒂 𝑑𝑖𝑓𝑓 𝑇 is a square matrix. Also remember that 𝒘 1 and 𝒘 2 are directional vectors in the directions of 𝑥 1 ′ and 𝑥 2 ′ .

Now we are ready to find 𝒘 1 and 𝒘 2 vectors that minimize and/or maximize the length square 𝑙 𝑖 2 for 𝑖 =1,2. But we have to add a constraint before starting to take its derivatives w.r.t. 𝒘 𝑖 . Our constraints will be 𝒘 𝑖 𝑇 𝒘 𝑖 =1 for 𝑖=1,2. Introducing Lagrange multipliers 𝜆 𝑖 , 𝑖=1,2 the metric to be optimized will be as follows: 𝑀 𝑖 = 𝒘 𝑖 𝑇 𝚽 𝒘 𝑖 + 𝜆 𝑖 1− 𝒘 𝑖 𝑇 𝒘 𝑖 , 𝑖=1,2

The derivative of 𝑀 𝑖 w. r. t
The derivative of 𝑀 𝑖 w.r.t. 𝒘 𝑖 is 𝑑 𝑀 𝑖 𝑑 𝒘 𝑖 = 𝜕 𝑀 𝑖 𝜕 𝑤 𝑖1 𝜕 𝑀 𝑖 𝜕 𝑤 𝑖2 where 𝑤 𝑖1 and 𝑤 𝑖2 are the first and the second elements in 𝒘 𝑖 respectively. Then the first order derivatives 𝑑 𝑀 𝑖 𝑑 𝒘 𝑖 =2 𝚽− 𝜆 𝑖 𝑰 𝒘 𝑖 =𝟎, 𝑖=1,2 will give the critical values of 𝑀 𝑖 and related 𝒘 𝑖 . Remember that 𝚽− 𝜆 𝑖 𝑰 𝒘 𝑖 =𝟎 is an eigenvalue-eigenvector problem. Since 𝚽 is real and symmetric matrix whose rank is equal to 1, one of the eigenvalues will be zero while the other one will be nonzero and positive.

Example: Let 𝒂 𝑑𝑖𝑓𝑓 = 1 0 𝑇 . Then 𝚽= 𝒂 𝑑𝑖𝑓𝑓 𝒂 𝑑𝑖𝑓𝑓 𝑇 = 1 0 0 0 .
The eigenvalue-eigenvector relation is 𝒘 𝑖 = 𝜆 𝑖 𝒘 𝑖 . For 𝜆=0, 𝒘 1 = then 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 1 =0, so the projection of 𝒂 𝑑𝑖𝑓𝑓 has the smallest length. For 𝜆=1, 𝒘 2 = then 𝒂 𝑑𝑖𝑓𝑓 𝑇 𝒘 2 =1, so the projection of 𝒂 𝑑𝑖𝑓𝑓 has the largest length. One Correction: The difference vectors and correspondingly 𝚽 in Fisher’s LDA are calculated with respect to the class average vectors. Let’s extend our derivation.

1.2 Three Data Problem toward Covariance Let 𝒂 1 , 𝒂 2 , 𝒂 3 ϵ ℝ 2×1 belong to the same class 𝐶. The basis vectors of 𝑥 1 and 𝑥 2 axis are 𝒆 1 and 𝒆 2 . The basis vectors of 𝑥 1 ′ and 𝑥 2 ′ axes are 𝒘 1 and 𝒘 2 . In the figure above 𝒂 𝑑𝑖𝑓 𝑓 𝑖 = 𝒂 𝑖 − 𝒂 𝑎𝑣𝑒 for 𝑖=1,2,3.

From our previous experience, the projection of a difference vector, 𝒂 𝑑𝑖𝑓 𝑓 𝑖 , onto 𝒘 𝑗 for 𝑗=1,2 has a length square of 𝑙 𝑖,𝑗 2 = 𝒘 𝑗 𝑇 𝒂 𝑑𝑖𝑓 𝑓 𝑖 𝒂 𝑑𝑖𝑓 𝑓 𝑖 𝑇 𝒘 𝑗 for 𝑖=1,2,3 and 𝑗=1,2. Then the sum of the squares is 𝑆 𝑗 = 𝑖=1 3 𝑙 𝑖,𝑗 2 = 𝒘 𝑗 𝑇 𝑖=1 3 𝒂 𝑑𝑖𝑓 𝑓 𝑖 𝒂 𝑑𝑖𝑓 𝑓 𝑖 𝑇 𝒘 𝑗 , 𝑗=1,2 The summation in parenthesis is called the covariance matrix (or within-class scatter matrix) 𝚽 of class 𝐶. The metric to be minimized (or maximized) including the constraint is 𝑀 𝑗 = 𝒘 𝑗 𝑇 𝚽 𝒘 𝑗 + 𝜆 𝑗 1− 𝒘 𝑗 𝑇 𝒘 𝑗 , 𝑗=1,2 The critical points can be found by taking the first order derivatives: 𝑑 𝑀 𝑗 𝑑 𝒘 𝑗 =2 𝚽− 𝜆 𝑗 𝑰 𝒘 𝑗 =𝟎, 𝑗=1,2

Example: Let the feature vectors 𝒂 1 = 𝑇 , 𝒂 2 = 𝑇 , 𝒂 3 = 𝑇 be the feature vectors of the same class. The average of the class is 𝒂 𝑎𝑣𝑒 = 𝑇 . The difference vectors are 𝒂 𝑑𝑖𝑓 𝑓 1 = −2 −1 , 𝒂 𝑑𝑖𝑓 𝑓 2 = 2 −1 , and 𝒂 𝑑𝑖𝑓 𝑓 3 = By using difference vectors the covariance matrix can be calculated as below: 𝚽= 𝒂 𝑑𝑖𝑓 𝑓 1 𝒂 𝑑𝑖𝑓 𝑓 1 𝑇 + 𝒂 𝑑𝑖𝑓 𝑓 2 𝒂 𝑑𝑖𝑓 𝑓 2 𝑇 + 𝒂 𝑑𝑖𝑓 𝑓 3 𝒂 𝑑𝑖𝑓 𝑓 3 𝑇 =

Example (continued): 𝚽= 𝒂 𝑑𝑖𝑓 𝑓 1 𝒂 𝑑𝑖𝑓 𝑓 1 𝑇 + 𝒂 𝑑𝑖𝑓 𝑓 2 𝒂 𝑑𝑖𝑓 𝑓 2 𝑇 + 𝒂 𝑑𝑖𝑓 𝑓 3 𝒂 𝑑𝑖𝑓 𝑓 3 𝑇 = The eigenvalues of 𝚽 are 𝜆 1 =8 and 𝜆 2 =6. From 𝚽 𝐰 1 = 𝜆 1 𝐰 1 , 𝐰 1 = , and 𝚽 𝐰 2 = 𝜆 2 𝐰 2 , 𝐰 2 = Since 𝐰 1 = 𝒆 1 and 𝐰 2 = 𝒆 2 no axes rotation is necessary. The sum of the largest projection squares of the difference vectors is 𝑠𝑢 𝑚 1 =8= = 𝜆 1 and the sum of the smallest projection squares of the difference vectors 𝑠𝑢 𝑚 2 =6= = 𝜆 2 . Notice the relation between the sums and the eigenvalues.

𝒂 𝑎𝑣𝑒 = 1 m 𝑖=1 𝑚 𝒂 𝑖 and 𝒂 𝑑𝑖𝑓 𝑓 𝑖 = 𝒂 𝑖 − 𝒂 𝑎𝑣𝑒 for 𝑖=1,2,…,𝑚.
Example (continued): Up to this point we have not used any scaling for the (within−class) covariance matrix. It can be a better idea to calculate it on a per−data basis, that is, 𝚽= 1 𝑚 𝑖=1 𝑚 𝒂 𝑑𝑖𝑓 𝑓 𝑖 𝒂 𝑑𝑖𝑓 𝑓 𝑖 𝑇 where 𝑚 is the number of data in class 𝐶. 𝒂 𝑎𝑣𝑒 = 1 m 𝑖=1 𝑚 𝒂 𝑖 and 𝒂 𝑑𝑖𝑓 𝑓 𝑖 = 𝒂 𝑖 − 𝒂 𝑎𝑣𝑒 for 𝑖=1,2,…,𝑚. Note that the zero eigenvalues of 𝚽 indicate the directions of 𝒘 𝑗 ’s where the lengths of projections are all zeros.

Fisher’s Metric and Its Maximization
Suppose that we have 𝑚 data in each one of the 𝐶 classes. From our previous derivation the within-class covariance (or scatter) matrix for each class will be 𝚽 𝑘 = 1 𝑚 𝑖=1 𝑚 𝒂 𝑘𝑖 − 𝒂 𝑎𝑣𝑒,𝑘 𝒂 𝑘𝑖 − 𝒂 𝑎𝑣𝑒,𝑘 𝑇 , 𝑘=1,2,…,𝐶 where 𝒂 𝑘𝑖 is the 𝑖𝑡ℎ data in the 𝑘𝑡ℎ class and 𝒂 𝑎𝑣𝑒,𝑘 is the average of the 𝑘𝑡ℎ class. The average sum of the within-class covariance matrix is 𝚽 𝑊 𝑇 = 1 𝐶 𝑘=1 𝐶 𝚽 𝑘 = 1 𝐶 𝚽 1 + 𝚽 2 +…+ 𝚽 𝐶

Still the metric of 𝐰 𝑗 𝑇 𝚽 𝑊 𝑇 𝒘 𝑗 will yield the directional vectors 𝒘 𝑗 where the length square sum of the projections of all the data in all classes. It can be maximized or minimized by including the constraints. In Fisher’s LDA, there will be another metric included that gives importance on how the class centers are scattered in the feature vector space. For that, another covariance matrix between the class centers must be calculated. This is called the between-class covariance (scatter) matrix: 𝚽 𝐵 = 1 𝑚 𝑘=1 𝐶 𝒂 𝑎𝑣𝑒,𝑘 − 𝒂 𝑡𝑜𝑡𝑎 𝑙 𝑎𝑣𝑒 𝒂 𝑎𝑣𝑒,𝑘 − 𝒂 𝑡𝑜𝑡𝑎 𝑙 𝑎𝑣𝑒 𝑇 where 𝒂 𝑡𝑜𝑡𝑎 𝑙 𝑎𝑣𝑒 is the average vector of all average feature vectors in all classes.

For a good separation between all classes, one must look for the directional vectors that maximize the sum of the length squares between all the class centers and their averages. Fisher (Fisher,1936) combined the metrics considering 𝐰 𝑗 𝑇 𝚽 𝑊 𝑇 𝐰 𝑗 and 𝐰 𝑗 𝑇 𝚽 𝐵 𝐰 𝑗 by stating the maximization of 𝐽= 𝐰 T 𝚽 𝐵 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 Here, 𝐰 is a directional vector which maximizes the numerator while minimizing the denominator. Both may not be satisfied simultaneously for any choice of 𝐰, therefore we will settle for the best possible solution.

For finding the critical point to maximize 𝐽, we do not need any constraints and the subscript 𝑗 will not be used for the moment, 𝑑𝐽 𝑑𝐰 = 2 𝚽 𝐵 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 −2 𝚽 𝑊 𝑇 𝐰 𝐰 T 𝚽 𝐵 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 =𝟎 By rearranging the expression, 𝚽 𝐵 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 − 𝚽 𝑊 𝑇 𝐰 𝐰 T 𝚽 𝐵 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 𝐰 T 𝚽 𝑊 𝑇 𝐰 =𝟎 After getting rid of 𝐰 T 𝚽 𝑊 𝑇 𝐰 in the denominator, the expression reduces to 𝚽 𝐵 𝐰−𝐽 𝚽 𝑊 𝑇 𝐰=𝟎

If the inverse of 𝚽 𝑊 𝑇 exists, then we have 𝚽 𝑊 𝑇 −1 𝚽 𝐵 −𝐽𝑰 𝒘=𝟎 This equation is an eigenvalue-eigenvector problem where 𝐽=𝜆. The eigenvector of 𝚽 𝑊 𝑇 −1 𝚽 𝐵 belonging to its largest eigenvalue will maximize 𝐽. The next question is that “can we state Fisher’s LDA metric in any other way?” What about the combination of the numerator and denominator metrics of 𝐽 by having a subtraction (Fukunaga, 1990) 𝐽 𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 = 𝐽 𝑐 = 𝐰 T 𝚽 𝐵 𝐰−𝜂 𝐰 T 𝚽 𝑊 𝑇 𝐰 where 𝜂 is a positive number.

𝐽 𝑐 must be maximized. The critical point of 𝐽 𝑐 can be calculated without constraints on 𝐰, 𝑑 𝐽 𝑐 𝑑𝐰 =2 𝚽 𝐵 𝐰−2𝜂 𝚽 𝑊 𝑇 𝐰=𝟎 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 𝐰=𝟎 Premultiplying with 𝚽 𝑊 𝑇 −1 if it exists gives, 𝚽 𝑊 𝑇 −1 𝚽 𝐵 −𝜂𝐈 𝐰=𝟎 𝜂 must be the eigenvalue of 𝚽 𝑊 𝑇 −1 𝚽 𝐵 as before 𝜂=𝐽 and 𝐰 is the corresponding eigenvector with unit length. Note that when 𝚽 𝑊 𝑇 −1 does not exist, that is, when at least one of its eigenvalues is zero, the case is called “small sample size” problem and its solution is given by (Cevikalp et al. 2005).

Traces in Fisher’s LDA Remember that 𝐽 𝐰 j = 𝒘 𝑗 𝐓 𝚽 𝐵 𝒘 𝑗 𝒘 𝑗 𝐓 𝚽 𝑊 𝑇 𝒘 𝑗 , that is, 𝐽 is a function of 𝐰 𝑗 and 𝐽 is the largest eigenvalue of 𝚽 𝑊 𝑇 −1 𝚽 𝐵 . If one wants to consider all the eigenvalues and the eigenvectors of 𝚽 𝑊 𝑇 −1 𝚽 𝐵 , then a matrix of the eigenvectors can be formed as 𝐖= 𝒘 1 ⋮ 𝒘 2 ⋮⋯⋮ 𝒘 𝑛 𝜖 ℝ 𝑛×𝑛 . Then 𝑾 T 𝚽 𝐵 𝑾= 𝒘 1 ⋮ 𝒘 2 ⋮⋯⋮ 𝒘 𝑛 T 𝚽 𝐵 𝒘 1 ⋮ 𝒘 2 ⋮⋯⋮ 𝒘 𝑛 is an 𝑛×𝑛 matrix. More precisely, 𝐖 T 𝚽 𝐵 𝐖= 𝒘 1 𝐓 𝚽 𝐵 𝒘 𝒘 1 𝐓 𝚽 𝐵 𝒘 ⋯ 𝒘 1 𝐓 𝚽 𝐵 𝒘 𝑛 𝒘 2 𝐓 𝚽 𝐵 𝒘 𝒘 2 𝐓 𝚽 𝐵 𝒘 ⋯ 𝒘 2 𝐓 𝚽 𝐵 𝒘 𝑛 ⋮ ⋮ ⋱ ⋮ 𝒘 𝑛 𝐓 𝚽 𝐵 𝒘 𝒘 𝑛 𝐓 𝚽 𝐵 𝒘 ⋯ 𝒘 𝑛 𝐓 𝚽 𝐵 𝒘 𝑛 What is important is the sum of diagonal elements in the above expression.

Since the trace of 𝑾 T 𝚽 𝐵 𝑾 is the following, Tr 𝑾 T 𝚽 𝐵 𝑾 = 𝒘 1 𝐓 𝚽 𝐵 𝒘 1 + 𝒘 2 𝐓 𝚽 𝐵 𝒘 2 +⋯+ 𝒘 𝑛 𝐓 𝚽 𝐵 𝒘 𝑛 = 𝑖=1 𝑛 𝒘 𝑖 𝐓 𝚽 𝐵 𝒘 𝑖 then the new form of Fisher’s LDA is 𝐽 𝒘 = Tr 𝑾 T 𝚽 𝐵 𝑾 Tr 𝑾 T 𝚽 𝑊 𝑇 𝑾 . The critical points for each eigenvalue and eigenvector are the same as before. 𝜕𝐽 𝑾 𝜕 𝒘 𝑗 =𝟎 𝚽 𝐵 −𝐽 𝚽 𝑊 𝑇 𝒘 𝑗 =𝟎 and 𝐽= 𝜆 𝑗 for 𝑗=1,2,…,𝑛 Fisher’s LDA is also written in (Fukunaga, 1990) as 𝐽 𝑾 =Tr 𝚽 𝑊 𝑇 −1 𝚽 𝐵 and 𝐽 𝑾 =Tr 𝑾 T 𝚽 𝐵 𝑾 𝑾 T 𝚽 𝑊 𝑇 𝑾 .

Example: Suppose that we have two classes that have 4 data (sample) points in 2D space: 𝐶 1 = , , , and 𝐶 2 = , , , Data are shown the figure in the below.

Example (continued): The mean of the classes are 𝒂 𝑎𝑣𝑒,1 = and 𝒂 𝑎𝑣𝑒,2 = for 𝐶 1 and 𝐶 2 respectively. The within-class covariance matrices are 𝚽 1 = and 𝚽 2 = for 𝐶 1 and 𝐶 2 respectively. Then the total within- class covariance matrix is 𝚽 𝑊 𝑇 = 𝚽 1 + 𝚽 2 = The average value of all data in two classes is 𝒂 𝑎𝑣𝑒 = 𝒂 𝑎𝑣 𝑒 1 + 𝒂 𝑎𝑣 𝑒 2 =

Example (continued): The between-class covariance matrix is 𝚽 𝐵 = 𝒂 𝑎𝑣 𝑒 1 − 𝒂 𝑎𝑣𝑒 𝒂 𝑎𝑣 𝑒 1 − 𝒂 𝑎𝑣𝑒 𝑇 + 𝒂 𝑎𝑣 𝑒 2 − 𝒂 𝑎𝑣𝑒 𝒂 𝑎𝑣 𝑒 2 − 𝒂 𝑎𝑣𝑒 𝑇 = −25 −25 25 and 𝚽 𝑊 𝑇 −1 = Then 𝚽 𝑊 𝑇 −1 𝚽 𝐵 = −5 − The eigenvalues of 𝚽 𝑊 𝑇 −1 𝚽 𝐵 can be calculated by solving the following equation: det 𝜆𝑰− 𝚽 𝑊 𝑇 −1 𝚽 𝐵 =𝜆 𝜆−5 =0

Example (continued): det 𝜆𝑰− 𝚽 𝑊 𝑇 −1 𝚽 𝐵 =𝜆 𝜆−5 =0 Therefore 𝜆 1 =5 and 𝜆 2 =0 and for finding eigenvectors 𝚽 𝑊 𝑇 −1 𝚽 𝐵 𝒘 1 =5 𝒘 1 ⇒ 𝒘 1 = −1 1 since 𝒘 1 =1. 𝚽 𝑊 𝑇 −1 𝚽 𝐵 𝒘 2 =0 𝒘 2 ⇒ 𝒘 2 = since 𝒘 2 =1. and 𝒘 1 ∙ 𝒘 𝟐 =0.

Example (continued): Geometric interpretation: Note that the projection of all data onto 𝒘 1 well-separates the classes 𝐶 1 and 𝐶 2 whereas the projection onto 𝒘 2 causes confusion.

Example (continued): Reducing the dimensions of the whole space by using directional vectors is related with subspace methods in Pattern Recognition. In 2D-space finding the lines that separate the classes is called LDA (linear discriminant analysis) where the line is the discriminant line. In higher dimensional spaces we will end up in finding hyperplanes that separate the whole space into two parts. That is given in (Bishop, 1996). The general equation for the discriminant hyperplanes is 𝐷 𝐗 = 𝒘 𝑇 𝐗− 𝒂 𝑎𝑣𝑒 =0.

Example (continued): For the last example, the eigenvector that belongs to the largest eigenvalue 𝜆 1 =5 is 𝒘 1 = −1 1 𝑇 and 𝒂 𝑎𝑣𝑒 = 𝑇 with 𝐗= 𝑥 1 𝑥 2 𝑇 . Then the discriminant line is 𝐷 𝐗 = 𝒘 1 𝑇 𝐗− 𝒂 𝑎𝑣𝑒 =0 − 𝑥 1 − 𝑥 2 − =0 𝐷 𝐗 =𝐷 𝑥 1 , 𝑥 2 =− 𝑥 1 + 𝑥 2 =0 𝑥 1 = 𝑥 2 line separates 𝑥 1 𝑥 2 -plane into two parts where the upper part belongs to the data points of 𝐶 1 and the lower part belongs to the data points of 𝐶 2 .

Example (continued): That is, if 𝐷 𝐗 >0 for any data point, then that data belongs to 𝐶 1 or else if 𝐷 𝐗 <0 then that data belongs to 𝐶 2 . For example 𝐷 𝐗 =− 𝑥 1 + 𝑥 2 and if 𝐗= 𝒂 1 = , then 𝐷 𝐗 =4>0 𝑎𝑛𝑑 𝒂 1 belongs to 𝐶 1 . If 𝐗= 𝒂 5 = , then 𝐷 𝐗 =−4<0 and 𝒂 5 belongs to 𝐶 2 .

Weighted Features for Feature Selection
We will try to assign weights to each one of the features, that is, each element in all the feature vectors will be multiplied with a weight factor. We will use ℎ 𝑖 (heaviness) for the ith feature. For example, if 𝒂 𝑗 = 𝑎 1𝑗 𝑎 2𝑗 ⋯ 𝑎 𝑛𝑗 𝑇 , the weighted feature vector will be ℎ 1 𝑎 1𝑗 ℎ 2 𝑎 2𝑗 ⋯ ℎ 𝑛 𝑎 𝑛𝑗 𝑇 = ℎ ℎ 2 ⋯ 0 ⋯ 0 ⋮ ⋮ ⋱ ⋮ ⋯ ℎ 𝑛 𝑎 1𝑗 𝑎 2𝑗 ⋮ 𝑎 𝑛𝑗 =𝑯 𝒂 𝑗 We have c classes for classification, that is, 𝐶 1 , 𝐶 2 ,…, 𝐶 𝑐 . In each class we assume that we have m feature vectors in n dimensional feature space. That is, 𝐶 𝑖 = 𝒂 1 𝑖 , 𝒂 2 𝑖 ,…, 𝒂 𝑚 𝑖 = 𝒂 𝑗 𝑖 , 𝑖=1,…,𝑐, 𝑗=1,…,𝑚 and 𝒂 𝑗 𝑖 ∈ ℝ 𝑛×1 . The classes with the weighted features are 𝐶 𝑖,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝑯 𝒂 𝑗 𝑖 for 𝑖=1,…,𝑐 and 𝑗=1,…,𝑚.

The weights for each feature may be constrained to a finite interval: ℎ 𝑘 ≤1 for 𝑘=1,…,𝑛. The covariance matrix must be calculated after this weight assignment. The averages of classes are 𝒂 𝑎𝑣𝑒,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑖 = 1 𝑚 𝑗=1 𝑚 𝑯 𝒂 𝑗 𝑖 =𝑯 𝒂 𝑎𝑣𝑒 𝑖 Then the difference vector after the weight assignment is 𝑯 𝒂 𝑗 𝑖 − 𝑯 𝒂 𝑎𝑣𝑒 𝑖 =𝑯 𝒂 𝑗,𝑑𝑖𝑓 𝑖 , 𝑖=1,…,𝑐, 𝑗=1,…,𝑚 The projection of a weighted difference vector onto a unit directional vector, 𝒘 𝑘 , has a length square of 𝑙 𝑖𝑗𝑘 2 = 𝒘 𝑘 𝑇 𝑯 𝒂 𝑗,𝑑𝑖𝑓 𝑖 − 𝒂 𝑗,𝑑𝑖𝑓 𝑖 𝑇 𝑯 𝒘 𝑘

The sum of the length squares for the feature vectors in 𝐶 𝑖 is 𝑆𝑢𝑚= 𝑗=1 𝑚 𝑙 𝑖𝑗𝑘 2 = 𝒘 𝑘 𝑇 𝑯 𝑗=1 𝑚 𝒂 𝑗,𝑑𝑖𝑓 𝑖 − 𝒂 𝑗,𝑑𝑖𝑓 𝑖 𝑇 Within−class covariance matrix with no weights 𝑯 𝒘 𝑘 Therefore 𝑆𝑢𝑚= 𝒘 𝑘 𝑇 𝑯 𝚽 𝑖 𝑯 𝒘 𝑘 = 𝒘 𝑘 𝑇 𝚽 𝐻 𝑖 𝒘 𝑘 . Total sum of the within-class covariance matrices is 𝚽 𝐻,𝑇𝑜𝑡𝑎𝑙 = 1 𝑐 𝚽 𝐻 1 + 𝚽 𝐻 2 +⋯+ 𝚽 𝐻 𝑐 = 1 𝑐 𝑯 𝚽 1 + 𝚽 2 +⋯+ 𝚽 𝑐 𝑯=𝑯 𝚽 𝑊,𝑇𝑜𝑡𝑎𝑙 𝑯

If 𝚽 𝑊,𝑇𝑜𝑡𝑎𝑙 = 𝑑 11 𝑑 12 𝑑 21 𝑑 22 ⋯ 𝑑 1𝑛 ⋯ 𝑑 2𝑛 ⋮ ⋮ 𝑑 1𝑛 𝑑 2𝑛 ⋱ ⋮ ⋯ 𝑑 𝑛𝑛 , then the total weighted within-class will be as below: 𝚽 𝐻,𝑇𝑜𝑡𝑎𝑙 = ℎ 1 2 𝑑 11 ℎ 1 ℎ 2 𝑑 12 ℎ 1 ℎ 2 𝑑 21 ℎ 2 2 𝑑 22 ⋯ ℎ 1 ℎ 𝑛 𝑑 1𝑛 ⋯ ℎ 2 ℎ 𝑛 𝑑 2𝑛 ⋮ ⋮ ℎ 1 ℎ 𝑛 𝑑 1𝑛 ℎ 2 ℎ 𝑛 𝑑 2𝑛 ⋱ ⋮ ⋯ ℎ 𝑛 2 𝑑 𝑛𝑛 Let’s also remember a critical formula: 𝑇𝑟𝑎𝑐𝑒 𝚽 𝑊,𝑇𝑜𝑡𝑎𝑙 = 𝑑 11 + 𝑑 22 +…+ 𝑑 𝑛𝑛 = 𝜆 1 + 𝜆 2 +…+ 𝜆 𝑛 Similarly 𝑇𝑟𝑎𝑐𝑒 𝚽 𝐻,𝑇𝑜𝑡𝑎𝑙 = ℎ 1 2 𝑑 11 + ℎ 2 2 𝑑 22 +…+ ℎ 𝑛 2 𝑑 𝑛𝑛 =sum of the eigenvalues of 𝚽 𝐻,𝑇𝑜𝑡𝑎𝑙 . Between-class covariance matrix can be also calculated before and after the weight assignment.

The class averages are 𝒂 𝑎𝑣𝑒 𝑖 for 𝑖=1,…,𝑐
The class averages are 𝒂 𝑎𝑣𝑒 𝑖 for 𝑖=1,…,𝑐. Then the average of all class averages is 𝒂 𝑡𝑜𝑡𝑎𝑙,𝑎𝑣𝑒 = 1 𝑐 𝑖=1 𝑐 𝒂 𝑎𝑣𝑒 𝑖 . Then the between-class covariance matrix is 𝚽 𝐵 = 1 𝐶 𝑘=1 𝐶 𝒂 𝑘𝑎𝑣𝑒 − 𝒂 𝑡𝑜𝑡𝑎𝑙,𝑎𝑣𝑒 𝒂 𝑘𝑎𝑣𝑒 − 𝒂 𝑡𝑜𝑡𝑎𝑙,𝑎𝑣𝑒 𝑇 After the weight assignment the class averages will be 𝑯 𝒂 𝑎𝑣𝑒 𝑖 for 𝑖=1,…,𝑐 and the average of all classes will be 𝑯 𝒂 𝑡𝑜𝑡𝑎𝑙,𝑎𝑣𝑒 . Similarly 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 =𝑯 𝚽 𝐵 𝑯. If 𝑇𝑟𝑎𝑐𝑒 𝚽 𝐵 = 𝑐 11 + 𝑐 22 +…+ 𝑐 𝑛𝑛 , then 𝑇𝑟𝑎𝑐𝑒 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 1 2 𝑐 11 + ℎ 2 2 𝑐 22 +…+ ℎ 𝑛 2 𝑐 𝑛𝑛 .

Fisher’s LDA metric in its subtraction form is
𝐽 𝑠𝑢𝑏 =𝑇𝑟 𝐖 T 𝚽 𝐵 𝐖−𝜂 𝐖 T 𝚽 𝑊 𝑇 𝐖 where 𝑾= 𝒘 𝟏 ⋮ 𝒘 𝟐 ⋮⋯⋮ 𝒘 𝒏 is 𝑛×𝑛 matrix. Or 𝐽 𝑠𝑢𝑏 = 𝑖=1 𝑛 𝒘 𝑖 𝑇 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 𝒘 𝑖 = 𝑖=1 𝑛 𝑐 𝑖𝑖 −𝜂 𝑑 𝑖𝑖 = 𝜆 1 + 𝜆 2 +…+ 𝜆 𝑛 . With the weights 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝑖=1 𝑛 𝒘 𝑖 𝑇 𝑯 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 𝑯𝒘 𝑖 = 𝑖=1 𝑛 ℎ 𝑖 2 𝑐 𝑖𝑖 −𝜂 𝑑 𝑖𝑖 Here the weights are bounded by the inequalities ℎ 𝑖 ≤1 for 𝑖=1,…,𝑛. Then the maximization of 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 with respect to weights will yield the following: If 𝑐 𝑖𝑖 −𝜂 𝑑 𝑖𝑖 >0, then ℎ 𝑖 2 =1, If 𝑐 𝑖𝑖 −𝜂 𝑑 𝑖𝑖 <0, then ℎ 𝑖 2 =0.

Example: The simplest case where 𝐶 1 = 1 , 𝐶 2 = 3 . Then 𝑎 1 1 =1= 𝑎 𝑎𝑣𝑒 1 and 𝑎 1 2 =3= 𝑎 𝑎𝑣𝑒 2 . With no weight assignment, we will use the regular procedure. Since there is one available data for both of the classes, the within-class scatters 𝚽 1 =0 and 𝚽 2 =0. Then 𝚽 𝑊 𝑇 =0 whose inverse does not exist. 𝑎 𝑎𝑣𝑒,𝑇 = =2. Then the between-class scatter is 𝚽 𝐵 = −2 1−2 +(2−3)(2−3) =1. 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 =1−0=1 and 𝑤 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 𝑤= 𝑤 2 with the constraint 𝑤 2 =1. It is best to classify data points as being on the left or right side of 𝑎 𝑎𝑣 𝑒 𝑇 .

Example (continued): After the weight assignment a similar procedure will be followed. 𝐶 1,𝑤 ={ ℎ 1 } and 𝐶 2,𝑤 = 3 ℎ 1 𝑎 𝑎𝑣 𝑒 𝑇 = ℎ 1 +3 ℎ 1 𝟐 =2 ℎ 1 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 1 −2 ℎ ℎ 1 −2 ℎ = ℎ 1 2 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 −𝜂 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 1 2 =𝑇𝑟𝑎𝑐𝑒 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 −𝜂 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 Trace of the expression is maximized when ℎ 1 =±1. Remark: 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 1 𝑤 2 with 𝑤 2 =1.

4.1 The Choice of the Constant 𝜼 If 𝜂=1 is chosen for all 𝒘 𝑖 for 𝑖=1,…,𝑛, then 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝑖=1 𝑛 𝒘 𝑖 𝑇 𝑯 𝚽 𝐵 − 𝚽 𝑊 𝑇 𝑯𝒘 𝑖 If 𝚽 𝐵 is calculated on a per-class basis and 𝚽 𝑊 𝑇 is calculated on a per-data basis, then 𝚽 𝐵 − 𝚽 𝑊 𝑇 becomes meaningful. Let the eigenvalues of 𝚽 𝐵 be 𝜆 1𝐵 , 𝜆 2𝐵 ,…, 𝜆 (𝑐−1)𝐵 ,0,…,0 and let the eigenvalues of 𝚽 𝑊 𝑇 be 𝜆 1𝑊 , 𝜆 2𝑊 ,…, 𝜆 𝑛𝑊 . Then 𝑇𝑟𝑎𝑐𝑒 𝚽 𝐵 − 𝚽 𝑊 𝑇 = 𝜆 1𝐵 + 𝜆 2𝐵 +…+ 𝜆 (𝑐−1)𝐵 − 𝜆 1𝑊 + 𝜆 2𝑊 +…+ 𝜆 𝑛𝑊 . Remember the 𝜆’s indicate between and within class variances.

For class separability, the difference between the terms in parenthesis must be positive and large as possible. If the difference is negative, the classes may overlap causing misclassification. By assigning zero weights to some of the features that causes negativity, one may prevent overlapping of classes. Example: Let there be two classes with 𝐶 1 = 1 1 , 1 −1 , 𝐶 2 = 3 1 , 3 −1 .

Example (continued): 𝒂 𝑎𝑣𝑒 1 = 1 0 , 𝒂 𝑎𝑣𝑒 2 = 3 0 , 𝒂 𝑎𝑣𝑒𝑇 = 2 0
With no weight assignment we will follow a regular procedure. 𝚽 1 = , 𝚽 2 = , 𝚽 𝑊 𝑇 = , 𝚽 𝐵 = 𝐽 𝑠𝑢𝑏 = 𝒘 𝑇 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 𝒘= 𝒘 𝑇 −2𝜂 𝒘= 𝑤 1 𝑤 𝑤 1 −2𝜂 𝑤 2 𝐽 𝑠𝑢𝑏 = 𝑤 1 2 −2𝜂 𝑤 2 2 with the constraint 𝑤 𝑤 2 2 =1. 𝐽 𝑠𝑢𝑏 = 1− 𝑤 2 2 −2𝜂 𝑤 2 2 =1− 1+2𝜂 𝑤 2 2 𝑑𝐽 𝑑 𝑤 2 =−2 1+2𝜂 𝑤 2 =0 then 𝑤 2 =0 and 𝑤 1 =1, that is, 𝒘= = 𝒆 1 directional vector best separates 𝐶 1 and 𝐶 2 .

𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 1 2 0 0 0 , 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 0 0 0 2 ℎ 2 2
Example (continued): In terms of LDA: 𝒘 𝑿− 𝒂 𝑎𝑣𝑒𝑇 = 𝑥 1 −2 𝑥 2 = 𝑥 1 −2=𝐷 𝑿 =0 𝑥 1 =2 is the discriminant line. 𝐶 1 data points are to the left of 𝑥 1 =2. 𝐶 2 data points are to the right of 𝑥 1 =2. After the weight assignment to the features we will have the following. 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ , 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 2 2

Example (continued): 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 ℎ −2𝜂 ℎ 𝒘= ℎ 1 2 𝑤 1 2 −2𝜂 ℎ 2 2 𝑤 with the constraint 𝑤 𝑤 2 2 =1. 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 can still be maximized using the constraint relation. Then 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ − 𝑤 2 2 −2𝜂 ℎ 2 2 𝑤 2 2 = ℎ 1 2 − ℎ 𝜂 ℎ 𝑤 2 2 . 𝑑 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑑 𝑤 2 =0=−2 ℎ 𝜂 ℎ 𝑤 2 =0 gives 𝑤 2 =0. This result is independent of weight assignment. But if we go back to the relation 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ 1 2 𝑤 1 2 −2𝜂 ℎ 2 2 𝑤 2 2 with the constraints ℎ 1 ≤1 and ℎ 2 ≤1, 𝐽 𝑠𝑢𝑏, 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 can be maximized with respect to the weights. Since 𝜂 is non-negative and a minus sign in front, ℎ 2 =0 for better classification. This is the elimination of the second features in both classes. Classification can be done for the data along 𝑥 1 -axis only.

Example: There is a small probability that makes our feature selection method useless. This is an example for this case. Let there be two classes with 𝐶 1 = , , 𝐶 2 = , Then 𝒂 𝑎𝑣𝑒 1 = , 𝒂 𝑎𝑣𝑒 2 = , 𝒂 𝑎𝑣𝑒𝑇 = The between-class covariance matrix is 𝚽 𝐵 = 1 −1 −1 1 per class. And the within-class covariances are 𝚽 1 = , 𝚽 2 = per data and 𝚽 𝑊 𝑇 = 𝚽 1 + 𝚽 2 = per class. 𝚽 𝑊 𝑇 −1 = 𝚽 𝑊 𝑇 −1 𝚽 𝐵 =2 1 −1 −1 1 = 2 −2 −2 2 .

Example (continued): For rational expression of Fisher:
det 𝚽 𝑊 𝑇 −1 𝚽 𝐵 −𝜆𝑰 = det 2 −2 −2 2 − 𝜆 0 0 𝜆 =0 will give 𝜆 𝜆−4 =0. Then 𝜆 1 =4 and 𝜆 2 =0. The corresponding eigenvectors are 𝒘 1 = −1 1 and 𝒘 2 =

Example (continued): We should go back to the figure to see class separability with no weights. The direction of 𝒘 1 is along 𝑥 1 ′ . The projection of all data onto 𝑥 1 ′ -axis gives us separable classes. The direction of 𝒘 2 is along 𝑥 2 ′ . The projection of the data in 𝐶 1 and 𝐶 2 onto 𝑥 2 ′ are not separable. Or else, the line through the 𝑥 2 ′ -axis is the linear discriminant.

Example (continued): Now let’s see if we can use the feature selection method. With the weight assignment the average value of two-classes are 𝒂 𝑎𝑣𝑒 1 = 2 ℎ , 𝒂 𝑎𝑣𝑒 2 = 0 2 ℎ 2 and 𝒂 𝑎𝑣𝑒𝑇 = ℎ 1 ℎ Then the covariance matrices with weights are 𝚽 𝑊1 = ℎ , 𝚽 𝑊2 = ℎ per data and 𝚽 𝑊 𝑇 = 𝚽 𝑊1 + 𝚽 𝑊2 = ℎ ℎ per class. 𝚽 𝐵 = ℎ 1 2 − ℎ 1 ℎ 2 − ℎ 1 ℎ 2 ℎ per class. Then 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 𝚽 𝐵 −𝜂 𝚽 𝑊 𝑇 𝒘.

𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 𝚽 𝐵 − 𝚽 𝑊 𝑇 𝒘= 𝒘 𝑇 1 2 −1 −1 1 2 𝒘
Example (continued): Assume 𝜂=1. Then 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 ℎ 1 2 − 1 2 ℎ 1 2 − ℎ 1 ℎ 2 − ℎ 1 ℎ 2 ℎ 2 2 − 1 2 ℎ 𝒘= 𝒘 𝑇 ℎ 1 2 − ℎ 1 ℎ 2 − ℎ 1 ℎ ℎ 𝒘 𝑇𝑟𝑎𝑐𝑒 𝚽 𝐵 − 𝚽 𝑊 𝑇 = 1 2 ℎ ℎ 2 2 = 𝜆 1 + 𝜆 2 This means that ℎ 1 and ℎ 2 have equal importance in 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 . So the elimination of any feature is meaningless. The metric with 𝜂=1 with no weight assignment is 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 𝚽 𝐵 − 𝚽 𝑊 𝑇 𝒘= 𝒘 𝑇 −1 − 𝒘

Example (continued): 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 𝚽 𝐵 − 𝚽 𝑊 𝑇 𝒘= 𝒘 𝑇 1 2 −1 −1 1 2 𝒘
Introduce the constraint 𝒘 𝑇 𝒘=1. 𝐽 𝑠𝑢𝑏,𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 = 𝒘 𝑇 −1 − 𝒘+𝜆(1− 𝒘 𝑇 𝒘) 𝑑𝐽 𝑠𝑢𝑏,𝑎𝑢𝑔𝑚𝑒𝑛𝑡𝑒𝑑 𝑑𝒘 = −1 − −𝜆 𝒘=0 This has two eigenvalues 𝜆 1 = 3 2 and 𝜆 2 =− 1 2 with the egienvectors 𝒘 1 = −1 1 and 𝒘 2 = These eigenvectors give the same separability as before.

Example: This example is to show where the weight assignment is useful for feature selection.
Let 𝐶 1 = , , , with average 𝒂 𝑎𝑣𝑒 1 = Let the data of 𝐶 2 is the mirror image of data in 𝐶 1 along 𝑥 1 -axis, that is, 𝐶 2 = 2 −2 , 6 −2 , 2 −4 , 6 −4 with average 𝒂 𝑎𝑣𝑒 2 = 4 −3 .

Example (continued): The total within-class matrix per-data and per-class basis is 𝚽 𝑊 𝑇 = and the between-class covariance matrix on per-class basis is 𝚽 𝐵 = Fisher’s metric in subtraction form is 𝐽 𝑠𝑢𝑏 = 𝒘 𝑇 𝚽 𝐵 − 𝚽 𝑊 𝑇 𝒘= 𝒘 𝑇 − 𝒘 with the eigenvalues 𝜆 1 =−4 and 𝜆 2 =8 and the corresponding eigenvectors 𝒘 1 = and 𝒘 2 = The projections onto 𝒘 2 (onto 𝑥 2 -axis) will separate 𝐶 1 and 𝐶 2 classes whereas the projection onto 𝒘 1 (onto 𝑥 1 -axis) will overlap.

Example (continued): With the use of weight assignment method we will get the following. 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ , 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 4 ℎ ℎ , then 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 − 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝒘= 𝒘 𝑇 −4 ℎ ℎ 𝒘 It has a trace of 𝑇𝑟 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 =−4 ℎ ℎ 2 2 . In terms of weights 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 is maximized when ℎ 1 =0 and ℎ 2 =1. Therefore one can eliminate the first features in the data of all classes and still get a classification with second features. 2D-space can be reduced to 1D-space. (Notice that it is the projection of all data onto 𝑥 2 -axis).

Example: This example is to show where the classes are not separable in all directions. Let the data of 𝐶 1 be the same as the data of 𝐶 1 in the previous example. Let the data of 𝐶 2 be the data of 𝐶 1 shifted downward by one unit. That is, 𝐶 1 = , , , , 𝐶 2 = , , , Then 𝒂 𝑎𝑣𝑒 1 = , 𝒂 𝑎𝑣𝑒 2 = and 𝒂 𝑎𝑣𝑒,𝑇 = 𝚽 𝑊 𝑇 = per data and 𝚽 𝐵 = per class. 𝚽 𝑊 𝑇 −1 𝚽 𝐵 = with eigenvalues 𝜆 1 =0, 𝜆 2 = and their corresponding eigenvectors 𝒘 1 = = 𝒆 1 , 𝒘 2 = = 𝒆 2 respectively.

Example (continued): Even the projection of data onto 𝒆 2 direction ( 𝑥 2 -axis) are not well separable with respect to their class averages. For the weighted case; 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 4 ℎ ℎ , 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ℎ Then the Fisher’s metric is 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝒘 𝑇 𝚽 𝐵,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 − 𝚽 𝑊 𝑇 ,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝒘= 𝒘 𝑇 −4 ℎ − 3 4 ℎ 𝒘

𝑇𝑟 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 =𝑇𝑟 −4 ℎ 1 2 0 0 − 3 4 ℎ 2 2 =−4 ℎ 1 2 − 3 4 ℎ 2 2 .
Example (continued): The trace of 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 is 𝑇𝑟 𝐽 𝑠𝑢𝑏,𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 =𝑇𝑟 −4 ℎ − 3 4 ℎ =−4 ℎ 1 2 − 3 4 ℎ 2 2 . Both ℎ 1 and ℎ 2 must be equal to zero. No classification will be possible. One may try to get rid of −4 ℎ since it is larger in the negative direction compared with − 3 4 ℎ Best advice in this case is “try to find other measurable features for classification!”. If this is not possible, then one may try to use kernel methods to increase the number of features in both classes and apply other methods for pattern recognition (Cevikalp et al, 2007).

ACKNOWLEDGMENT The authors would like to thank Emre Celebioglu for his contributions to this work. References R.A. Fisher, "The Use of Multiple Measures in Taxonomic Problems", Ann. Eugenics, Vol. 7, pp , 1936. C. M. Bishop, “Neural Networks for Pattern Recognition”, Oxford University Press, 1995. H. Cevikalp, M. Neamtu, M. Wilkes, A. Barkana, “Discriminative Common Vectors for Face Recognition”, IEEE Trans. on PAMI, Vol.27, No.1, pp K. Fukunaga, “Introduction to Statistical Pattern Recognition”, Academic Press, 1990. H. Cevikalp, M. Neamtu, A. Barkana, “The Kernel Common Vector Method: A Novel Nonlinear Subspace Classifier for Pattern Recognition”, IEEE Trans. on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol.37, No.4, pp , 2007.

Atalay Barkana, Mehmet Koc, Ozen Yelbasi

Similar presentations

Presentation on theme: "Atalay Barkana, Mehmet Koc, Ozen Yelbasi"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Atalay Barkana, Mehmet Koc, Ozen Yelbasi

Similar presentations

Presentation on theme: "Atalay Barkana, Mehmet Koc, Ozen Yelbasi"— Presentation transcript:

Similar presentations

About project

Feedback