Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Statistical Methods for Research Math 736/836

Similar presentations


Presentation on theme: "Advanced Statistical Methods for Research Math 736/836"— Presentation transcript:

1 Advanced Statistical Methods for Research Math 736/836
Discriminant Analysis and Classification Supervised learning. Kathy dreaded her annual performance review with Catbert the evil H.R. Director. ©2009 Philip J. Ramsey, Ph.D.

2 In this section we explore a technique to separate observations into groups as linear functions of some number of covariates. We are interested in developing linear functions that accurately separate the observations into predetermined groups and understand the relative contributions of the covariates in creating the separations into these groupings. Today Discriminant Analysis is used for more than separation and is used as a classification method where a training dataset with known classifications exist. In this sense Discriminant analysis is a method of supervised learning. We will discuss linear discriminant functions and quadratic discriminant functions. Discriminant analysis is related to such methods as logistic regression, CART, and neural nets, and these other methods are increasingly used in lieu of discriminant analysis for classification. ©2009 Philip J. Ramsey, Ph.D.

3 The origins of Discriminant analysis seem to be from a paper published by R.A. Fisher in 1936.
Fisher was attempting to find some linear combination of covariates that could be used to discriminate between predetermined categories or groupings of the observations. He was also attempting to reduce the dimensionality of the covariates such that one could visually discriminate between the groups. In a sense, Discriminant analysis combines concepts of ANOVA and PCA in developing these linear discriminant functions. PCA is an example of unsupervised learning in that no use is made of any natural groupings in the observations. However, Discriminant is considered supervised learning since it explicitly makes use of the groupings in developing the functions. ©2009 Philip J. Ramsey, Ph.D.

4 We begin with the simplest case of k variables or covariates that can be used to separate a set of N observations into m = 2 groups. We assume that both groups have an identical covariance structure  among the k variables. However, they have different centroids μ1 and μ2. A linear discriminate function is a combination of the k variables that maximizes the distance between the centroids. In basic discriminant analysis we typically have two samples, one from each group, with n1 and n2 observations in each group, where N = n1 + n2. We will let Y1j represent a row vector (k elements, one for each column or variable) from group 1, there are n1 such vectors, while Y2j represent a row vector of length k from group 2 and there are n2 such vectors. ©2009 Philip J. Ramsey, Ph.D.

5 We have m = 2 such discriminant functions.
The linear discriminant functions turn each row vector into a scalar value Z. In other words we create a linear combination of the k covariates Z = a′Y where a′ is a vector of coefficients for the linear combination. Note, in concept, the similarity to PCA. We have m = 2 such discriminant functions. The idea is to find estimates of the coefficients a′ such that the standardized differences between the two scalar means are maximized. Since the differences could be negative we work with the squared distances. ©2009 Philip J. Ramsey, Ph.D.

6 Basically we wish to find the weights that maximize the ratio
The trick is to find the values of the weights a´ such that we have the maximum separation between the two groups. Fisher viewed this as an ANOVA type problem where he wanted to find the weights such that the variation between the transformed groups (the linear combinations) was maximized and the within group variation minimized. We can define the between group covariance as a matrix H, which we explain in detail later in the notes. Also, we can identify the within group covariance with the matrix E and again we explain this matrix in detail later. Basically we wish to find the weights that maximize the ratio For only two groups the expression greatly simplifies. ©2009 Philip J. Ramsey, Ph.D.

7 Let Spool be the pooled covariance matrix (dimension k x k) from the two groups. Recall we assume both groups have the same underlying covariance structure, so it makes sense to pool the two estimates together in order to get a better overall estimate of . Let SZ be the pooled standard deviation for the two columns of scalars Z1 and Z2, then the formula for the standardize squared distance between the scalar or column means is The RHS of this expression is our objective function, which we maximize to find the weights; i.e.; with some calculus the maximum separation occurs where ©2009 Philip J. Ramsey, Ph.D.

8 What we are doing is trying to find a direction or line (for two groups) onto which we project the original observations, such that the standardized mean differences between the transformed groups is maximized on that line. The solution for a´, intuitively, projects the observations onto a line that runs parallel to a line joining the two centroids of our two groups. The differences between the two transformed or projected means is maximized only if that line is parallel to the line joining the centroids. A projection in any other direction (not orthogonal to the line joining the centroids) would result in a smaller difference between the transformed means Zi. The solution for the weights a´ is not unique, however the projection direction is unique and is always orthogonal to the line joining the centroids. ©2009 Philip J. Ramsey, Ph.D.

9 Example: We use the dataset Table81Rencher
Example: We use the dataset Table81Rencher.JMP to illustrate the discriminant function method for 2 groups. The data set consists of two measurements of strength on samples of steel processed at two different temperatures; the temperatures form the groups. We will develop a linear discriminant function to classify the steel into the two groups based upon the two measurements. ©2009 Philip J. Ramsey, Ph.D.

10 Example: Notice from the Fit Y by X plot that we would have a great deal of overlap if we tried to project the observations onto either the Y axis or X axis. However, if it is possible to project the points onto a line in another direction, then we could greatly reduce the overlap between the two groups. This is the concept of a linear discriminant function. ©2009 Philip J. Ramsey, Ph.D.

11 Example: In the plot to the right, the ‘X’ symbols connect the centroids for the two groups. The line we project the points onto will always be parallel to a line connecting the centroids and the exact position of this line depends upon the coefficient estimates a′. Remember the solution shown is not unique, but the projection direction is unique. Z Temperature A Temperature B ©2009 Philip J. Ramsey, Ph.D.

12 Example: Below are the calculations for the coefficients of the discriminant functions for the two groups. ©2009 Philip J. Ramsey, Ph.D.

13 Example: Below is a data table showing the values of the discriminant function Z. On the Z scale we can clearly differentiate between the two Temperature groups. ©2009 Philip J. Ramsey, Ph.D.

14 Example: One could also do a two sample t test on Z to see if a true mean difference exists between the two groups. Below is the two sample t test results from the JMP Fit Y by X platform. Notice that a highly significant difference exists. Equivalently we can say that a significant difference exists between the centroids of the original data. ©2009 Philip J. Ramsey, Ph.D.

15 Example: JMP performs discriminant analysis in the Discrminant platform. The solution from JMP is shown below. Notice that it is a scaled version of the manual solution, but it is also equivalent in its ability to classify into the two groups. ©2009 Philip J. Ramsey, Ph.D.

16 Example: The Canon[1] scores below are the discriminant scores for the JMP solution. We can see that the the JMP solution equivalently separates the two temperatures into their respective groups. To show the equivalence note that the ratio of the Z values to the Canon[1] values is a constant = 2.89 ©2009 Philip J. Ramsey, Ph.D.

17 Example: We can show the conceptual similarity to PCA by doing a PCA on Yield and Ultimate and show that we can separate the two Temperature groups by projecting all of the points onto the PC2 line. In the PC1 direction we cannot separate the two groups but in the PC2 direction we can easily separate the two groups, so a projection on to the PC2 line does provide discrimination between the groups. ©2009 Philip J. Ramsey, Ph.D.

18 Example: The solution in Discriminant is not the eigen decomposition on S or R used in PCA, however in concept the idea is to find a coordinate system on which to plot the points (rotate the original axes) such that the groups are separated as much as possible – the transformed means are as far apart as possible. Of course PCA is not intended to separate groups, so the solution is not identical to Discriminant. More on this later. PC1 PC2 ©2009 Philip J. Ramsey, Ph.D.

19 When we have more than two groups, the problem becomes more complex, however the basic principle of linear discriminant functions is still the same as in the two group case. In order to explain the solution for m > 2 groups we need to introduce a couple of concepts from one-way Multivariate Analysis of Variance (MANOVA). Recall in univariate one-way Analysis of Variance (ANOVA) the variation in the observations is broken down into two parts. The first part is the within variation, which describes the purely random variation within the replicates in each of the groups. The second part is the between variation, which describes the variation between groups. If the between variation is large compared to the within variation, then we have evidence of significant differences between the groups. ©2009 Philip J. Ramsey, Ph.D.

20 For the univariate one-way ANOVA case the sums of squares define the variation in the two groups and we have the well known result SS(Total) = SS(Between) + SS(Error) and for m groups the formula is When multivariate responses exist, then we have to re-express the within and between formulas using matrix notation. ©2009 Philip J. Ramsey, Ph.D.

21 Below is the expression for standardized distance squared between two groups
To generate the analogous multivariate expression for m > 2 groups we substitute in the H and E matrices. In this case we search for solutions a′ which maximize λ. Notice that the trivial solution a′ = 0 is not permissible. Notice that λ is a ratio of the between groups variation to the random within groups variation and we want to maximize the discrimination between the groups, hence maximize λ. ©2009 Philip J. Ramsey, Ph.D.

22 By rearranging terms we have the expressions
With some calculus it can be shown that the solutions are the s nonzero eigenvalues and associated eigenvectors of E-1H and thus the largest eigenvalue is the maximum value of λ and the associated eigenvector gives the solution a′, the scoring coefficients. Note the similarity to PCA. In other words, the solution for the coefficients of the linear discriminant functions are the weights from the eigenvectors of E-1H. Typically the number of required linear discriminant functions is based upon the magnitudes of the eigenvalues and generally only one or two are required. The method is inherently graphical. ©2009 Philip J. Ramsey, Ph.D.

23 In general for m groups and k variables s = min(m-1,k) and is the rank of H. The number of discriminant functions is no greater than the rank of H. For m = 2 groups we see that 1 discriminant function is required. The relative importance of each discriminant function can be assessed by the relative contribution of each eigenvalue to the total (recall the proportion of variation explained concept form PCA). For the ith discriminant function the relative importance is Typically only the discriminant functions associated with the largest eigenvalues are retained and the remainder are ignored. ©2009 Philip J. Ramsey, Ph.D.

24 Example: We use the dataset FootBallHelmet. JMP
Example: We use the dataset FootBallHelmet.JMP. This is a dataset measuring head dimensions, for a study of football helmets, for three groups of individuals; high school football players, college football players, and non-football players. For each of the m = 3 groups, k = 6 variables are measured. There are a total of N = 90 observations. We use the JMP Discriminant platform to find a solution. ©2009 Philip J. Ramsey, Ph.D.

25 Example: Results from the Discriminant report
Eigenvectors of E-1H ©2009 Philip J. Ramsey, Ph.D.

26 Example: From the report we see that only s = 2 nonzero eigenvalues exist (m-1), therefore there are only two useful disciminant functions Z1 and Z2. The coefficients of the functions are obtained from the first two rows of the eigenvector matrix given on the previous slide. Since the first eigenvalue is 94.3% of the total, it is possible that only one discriminant function is actually needed to separate the groups. Z1 = ©2009 Philip J. Ramsey, Ph.D.

27 Example: Below are Fit Y by X plots of the two discriminant functions by Group. The first function Canon[1] seems to separate the HSFB group quite well from the other two, while the second function Canon[2] has little or no discrimination ability between the groups. ©2009 Philip J. Ramsey, Ph.D.

28 The null hypotheses can be stated as
Hypothesis tests can be performed for the significance of the discriminant functions. In developing the discriminant functions we have made no assumptions about the distributions of the multivariate groups. In order to perform hypothesis tests, we do have to assume that the groups are multivariate normal with an identical covariance structure , but possibly different centroids μ1, μ2, … , μm. If the discriminant functions are not significant, the null hypothesis, then is equivalent to saying that the true coefficient vectors a′ of the discriminant functions are 0. Stated another way, there is not a significant difference among the m centroids. The null hypotheses can be stated as ©2009 Philip J. Ramsey, Ph.D.

29 A number of multivariate hypothesis tests exist for a difference between a group of centroids. We will not discuss the majority of them in detail. One of the multivariate tests called Wilk’s Λ is particularly useful for the hypothesis testing of the discriminant functions. Again, we skip the details of the test for now, but the test can be shown to be a function of the eigenvalues of E-1H from which we derive the discriminant functions. Wilk’s Λ for s nonzero eigenvalues (the number of discriminant functions) can be shown to be ©2009 Philip J. Ramsey, Ph.D.

30 For k variables, m groups, significance level α (usually 0
For k variables, m groups, significance level α (usually 0.05), and N observations the distribution of the Wilk’s Λ test statistic can be compared to tabled percentile values Λk,m-1,N-m,α. Various approximations exist and JMP uses an approximation based upon the F distribution. Wilk’s Λ is a bit unusual as a test statistic in that we reject for smaller values rather than larger. Looking at the formula note that large eigenvalues will lead to small values of the test statistic. In general larger eigenvalues indicate more significant discriminant functions, so our test rejects for values of the test statistic below the critical value. ©2009 Philip J. Ramsey, Ph.D.

31 The test for the set with the largest eigenvalue deleted is
A nice aspect of Wilk’s Λ test is that we can also use it to test each of the discriminant functions for significance, in addition to the overall test for the entire set of s functions. If the overall Wilk’s Λ test rejects, then we are assured that at least the first discriminant function is significant, but this does not tell us about the remaining s-1 functions. To test for the other functions drop the largest eigenvalue and test the remaining set. This procedure can be followed iteratively for the entire set of s functions. The test for the set with the largest eigenvalue deleted is ©2009 Philip J. Ramsey, Ph.D.

32 For the lth test Λl the F approximation is
JMP provides the overall Wilk’s Λ test, but does not perform tests on the remaining discriminant functions. We can perform these tests by hand, but we will need to compute the F distribution approximation to the Λ distribution to evaluate the tests. For the lth test Λl the F approximation is ©2009 Philip J. Ramsey, Ph.D.

33 Example: Continuing with the football helmet data we compute the Wilk’s test statistics for the s = 2 discriminant functions. From an F distribution table F0.95,12,164 = Since our test statistic F = >> 1.89 we overwhelmingly reject the null hypothesis. The p-value < We can now assume that at least the first discriminant function is statistically significant. ©2009 Philip J. Ramsey, Ph.D.

34 Example: Next we perform the test for the second discriminant function.
From an F distribution table F0.95,5,83 = Since our test statistic F = < we fail to reject the null hypothesis. The p-value < We can not assume that the second discriminant function is statistically significant. ©2009 Philip J. Ramsey, Ph.D.

35 Our test indicated that the second function may not be significant.
Example: From the discriminant analysis in JMP we can get the overall test of significance for the 2 discriminant functions. JMP provides several other tests, however we will not discuss these in this section. As shown, the Wilk’s Λ test is useful for discriminant functions since we can partition the test for each of the discriminant functions. All of the multivariate tests overwhelmingly reject the null hypothesis, so they are in agreement that at least the first discriminant function is statistically significant. Our test indicated that the second function may not be significant. ©2009 Philip J. Ramsey, Ph.D.

36 Another statistic that can be used to try and determine the importance of each of the discriminant functions is referred to as the canonical correlation coefficient. The canonical correlation represents the relationship between the discriminant function and the grouping variable. The higher the correlation, the greater the ability of the discriminant function to sort observations into the proper groups. In ANOVA a categorical factor is transformed into a dummy variable (it can only have the values of 1 or 0) in order to construct a linear model. For m levels of the categorical factor, m-1 dummy variables are required. For example, suppose we have m = 3 groups, then we need 2 dummy variables ©2009 Philip J. Ramsey, Ph.D.

37 We use the football helmet data as an example
Canonical correlation measures the association between each of the discriminant functions and a best linear combination of the dummy variables associated with the categories of the grouping variable. The correlation is a measure of how much variation between the groups can be explained by the discriminant function. Mathematically, the canonical correlation for the ith discriminant function can be computed as Category D1 D2 HSFB 1 CollFB NonFB ©2009 Philip J. Ramsey, Ph.D.

38 The hand computations are
Example: Again using the football helmet data we have JMP calculate the canonical correlations for the two discriminant functions. The first discriminant function seems to be highly correlated with the groupings, while the second appears weakly correlated. The hand computations are ©2009 Philip J. Ramsey, Ph.D.

39 There are three basic approaches.
Another issue to consider in Discriminant analysis whether or not we require all of the potential k variables in our functions in order for them to classify correctly. Often some of the variables may not be of value for classification. A stepwise model building procedure can be used to see if some of the variables can be dropped from consideration. There are three basic approaches. Forward – sequentially enter variables based on ability to separate the groups. Backward – sequentially remove variables based on ability to separate the groups. Stepwise – combine forward and backward (recommended). The procedures are based upon computing Wilk’s Λ. ©2009 Philip J. Ramsey, Ph.D.

40 The initial step for the stepwise selection is to compute a univariate ANOVA for each of the k variables to see how they individually separate the m group centroids. The first variable entered into the model is the one with the smallest p-value for the univariate F test of significance. Since JMP provides stepwise Discriminant analysis we will work along with the football helmet data to explain the stepwise method. ©2009 Philip J. Ramsey, Ph.D.

41 The F tests and p-values are the univariate ANOVA tests
The F tests and p-values are the univariate ANOVA tests. It appears that EyeHD is our first candidate to enter the model. On the next two slides we show that the initial p-values are from a set of 6 univariate ANOVA’s. ©2009 Philip J. Ramsey, Ph.D.

42 The first three univariate F tests.
©2009 Philip J. Ramsey, Ph.D.

43 The next three univariate F tests.
©2009 Philip J. Ramsey, Ph.D.

44 After entering EyeHD into the model, partial Wilk’s Λ tests are computed for each of the remaining k-1 variables and the one with the smallest p-value is entered into the model, since it gives the best separation of the groups given the first selected variable is already in the model. Here Y1 is the variable entered at step 1 and Yr is one of the remaining variables. We have to compute 5 partial tests. Remember at step 2 a partial test is performed for every possible 2 variable model, since EyeHD is already selected. ©2009 Philip J. Ramsey, Ph.D.

45 After entering EyeHD into the model, partial Wilk’s Λ tests are computed for each of the remaining k-1 variables and the one with the smallest p-value is entered into the model, since it gives the best separation of the groups given the first selected variable is already in the model. The Wilk’s Λ test for a model containing only EyeHD is shown below. This equivalent to the univariate ANOVA F test. We need this test value to compute the partial test values at stage two. ©2009 Philip J. Ramsey, Ph.D.

46 With both EyeHD and WDim in the model the test statistic value is
The associated partial Wilk’s test for WDim is It can be shown that for a model with p variables, the associated partial F test with m-1 and N-m-p+1 degrees of freedom is given by ©2009 Philip J. Ramsey, Ph.D.

47 For stage two, the partial F test with 2 and 86 degrees of freedom for a model containing EyeHD and WDim is The associated p-value = for this partial F test and since it is the most significant partial F test among the 5 tests, we will enter WDim into the model. With WDim in the model, both variables still appear to be significant so we do not remove a variable from the model. Next we perform the partial F tests for the remaining 4 variables and select the one with the most significant partial F test. ©2009 Philip J. Ramsey, Ph.D.

48 The partial Wilk’s test at step three (p = 3)
©2009 Philip J. Ramsey, Ph.D.

49 So, at step 3 we admit Jaw to the model and then check to make certain that all three variables are still significant. If a variable is no longer significant we may opt to remove it form the model – the backward and forward aspect of the stepwise procedure. At step 4 it appears that EarHD is the only candidate to enter the model. ©2009 Philip J. Ramsey, Ph.D.

50 At step 4 we entered EarHD into the model and all of the three variables previously entered into the model remain significant so we do not remove any variables. Since neither of the two remaining variables are even close to significant we stop with the 4 variable model. In the original discriminant functions both Circum and FBEye had nearly 0 weights so they did not contribute much to the separation capability and we drop them in our new reduced model. ©2009 Philip J. Ramsey, Ph.D.

51 We next fit the reduced 4 variable model
Notice by Wilk’s Λ the reduced model appears more significant than the original 6 variable model. However, we still do not know if it classifies or separates any better than the full model. ©2009 Philip J. Ramsey, Ph.D.

52 In comparing the full and reduced linear discriminant functions, we notice that the coefficients are not much different. Therefore, it is unlikely that the new model will separate any more effectively than the original or full model. ©2009 Philip J. Ramsey, Ph.D.

53 On the left is the Mosaic plot of Actual vs
On the left is the Mosaic plot of Actual vs. Predicted for the 6 variable model and on the right for the 4 variable model. Both plots are virtually identical. ©2009 Philip J. Ramsey, Ph.D.

54 The procedure might be better named stepwise MANOVA.
The procedure we have just demonstrated with the football helmet data is often referred to as stepwise discriminant analysis. The title is a bit of a misnomer since we never compute any discriminant functions during the stepwise procedure. Rather the procedure attempts to find the subset of the original k variables or covariates which provides for significant separation between the m group means. The procedure might be better named stepwise MANOVA. Once the subset of the k variables is found, then one can use this subset to construct discriminant functions. There is confusion on this point, so be wary of the term stepwise discriminant analysis and what it actually implies. ©2009 Philip J. Ramsey, Ph.D.

55 Our discussion to this point has mostly focused on the construction of linear discriminant functions. As noted the original purpose of discriminant analysis was not classification into groups. Fisher developed discriminant analysis to provide a graphical technique that could distinguish between multivariate groups very much in the spirit of PCA and biplots (not yet invented). However, over time discriminat analysis and classification have become synonymous and this does lead to some confusion. In general the linear discriminant functions are only used to build graphical displays such as the biplots in JMP’s Discriminant platform. Classification in general uses a different set of functions, unfortunately these functions are often called discriminant functions. ©2009 Philip J. Ramsey, Ph.D.

56 Fisher’s linear discriminant procedure is nonparametric in that it makes no assumption about the distribution for each group other than equal covariance structure. It can be shown for 2 multivariate normal groups with equal covariance matrices, Fisher’s linear discriminant functions are optimal for classification. If we depart from these assumptions they are not so. Therefore, the linear discriminant functions for 2 groups serve as optimal linear classification functions – a better term for them. For two populations the linear discriminant or classification function is very straightforward to use as a classification rule for new sets of observations not used in creating the discriminant functions. Assuming that we have no prior knowledge of the probability that an observation comes from one population or the other the classification rule is straightforward. ©2009 Philip J. Ramsey, Ph.D.

57 Z0 is calculated from the linear discriminant function
Assign the new observation vector Y0 (row of a data table) to group 1 if the discriminant function value Z is greater than the midpoint of the mean discriminant scores for the two groups. Z0 is calculated from the linear discriminant function We also have that ©2009 Philip J. Ramsey, Ph.D.

58 The classification rule is to classify Y0 to group 1 if
And assign Y0 to group 2 if In the very rare case of equality randomly assign Y0 to either group. The rule implicitly assumes a priori that the probability that the new observations came from either group is equal or simply If we have prior information that favors one group or the other in terms of classification, then we can modify the classification rule to take this new information into account. ©2009 Philip J. Ramsey, Ph.D.

59 The classification rule is to classify Z0 to group 1 if
We illustrate the classification rule using the Temperature example cover earlier. Recall there are two groups Temp A and Temp B. The classification rule is to classify Z0 to group 1 if Or, group 2 otherwise. Next we assume that both groups have a normal distribution with the same standard deviation, but different means. To illustrate the classification we simulated 500 values of Z for each group assuming normal distributions. The Graph Builder display on the next slide illustrates the two Z distributions created by the linear discriminant. One can see that the classification rule is quite intuitive. ©2009 Philip J. Ramsey, Ph.D.

60 We illustrate the classification rule using the Temperature example cover earlier. Recall there are two groups Temp A and Temp B. ©2009 Philip J. Ramsey, Ph.D.

61 Before discussing the m > 2 groups case, suppose we only have two groups, but we have prior probabilities p1 and p2 that an observation may belong to either group. In order to use this information we need to assume a probability distribution for each group. The natural choice is to assume that each group is multivariate normal with the same covariance matrix . With some algebra Rencher (2002) shows that the asymptotically optimal classification rule (assuming multivariate normality with equal covariance matrices) reduces to Obviously if p1 = p2 (we say the priors are uniform) then the equation reduces to the classification rule given previously. ©2009 Philip J. Ramsey, Ph.D.

62 Our classification criterion becomes
Example: Suppose for the steel processing example given earlier we have a prior probabilities pA = 0.7 and pB = 0.3 that a sample of steel was processed at one of the two temperatures. Our classification criterion becomes Suppose we test a new sample of steel without knowledge of which temperature it was processed at and the values are Yield = 40 and Ultimate = 63. Using the linear discriminant function estimated earlier, Z0 = 49.76; therefore we assign the sample to temperature B. Why? Notice that we have assigned the sample to B, but the discriminant value is close to the cutoff and we are not very confident in our classification. ©2009 Philip J. Ramsey, Ph.D.

63 For more than two groups we can take a different approach to classification based upon the distance of a new observation from the estimated centroids of each group and the posterior probability that a new observation belongs to each group. Note this procedure can be used for only two groups and is analogous to Fisher’s procedure. We will once again assume that the observations come from distributions that are multivariate normal with the same covariance matrix  and possibly different centroids. We wish to develop a rule that assigns or classifies a new observation to a group based upon the highest posterior probability that the observation came from that distribution. We assume that we have k variables that are measured on m groups that we designate π1, π2, … , πm. Associated with each of the m groups is a prior probability of membership p1, p2, … , pm. If the priors are equal then we say they are uniform, as discussed earlier. ©2009 Philip J. Ramsey, Ph.D.

64 The squared Mahalanobis distance for an observation vector Y0 from the centroid of the ith group is
Notice that the first term on the right hand side is not a function of i and can be ignored for classification – it is constant for all groups. Rencher (2002) shows that an optimal linear classification function is then Assign the observation vector Y to the group for which i is a maximum, which is the same group for which D2 is smallest. ©2009 Philip J. Ramsey, Ph.D.

65 Again assign Y0 to the group for which the function is maximized.
However this simple linear classification rule does not allow us to make use of our prior information about the probabilities of belonging to the different groups. If we assume that we have prior probabilities of membership for each of the m groups, then our linear classification rule is simply modified to incorporate this information. Let pi be the prior probability of membership in the ith group, then we have the classification rule (Rencher, 2002) Again assign Y0 to the group for which the function is maximized. Assuming multivariate normal distributions with equal covariance the rule is optimal in terms of misclassification errors. ©2009 Philip J. Ramsey, Ph.D.

66 However more is possible than to just use the prior probabilities to adjust the classification functions. Basically what we wish to do is estimate the probability that Y0 came from a particular group given we have observed the data and we call this the posterior probability of membership since it is estimated after we observe the data. We will introduce a concept called Bayes Rule to estimate the posterior probabilities given the data and prior probabilities. Our classification rule will then be based upon the posterior probability of membership in each group. We assign the observation vector Y0 to the group for which the posterior probability of membership is highest. Note this is the rule JMP uses in the Discriminant platform to classify observations. ©2009 Philip J. Ramsey, Ph.D.

67 Recall, the estimated density function for the multivariate normal is
Assume that fi(Y) represents the probability density function for the ith group with prior probability pi , then using Bayes Rule the posterior probability for the ith group is For general probability distributions these posterior probabilities may be very difficult to impossible to calculate in a closed form. However, for the multivariate normal distribution they are straight forward. Recall, the estimated density function for the multivariate normal is ©2009 Philip J. Ramsey, Ph.D.

68 For the multivariate normal our expression for the posterior probabilities, if we assume an equal covariance matrix, becomes Furthermore, if the priors are uniform and equal to a value p, they also drop out of the expression. The above expression is how JMP calculates the posterior probabilities of membership. The observation vector Y0 is then assigned to the group for which the posterior probability of membership is highest. ©2009 Philip J. Ramsey, Ph.D.

69 Example: We will use the dataset Iris
Example: We will use the dataset Iris.JMP to demonstrate the impact of prior probabilities on the classification of observations to m groups. This is classic dataset that R.A. Fisher first use to demonstrate the concept of discriminant analysis. The data consists of measurements on 150 Iris plants for which the species is known. The goal is to estimate a discriminant function to separate the three species and to classify them based on the measurements. Below is a partial view of the data table. ©2009 Philip J. Ramsey, Ph.D.

70 We first show the linear discriminant functions
We first show the linear discriminant functions. These functions are used to generate the biplot and are not directly used for classification. In the biplot the confidence ellipsoids for each centroid are based on a multivariate normal distribution. ©2009 Philip J. Ramsey, Ph.D.

71 Posterior probabilities
We next show a partial view of the calculated probabilities for each group under the assumption that the prior probabilities are equal (uniform). Note there are 3 misclassifications out of 150. Posterior probabilities ©2009 Philip J. Ramsey, Ph.D.

72 Now suppose we have prior information about the occurrence of the three species from the population where the sample was collected. Suppose we know that 15% are Setosa, 25% are Versicolor, and 60% are Virginica. You can specify your own prior probabilities by selecting the option by selecting the “Specify Prior” option in the main menu in the Discriminant Analysis report window. ©2009 Philip J. Ramsey, Ph.D.

73 Posterior probabilities
Notice that the posterior probabilities are changed once we specify the nonuniform priors. We now have 4 misclassifications. However our priors in this case were completely arbitrary and in practice are hopefully based on scientific understanding. Posterior probabilities ©2009 Philip J. Ramsey, Ph.D.

74 You can save the posterior probabilities to the data table by selecting the option “Score Options” and then the option “Save Formulas”. ©2009 Philip J. Ramsey, Ph.D.

75 The posterior probabilities are stored in the data table in the format shown below. Here we assume uniform priors for the three groups. ©2009 Philip J. Ramsey, Ph.D.

76 The column Prob[0] represents the denominator in the posterior probability calculation using Bayes rule. Note that the prior probabilities are factored into the SqDist[ ] functions as the term -2ln(pi), which is equivalent to the posterior probability formula shown earlier. Prob[0] = Prob[setosa] = Prob[versicolor]= Prob[virginica] = ©2009 Philip J. Ramsey, Ph.D.

77 Example: We will use the OwlDiet. JMP data
Example: We will use the OwlDiet.JMP data. Below is the results for the analysis using uniform priors for the seven species. Notice that with uniform priors we have 13.97% misclassified. ©2009 Philip J. Ramsey, Ph.D.

78 Example: With priors proportional to occurrence in the sample we reduce the misclassification percentage. ©2009 Philip J. Ramsey, Ph.D.

79 Example: Let’s revisit the football helmet data
Example: Let’s revisit the football helmet data. Recall we earlier examined the data using the biplots from the linear discrimant functions. We reexamine the data looking at linear classification instead. We again assume uniform priors on class membership. Notice that 24 rows have been misclassified or 26.67%. The highlighted rows in the table are misclassifications. ©2009 Philip J. Ramsey, Ph.D.

80 Example: From the Mosaic plot of predicted group membership vs
Example: From the Mosaic plot of predicted group membership vs. actual membership, we can see that significant misclassification occurs for the NonFB and CollFB groups. The ROC Curves plot the P(Correct Classification) – Y axis – vs. P(Incorrect Classification) – X axis. This is done for each group. A perfect classifier has Sensitivity = 1.0. ©2009 Philip J. Ramsey, Ph.D.

81 Recall that linear classification (discriminant) analysis makes an assumption of multivariate normal distributions for each of the m groups and assumes that all groups share a common covariance structure or matrix. A variation of classification analysis exists, where one does not assume equal covariance structure for the groups and this version is referred to as quadratic discriminant analysis. The boundaries between the groups in quadratic discriminant analysis are literally quadratic in shape, hence the term quadratic. The application of quadratic discriminant classification is analogous to the linear version except the discriminant score functions are more complicated and have more parameters to estimate. In general linear discriminant analysis uses simpler functions but can be biased if the equal covariance assumption is invalid. Quadratic discriminant functions are larger and more variable. ©2009 Philip J. Ramsey, Ph.D.

82 The quadratic discriminant function for the ith group, assuming one is interested in classification, can be shown to be (we omit considerable mathematical detail) The rule is to assign the observation to the group for which i is a minimum. Notice that the first term on the right hand side is just the Mahalanobis distance for the observation vector Y0. Also notice that the discriminant score is weighted by the determinant of the covariance matrix. Larger covariance matrices having a larger penalty applied to the score. The quadratic classification rules cannot be reduced to linear functions. ©2009 Philip J. Ramsey, Ph.D.

83 We illustrate quadratic discriminant classification with the football helmet data. In the Discriminant platform in JMP Quadratic discriminant analysis is one of the options available. ©2009 Philip J. Ramsey, Ph.D.

84 First let’s examine the three covariance matrices to see if there may be differences – no straightforward tests exist for equal covariance structure and we will rely on visual assessment. Some differences do seem to exist in the sample covariance structures for the groups. For purposes of example we will assume the three matrices are different. ©2009 Philip J. Ramsey, Ph.D.

85 Remember the linear discriminant functions are unchanged by selecting quadratic discriminant analysis and as a result the biplot is unchanged. It is the classification probabilities where we see the difference. Quadratic Linear ©2009 Philip J. Ramsey, Ph.D.

86 The key difference in the classification algorithm for quadratic discriminant (classification) analysis is we assume a different covariance matrix in the probability calculation for each of the groups. Recall in linear discriminant analysis we use an overall pooled covariance matrix for the probability calculations in all of the groups. However, both methods do rely on an assumption that each group follows a multivariate normal distribution. In general, both methods are robust to the normal assumption (quadratic classification is more sensitive), however the equal covariance assumption is problematic in many cases for linear discriminant analysis. If the equal covariance assumption is invalid then the linear procedure is often quite biased in terms of correct classification. Unfortunately the quadratic procedure requires the estimation of more parameters and the classification formulas are more variable. ©2009 Philip J. Ramsey, Ph.D.

87 A compromise classification procedure can be used between the linear and quadratic methods.
The method due to Friedman (1988) attempts to find a compromise between the bias of linear classification and the added variability of quadratic classification functions. The method is often referred to as regularized discriminant analysis although once again it is a classification procedure that is quite different from Fisher’s discriminant analysis. We will not delve into the mathematical details of Friedman's method, however a copy of his paper can be found at The key to his method is to find values for two parameters  and  which are used to create a regularized covariance matrix for each of the groups, which is difficult in practice. We omit the details. JMP does the method but you have to supply the values for the constants. ©2009 Philip J. Ramsey, Ph.D.


Download ppt "Advanced Statistical Methods for Research Math 736/836"

Similar presentations


Ads by Google