Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects.

Similar presentations


Presentation on theme: "Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects."— Presentation transcript:

1 Topic 20: Single Factor Analysis of Variance

2 Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects model –Link to linear regression using indicator explanatory variables

3 One-Way ANOVA The response variable Y is continuous The explanatory variable is categorical –We call it a factor –The possible values are called levels This approach is a generalization of the independent two-sample pooled t-test In other words, it can be used when there are more than two treatments

4 Data for One-Way ANOVA Y is the response variable X is the factor (it is qualitative/discrete) –r is the number of levels –often refer to these levels as groups or treatments Y i,j is the j th observation in the i th group

5 Notation For Y i,j we use –i to denote the level of the factor –j to denote the j th observation at factor level i i = 1,..., r levels of factor X j = 1,..., n i observations for level i of factor X –n i does not need to be the same in each group

6 KNNL Example (p 685) Y is the number of cases of cereal sold X is the design of the cereal package –there are 4 levels for X because there are 4 different package designs i =1 to 4 levels j =1 to n i stores with design i (n i =5,5,4,5) Will use n if n i the same across groups

7 Data for one-way ANOVA data a1; infile 'c:../data/ch16ta01.txt'; input cases design store; proc print data=a1; run;

8 The data Obscasesdesignstore 11111 21712 31613 41414 51515 61221 71022 81523

9 Plot the data symbol1 v=circle i=none; proc gplot data=a1; plot cases*design; run;

10 The plot

11 Plot the means proc means data=a1; var cases; by design; output out=a2 mean=avcases; proc print data=a2; symbol1 v=circle i=join; proc gplot data=a2; plot avcases*design; run;

12 New Data Set Obsdesign_TYPE__FREQ_avcases 110514.6 220513.4 330419.5 440527.2

13 Plot of the means

14 The Model We assume that the response variable is –Normally distributed with a 1.mean that may depend on the level of the factor 2.constant variance All observations assumed independent NOTE: Same assumptions as linear regression except there is no assumed linear relationship between X and E(Y|X)

15 Cell Means Model A “cell” refers to a level of the factor Y ij = μ i + ε ij –where μ i is the theoretical mean or expected value of all observations at level (or cell) i –the ε ij are iid N(0, σ 2 ) which means –Y ij ~N(μ i, σ 2 ) and independent –This is called the cell means model

16 Parameters The parameters of the model are – μ 1, μ 2, …, μ r –σ 2 Question (Version 1) – Does our explanatory variable help explain Y? Question (Version 2) – Do the μ i vary? H 0 : μ 1 = μ 2 = … = μ r = μ (a constant) H a : not all μ’s are the same

17 Estimates Estimate μ i by the mean of the observations at level i, (sample mean) û i = = ΣY i,j /n i For each level i, also get an estimate of the variance = Σ(Y ij - ) 2 /(n i -1) (sample variance) We combine these to get an overall estimate of σ 2 Same approach as pooled t-test

18 Pooled estimate of σ 2 If the n i were all the same we would average the –Do not average the s i In general we pool the, giving weights proportional to the df, n i -1 The pooled estimate is

19 Running proc glm proc glm data=a1; class design; model cases=design; means design; lsmeans design run; Difference 1: Need to specify factor variables Difference 2: Ask for mean estimates

20 Output Class Level Information ClassLevelsValues design41 2 3 4 Number of Observations Read19 Number of Observations Used19 Important summaries to check these summaries!!!

21 SAS 9.3 default output for MEANS statement

22 MEANS statement output Level of designN cases MeanStd Dev 1514.60000002.30217289 2513.40000003.64691651 3419.50000002.64575131 4527.20000003.96232255 Table of sample means and sample variances

23 SAS 9.3 default output for LSMEANS statement

24 LSMEANS statement output designcases LSMEAN Standard ErrorPr > |t| 114.60000001.4523544<.0001 213.40000001.4523544<.0001 319.50000001.6237816<.0001 427.20000001.4523544<.0001 Provides estimates based on model (i.e., constant variance)

25 Notation

26 ANOVA Table Source df SS MS Model r-1 Σ ij ( - ) 2 SSR/df R Error n T -r Σ ij (Y ij - ) 2 SSE/df E Total n T -1 Σ ij (Y ij - ) 2 SST/df T

27 ANOVA SAS Output SourceDF Sum of Squares Mean SquareF ValuePr > F Model3588.2210526196.073684218.59<.0001 Error15158.200000010.5466667 Corrected Total 18746.4210526 R-SquareCoeff VarRoot MSEcases Mean 0.78805517.430423.24756318.63158

28 Expected Mean Squares E(MSR) > E(MSE) when the group means are different See KNNL p 694 – 698 for more details In more complicated models, these tell us how to construct the F test

29 F test F = MSR/MSE H 0 : μ 1 = μ 2 = … = μ r H a : not all of the μ i are equal Under H 0, F ~ F(r-1, n T -r) Reject H 0 when F is large Report the P-value

30 Maximum Likelihood Approach proc glimmix data=a1; class design; model cases=design / dist=normal; lsmeans design; run;

31 GLIMMIX Output Model Information Data SetWORK.A1 Response Variablecases Response DistributionGaussian Link FunctionIdentity Variance FunctionDefault Variance MatrixDiagonal Estimation Technique Restricted Maximum Likelihood Degrees of Freedom MethodResidual

32 GLIMMIX Output Fit Statistics -2 Res Log Likelihood84.12 AIC (smaller is better)94.12 AICC (smaller is better)100.79 BIC (smaller is better)97.66 CAIC (smaller is better)102.66 HQIC (smaller is better)94.08 Pearson Chi-Square158.20 Pearson Chi-Square / DF10.55

33 GLIMMIX Output Type III Tests of Fixed Effects Effect Num DF Den DFF ValuePr > F design31518.59<.0001 design Least Squares Means designEstimate Standard ErrorDFt ValuePr > |t| 114.60001.45241510.05<.0001 213.40001.4524159.23<.0001 319.50001.62381512.01<.0001 427.20001.45241518.73<.0001

34 Factor Effects Model A reparameterization of the cell means model Useful way at looking at more complicated models Null hypotheses are easier to state Y ij = μ +  i + ε ij –the ε ij are iid N(0, σ 2 )

35 Parameters The parameters of the model are – μ,  1,  2, …,  r –σ 2 The cell means model had r + 1 parameters –r μ’s and σ 2 The factor effects model has r + 2 parameters –μ, the r  ’s, and σ 2 –Cannot uniquely estimate all parameters

36 An example Suppose r=3; μ 1 = 10, μ 2 = 20, μ 3 = 30 What is an equivalent set of parameters for the factor effects model? We need to have μ +  i = μ i μ = 0,  1 = 10,  2 = 20,  3 = 30 μ = 20,  1 = -10,  2 = 0,  3 = 10 μ = 5000,  1 = -4990,  2 = -4980,  3 = -4970

37 Problem with factor effects? These parameters are not estimable or not well defined (i.e., unique) There are many solutions to the least squares problem There is an X΄X matrix for this parameterization that does not have an inverse (perfect multicollinearity) The parameter estimators here are biased (SAS proc glm)

38 Factor effects solution Put a constraint on the  i Common to assume Σ i  i = 0 This effectively reduces the number of parameters by 1 Numerous other constraints possible

39 Consequences Regardless of constraint, we always have μ i = μ +  i The constraint Σ i  i = 0 implies –μ = (Σ i μ i )/r (unweighted grand mean) –  i = μ i – μ (group effect) The “unweighted” complicates things when the n i are not all equal; see KNNL p 702-708

40 Hypotheses H 0 : μ 1 = μ 2 = … = μ r H 1 : not all of the μ i are equal are translated into H 0 :  1 =  2 = … =  r = 0 H 1 : at least one  i is not 0

41 Estimates of parameters With the constraint Σ i  i = 0

42 Solution used by SAS Recall, X΄X does not have an inverse We can use a generalized inverse in its place (X΄X) - is the standard notation There are many generalized inverses, each corresponding to a different constraint

43 Solution used by SAS (X΄X) - used in proc glm corresponds to the constraint  r = 0 Recall that μ and the  i are not estimable But the linear combinations μ +  i are estimable These are estimated by the cell means

44 Cereal package example Y is the number of cases of cereal sold X is the design of the cereal package i =1 to 4 levels j =1 to n i stores with design i

45 SAS coding for X Class statement generates r explanatory variables The i th explanatory variable is equal to 1 if the observation is from the i th group In other words, the rows of X are 1 1 0 0 0 for design=1 1 0 1 0 0 for design=2 1 0 0 1 0 for design=3 1 0 0 0 1 for design=4

46 Some options proc glm data=a1; class design; model cases=design /xpx inverse solution; run;

47 Output The X'X Matrix Int d1 d2 d3 d4 cases Int 19 5 5 4 5 354 d1 5 5 0 0 0 73 d2 5 0 5 0 0 67 d3 4 0 0 4 0 78 d4 5 0 0 0 5 136 cases 354 73 67 78 136 7342 Also contains X’Y

48 Output X'X Generalized Inverse (g2) Int d1 d2 d3 d4 cases Int 0.2 -0.2 -0.2 -0.2 0 27.2 d1 -0.2 0.4 0.2 0.2 0 -12.6 d2 -0.2 0.2 0.4 0.2 0 -13.8 d3 -0.2 0.2 0.2 0.45 0 -7.7 d4 0 0 0 0 0 0 cases 27.2 -12.6 -13.8 -7.7 0 158.2

49 Output matrix Actually, this matrix is (X΄X) - (X΄X) - X΄Y Y΄X(X΄X) - Y΄Y-Y΄X(X΄X) - X΄Y Parameter estimates are in upper right corner, SSE is lower right corner (last column on previous page)

50 Parameter estimates St Par Est Err t P Int 27.2 B 1.45 18.73 <.0001 d1 -12.6 B 2.05 -6.13 <.0001 d2 -13.8 B 2.05 -6.72 <.0001 d3 -7.7 B 2.17 -3.53 0.0030 d4 0.0 B...

51 Caution Message NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

52 Interpretation If  r = 0 (in our case,  4 = 0), then the corresponding estimate should be zero the intercept μ is estimated by the mean of the observations in group 4 since μ +  i is the mean of group i, the  i are the differences between the mean of group i and the mean of group 4

53 Recall the means output Level of design N Mean Std Dev 1 5 14.6 2.3 2 5 13.4 3.6 3 4 19.5 2.6 4 5 27.2 3.9

54 Parameter estimates based on means Level of design Mean = 27.2 = 27.2 1 14.6 = 14.6-27.2 = -12.6 2 13.4 = 13.4-27.2 = -13.8 3 19.5 = 19.5-27.2 = -7.7 4 27.2 = 27.2-27.2 = 0

55 Last slide Read KNNL Chapter 16 up to 16.10 We used programs topic20.sas to generate the output for today Will focus more on the relationship between regression and one-way ANOVA in next topic


Download ppt "Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects."

Similar presentations


Ads by Google