ANOVA Analysis of Variance.

ANOVA Analysis of Variance

One way Analysis of Variance (ANOVA)
Comparing k Populations

The F test – for comparing k means
Situation We have k normal populations Let mi and s 2 denote the mean and variance of population i. i = 1, 2, 3, … k. Note: we assume that the variance for each population is unknown but the same. s12 = s22 = … = sk2= s 2

We want to test against

The F statistic where xij = the jth observation in the i th sample.

The ANOVA table Source S.S d.f, M.S. F Between Within The ANOVA table is a tool for displaying the computations for the F test. It is very important when the Between Sample variability is due to two or more factors

Computing Formulae: Compute 1) 2) 3) 4) 5)

The data Assume we have collected data from each of k populations
Let xi1, xi2 , xi3 , … denote the ni observations from population i. i = 1, 2, 3, … k.

Then 1) 2) 3)

Anova Table Mean Square F-ratio Between k - 1 SSBetween MSBetween
Source d.f. Sum of Squares Mean Square F-ratio Between k - 1 SSBetween MSBetween MSB /MSW Within N - k SSWithin MSWithin Total N - 1 SSTotal

Example In the following example we are comparing weight gains resulting from the following six diets Diet 1 - High Protein , Beef Diet 2 - High Protein , Cereal Diet 3 - High Protein , Pork Diet 4 - Low protein , Beef Diet 5 - Low protein , Cereal Diet 6 - Low protein , Pork

Thus Thus since F > we reject H0

Anova Table Mean Square F-ratio Between 5 4612.933 922.587 4.3**
Source d.f. Sum of Squares Mean Square F-ratio Between 5 4.3** (p = ) Within 54 Total 59 * - Significant at 0.05 (not 0.01) ** - Significant at 0.01

Equivalence of the F-test and the t-test when k = 2

the F-test

SAS Code for one-way ANOVA

Data oneway; Input diet $ weight_gain; Datalines; 1 73 1 102 1 118 1 104 1 81 1 107 1 100 1 87 1 117 1 111 2 98 2 74 2 56 2 111 2 95 2 88 2 82 2 77 2 86 2 92 5 107 5 95 5 97 5 80 5 98 5 74 5 67 5 89 5 58 6 49 6 82 6 73 6 86 6 81 6 97 6 106 6 70 6 61 ; Run; 3 94 3 79 3 96 3 98 3 102 3 108 3 91 3 120 3 105 4 90 4 76 4 64 4 86 4 51 4 72 4 95 4 78 Note: there are easier ways to enter the data. We will come to that later.

SAS Code for one-way ANOVA
To test our hypothesis, we use the following code in SAS: “class” tells SAS the classification variable. In general, this is going to be the effect that you are studying. In this case, the effect is “diet.” “model” tells SAS the dependent variable. The general format is “model Y = X” where Y is the dependent variable, and X is the independent variable. In this case, weight_gain is dependent on diet. Often a “quit” statement is necessary, because SAS may continue to run a procedure until either another one has been run, or SAS has been told to quit.

SAS Output The ANOVA Procedure Class Level Information
Class Levels Values diet Number of Observations Read Number of Observations Used

The ANOVA Procedure Dependent Variable: weight_gain Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE weight_gain Mean Source DF Anova SS Mean Square F Value Pr > F diet

Factorial Experiments
Analysis of Variance

k Categorical independent variables A, B, C, … (the Factors) Let
Dependent variable Y k Categorical independent variables A, B, C, … (the Factors) Let a = the number of categories of A b = the number of categories of B c = the number of categories of C etc.

The Completely Randomized Design
We form the set of all treatment combinations – the set of all combinations of the k factors Total number of treatment combinations t = abc…. In the completely randomized design n experimental units (test animals , test plots, etc. are randomly assigned to each treatment combination. Total number of experimental units N = nt=nabc..

The treatment combinations can thought to be arranged in a k-dimensional rectangular block
1 2 b 1 2 A a

The Completely Randomized Design is called balanced
If the number of observations per treatment combination is unequal the design is called unbalanced. (resulting mathematically more complex analysis and computations) If for some of the treatment combinations there are no observations the design is called incomplete. (In this case it may happen that some of the parameters - main effects and interactions - cannot be estimated.)

Example: Two-way ANOVA (two-factor experiment)
In this example we are examining the effect of The level of protein A (High or Low) and the source of protein B (Beef, Cereal, or Pork) on weight gains (grams) in rats. We have n = 10 test animals randomly assigned to k = 6 diets

The k = 6 diets are the 6 = 3×2 Level-Source combinations
High - Beef High - Cereal High - Pork Low - Beef Low - Cereal Low - Pork

Treatment combinations
Source of Protein Beef Cereal Pork Level of Protein High Diet 1 Diet 2 Diet 3 Low Diet 4 Diet 5 Diet 6

Summary Table of Means Source of Protein
Level of Protein Beef Cereal Pork Overall High Low Overall

Gains in weight (grams) for rats under six diets
Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and s ource of protein (Beef, Cereal, or Pork) Level of Protein High Protein Low protein Source of Protein Beef Cereal Pork Beef Cereal Pork Diet Mean Std. Dev

Data twoway; Input Protein $ Source $ weight_gain; Datalines; High Beef 73 High Beef 102 High Beef 118 High Beef 104 High Beef 81 High Beef 107 High Beef 100 High Beef 87 High Beef 117 High Beef 111 High Cereal 98 High Cereal 74 High Cereal 56 High Cereal 111 High Cereal 95 High Cereal 88 High Cereal 82 High Cereal 77 High Cereal 86 High Cereal 92 High Pork 94 High Pork 79 High Pork 96 High Pork 98 High Pork 102 High Pork 108 High Pork 91 High Pork 120 High Pork 105 Low Beef 90 Low Beef 76 Low Beef 64 Low Beef 86 Low Beef 51 Low Beef 72 Low Beef 95 Low Beef 78 Low Cereal 107 Low Cereal 95 Low Cereal 97 Low Cereal 80 Low Cereal 98 Low Cereal 74 Low Cereal 67 Low Cereal 89 Low Cereal 58 Low Pork 49 Low Pork 82 Low Pork 73 Low Pork 86 Low Pork 81 Low Pork 97 Low Pork 106 Low Pork 70 Low Pork 61 ; Run;

SAS Code for two-way ANOVA
To test our hypotheses, we use the following code in SAS: “class” tells SAS the two classification variables, which are generally going to be the effects that you are studying. In this case, the effects are “Protein” and “Source” “model” tells SAS the dependent variable. The general format is “model Y = X1 X2 X1*X2” where Y is the dependent variable, X1 and X2 are independent variables. X1*X2 means the interaction of X1 and X2. Often a “quit” statement is necessary, because SAS may continue to run a procedure until either another one has been run, or SAS has been told to quit.

SAS Output The ANOVA Procedure Class Level Information
Class Levels Values Protein High Low Source Beef Cereal Pork Number of Observations Read Number of Observations Used

The ANOVA Procedure Dependent Variable: weight_gain Sum of
Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE weight_gain Mean Source DF Anova SS Mean Square F Value Pr > F Protein Source Protein*Source

Profiles of the response relative to a factor
A graphical representation of the effect of a factor on a reponse variable (dependent variable)

Profile Y for A Y Levels of A a
This could be for an individual case or averaged over a group of cases This could be for specific level of another factor or averaged levels of another factor a … 1 2 3 Levels of A

Profiles of Weight Gain for Source and Level of Protein

Example – Four factor experiment
Four factors are studied for their effect on Y (luster of paint film). The four factors are: 1) Film Thickness - (1 or 2 mils) 2) Drying conditions (Regular or Special) 3) Length of wash (10,30,40 or 60 Minutes), and 4) Temperature of wash (92 ˚C or 100 ˚C) Two observations of film luster (Y) are taken for each treatment combination

The data is tabulated below:
Regular Dry Special Dry Minutes 92 C 100 C 92C 100 C 1-mil Thickness 2-mil Thickness

Definition: A factor is said to not affect the response if the profile of the factor is horizontal for all combinations of levels of the other factors: No change in the response when you change the levels of the factor (true for all combinations of levels of the other factors) Otherwise the factor is said to affect the response:

Profile Y for A – A affects the response
Levels of B a … 1 2 3 Levels of A

Profile Y for A – no affect on the response
Levels of B a … 1 2 3 Levels of A

Definition: Two (or more) factors are said to interact if changes in the response when you change the level of one factor depend on the level(s) of the other factor(s). Profiles of the factor for different levels of the other factor(s) are not parallel Otherwise the factors are said to be additive . Profiles of the factor for different levels of the other factor(s) are parallel.

Interacting factors A and B
Y Levels of B a … 1 2 3 Levels of A

Additive factors A and B
Y Levels of B a … 1 2 3 Levels of A

If two (or more) factors interact each factor effects the response.
If two (or more) factors are additive it still remains to be determined if the factors affect the response In factorial experiments we are interested in determining which factors effect the response and which groups of factors interact .

Order of testing in factorial experiments
Test first the higher order interactions. If an interaction is present there is no need to test lower order interactions or main effects involving those factors. All factors in the interaction affect the response and they interact The testing continues for lower order interactions and main effects for factors which have not yet been determined to affect the response.

More SAS Program: Proc GLM
The ANOVA procedure is one of several procedures available in SAS/STAT software for analysis of variance. The ANOVA procedure is designed to handle balanced data (that is, data with equal numbers of observations for every combination of the classification factors), whereas the GLM procedure can analyze both balanced and unbalanced data. Because PROC ANOVA takes into account the special structure of a balanced design, it is faster and uses less storage than PROC GLM for balanced data.

Proc GLM PROC GLM DATA = twoway; class Protein Source;
model weight_gain = Protein Source Protein*Source; lsmeans Protein Source Protein*Source /out=outmns; *gives least square means and outputs them into another data set called 'outmns'; means Protein Source /cldiff bon; *ask SAS for the confidence limits for the difference of means and the type of comparison; output out=resout p=preds rstudent=exstdres; *outputs the residuals and predicted value to a data set called 'resout'; RUN; QUIT;

Proc GLM, continued title 'Profile/Interaction Plots'; symbol i=j;
*tells SAS to draw lines between joint means; proc gplot data=outmns; where poison ne . and treatment ne .; *remove the marginal means from the data set since we only wish to plot joint means; plot lsmean*Protein=Source; plot lsmean*Source=Protein; run; quit; goptions reset=all; *resets PROC GPLOT options; title 'Residual Plot'; proc gplot data=resout; plot exstdres*preds;

Mean versus LS Mean (LSM)

Mean versus LS Mean (LSM)
Note, for balanced designs, as true for our examples, the mean and LSM are the same.

Bonferroni Pairwise Mean Comparisons
The GLM Procedure Bonferroni (Dunn) t Tests for weight_gain NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons. Alpha Error Degrees of Freedom Error Mean Square Critical Value of t Minimum Significant Difference Comparisons significant at the 0.05 level are indicated by ***. Difference Protein Between Simultaneous 95% Comparison Means Confidence Limits High - Low *** Low - High ***

The GLM Procedure Bonferroni (Dunn) t Tests for weight_gain
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons. Alpha Error Degrees of Freedom Error Mean Square Critical Value of t Minimum Significant Difference Comparisons significant at the 0.05 level are indicated by ***. Difference Simultaneous Source Between % Confidence Comparison Means Limits Beef - Pork Beef - Cereal Pork - Beef Pork - Cereal Cereal - Beef Cereal - Pork

Tukey pairwise mean comparisons
PROC GLM DATA = twoway; class Protein Source; model weight_gain = Protein Source Protein*Source; means Protein Source /tukey; RUN; QUIT;

The GLM Procedure Tukey's Studentized Range (HSD) Test for weight_gain NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha Error Degrees of Freedom Error Mean Square Critical Value of Studentized Range Minimum Significant Difference Means with the same letter are not significantly different. Tukey Grouping Mean N Protein A High B Low

The GLM Procedure Tukey's Studentized Range (HSD) Test for weight_gain NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than REGWQ. Alpha Error Degrees of Freedom Error Mean Square Critical Value of Studentized Range Minimum Significant Difference Means with the same letter are not significantly different. Tukey Grouping Mean N Source A Beef A A Pork A Cereal

Models for factorial Experiments

Part I. Factor Effects Model

The Single Factor Experiment (One-way ANOVA)
Situation We have t = a treatment combinations Let mi and s 2 denote the mean and variance of treatment (population) i. i = 1, 2, 3, … a. Note: we assume that the variance for each population is unknown but the same. s12 = s22 = … = sa2= s 2

The data Assume we have collected data for each of the a treatments
Let yi1, yi2 , yi3 , … , yin denote the n observations for treatment i. i = 1, 2, 3, … a.

The model Note: where has N(0,s 2) distribution (overall mean effect)
(Effect of Factor A) Note: by their definition.

Model 1: yij (i = 1, … , a; j = 1, …, n) are independent Normal with mean mi and variance s 2. Model 2: where eij (i = 1, … , a; j = 1, …, n) are independent Normal with mean 0 and variance s 2. Model 3: where eij (i = 1, … , a; j = 1, …, n) are independent Normal with mean 0 and variance s 2 and

The Two Factor Experiment
Situation We have t = ab treatment combinations Let mij and s 2 denote the mean and variance of observations from the treatment combination when A = i and B = j. i = 1, 2, 3, … a, j = 1, 2, 3, … b.

The data Assume we have collected data (n observations) for each of the t = ab treatment combinations. Let yij1, yij2 , yij3 , … , yijn denote the n observations for treatment combination - A = i, B = j. i = 1, 2, 3, … a, j = 1, 2, 3, … b.

The model Note: where follows N(0,s 2) distribution and

The model Note: where follows N(0,s 2) distribution Note:
by their definition.

Main effects Interaction Effect Error Mean Model :
where eijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are independent Normal with mean 0 and variance s 2 and

Maximum Likelihood Estimates
where eijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are independent Normal with mean 0 and variance s 2 and

This is not an unbiased estimator of s 2 (usually the case when estimating variance.)
The unbiased estimator results when we divide by ab(n -1) instead of abn

The unbiased estimator of s 2 is
where

Testing for Interaction:
We want to test: H0: (ab)ij = 0 for all i and j, against HA: (ab)ij ≠ 0 for at least one i and j. The test statistic where

We reject H0: (ab)ij = 0 for all i and j, If

Testing for the Main Effect of A:
We want to test: H0: ai = 0 for all i, against HA: ai ≠ 0 for at least one i. The test statistic where

We reject H0: ai = 0 for all i, If

Testing for the Main Effect of B:
We want to test: H0: bj = 0 for all j, against HA: bj ≠ 0 for at least one j. The test statistic where

We reject H0: bj = 0 for all j, If

The ANOVA Table Source S.S. d.f. MS =SS/df F A SSA a - 1 MSA
MSA / MSError B SSB b - 1 MSB MSB / MSError AB SSAB (a - 1)(b - 1) MSAB MSAB/ MSError Error SSError ab(n - 1) MSError Total SSTotal abn - 1

The ANOVA Procedure Dependent Variable: weight_gain Sum of
Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square Coeff Var Root MSE weight_gain Mean Source DF Anova SS Mean Square F Value Pr > F Protein Source Protein*Source

Part II. General Linear Model

One-way ANOVA The ANOVA is indeed a special case of the general linear model (GLM) when all the predictors are categorical variables. For one-way ANOVA, we have only one categorical predictor. As shown in the following slides, we can easily translate the ANOVA into a GLM using dummy variables.

Dummy Variables Group D1 D2 1 2 3 Dummy coding 0s and 1s
For a categorical predictor with k categories, k-1 dummy variables will go into the regression equation leaving out one reference category (e.g. control) Coefficients are interpreted as change with respect to the reference variable (the one with all zeros) In this case group 3 Group D1 D2 1 2 3

GLM representation and interpretations
GLM model: Relation to category/group means: Therefore the ANOVA hypothesis: Can be expressed as:

Two-way ANOVA We will revisit the two-way ANOVA example on the impact of weight_gain from two factors: Protein level (denoted as Protein) – it has two levels: High/Low Protein source (denoted as Source) – it has three levels: Beef/Cereal/Pork

Dummy Variables Protein D High 1 Low Source D1 D2 Beef 1 Cereal Pork

GLM model: Relation to category/group means:

Test for Interaction: Test for Protein (level) main effect: Test for (protein) Source main effect:

Acknowledgement: We thank colleagues who posted their lecture notes on the Please note that in SAS, we have several procedures that will enable you to perform ANOVA. These include Proc ANOVA and Proc GLM, plus several other procedures such as Proc Mixed, etc. The ANOVA procedures we have learned so far are just the basic fixed effect ANOVAs. In the future we will also learn those with random effect, and mixed effects. See the following websites for a review and preview:

ANOVA Analysis of Variance.

Similar presentations

Presentation on theme: "ANOVA Analysis of Variance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ANOVA Analysis of Variance.

Similar presentations

Presentation on theme: "ANOVA Analysis of Variance."— Presentation transcript:

Similar presentations

About project

Feedback