Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly.

Similar presentations


Presentation on theme: "1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly."— Presentation transcript:

1 1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly influential FACTOR)

2 2 OBJECTIVE: To determine the impact of X on Y Mathematical Model: Y = f (x,  ), where  = (impact of) all factors other than X Ex: Y = Battery Life (hours) X = Brand of Battery   = Many other factors (possibly, some we’re unaware of)

3 3 Statistical Model “LEVEL” OF BRAND (Brand is, of course, represented as “categorical”) Y 11 Y 12 Y 1c Y ij Y 21 Y RI 1 2 R 1 2 C Y ij =  +  j +  ij i = 1,....., R j = 1,....., C Y Rc

4 4 Where  = OVERALL AVERAGE j = index for FACTOR (Brand) LEVEL i  = index for “replication”  j = Differential effect (response) associated with j th level of X and  ij = “noise” or “error” associated with the (particular) (i,j) th data value. Let  j = AVERAGE associated with j th level of X   j =  j –  and  = AVERAGE of  j.

5 5 Y ij =  +  j +  ij By definition,   j = 0 C j=1 The experiment produces R x C Y ij data values. The analysis produces estimates of         c . (We can then get estimates of the  ij by subtraction).

6 6 Y 11 Y 12 Y 1c Y 21 Y RI Y Rc 1 2 C Y 1 Y c (Y j ) Y 2 3 Y 1, Y 2, etc., are Column Means

7 7 Y =  Y j / C = “GRAND MEAN” (assuming same # data points in each column) (otherwise, Y = mean of all the data) j=1 c

8 8 MODEL: Y ij =  +  j +  ij Y estimates  Y j - Y estimates   j (=  j –  ) (for all j) These estimates are based on Gauss’ (1796) PRINCIPLE OF LEAST SQUARES and (I would argue) on COMMON SENSE

9 9 MODEL: Y ij =  +  j +  ij If you insert the estimates into the MODEL, (1) Y ij = Y + (Y j - Y ) +  ij. it follows that our estimate of  ij is (2)  ij = Y ij - Y j < <

10 10 Then, Y ij = Y + (Y j - Y ) + ( Y ij - Y j ) or, (Y ij - Y ) = (Y j - Y ) + (Y ij - Y j ) { { { (3) TOTAL VARIABILITY in Y = Variability in Y associated with X Variability in Y associated with all other factors +

11 11 If you square both sides of (3), and double sum both sides (over i and j), you get, [after some unpleasant algebra, but lots of terms which “cancel”]  (Y ij - Y ) 2 = R  (Y j - Y ) 2 +  (Y ij - Y j ) 2 C R j=1 i=1 { { j=1 CC R j=1 i=1 TSS TOTAL SUM OF SQUARES ==== SSB C SUM OF SQUARES BETWEEN COLUMNS ++++ SSW (SSE) SUM OF SQUARES WITHIN COLUMNS ( ( ( ( ( (

12 12 ANOVA TABLE SOURCE OF VARIABILITY SSQDF Mean square (M.S.) Between Columns (due to brand) Within Columns (due to error) SSB C C - 1 MSB C SSB C C - 1 SSW (R - 1) C SSW (R-1)C = MSW = TOTAL TSS RC -1

13 13 Example: Y = LIFETIME (HOURS) BRAND 3 replications per level 1 2 3 4 5 6 7 8 1.8 4.2 8.6 7.0 4.2 4.2 7.8 9.0 5.0 5.4 4.6 5.0 7.8 4.2 7.0 7.4 1.0 4.2 4.2 9.0 6.6 5.4 9.8 5.8 2.6 4.6 5.8 7.0 6.2 4.6 8.2 7.4 5.8 SSB C = 3 ( [2.6 - 5.8] 2 + [4.6 - 5.8] 2 + + [7.4 - 5.8] 2 ) = 3 (23.04) = 69.12

14 14 (1.8 - 2.6) 2 =.64 (4.2 - 4.6) 2 =.16 (9.0 -7.4) 2 = 2.56 (5.0 - 2.6) 2 = 5.76 (5.4 - 4.6) 2 =.64 (7.4 - 7.4) 2 = 0 (1.0 - 2.6) 2 = 2.56 (4.2 - 4.6) 2 =.16 (5.8 - 7.4) 2 = 2.56 8.96.96 5.12 Total of (8.96 +.96 + + 5.12), SSW = 46.72 SSW =

15 15 ANOVA TABLE Source of Variability SSQ df M.S. BRAND ERROR 69.12 46.72 7 = 8 - 1 16 = 2 (8) 9.87 2.92 TOTAL 115.84 23 = (3 8) -1

16 16 We can show: E (MSB C ) =  2 + “V COL ” { MEASURE OF DIFFERENCES AMONG COLUMN MEANS R C-1  (  j -  ) 2 { jj ( ( E (MSW) =  2 (Assuming each Y ij has (constant) standard deviation,  ) (More about assumptions, Later)

17 17 E ( MSB C ) =  2 + V COL E ( MSW ) =  2 This suggests that if MSB C MSW > 1, There’s some evidence of non- zero V COL, or “level of X affects Y” if MSB C MSW < 1, No evidence that V COL > 0, or that “level of X affects Y”

18 18 With H O : Level of X has no impact on Y H I : Level of X does have impact on Y, We need MSB C MSW > > 1 to reject H O.

19 19 More Formally, H O :  1 =  2 =  c = 0 H I : not all  j = 0 OR H O :  1 =  2 =  c H I : not all  j are EQUAL (All column means are equal)

20 20 The probability Law of MSB C MSW = “F calc ”, is The F - distribution with (C-1, (R-1)C) degrees of freedom Assuming H O true. C = Table Value 

21 21 In our problem: ANOVA TABLE Source of Variability SSQ df M.S. BRAND ERROR 69.12 46.72 7 16 9.87 2.92 = 9.87 2.92 F calc 3.38

22 22  =.05 C = 2.66 3.38 F table coming up (7,16 DF)

23 23 F-Table

24 24 Hence, at  =.05, Reject H o. (i.e., Conclude that level of BRAND does have an impact on battery lifetime.)

25 25

26 26 SPSS/MINITAB INPUT VAR001VAR002 1.81 5.01 1.01 4.22 5.42 4.22... 9.08 7.48 5.88

27 27

28 28 ONE FACTOR ANOVA (MINITAB) Analysis of Variance for life Source DF SS MS F P brand 7 69.12 9.87 3.38 0.021 Error 16 46.72 2.92 Total 23 115.84 MINITAB: STAT>>ANOVA>>ONE-WAY

29 29

30 30

31 31 EXAMPLE: MORTAR The tension bond strength of cement mortar is an important characteristic of the product. An engineer is interested in comparing the strength of a modified formulation in which polymer latex emulsions have been added during mixing to the strength of the unmodified mortar. The experimenter has collected 10 observations on strength for the modified formulation and another 10 observations for the unmodified formulation.

32 32 ModifiedUnmodified 16.8517.50 16.4017.63 17.2118.25 16.3518.00 16.5217.86 17.0417.75 16.9618.22 17.1517.90 16.5917.96 16.5718.15

33 33 One-way ANOVA: strength versus type (Minitab) Analysis of Variance for strength Source DF SS MS F P type 1 6.7048 6.7048 82.98 0.000 Error 18 1.4544 0.0808 Total 19 8.1592

34 34

35 35

36 36

37 37 Assumptions Basically, the same as in Regression analysis: MODEL: Y ij =  +  j +  ij 1.) the  ij are indep. random variables 2.) Each  ij is Normally Distributed E(  ij ) = 0 for all i, j 3.)  2 (  ij ) = constant for all i, j Normality plot Residual plot Run order plot

38 38 Diagnosis: Normality The points on the normality plot must more or less follow a line to claim “normal distributed”. There are statistic tests to verify it scientifically. The ANOVA method we learn here is not sensitive to the normality assumption. That is, a mild departure from the normal distribution will not change our conclusions much. Normality plot: normal scores vs. residuals

39 39 From Mortar data:

40 40 Diagnosis: Constant Variances The points on the residual plot must be more or less within a horizontal band to claim “constant variances”. There are statistic tests to verify it scientifically. The ANOVA method we learn here is not sensitive to the constant variances assumption. That is, slightly different variances within groups will not change our conclusions much. Residual plot: fitted values vs. residuals

41 41 From Mortar data:

42 42 Diagnosis: Randomness/Independence The run order plot must show no “systematic” patterns to claim “randomness”. There are statistic tests to verify it scientifically. The ANOVA method is sensitive to the constant variances assumption. That is, a little level of dependence between data points will change our conclusions a lot. Run order plot: order vs. residuals

43 43 From Mortar data:

44 44 This assumes a “fixed model”: Inherent interest in the specific levels of the factors under study - there’s no direct interest in extrapolating to other levels - inference will be limited to levels that appear in the experiment. Experimenter selects the levels If a “random model”: Levels in experiment randomly selected from a population of such levels, and inference is to be made about the entire population of levels. Then, besides assumptions 1 to 3, there is another assumption: 4) a) the  j are independent random variables which are normally distributed with constant variance b) the  j and  ij are independent

45 45 With these assumptions, the estimates (Y.. and the Y j ) are “Maximum likelihood estimates”(a statistical notion which could be thought of as “efficiency” [“most likely value”]), and, more directly relevant: The “Conventional” F- and t- tests are applicable (VALID) for a variety of hypothesis testing and confidence interval computations.

46 46 KRUSKAL - WALLIS TEST (Non - Parametric Alternative) H O : The probability distributions are identical for each level of the factor H I : Not all the distributions are the same

47 47 Brand A B C 32 32 28 30 32 21 30 26 15 29 26 15 26 22 14 23 20 14 20 19 14 19 16 11 18 14 9 12 14 8 BATTERY LIFETIME (hours) (each column rank ordered, for simplicity) Mean: 23.9 22.1 14.9 (here, irrelevant!!)

48 48 H O : no difference in distribution among the three brands with respect to battery lifetime H I : At least one of the 3 brands differs in distribution from the others with respect to lifetime

49 49 Brand A B C 32 (29) 32 (29) 28 (24) 30 (26.5) 32 (29) 21 (18) 30 (26.5) 26 (22) 15 (10.5) 29 (25) 26 (22) 15 (10.5) 26 (22) 22 (19) 14 (7) 23 (20) 20 (16.5) 14 (7) 20 (16.5) 19 (14.5) 14 (7) 19 (14.5) 16 (12) 11 (3) 18 (13) 14 (7) 9 (2) 12 (4) 14 (7) 8 (1) T 1 = 197 T 2 = 178 T 3 = 90 n 1 = 10 n 2 = 10 n 3 = 10 Ranks

50 50 TEST STATISTIC: H = 12 N (N + 1)  (T j 2 /n j ) - 3 (N + 1) n j = # data values in column j N =  n j K = # Columns (levels) T j = SUM OF RANKS OF DATA ON COL j When all DATA COMBINED (There is a slight adjustment in the formula as a function of the number of ties in rank.) K j = 1 K

51 51 H = [ 12 197 2 178 2 90 2 30 (31) 10 10 10 + + [ - 3 (31) = 8.41 (with adjustment for ties, we get 8.46)

52 52 We can show that, under H O, H is well approximated by a  2 distribution with df = K - 1. What do we do with H? Here, df = 2, and at  =.05, the critical value = 5.99  2  df df F   df, = 5.99 8.41 = H  =.05 Reject H O ; conclude that mean lifetime NOT the same for all 3 BRANDS 8


Download ppt "1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly."

Similar presentations


Ads by Google