Presentation is loading. Please wait.

Presentation is loading. Please wait.

F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE.

Similar presentations


Presentation on theme: "F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE."— Presentation transcript:

1 F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE

2 regression: x is a quantitative explanatory variable

3 type is a qualitative variable (a factor)

4 Company 1: 36 28 32 43 30 21 33 37 26 34 36 28 32 43 30 21 33 37 26 34 Company 2: 26 21 31 29 27 35 23 33 Company 3: 39 28 45 37 21 49 34 38 44 Illustration

5

6 Explanatory variable qualitative i.e. categorical - a factor Analysis of variance  linear models for comparative experiments

7

8

9 ► The display is different if “type” is declared as a factor. Using Factor Commands

10

11 ► We could check for significant differences between two companies using t tests. ► t.test(company1,company2) ► This calculates a 95% Confidence Interval for difference between means

12 Includes 0 so no significant difference

13 Instead use an analysis of variance technique

14 Taking all the results together

15 We calculate the total variation for the system which is the sum of squares of individual values – 32.59259

16

17 ► We can also work out the sum of squares within each company This sums to 1114.431

18 ► The total sum of squares of the situation must be made up of a contribution from variation WITHIN the companies and variation BETWEEN the companies. ► This means that the variation between the companies equals 356.0884

19 ► This can all be shown in an analysis of variance table which has the format:

20 Source of variation Degrees of freedom Sum of squares Mean squares F Between treatments k  1 SS B SS B /(k  1) Residual (within treatments) n  k SS RES SS RES /(n  k) Total n  1 SS T

21 Source of variation Degrees of freedom Sum of squares Mean squares F Between treatments k  1 356.088 SS B /(k  1) Residual (within treatments) n  k 1114.431 SS RES /(n  k) Total n  1 1470.519

22 ► Using the R package, the command is similar to that for linear regression

23 Data: y ij is the j th observation using treatment i where the errors  ij are i.i.d. N(0,s 2 ) Model : Theory

24 The response variables Y ij are independent Y ij ~ N(µ + τ i, σ 2 ) Constraint:

25 Derivation of least-squares estimators

26 The fitted values are the treatment means

27 Partitioning the observed total variation SST = SSB + SSRES SSTSSRES SSB

28 The following results hold The following results hold

29 Back to the example

30 Fitted values: Company 1: 320/10 = 32 Company 2: 225/8 = 28.125 Company 3: 335/9 = 37.222 Residuals: Company 1:  1j = y 1j - 32 Company 2:  2j = y 2j - 28.125 Company 3:  3j = y 3j - 37.222

31 SS T = 30152 – 880 2 /27 = 1470.52 SS B = (320 2 /10 + 225 2 /8 + 335 2 /9) – 880 2 /27 = 356.09 = 356.09  SS RES = 1470.52 – 356.09 = 1114.43

32 ANOVA table Source of Degrees of Sum Mean F variation freedom of squares squares Between 2 356.09 178.04 3.83 treatments Residual 24 1114.43 46.44 Total 26 1470.52

33 Testing H 0 : τ i = 0, i = 1,2,3 v H 1 : not H 0 (i.e. τ i  0 for at least one i) Under H 0, F = 3.83 on 2,24 df. P-value = P(F 2,24 > 3.83) = 0.036 so we can reject H 0 at levels of testing down to 3.6%.

34 Conclusion Results differ among the three companies (P-value 3.6%)

35 The fit of the model can be investigated by examining the residuals: the residual for response yij is this is just the difference between the response and its fitted value (the appropriate sample mean).

36 Plotting the residuals in various ways may reveal ●a pattern (e.g. lack of randomness, suggesting that an additional, uncontrolled factor is present) ●non-normality (a transformation may help) ●heteroscedasticity (error variance differs among treatments – for example it may increase with treatment mean: again a transformation – perhaps log - may be required)

37

38

39 ► In this example, samples are small, but one might question the validity of the assumptions of normality (Company 2) and homoscedasticity (equality of variances, Company 2 v Companies 1/3).

40

41

42 ► plot(residuals(lm(company~type))~ fitted.values(lm(company~type)),pch=8)

43

44 ► abline(h=0,lty=2)

45

46 ► It is also possible to compare with an analysis using “type” as a qualitative explanatory variable ► type=c(rep(1,10),rep(2,8),rep(3,9)) ► No “factor” command

47

48 The equation is company = 27.666+2.510 x type Note low R 2

49

50

51

52

53

54

55 Example A school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, 6 are used.

56 Example A school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, 6 are used. The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade.

57 To illustrate, suppose we have just 27 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5 with 5 being the best and the scores are reported as: grader 143452345 grader 244554544 grader 334245544

58 The 5% cut off for F distribution with 2,21 df is 3.467. The null hypothesis cannot be rejected. No difference between markers.

59 ClassIIIIIIIVVVI 151152175149123145 168141155148132131 128129162137142155 167120186138161172 134115148169152141 Another Example

60 Source of variation dfSum of squares Mean squares F Between treatments 53046.7609.32.54 Residual245766.8240.3 Total298813.5

61

62 Normality and homoscedasticity (equality of variance) assumptions both seem reasonable

63

64 We now wish to Calculate a 95% confidence interval for the underlying common standard deviation , using SSRES/  2 as a pivotal quantity with a  2 distribution.

65

66 It can easily be shown that the class III has the largest value of 165.20 and that Class II has the smallest value of 131.40 Consider performing a t test to compare these two classes

67

68 There is no contradiction between this and the ANOVA results. It is wrong to pick out the largest and the smallest of a set of treatment means, test for significance, and then draw conclusions about the set. Even if H 0 : "  all equal" is true, the sample means would differ and the largest and smallest sample means perhaps differ noticeably.


Download ppt "F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE."

Similar presentations


Ads by Google