Presentation is loading. Please wait.

Presentation is loading. Please wait.

ANALYSIS OF VARIANCE STAT 200. OUTLINE Introduction of concepts without numbers, notation or details Motivation The four steps: Hypothesis in words The.

Similar presentations


Presentation on theme: "ANALYSIS OF VARIANCE STAT 200. OUTLINE Introduction of concepts without numbers, notation or details Motivation The four steps: Hypothesis in words The."— Presentation transcript:

1 ANALYSIS OF VARIANCE STAT 200

2 OUTLINE Introduction of concepts without numbers, notation or details Motivation The four steps: Hypothesis in words The Assumptions: a graphical investigation The Test Statistic: basic ideas Conclusions The Follow-Up ANOVA in details – revisiting the above. 2

3 OBJECTIVES On a conceptual level: Know what ANOVA is used for. What type of situation calls for ANOVA. When do we use it Know the assumptions/conditions required for ANOVA Know what affects the Test Statistic. How do each type of variance change it? How does overall sample size change it? Know what is to be done if we reject the Null Hypothesis Be able to interpret the P-value. On a practical level: You will be expected to Run an entire ANOVA from raw data using software – not by hand. Be able to complete a partially filled ANOVA table. Run an entire ANOVA starting from summary statistics. Be able reach conclusions, interpret conclusions and obtain p- values. 3

4 UNDERSTANDING THE BIG PICTURE FIRST

5 ANOVA Analysis of Variance is often referred to as ANOVA. It is a hypothesis test for comparing two or more samples. We can use it to compare multiple populations (e.g. Vancouver, Montreal and Toronto) or to verify the association between a categorical and quantitative variable (e.g. Diet type and weight loss). 5

6 SOME FAMILIAR GROUND Since this is a Hypothesis Test, some of the structures we have already learned in other tests continue to hold. For example, the four steps to hypothesis testing still hold. So, step 1, the Hypotheses: The Null Hypothesis: There are no differences between the group means. Alternative Hypothesis: At least two groups have different means. 6

7 ALWAYS A NEW DIET It seems that over the years new diets keep on coming out. People go out and buy a bunch of new books and products only to repeat the same steps once the new diet comes out. In order to decipher this, 60 people were randomly assigned one of four diets. Their weight loss after six weeks was recorded. ( see next page) As would be expected, the sample means differ, but can this difference be entirely explained by sampling variability or is there real difference between some of the diets? 7

8 DIET DATA DietWeight LossSample Mean Sample SD A 9.9 9.6 8.0 4.9 10.2 9.0 9.8 10.8 6.2 8.3 12.9 11.8 5.5 9.3 11.5 9.182.29 B 9.5 3.8 11.5 9.2 4.5 8.9 5.2 8.2 7.5 12.7 10.0 10.0 11.7 8.2 12.7 8.912.78 C 8.8 10.2 14.4 8.7 12.2 10.9 11.0 5.1 14.5 13.0 12.5 10.8 10.5 9.2 10.9 10.852.39 D13.2 9.5 10.6 9.9 9.5 10.5 11.8 5.8 10.0 11.6 11.8 7.1 9.4 13.7 13.7 10.542.23 8

9 TWO QUESTIONS There are two questions which arise from the problem at hand: First: Do diets differ in effectiveness? Does the mean weight loss differ for the diets? ANOVA allows us to answer this question. Secondly: If the diets differ, which ones differ? ANOVA does not answer this. Multiple Comparisons allows us to answer this question. 9

10 THE ASSUMPTIONS In ANOVA we make three assumptions: The data come from a Normal Population. The Variances in each group are Equal. The Observations are Independent. Thus this is a parametric test. There exists a nonparametric alternative, but it will not be covered here (For the Curious: Kruskal Wallis) 10

11 A TABLE TO OBTAIN A TEST STATISTIC The test statistic is more involved than any other we have seen, but a table greatly simplifies the task. We call this an ANOVA table SourceSum of Squares Degrees of Freedom Mean Sum of Squares F-Statistic Treatment or Between SSTdf1MST= SST/df1F=MST/MSE Error or Within SSEdf2MSE=SSE/df2--------------- TotalSSTon-1-------------------------------- 11

12 12

13 THE LOGIC OF THE TEST STATISTIC Under H o each group is Normally Distributed with the same mean μ and variance σ 2. Therefore, we know what the variance of the group means should be: σ 2 /n. The Test Statistic uses the fact that under H o both the variance between group means and within group have σ 2 in them. Essentially, it manipulates these in a way such that their ratio should equal 1. 13

14 THE BOTTOM LINE We get a test statistic from the ANOVA table. We get a critical value from the F distribution Table If the Test Statistic is greater than the critical value, we reject the Null Hypothesis. Doing so implies that two or more groups differ significantly. It does not tell us which groups differ though. 14

15 WHAT WE’VE DONE UP TO NOW All the tests we’ve seen thus far pertain to one or two samples. It seems logical that we could run a series of two sample tests for each combination of two diets. If we ran two sample tests for each combination of two samples There would be many test to run (here 6) We would run into trouble due to multiple testing If we run 100 tests at the 0.05 significance level, how many False Positives do we expect to encounter? 15

16 THE BIG PICTURE PART 2 Understanding the role of ANOVA in inferential statistics 16

17 THREE FORMS OF INFERENCE There are three main forms of inference Estimation Hypothesis Testing Modeling We’ve seen that these overlap: We can determine the results of a Hypothesis Test using a Confidence Interval (estimation). We use Testing with Modeling ANOVA is a form of modeling 17

18 WHAT IS MODELING? In modeling, we attempt to determine the relationship between two variables based on our selection of model. Two questions can be answered by modeling: Are the two variables independent? If not, how do they relate? 18

19 WHAT TYPE OF VARIABLES ARE WE RELATING? Contingency Tables: Relate two categorical variables ANOVA: Relate a Quantitative Variable to a Categorical Variable. Regression: Relate two Quantitative Variables. 19

20 THE MODEL In ANOVA we consider two possible Models: Under H 0 we have Y ij = μ + ε ij Under H A we have Y ij = μ i + ε ij Here i indexes the group and j indexes the observations within the groups. Here ε ij ~ N(0,σ 2 ). So what what is the distribution of Y ij under each Model? The groups are defined by the categories of the categorical variable. 20

21 ONE-WAY ANOVA: IN DETAILS The Details of the Test

22 VARIABLES – SOME VOCABULARY ANOVA Tests consider two variables. One is quantitative and the other is categorical In the diet example, the weight loss is quantitative and the type of diet is categorical We call the quantitative variable the response variable and usually denote it by Y. We call the categorical variable a factor and its categories are referred to as levels. The levels are often referred to as groups. Here it can also be referred to as a treatment. 22

23 23

24 24

25 DECOMPOSING THE TABLE The table serves as more than a tool for obtaining the test statistic. It also allows us to understand some of the underlying concepts. We can also complete an ANOVA test based on partial information in the table. 25

26 DIFFERENT SOURCES OF VARIANCE TOTAL SUM OF SQUARES We start by decomposing the Sum of Squares column. The Total Sum of Squares looks at the difference between all observations and the overall mean. Note: If we divide the above formula by n-1 we get the formula for variance. 26

27 THE TREATMENT SUM OF SQUARES (SST) AND ERROR SUM OF SQUARES (SSE) The Treatment Sum of Squares looks at the difference between the group means and the overall mean (between groups) The Error Sum of Squares looks at the difference between the observations and their group means It can be simplified to 27

28 PARTITIONING THE SSTO One of the major points to introducing these formulae are to understand that we are partitioning the variance into two parts 28

29 DEGREES OF FREEDOM The Degrees of freedom are also cumulative. For the Total Sum of Squares, to obtain the variance, we would need to divide by n-1, so that is the d.f. Total. The same logic applies for the d.f. Treatment Knowing the logic of where the degrees of freedom come from is not important. Knowing their values and that they add up is. 29

30 THE F-DISTRIBUTION Building an ANOVA table leads to a Test Statistic which follows a new distribution called the F- distribution. When the Null is true… F=MST/MSE ~ F g-1,n-g The F-distribution is characterized by two different degrees of freedoms Numerator degrees of freedom: g-1 Denominator degrees of freedom: n-g As with the t-test, the F-tables only include critical values. In fact, there is a table for each significance level 30

31 MORE F-DISTRIBUTION The good news is that there is no such thing as a two sided test here. The higher the F-statistic, the more evidence we have that the group means are different. 31

32 32

33 ABOUT THE TABLE…AND P-VALUES df1 = df treatment df2 = df error Each table corresponds to a significance level α Thus, when using the tables, we can only obtain an interval for the p-value as with the t-distribution. Computer Software, like SGC, allow us to get an exact p-value: P(F >f) where f is the observed Test Stat The interpretation of the p-value continues to be the same as seen with all other tests. 33

34 BACK TO FAD DIETS… Using this information run an ANOVA to verify if there is any difference between the fad Diets. 34

35 EXERCISE: COMPLETE THE TABLE Three companies produce circuit boards. We investigate to determine if the average life span of the circuit boards differ according to the company chosen. Their prices differ, but do the products differ? Test at the 0.01 level. Interpret the p-value in the context of the problem. SourceSum of Squares Degrees of Freedom Mean Sum of Squares F-Statistic Treatment or Between 1292.40 Error or Within --------------- Total3115.12130-------------------------------- 35

36 HORSEPOWER BY CART TYPE An ANOVA was run to verify if there is a difference in the mean number horsepower of cars of each type (small, midsize, etc.). Here is part of the SGC output. State the Hypotheses How many car types are their? How many observations do we have? Find the p-value and state the conclusions. Does the Horsepower of “small” cars differ from that of “Large” cars? 36

37 VERIFICATION OF ASSUMPTIONS/CONDITIONS Two things to verify: The data are Normal The variances are equal Here we want Normality regardless of sample size. The CLT isn’t enough. To verify Normality, we produce histograms for each group/sample or side-by-side boxplots To verify equality of variance, we compare the sample variances or SDs and assess plausibility. The process is analogous to what we did with the two sample T-Test. 37

38 38

39 FINAL THOUGHTS ON ONE-WAY ANOVA When comparing two samples, we can determine which samples differ significantly – there’s only one possibility. In a two sample case, we can either use a two sample T-test assuming equal variance or a one- way ANOVA (they are equivalent). The T-test Allows for a one sided alternative Can be altered if the variances are not likely to be equal 39

40 THE RELATIONSHIP It should be noted that ANOVA is a generalization of the two sample t-test with equal variances. If we consider the two sample case, we’ll find that F = t 2 40

41 SUMMARY OF ONE-WAY ANOVA One Way ANOVA tests for a difference in population means Rejecting the Null does allow us to determine which population means differ. The assumptions are Populations are Normally distributed The population variances are equal The observations are independent Verification of Assumptions Independence is established from design Normality is assessed using histograms or normal probability plots Equal variances is established using summary statistics 41

42 SUMMARY PART 2 The math required for the Test Statistic is F = SST/(g-1) SSE/(n-g) Where The Test Statistic F, follows a F-distribution with degrees of freedom g-1 and n-g To complete an ANOVA table we note that… 42


Download ppt "ANALYSIS OF VARIANCE STAT 200. OUTLINE Introduction of concepts without numbers, notation or details Motivation The four steps: Hypothesis in words The."

Similar presentations


Ads by Google