Statistical Data Analysis - Lecture /04/03

Statistical Data Analysis - Lecture16 - 09/04/03
A case study The following data come from an experiment designed to measure the accuracy of eleven laboratories. Each laboratory was given three samples for each of two different types of chalk. The laboratories were then asked to take readings on the bulk density of precipitated chalk. In this experiment, The response is bulk density The factors are CHALK and LAB The factor CHALK has two levels A and B The factor LAB has eleven levels corresponding to the different laboratories. Statistical Data Analysis - Lecture /04/03

Here’s the raw data. How do we get it into a form that can be analysed Statistical Data Analysis - Lecture /04/03

Data manipulation Often, up to 90% of your time in any analysis is getting the data into a format that is convenient for the analysis you wish to do This data set is no exception. There is no single way of doing this. I used Microsoft Excel (good for data manipulation not so good for statistical analysis) because of its ease of handling of columnar data Our ultimate goal is to get the data into R ready for a two way ANOVA. The format R expects (as do most stats packages) is to have the response in one column, and appropriate factor levels in adjacent columns Statistical Data Analysis - Lecture /04/03

Data manipulation Using Excel, I copied each block chalk data and pasted the transpose of the data in my worksheet. Transposing turns the rows into columns, so for each chalk type I went from a block of 11 rows and 3 columns to 3 rows and eleven columns. This makes it easier to stack the results from the different laboratories on top of each other After turning each block into a column I stacked those columns on top of each other leaving me with one column Now, all the data are in one column. Statistical Data Analysis - Lecture /04/03

Coding the factors The 1st 33 observations are experiments done with Chalk A and the 2nd 33 observations are experiments done with Chalk B Therefore in R we need to make a vector with 33 A’s and 33 B’s (to represent the factor levels for CHALK) We can do this with chalk<-as.factor(rep(c(“A”,”B”),c(33,33))) Statistical Data Analysis - Lecture /04/03

Coding the factors We know that in each block of 33 experiments that there are 3 observations from each lab This means we need a sequence that represents the idea that the 1st 3 observations where done on chalk A by lab 1, the 2nd 3 observations where done on chalk A by lab 2 and so on We’ve taken care of the CHALK coding. To code the LAB factor we label each observation with the lab it came from. This means we need a vector of 3 “ones”, 3 “twos” and so on. And it needs to be repeated for Chalk B Therefore, we use lab<-as.factor(rep(rep(1:11,rep(3,11)),2)) Statistical Data Analysis - Lecture /04/03

Fitting the model We fit a standard two-way ANOVA model to the data In this case i = A, B , j = 1,…,11 and k = 1,…,3 This is a balanced design because =N/IJ=66/211=3 What do we expect to see before we do any fitting? We know the chalk types are different, so the factor CHALK should be significant We expect the labs to perform about the same so we hope that the factor LAB is not significant – if it is this means that the quality of some of the labs is lower We hope there is no difference in the quality of the results on the basis of chalk type – i.e. we hope there is no significant interaction between CHALK and LAB Statistical Data Analysis - Lecture /04/03

The fitted model In the interests of numerical stability we multiple the responses by 1000. This multiplies the group means by 1,000 and the group variances and sums of squares by 1,000,000 The results are still easy to understand, but we don’t need to worry as much about rounding error We need to remember to undo this change if we wish to say anything in particular about the numerical value of the results Statistical Data Analysis - Lecture /04/03

Analysis of Variance Table Response: Density Df Sum Sq Mean Sq F value Pr(>F) chalk < 2.2e-16 *** lab < 2.2e-16 *** chalk:lab e-05 *** Residuals --- We can see that we have some problems CHALK is significant as we predicted BUT… so are the LAB effects and the CHALK*LAB interaction is significant as well What does this mean? Maybe a plot will help Statistical Data Analysis - Lecture /04/03

Interpreting the interaction plot
The interaction plot is interesting It seems to offer contrary findings to our ANOVA table Remember, if an interaction is significant, then the lines will generally overlap or not be parallel The lines here seem to be mostly parallel In fact the plot is dominated by the difference between the chalks This fact is key to our interpretation Let’s go back to the ANOVA table Statistical Data Analysis - Lecture /04/03

Percentage of variation explained
When we’re modelling data, our aim is to explain the data In statistics, we measure how well we’ve explained the data by the percentage or proportion of variation in the data that the model accounts for. If the model only explains a small amount of the variation, then the model does not explain the data well, i.e. a poor fit. Conversely if the model explains a large amount of the variation, then USUALLY the model does explain the data well, i.e. a good fit. The reason we don’t automatically say the model is a good fit is because addition of model parameters will always improve fit Statistical Data Analysis - Lecture /04/03

Percentage of variation explained
When we work out the percentage of the total sums of squares (TSS) attributed to each of the model terms we see that 98.8% comes from the difference between the chalks Because the sums of squares are a measure of total variation we can treat this as a measure of variation explained It is fairly obvious that there is little increase in the relative quality of the fit with the addition of the labs and interaction terms Furthermore, our interaction plot says that we’re unlikely to pick a different lab to do an analysis on the basis of the chalk type we’re looking at Statistical Data Analysis - Lecture /04/03

Refitting the model Having convinced ourselves that the additive model will explain the data well enough, we fit the reduced model Analysis of Variance Table Response: Density Df Sum Sq Mean Sq F value Pr(>F) chalk < 2.2e-16 *** lab < 2.2e-16 *** Residuals Statistical Data Analysis - Lecture /04/03

Additive model Examining the ANOVA table we can see that the main effects are still significant even though we haven’t accounted for the interaction Remember that the aim of this experiment was not to prove that there is no difference between the chalks (we know there is), but to look at differences in accuracy between labs A main effects plot shows us the effects due to to each effect. The group means are plotted on separated plots for each factor In our example we have a plot with for the chalk means and another plot for the lab means Statistical Data Analysis - Lecture /04/03

Main effects plot Statistical Data Analysis - Lecture /04/03

Further considerations in Twoway ANOVA (not examinable)
Linear Contrast and Confidence intervals for interaction effects Similar to those for one-way ANOVA Twoway ANOVA with one replicate, We can not fit a model with an interaction term Since there is only one replicate, we can in fact drop the subscript k Tukey’s test for non-addivity, assumes our interaction is proportional to the product of the two main effects Statistical Data Analysis - Lecture /04/03

Statistical Data Analysis - Lecture /04/03

Similar presentations

Presentation on theme: "Statistical Data Analysis - Lecture /04/03"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Data Analysis - Lecture /04/03

Similar presentations

Presentation on theme: "Statistical Data Analysis - Lecture /04/03"— Presentation transcript:

Similar presentations

About project

Feedback