Presentation on theme: "Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012."— Presentation transcript:
Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012
the sample mean Perhaps the most commonly used measure of centre is the arithmetic mean (from now on called the mean). If we have a sample of n observations denoted by x 1, x 2,...,x n then the mean is shown below
the sample variance the variance of the observations is shown below
the sample standard deviation the standard deviation is the square root of the variance and is shown below
the estimated standard error the standard error is the standard deviation divided by n this is a measure of the precision with which we can estimate the mean it is sometimes called the standard deviation of the mean
the coefficient of variation The coefficient of variation is a simple summary, CV = (stdev/mean)*100%. It is a useful way of evaluating the variation relative to the mean value and also to compare different data sets, even where the mean value is quite different.
Data summaries Case 1: all data, mean=130.5, stdev= 256.9, CV= 197% Case 2: extreme value at 1500 removed, mean= 95.4, stdev= 133.3, CV=139%
a more sensible analysis use the log data, as above- no problem data values CV=36.9%
robust summary statistics robust summary statistics include the median, quartiles and inter-quartile range (IQR) the median which is defined as the value below which (or equivalently above which) half of the observations lie. It is also known as the 50th percentile. This is a non- parametric percentile, since no distributional assumptions are made
robust summary statistics quartiles and inter-quartile range (IQR) Similarly, the more robust way to measure spread is to look at the lower and upper quartiles Q1 and Q3 - also known as the 25th and 75th percentiles. The IQR (interquartile range) is Q3 – Q1. these statistics form the basis of the construction of the boxplot
Preliminary Analysis Bathing water example There is considerable variation –Across different sites –Within the same site across different years Distribution of data is highly skewed with evidence of outliers and in some cases bimodality
detecting and dealing with outliers from the boxplot, most statistical software identifies an outlier as a value which is more then 1.5 * IQR from the median and marks it by a special symbol.
Formal tests Formal outlier tests exist- Dixons, Grubbs Chauvenets criterion; all are based on the how far rule, but usually how far from the mean, in terms of standard deviations. what to do? first check your data for any errors second, perhaps consider an analysis both with and without the problem value use robust statistics
Robust values original outlier removed Q1 Median Q3 Q1 median Q3 10.0 60.0 23.0 10.0 58.0 112.5 Removing the outlier makes almost no difference to the median but the range is affected.
Simple Regression Model The basic regression model assumes: The average value of the response y, is linearly related to the explanatory x, The spread of the response y, about the average is the SAME for all values of x, The VARIABILITY of the response y, about the average follows a NORMAL distribution for each value of x.
Simple Regression Model Model is fit typically using least squares Goodness of fit of model assessed based on residual sum of squares and R 2 Assumptions checked using residual plots Inference about model parameters
Regression Output The regression equation is chloro = - 1.7 + 28.8 N Predictor Coef StDev T P Constant -1.69 10.14 -0.17 0.869 N 28.808 4.171 6.91 0.000 S = 15.19 R-Sq = 67.5% R-Sq(adj) = 66.1% Analysis of Variance Source DF SS MS F P Regression 1 11000 11000 47.70 0.000 Error 23 5304 231 Total 24 16304
Conclusions the equation for the best fit straight line as one with an intercept of -1.7 and a slope of 28.8. Thus for every unit increase in N, the chloro measures increases by 28.8. The R 2 (adj) value is 66.1%, so we have explained 66% of the variation in chloro by its relationship to N. The S value is 15.19, which describes the variation in the points around this fitted line.
Conclusions Analysis of Variance table, against the Regression term, a p-value of 0.000. since the p-value is small (<0.05), then we can conclude that the regression is significant. Check for unusual observations these may have a large residual, which simply means that the observed value lies far from the fitted line or they may be influential, this means that the value for this particular observation has been particularly important in the calculation of the best fitted line.
Regression Output The regression equation is log(amm) = 6.45 -0.75pH Predictor Coef ese Constant 6.45 0.7837 pH -0.75 0.0998 S = 0.336 R-Sq(adj) = 19.4% So only 19.4% of variability in log(amm) explained by pH
Check Assumptions Residual plot shows no pattern, probability plot looks broadly linear
Assess the Model Fit The R 2 (adjusted) value expresses the % variability in the response variable that has been explained. High values are good!! 19.4% of variability in log(amm) explained by pH Look at the fitted values and compare with the observed data (using the residuals). Look at the residual plots.
other features Influential points they are key in determining where the fitted line goes. often (they are at the ends of the line), so either large or small x values
Model inference The main items of note : Testing significance of parameters using p-values Testing the overall significance of the regression using the ANOVA table Assessing the goodness of fit using the R 2 (adjusted value) and the residuals. typical questions concerning the slope and intercept of the line are Does the line pass through the origin? (is 0 = 0) Is the slope significantly different from 0? (is 1 0) Constructing –a 95% confidence interval for the mean response for a given value of the explanatory variable and a 95% prediction interval for a future observation.
Modelling dissolved oxygen Model 1: DO ~ temperature
Regression output The regression equation is DO = 11.9 - 0.475 temp Predictor Coef SE Coef T P Constant 11.8887 0.1303 91.27 0.000 temp -0.47524 0.01133 -41.95 0.000 S = 2.0598 R-Sq(adj) = 47.3% So only 47.3% of variability in DO is explained by temperature.
Regression output Analysis of Variance Source DF SS MS F p-value Regression 1 7467.7 7467.7 1760.08 0.000 Residual Error 1961 8320.2 4.2 Total 1962 15787.9 The ANOVA table shows the residual sum of squares as 8320.2, the p- value is 0.000, so the summary of a test of the null hypothesis: model 0 : DO=error. We would reject this model in favour of model 1 : DO=temperature+error
Measures of agreement When there are two methods by which a measurement can be made, then it is important to know how well the methods agree. As an example, we can consider a recent study of low-level total phosphorus (Nov 2007) conducted in the Edinburgh chemistry lab. Although not a situation where two different analytical techniques were being used, instead duplicate samples of water were analysed for two different lochs over approximately one month. How well did the duplicate samples agree?
Measures of agreement First what not to do! dont quote a correlation coefficient A correlation coefficient measures the strength of relationship between two quantities, and we might expect if we have two measurement techniques, that they are indeed related, so that the correlation coefficient therefore is not a measure of agreement.
Measures of agreement A further tool commonly used is the scatterplot. in this situation care must be taken in constructing the scatterplot- the scale on both the x- and y-axis must be the same, and as a useful visual aid, it would be common to sketch the line of equality (y=x).
The scatterplot with the line y=x is shown. If the two sets are in agreement, then the points should be scattered closely round the line assessing agreement
The scatterplot with the line y=x is shown. the blue line is the best fitting straight line. so the results are clearly related but we knew that anyway. assessing agreement
Bland-Altman method This method involves studying the distribution of the between-method differences, and summarizing these data by the mean and 95% range of the differences. (These are called the 95% limits of agreement). This is then backed up with a Bland and Altman plot which plots the differences against the mean of the paired measurements, to ensure that the difference data are well behaved.
mean difference is -1.30 and standard deviation of the differences is 1.974. But is there a suggestion that the difference is larger for higher levels of TP? Bland-Altman approach
mean difference is -1.30 and standard deviation of the differences is 1.974. limits of agreement are indicated. Bland-Altman approach