Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012.

Slides:



Advertisements
Similar presentations
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Advertisements

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.
Describing Data: Measures of Dispersion
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Assumptions underlying regression analysis
Simple Linear Regression 1. review of least squares procedure 2
Chapter Twelve Multiple Regression and Model Building McGraw-Hill/Irwin Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
COMPLETE f o u r t h e d i t i o n BUSINESS STATISTICS Aczel Irwin/McGraw-Hill © The McGraw-Hill Companies, Inc., Using Statistics The Simple.
Simple Linear Regression Analysis
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Descriptive Measures MARE 250 Dr. Jason Turner.
Forecasting Using the Simple Linear Regression Model and Correlation
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Linear regression models
Objectives (BPS chapter 24)
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
Simple Linear Regression Analysis
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Correlation and Regression Analysis
Chapter 7 Forecasting with Simple Regression
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
LECTURE 12 Tuesday, 6 October STA291 Fall Five-Number Summary (Review) 2 Maximum, Upper Quartile, Median, Lower Quartile, Minimum Statistical Software.
Numerical Descriptive Techniques
LECTURE 8 Thursday, 19 February STA291 Fall 2008.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Descriptive statistics Describing data with numbers: measures of variability.
Introduction to Linear Regression
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
The Practice of Statistics Third Edition Chapter 1: Exploring Data 1.2 Describing Distributions with Numbers Copyright © 2008 by W. H. Freeman & Company.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 4 Describing Numerical Data.
INVESTIGATION 1.
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
LIS 570 Summarising and presenting data - Univariate analysis.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Descriptive statistics Describing data with numbers: measures of variability.
MODULE 3: DESCRIPTIVE STATISTICS 2/6/2016BUS216: Probability & Statistics for Economics & Business 1.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
BUSINESS MATHEMATICS & STATISTICS. Module 6 Correlation ( Lecture 28-29) Line Fitting ( Lectures 30-31) Time Series and Exponential Smoothing ( Lectures.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
AP Review Exploring Data. Describing a Distribution Discuss center, shape, and spread in context. Center: Mean or Median Shape: Roughly Symmetrical, Right.
Chapter 13 Simple Linear Regression
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Inference for Least Squares Lines
Statistics for Managers using Microsoft Excel 3rd Edition
Chapter 13 Simple Linear Regression
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
Simple Linear Regression - Introduction
Inference for Regression Lines
CHAPTER 29: Multiple Regression*
Regression Models - Introduction
No notecard for this quiz!!
Lecture 2 Chapter 3. Displaying and Summarizing Quantitative Data
PENGOLAHAN DAN PENYAJIAN
Simple Linear Regression and Correlation
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012

the sample mean Perhaps the most commonly used measure of centre is the arithmetic mean (from now on called the mean). If we have a sample of n observations denoted by x 1, x 2,...,x n then the mean is shown below

the sample variance the variance of the observations is shown below

the sample standard deviation the standard deviation is the square root of the variance and is shown below

the estimated standard error the standard error is the standard deviation divided by n this is a measure of the precision with which we can estimate the mean it is sometimes called the standard deviation of the mean

the coefficient of variation The coefficient of variation is a simple summary, CV = (stdev/mean)*100%. It is a useful way of evaluating the variation relative to the mean value and also to compare different data sets, even where the mean value is quite different.

Data summaries Case 1: all data, mean=130.5, stdev= 256.9, CV= 197% Case 2: extreme value at 1500 removed, mean= 95.4, stdev= 133.3, CV=139%

a more sensible analysis use the log data, as above- no problem data values CV=36.9%

robust summary statistics robust summary statistics include the median, quartiles and inter-quartile range (IQR) the median which is defined as the value below which (or equivalently above which) half of the observations lie. It is also known as the 50th percentile. This is a non- parametric percentile, since no distributional assumptions are made

robust summary statistics quartiles and inter-quartile range (IQR) Similarly, the more robust way to measure spread is to look at the lower and upper quartiles Q1 and Q3 - also known as the 25th and 75th percentiles. The IQR (interquartile range) is Q3 – Q1. these statistics form the basis of the construction of the boxplot

Preliminary Analysis Bathing water example There is considerable variation –Across different sites –Within the same site across different years Distribution of data is highly skewed with evidence of outliers and in some cases bimodality

detecting and dealing with outliers from the boxplot, most statistical software identifies an outlier as a value which is more then 1.5 * IQR from the median and marks it by a special symbol.

Formal tests Formal outlier tests exist- Dixons, Grubbs Chauvenets criterion; all are based on the how far rule, but usually how far from the mean, in terms of standard deviations. what to do? first check your data for any errors second, perhaps consider an analysis both with and without the problem value use robust statistics

Robust values original outlier removed Q1 Median Q3 Q1 median Q Removing the outlier makes almost no difference to the median but the range is affected.

Simple Regression Model The basic regression model assumes: The average value of the response y, is linearly related to the explanatory x, The spread of the response y, about the average is the SAME for all values of x, The VARIABILITY of the response y, about the average follows a NORMAL distribution for each value of x.

Simple Regression Model Model is fit typically using least squares Goodness of fit of model assessed based on residual sum of squares and R 2 Assumptions checked using residual plots Inference about model parameters

Regression- chlorophyll

Regression Output The regression equation is chloro = N Predictor Coef StDev T P Constant N S = R-Sq = 67.5% R-Sq(adj) = 66.1% Analysis of Variance Source DF SS MS F P Regression Error Total

Check Assumptions

Assess the Model Fit

Conclusions the equation for the best fit straight line as one with an intercept of -1.7 and a slope of Thus for every unit increase in N, the chloro measures increases by The R 2 (adj) value is 66.1%, so we have explained 66% of the variation in chloro by its relationship to N. The S value is 15.19, which describes the variation in the points around this fitted line.

Conclusions Analysis of Variance table, against the Regression term, a p-value of since the p-value is small (<0.05), then we can conclude that the regression is significant. Check for unusual observations these may have a large residual, which simply means that the observed value lies far from the fitted line or they may be influential, this means that the value for this particular observation has been particularly important in the calculation of the best fitted line.

Example 1: simple regression

log ammonia model Model log(amm) ~ pH Fitted model Log(amm)= pH

Fitted line: simple regression

Regression Output The regression equation is log(amm) = pH Predictor Coef ese Constant pH S = R-Sq(adj) = 19.4% So only 19.4% of variability in log(amm) explained by pH

Check Assumptions Residual plot shows no pattern, probability plot looks broadly linear

Assess the Model Fit The R 2 (adjusted) value expresses the % variability in the response variable that has been explained. High values are good!! 19.4% of variability in log(amm) explained by pH Look at the fitted values and compare with the observed data (using the residuals). Look at the residual plots.

other features Influential points they are key in determining where the fitted line goes. often (they are at the ends of the line), so either large or small x values

Model inference The main items of note : Testing significance of parameters using p-values Testing the overall significance of the regression using the ANOVA table Assessing the goodness of fit using the R 2 (adjusted value) and the residuals. typical questions concerning the slope and intercept of the line are Does the line pass through the origin? (is 0 = 0) Is the slope significantly different from 0? (is 1 0) Constructing –a 95% confidence interval for the mean response for a given value of the explanatory variable and a 95% prediction interval for a future observation.

Modelling dissolved oxygen Model 1: DO ~ temperature

Regression output The regression equation is DO = temp Predictor Coef SE Coef T P Constant temp S = R-Sq(adj) = 47.3% So only 47.3% of variability in DO is explained by temperature.

Regression output Analysis of Variance Source DF SS MS F p-value Regression Residual Error Total The ANOVA table shows the residual sum of squares as , the p- value is 0.000, so the summary of a test of the null hypothesis: model 0 : DO=error. We would reject this model in favour of model 1 : DO=temperature+error

Check Assumptions

Measures of agreement When there are two methods by which a measurement can be made, then it is important to know how well the methods agree. As an example, we can consider a recent study of low-level total phosphorus (Nov 2007) conducted in the Edinburgh chemistry lab. Although not a situation where two different analytical techniques were being used, instead duplicate samples of water were analysed for two different lochs over approximately one month. How well did the duplicate samples agree?

Measures of agreement First what not to do! dont quote a correlation coefficient A correlation coefficient measures the strength of relationship between two quantities, and we might expect if we have two measurement techniques, that they are indeed related, so that the correlation coefficient therefore is not a measure of agreement.

Measures of agreement A further tool commonly used is the scatterplot. in this situation care must be taken in constructing the scatterplot- the scale on both the x- and y-axis must be the same, and as a useful visual aid, it would be common to sketch the line of equality (y=x).

The scatterplot with the line y=x is shown. If the two sets are in agreement, then the points should be scattered closely round the line assessing agreement

The scatterplot with the line y=x is shown. the blue line is the best fitting straight line. so the results are clearly related but we knew that anyway. assessing agreement

Bland-Altman method This method involves studying the distribution of the between-method differences, and summarizing these data by the mean and 95% range of the differences. (These are called the 95% limits of agreement). This is then backed up with a Bland and Altman plot which plots the differences against the mean of the paired measurements, to ensure that the difference data are well behaved.

mean difference is and standard deviation of the differences is But is there a suggestion that the difference is larger for higher levels of TP? Bland-Altman approach

mean difference is and standard deviation of the differences is limits of agreement are indicated. Bland-Altman approach