Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Gentle, JE (2002) Elements of Computational Statistics. Springer. Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Measure of Variability This is the most important quantity in statistical analysis. Data may show no central tendency, but they almost always show variation. The greater the variability, –the greater the uncertainty about parameters estimated from the data. –the lower our ability to distinguish between competing hypotheses.

Typical Measures of Variability The range—depends only on outlying values. Sum of the differences between the data and the mean—useless, because it is zero by definition. Sum of the absolute values of the differences between the data and the mean—hard to use, although a very good measure. Sum of the squares of the differences between the data and the mean—most often used. Divide by the number of data points to get the mean squared deviation. Variance is slightly different again….

Degrees of Freedom Suppose you have n data points, v. The mean, m(v), is the sum of the data point values, divided by n (the number of independent pieces of information). Suppose you know n-1 of the data point values and the mean. What is the value of the remaining point? Hence, an estimate involving the data points and the mean actually has only n-1 independent pieces of information. The degrees of freedom of the estimate are n-1. Definition: the degrees of freedom of an estimate, df, is the sample size, n, minus the number of parameters in the estimate p already estimated from the data.

Variance If you have n points of data, v, from an unknown distribution, and you want to compute an estimate of its variability, use the following equation: variance = s 2 = (sum of squares)/(n-1) Note you divide by n-1. This is the df for the sample variance. (The sum of squares uses the sample mean.) If you know the mean, you can divide by n, but if all you have are the sample data, dividing by n will be too small (biased low).

Variance and Sample Size sample variance is not well behaved. The number of data points, n, affects the value of the variance estimated. For a small number of points, the variance estimate varies a lot. It still can vary by about a factor of three for 30 points. Rules of thumb: –You want a large number of independent data points if you need to estimate the variance. –Less than 10 sample points is a very small sample. –Less than 30 points is a small sample. –30 points is a reasonable sample.

Measures of Unreliability Given a sample variance (s 2 ), how much will the estimate of the mean vary with different samples? This is known as the standard error of the mean: SE y = √(s 2 /n) Note that the central limit theorem implies that the estimate of the mean will converge to a normal distribution as n increases. You can use this fact to derive a confidence interval for your estimate of the mean. (n of 30+ allows the normal distribution to be used.)

Small Sample Confidence Intervals For n<30, you can’t assume the normal distribution applies. Instead, you usually use Student’s t-distribution, which incorporates the degrees of freedom of the sample. You can also use bootstrap methods (advanced).

Confidence Intervals Three ways of generating a confidence interval for an estimate: –Assume a normal distribution. (You need lots of samples). –Assume a  2 distribution. (Less samples) –Bootstrapping (makes the fewest assumptions, computationally demanding) Demonstration (advanced)

R Demonstrations of all this… From the book. ozone<-read.table("gardens.txt”,header=T) attach(ozone) ozone

Ozone Data Frame gardenA gardenB gardenC 1 3 5 3 2 4 5 3 3 4 6 2 4 3 7 1 5 2 4 10 6 3 4 4 7 1 3 3 8 3 5 11 9 5 6 3 10 2 5 10

Continued mean(gardenA) 3 mean(gardenB) 5 mean(gardenC) 5 Are gardenB and gardenC distinguishable?

Continued Further var(gardenA) 1.33333 var(gardenB) 1.33333 var(gardenC) 14.22222 gardenA and gardenB have the same variance, gardenC does not!

Apply var.test var.test(gardenB,gardenC) F test to compare two variances data: gardenB and gardenC F = 0.0938, num df = 9, denom df = 9, p-value = 0.001624 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.02328617 0.37743695 sample estimates: ratio of variances 0.09375

Implications Since gardenA and gardenB have the same variance, you can use the t.test to compare means and conclude they are significantly different. Since gardenC has a different variance, you cannot use the t.test, and must use something weaker to compare their means.

Application of t.test to gardenA and gardenB t.test(gardenA,gardenB) Welch Two Sample t-test data: gardenA and gardenB t = -3.873, df = 18, p-value = 0.001115 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.0849115 -0.9150885 sample estimates: mean of x mean of y 3 5

Application of t.test to gardenA and gardenC t.test(gardenA,gardenC) Welch Two Sample t-test data: gardenA and gardenC t = -1.6036, df = 10.673, p-value = 0.1380 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.7554137 0.7554137 sample estimates: mean of x mean of y 3 5

Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Similar presentations

Presentation on theme: "Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Similar presentations

Presentation on theme: "Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland."— Presentation transcript:

Similar presentations

About project

Feedback