Presentation is loading. Please wait.

Presentation is loading. Please wait.

STAT03 - Descriptive statistics (cont.) - variability 1 Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov Applied statistics for testing.

Similar presentations


Presentation on theme: "STAT03 - Descriptive statistics (cont.) - variability 1 Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov Applied statistics for testing."— Presentation transcript:

1 STAT03 - Descriptive statistics (cont.) - variability 1 Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4

2 STAT03 - Descriptive statistics (cont.) - variability 2 Introduction We previously discussed measures of central tendency (location) of a data sample (collection) in descriptive statistics – arithmetic mean, median and mode; and also the range as a measure of statistical dispersion (variability) Here we continue with other important measures of variability – namely variance and standard deviations We will also get acquainted with some parameters leading to their definitions We will look at how we perform these operations in R, and a bit more about plotting as well

3 STAT03 - Descriptive statistics (cont.) - variability 3 Variability and deviations A measure of variability is perhaps the most important quantity in statistical analysis. –The greater the variability in the data, the greater will be our uncertainty in the values of the parameters estimated from the data, and –the lower will be our ability to distinguish between competing hypotheses about the data. Measures of variability – a single number describing the variability of data – eventually we look for variance and standard deviation

4 STAT03 - Descriptive statistics (cont.) - variability 4 Variability and deviations Deviations – distances of the individual values in the data sample, from the mean value Plotting – using lines in a for loop

5 STAT03 - Descriptive statistics (cont.) - variability 5 Variability and deviations The longer the lines – the more variable the data Could we use the sum of the deviations as a measure of variability? No – because of the definition of arithmetic mean, it is the line positioned such that the sum of the deviations cancels out. Quick proof

6 STAT03 - Descriptive statistics (cont.) - variability 6 Absolute deviations The minus signs of the deviations could be seen as the reason for cancellation of the sum We could try using the absolute deviations Their sum will be obviously different from 0. However, hard to compute – need an easier way

7 STAT03 - Descriptive statistics (cont.) - variability 7 Squared deviations and sum of squares Squaring the deviations is computationally less intensive Their sum will, again, be obviously different from 0. It is the well known sum of squares: More properly – it is the sum of squared deviations An unscaled, or unadjusted measure of dispersion

8 STAT03 - Descriptive statistics (cont.) - variability 8 Scaling the sum of squares – Mean Squared Deviation Now, what would happen to the sum of squares if we added an [additional] data point? –It would get bigger, of course. So usually, the sum of squares will grow with the size of the data collection. –That is a manifestation of the fact that it is unscaled. –Scaling (also known as normalizing) means adjusting the sum of squares so that it does not grow as the size of the data collection grows. We don't want our measure of variability to depend on sample size in this way, so the obvious solution is to divide by the number of samples, to get the mean squared deviation The MSD can be taken to be the wanted variance parameter, but…

9 STAT03 - Descriptive statistics (cont.) - variability 9 Degrees of freedom Suppose we had a sample of five numbers and their average was 4, What was the sum of the five numbers? It must have been 20, otherwise the mean would not have been 4. So now let us think about each of the five numbers in turn: We are going to put a number in each of the five boxes. If we allow that the numbers could be positive or negative real numbers, we ask how many values could the first number take.

10 STAT03 - Descriptive statistics (cont.) - variability 10 Degrees of freedom If we allow that the numbers could be positive or negative real numbers, we ask how many values could the first number take. You will realize it could take any value. Suppose it was a 2. 2

11 STAT03 - Descriptive statistics (cont.) - variability 11 Degrees of freedom How many values could the next number take? It could be anything. Say it was a 7. 2 7 2

12 STAT03 - Descriptive statistics (cont.) - variability 12 Degrees of freedom And the third number could be anything. Suppose it was a 4. 2 7 4 2 7

13 STAT03 - Descriptive statistics (cont.) - variability 13 Degrees of freedom The fourth number could be anything at all. Say it was 0. 2 7 4 0 2 7 4

14 STAT03 - Descriptive statistics (cont.) - variability 14 Degrees of freedom Now, how many values could the last number take? Just one - it has to be another 7 because the numbers have to add up to 20 because the mean of the five numbers is 4. 2 7 4 0 7 2 7 4 0

15 STAT03 - Descriptive statistics (cont.) - variability 15 Degrees of freedom We have total freedom in selecting the first number - and the second, third and fourth numbers. But we have no choice at all in selecting the fifth number. We have four degrees of freedom when we have five numbers (and their mean). In general we have (n-1) degrees if freedom if we estimated the mean from a sample of size n. More generally still, we can propose a formal definition of degrees of freedom: degrees of freedom is the sample size, N, minus the number of parameters, p, estimated from the data. 2 7 4 0 7

16 STAT03 - Descriptive statistics (cont.) - variability 16 Scaling the sum of squares – variance The mean is a parameter estimated from the data itself – hence we lose one degree of freedom Thus we finally arrive at a definition for variance – sum of squares divided by the degrees of freedom Only difference between MSD and variance – division with N or N-1, respectively

17 STAT03 - Descriptive statistics (cont.) - variability 17 Standard deviation Variance has a unit of measure which is squared (cm 2 ) in relation to the original units (cm) Therefore, another measure is used – standard deviation – measured in same units as the data

18 STAT03 - Descriptive statistics (cont.) - variability 18 Sample and population parameters Usually you are interested in drawing conclusions about the population from which your (random) sample of data is drawn. It is very important to keep in mind the difference between the descriptive statistics that characterise your sample, and the corresponding parameters that characterise the population from which your sample is drawn. Population (finite, infinite) “true” parameters Sample (finite) Estimates of population parameters mean variance standard deviation Ex. All raisin boxes ever produced by the company/factory Ex. The particular data collection for only 17 particular raisin boxes Needs (probability) distributions

19 STAT03 - Descriptive statistics (cont.) - variability 19 Geometric interpretations - quantity graph Standard deviation – same units as the quantity

20 STAT03 - Descriptive statistics (cont.) - variability 20 Geometric interpretations - quantity graph Variance - area

21 STAT03 - Descriptive statistics (cont.) - variability 21 Geometric interpretations - quantity graph Variance - area

22 STAT03 - Descriptive statistics (cont.) - variability 22 Geometric interpretation - histogram (frequency count) More commonly – geometric interpretation on a histogram. Makes it easier to see the spread If no deviations – standard deviation is 0 – the whole histogram collapses to a single peak

23 STAT03 - Descriptive statistics (cont.) - variability 23 Review Arithmetic mean Median Mode Range Variance Standard deviation Measures of Central tendency (location) Measure of Statistical variability (dispersion - spread) Descriptive statistics

24 STAT03 - Descriptive statistics (cont.) - variability 24 Exercise for mini-module 3 – STAT03 Exercise Use the following data: The data in the following table come from three garden markets. The data show the ozone concentrations in parts per hundre million (pphm) on ten consecutive summer days 1. Import the data into R, and for each garden, find the the central tendency parameters of the ozone concentrations. 2. Using R, for each garden, find dispersion parameters - the sample variance and sample standard deviation. 3. Using R, plot the relative frequency histogram for each of the gardens. Mark graphically the arithmetic mean on each graph and the one standard deviation range. Delivery: Deliver the collected data (in tabular format), the found statistics and the requested graphs for the assigned years in an electronic document. You are welcome to include R code as well.


Download ppt "STAT03 - Descriptive statistics (cont.) - variability 1 Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov Applied statistics for testing."

Similar presentations


Ads by Google