Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © Cengage Learning. All rights reserved. 1 Overview and Descriptive Statistics.

Similar presentations


Presentation on theme: "Copyright © Cengage Learning. All rights reserved. 1 Overview and Descriptive Statistics."— Presentation transcript:

1 Copyright © Cengage Learning. All rights reserved. 1 Overview and Descriptive Statistics

2 Copyright © Cengage Learning. All rights reserved. 1.4 Measures of Variability

3 3 Center is just one characteristic of a data set. Different datasets may have identical measures of center yet very different other characteristics. The dotplots of three samples with the same mean and median, yet the extent of spread is different. The three histograms also have the same mean. Samples with identical measures of center but different amounts of variability Figure 1.19

4 4 Measures of Variability for Sample Data

5 5 Measures of Variability for Sample Data: Range Simple measure: range, is the difference between the largest and smallest sample values. The range for sample 1 is much larger than sample 3, so it has more variability. A defect: it depends on the two most extreme observations only and disregards the remaining n – 2 values. Samples 1 and 2 have identical ranges, yet there is much less variability or dispersion in the second sample than in the first.

6 6 Measures of Variability for Sample Data: Deviations Our primary measures involve the deviations from the mean,, subtracting the mean from each of the n sample observations. + or - depends on if the observation is > or < than the mean. Small deviations, little variability; Large deviations, a big variability. Can we average them to get a single quantity? NO, because One possibility is to use the average absolute deviation.

7 7 Measures of Variability for Sample Data: Variance and Standard Deviations More conveniently, use the squared deviations Rather than use the average squared deviationfor several reasons, we divide the sum of squared deviations by n – 1. The sample variance, denoted by s 2, is given by The sample standard deviation, denoted by s, is the (positive) square root: Note that s 2 and s are both nonnegative. The unit for s is the same as the unit for each of the x i s.

8 8 Interpreting Sample Standard Deviations If, for example, the observations are fuel efficiencies in miles per gallon, them we might have = 23, s = 2.0 mpg. The size of a typical or representative deviation from the sample mean of 23mpg is about 2pmg. So, many observations fall around 21mpg and around 25 mpg. If s = 3.0 for a second sample of cars of another type, a typical deviation in this sample is roughly 1.5 times what it is in the first sample, an indication of more variability in the second sample.

9 9 Example 17 Fuel efficiency of various vehicles, ww.fueleconomy.gov S xx = 314.106,,

10 10 Motivation for s 2 Why divided by n-1 instead of n

11 11 Motivation for s 2, why divide by n-1 s 2 is the sample variance, s is the sample standard deviation.  2 is the population variance, and  is the population s.d. When the population size N is finite, which is the average of all squared deviations from the population mean. Just as is used to make inferences about the , we need to define s 2 in order to make inferences on  2 when  is needed. However, the value of  is often unknown, so the sum of squared deviations about must be used. But the x i s tend to be closer to than to , so to compensate for this, the divisor n – 1 is used rather than n.

12 12 Motivation for s 2: why divide by n-1 If using a divisor n in the sample variance, the resulting quantity would tend to underestimate  2 (too small), whereas dividing by the slightly smaller n – 1 corrects this underestimating. It is customary to refer to s 2 as being based on n – 1 degrees of freedom (df). This terminology reflects the fact that although s 2 is based on the n deviations these sum to 0, so specifying the values of any n – 1 of them can determine the remaining one. For example, if n = 4 and then automatically so only three of the four values of are freely determined (3 df).

13 13 A Computing Formula for s 2

14 14 A Computing Formula for s 2 Use statistical software (Minitab, SAS, etc) Use Excel Use a calculator with this function If your regular scientific calculator does not have this capability, there is an alternative formula for S xx. Both the defining formula and the computational formula for s 2 can be sensitive to rounding.

15 15 Example 18 Recovery measurement of leg angle from knee surgery. 154 142 137 133 122 126 135 135 108 120 127 134 122 The sum of these 13 sample observations is and the sum of their squares is Thus the numerator of the sample variance is From which, s 2 = 1579.0769/12 = 131.59 and s = 11.47.

16 16 Two other properties of s 2 Proposition Let x 1, x 2, ……., x n be a sample, c be any nonzero constant. 1. If a constant c is added to (or subtracted from) each data value, the variance is unchanged. If y 1 = x 1 + c, y 2 = x 2 + c, ….., y n = x n + c, then 2. Multiplication of each x i by c results in s 2 being multiplied by a factor of c 2. If y 1 = cx 1, ….., y n = cx n, then where is the sample variance of the x’s and is the sample variance of the y’s.

17 17 Boxplots

18 18 Boxplots A boxplot, is to describe several data features: center, spread, the extent and nature of any departure from symmetry, and “outliers”. The boxplot is based on the median and a measure of variability called the fourth spread, which are not sensitive to outliers. Order the n observations ascendingly and separate the smallest half from the largest half; the median is included in both halves if n is odd. Then the lower (upper) fourth is the median of the smallest (largest) half. A measure of spread, the fourth spread f s, given by f s = upper fourth – lower fourth

19 19 Boxplots The simplest boxplot is based on: smallest x i, lower fourth median, upper fourth, largest x i Draw a horizontal axis. Place a rectangle above it; the left edge of the rectangle is at the lower fourth, and the right edge is at the upper fourth (so box width = f s ). Place a vertical line segment inside the rectangle at the median; the position of the median symbol relative to the two edges conveys information about skewness in the middle 50% of the data. Draw “whiskers” out from either end of the rectangle to the smallest and largest observations.

20 20 Example 19 The five-number summary is as follows: smallest 40, lower fourth 72.5, upper fourth 96.5 largest 125 The right edge of the box is much closer to the median The box width (f s ) is also reasonably large relative to the range of the data

21 21 Example 19 Figure 1.21 shows Minitab output from a request to describe the corrosion data. Q1 and Q3 are the lower and upper quartiles; these are similar to the fourths but are calculated in a slightly different manner. SE Mean is this will be an important quantity in our subsequent work concerning inferences about . Figure 1.21 Minitab description of the pit-depth data

22 22 Boxplots That Show Outliers

23 23 Boxplots That Show Outliers A boxplot can reveal outliers.. Any observation farther than 1.5f s from the closest fourth is an outlier. An outlier is extreme if it is more than 3f s from the nearest fourth, and it is mild otherwise. Let’s now modify our previous construction of a boxplot by drawing a whisker out from each end of the box to the smallest and largest observations that are not outliers. Each mild outlier is represented by a closed circle and each extreme outlier by an open circle. Some statistical computer packages do not distinguish between mild and extreme outliers.

24 24 Boxplots That Show Outliers Let’s now modify our previous construction of a boxplot by drawing a whisker out from each end of the box to the smallest and largest observations that are not outliers. Each mild outlier is represented by a closed circle and each extreme outlier by an open circle. Some statistical computer packages do not distinguish between mild and extreme outliers.

25 25 Example 20 Among the pollutant loads in watersheds data of TN (total nitrogen) loads (kg N/day) from a particular Chesapeake Bay location, displayed here in increasing order.

26 26 Example 20 Relevant summary quantities are Subtracting 1.5f s from the lower 4th gives a negative number, and none of the observations are negative, so there are no outliers on the lower end of the data. However, upper 4 th + 1.5f s = 351.015 upper 4 th + 3f s = 534.24 Thus the four largest observations—563.92, 690.11, 826.54, and 1529.35—are extreme outliers, and 352.09, 371.47, 444.68, and 460.86 are mild outliers.

27 27 Example 20 The whiskers in the boxplot in Figure 1.22 extend out to the smallest observation, 9.69, on the low end and 312.45, the largest observation that is not an outlier, on the upper end. There is some positive skewness in the middle half of the data (the median line is somewhat closer to the left edge of the box than to the right edge) and a great deal of positive skewness overall. A boxplot of the nitrogen load data showing mild and extreme outliers

28 28 Comparative Boxplots

29 29 Comparative Boxplots A comparative or side-by-side boxplot is a very effective way of revealing similarities and differences between two or more data sets consisting of observations on the same variable—fuel efficiency observations for four different types of automobiles, crop yields for three different varieties, and so on. We can use vertical Boxplots instead of horizontal in the comparison.

30 30 Example 21 Indoor radon from two houses, one having a child with cancer Both the mean and median suggest that the cancer sample is centered to the right of the no-cancer sample. The mean exaggerates the magnitude of this shift, largely because of the outlier observation 210. The s suggests more variability in the cancer sample, but this impression is contradicted by the fourth spreads. Again, the observation 210, an extreme outlier, is the culprit.

31 31 Figure 1.24 shows a comparative boxplot from the S-Plus computer package. Example 21 Figure 1.24 A boxplot of the data in Example 1.21, from S-Plus

32 32 The no-cancer box is stretched out compared with the cancer box (f s = 18 vs. f s = 11), and the positions of the median lines in the two boxes show much more skewness in the middle half of the no-cancer sample than the cancer sample. Outliers are represented by horizontal line segments, and there is no distinction between mild and extreme outliers. Example 21


Download ppt "Copyright © Cengage Learning. All rights reserved. 1 Overview and Descriptive Statistics."

Similar presentations


Ads by Google