Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 3 Graphical Methods for Describing Data.

Similar presentations


Presentation on theme: "1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 3 Graphical Methods for Describing Data."— Presentation transcript:

1 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 3 Graphical Methods for Describing Data

2 2 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distribution Example The data in the column labeled vision for the student data set introduced in the slides for chapter 1 is the answer to the question, “What is your principle means of correcting your vision?” The results are tabulated below

3 3 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Bar Chart Examples This comparative bar chart is based on frequencies and it can be difficult to interpret and misleading. Would you mistakenly interpret this to mean that the females and males use contacts equally often? You shouldn’t. The picture is distorted because the frequencies of males and females are not equal.

4 4 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Bar Chart Examples When the comparative bar chart is based on percents (or relative frequencies) (each group adds up to 100%) we can clearly see a difference in pattern for the eye correction proportions for each of the genders. Clearly for this sample of students, the proportion of female students with contacts is larger then the proportion of males with contacts.

5 5 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Bar Chart Examples Stacking the bar chart can also show the difference in distribution of eye correction method. This graph clearly shows that the females have a higher proportion using contacts and both the no correction and glasses group have smaller proportions then for the males.

6 6 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Pie Charts - Procedure 1.Draw a circle to represent the entire data set. 2.For each category, calculate the “slice” size. Slice size = 360(category relative frequency) 3.Draw a slice of appropriate size for each category.

7 7 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Pie Chart - Example Using the vision correction data we have:

8 8 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Pie Chart - Example Using side-by-side pie charts we can compare the vision correction for males and females.

9 9 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Another Example This data constitutes the grades earned by the distance learning students during one term in the Winter of 2002.

10 10 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Pie Chart – Another Example Using the grade data from the previous slide we have:

11 11 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Using the grade data we have: By pulling a slice (exploding) we can accentuate and make it clearing how A was the predominate grade for this course. Pie Chart – Another Example

12 12 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Stem and Leaf A quick technique for picturing the distributional pattern associated with numerical data is to create a picture called a stem-and-leaf diagram (Commonly called a stem plot). 1.We want to break up the data into a reasonable number of groups. 2.Looking at the range of the data, we choose the stems (one or more of the leading digits) to get the desired number of groups. 3.The next digits (or digit) after the stem become(s) the leaf. 4.Typically, we truncate (leave off) the remaining digits.

13 13 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Stem and Leaf 10 11 12 13 14 15 16 17 18 19 20 3 154504 90050 000 05700 0 5 0 Choosing the 1 st two digits as the stem and the 3 rd digit as the leaf we have the following 150 140 155 195 139 200 157 130 113 130 121 140 140 150 125 135 124 130 150 125 120 103 170 124 160 For our first example, we use the weights of the 25 female students.

14 14 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Stem and Leaf 10 11 12 13 14 15 16 17 18 19 20 3 014455 00059 000 00057 0 5 0 Typically we sort the order the stems in increasing order. We also note on the diagram the units for stems and leaves Stem: Tens and hundreds digits Leaf: Ones digit Probable outliers

15 15 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Stem-and-leaf – GPA example The following are the GPAs for the 20 advisees of a faculty member. If the ones digit is used as the stem, you only get three groups. You can expand this a little by breaking up the stems by using each stem twice letting the 2 nd digits 0-4 go with the first and the 2 nd digits 5-9 with the second. The next slide gives two versions of the stem-and-leaf diagram. GPA 3.092.042.273.943.702.69 3.723.233.133.502.263.15 2.801.753.89 3.382.741.65 2.222.66

16 16 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Stem-and-leaf – GPA example 1L 1H 2L 2H 3L 3H 65,75 04,22,26,27 66,69,74,80 09,13,15,23,38 50,70,72,89,94 1L 1H 2L 2H 3L 3H 67 0222 6678 01123 57789 Stem: Ones digit Leaf: Tenths digits Note: The characters in a stem-and-leaf diagram must all have the same width, so if typing a fixed character width font such as courier. Stem:Ones digit Leaf:Tenths and hundredths digits

17 17 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparative Stem and Leaf Diagram Student Weight (Comparing two groups) When it is desirable to compare two groups, back- to-back stem and leaf diagrams are useful. Here is the result from the student weights. From this comparative stem and leaf diagram, it is clear that the males weigh more (as a group not necessarily as individuals) than the females. 3 10 3 11 7 554410 12 145 95000 13 0004558 000 14 000000555 75000 15 0005556 0 16 00005558 0 17 000005555 18 0358 5 19 0 20 0 21 0 22 55 23 79

18 18 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparative Stem and Leaf Diagram Student Age female male 7 1 9999 1 888889999999999999999 1111000 2 00000001111111111 3322222 2 2222223333 4 2 445 2 6 2 88 0 3 3 7 3 8 3 4 4 4 7 4 From this comparative stem and leaf diagram, it is clear that the male ages are all more closely grouped then the females. Also the females had a number of outliers.

19 19 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms When working with discrete data, the frequency tables are similar to those produced for qualitative data. For example, a survey of local law firms in a medium sized town gave

20 20 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms When working with discrete data, the steps to construct a histogram are 1.Draw a horizontal scale, and mark the possible values. 2.Draw a vertical scale and mark it with either frequencies or relative frequencies (usually start at 0). 3.Above each possible value, draw a rectangle whose height is the frequency (or relative frequency) centered at the data value with a width chosen appropriately. Typically if the data values are integers then the widths will be one.

21 21 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms Look for a central or typical value, extent of spread or variation, general shape, location and number of peaks, and presence of gaps and outliers.

22 22 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms The number of lawyers in the firm will have the following histogram. Clearly, the largest group are single member law firms and the frequency decreases as the number of lawyers in the firm increases.

23 23 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms 50 students were asked the question, “How many textbooks did you purchase last term?” The result is summarized below and the histogram is on the next slide.

24 24 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms “How many textbooks did you purchase last term?” The largest group of students bought 5 or 6 textbooks with 3 or 4 being the next largest frequency.

25 25 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms Another version with the scales produced differently.

26 26 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Distributions & Histograms When working with continuous data, the steps to construct a histogram are 1.Decide into how many groups or “classes” you want to break up the data. Typically somewhere between 5 and 20. A good rule of thumb is to think having an average of more than 5 per group.* 2.Use your answer to help decide the “width” of each group. 3.Determine the “starting point” for the lowest group.

27 27 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Frequency Distribution Consider the student weights in the student data set. The data values fall between 103 (lowest) and 239 (highest). The range of the dataset is 239-103=136. There are 79 data values, so to have an average of at least 5 per group, we need 16 or fewer groups. We need to choose a width that breaks the data into 16 or fewer groups. Any width 10 or large would be reasonable.

28 28 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Frequency Distribution Choosing a width of 15 we have the following frequency distribution.

29 29 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data Mark the boundaries of the class intervals on a horizontal axis Use frequency or relative frequency on the vertical scale.

30 30 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data The following histogram is for the frequency table of the weight data.

31 31 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data The following histogram is the Minitab output of the relative frequency histogram. Notice that the relative frequency scale is in percent.

32 32 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Cumulative Relative Frequency Table If we keep track of the proportion of that data that falls below the upper boundaries of the classes, we have a cumulative relative frequency table.

33 33 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Cumulative Relative Frequency Plot If we graph the cumulative relative frequencies against the upper endpoint of the corresponding interval, we have a cumulative relative frequency plot.

34 34 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data Another version of a frequency table and histogram for the weight data with a class width of 20.

35 35 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data The resulting histogram.

36 36 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data The resulting cumulative relative frequency plot.

37 37 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data Yet, another version of a frequency table and histogram for the weight data with a class width of 20.

38 38 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data The corresponding histogram.

39 39 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histogram for Continuous Data A class width of 15 or 20 seems to work well because all of the pictures tell the same story. The bulk of the weights appear to be centered around 150 lbs with a few values substantially large. The distribution of the weights is unimodal and is positively skewed.

40 40 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Illustrated Distribution Shapes Unimodal BimodalMultimodal Skew negatively Symmetric Skew positively

41 41 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms with uneven class widths Consider the following frequency histogram of ages based on A with class widths of 2. Notice it is a bit choppy. Because of the positively skewed data, sometimes frequency distributions are created with unequal class widths.

42 42 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms with uneven class widths For many reasons, either for convenience or because that is the way data was obtained, the data may be broken up in groups of uneven width as in the following example referring to the student ages.

43 43 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms with uneven class widths If a frequency (or relative frequency) histogram is drawn with the heights of the bars being the frequencies (relative frequencies), the result is distorted. Notice that it appears that there are a lot of people over 28 when there is only a few.

44 44 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms with uneven class widths To correct the distortion, we create a density histogram. The vertical scale is called the density and the density of a class is calculated by This choice for the density makes the area of the rectangle equal to the relative frequency.

45 45 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms with uneven class widths Continuing this example we have

46 46 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms with uneven class widths The resulting histogram is now a reasonable representation of the data.

47 47 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

48 48 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Numerical Methods for Describing Data

49 49 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Describing the Center of a Data Set with the arithmetic mean

50 50 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Describing the Center of a Data Set with the arithmetic mean The population mean is denoted by µ, is the average of all x values in the entire population.

51 51 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The “average” or mean price for this sample of 10 houses in Fancytown is $295,000 Example calculations During a two week period 10 houses were sold in Fancytown.

52 52 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example calculations During a two week period 10 houses were sold in Lowtown. The “average” or mean price for this sample of 10 houses in Lowtown is $295,000 Outlier

53 53 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Reflections on the Sample calculations Looking at the dotplots of the samples for Fancytown and Lowtown we can see that the mean, $295,000 appears to accurately represent the “center” of the data for Fancytown, but it is not representative of the Lowtown data. Clearly, the mean can be greatly affected by the presence of even a single outlier. Outlier

54 54 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comments 1.In the previous example of the house prices in the sample of 10 houses from Lowtown, the mean was affected very strongly by the one house with the extremely high price. 2.The other 9 houses had selling prices around $100,000. 3.This illustrates that the mean can be very sensitive to a few extreme values.

55 55 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Describing the Center of a Data Set with the median The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then

56 56 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Consider the Fancytown data. First, we put the data in numerical increasing order to get 231,000 285,000 287,000 294,000 297,000 299,000 312,000 313,000 315,000 317,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle. Example of Median Calculation

57 57 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Median Calculation Consider the Lowtown data. We put the data in numerical increasing order to get 93,000 95,000 97,000 99,000 100,000 110,000 113,000 121,000 122,000 2,000,000 Since there are 10 (even) data values, the median is the mean of the two values in the middle.

58 58 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparing the Sample Mean & Sample Median

59 59 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparing the Sample Mean & Sample Median

60 60 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparing the Sample Mean & Sample Median Typically, 1.when a distribution is skewed positively, the mean is larger than the median, 2.when a distribution is skewed negatively, the mean is smaller then the median, and 3.when a distribution is symmetric, the mean and the median are equal. Notice from the preceding pictures that the median splits the area in the distribution in half and the mean is the point of balance.

61 61 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The Trimmed Mean A trimmed mean is computed by first ordering the data values from smallest to largest, deleting a selected number of values from each end of the ordered list, and finally computing the mean of the remaining values. The trimming percentage is the percentage of values deleted from each end of the ordered list.

62 62 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Trimmed Mean

63 63 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Trimmed Mean

64 64 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Another Example Here’s an example of what happens if you compute the mean, median, and 5% & 10% trimmed means for the Ages for the 79 students taking Data Analysis

65 65 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Categorical Data - Sample Proportion

66 66 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. If we look at the student data sample, consider the variable gender and treat being female as a success, we have 25 of the sample of 79 students are female, so the sample proportion (of females) is Categorical Data - Sample Proportion

67 67 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Describing Variability The simplest numerical measure of the variability of a numerical data set is the range, which is defined to be the difference between the largest and smallest data values. range = maximum - minimum

68 68 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Describing Variability The n deviations from the sample mean are the differences: Note: The sum of all of the deviations from the sample mean will be equal to 0, except possibly for the effects of rounding the numbers. This means that the average deviation from the mean is always 0 and cannot be used as a measure of variability.

69 69 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Sample Variance The sample variance, denoted s 2 is the sum of the squared deviations from the mean divided by n-1.

70 70 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Sample Standard Deviation The sample standard deviation, denoted s is the positive square root of the sample variance. The population standard deviation is denoted by .

71 71 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example calculations 10 Macintosh Apples were randomly selected and weighed (in ounces).

72 72 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculator Formula for s 2 and s A computational formula for the sample variance is given by A little algebra can establish the sum of the square deviations,

73 73 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculations Revisited The values for s 2 and s are exactly the same as were obtained earlier.

74 74 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Quartiles and the Interquartile Range Lower quartile (Q 1 ) = median of the lower half of the data set. Upper Quartile (Q 3 ) = median of the upper half of the data set. Note: If n is odd, the median is excluded from both the lower and upper halves of the data. The interquartile range (iqr), a resistant measure of variability is given by iqr = upper quartile – lower quartile = Q 3 – Q 1

75 75 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Quartiles and IQR Example 15 students with part time jobs were randomly selected and the number of hours worked last week was recorded. 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 19, 12, 14, 10, 12, 10, 25, 9, 8, 4, 2, 10, 7, 11, 15 The data is put in increasing order to get

76 76 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. With 15 data values, the median is the 8 th value. Specifically, the median is 10. 2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25 Median Lower Half Upper Half Lower quartile Q 1 Upper quartile Q 3 Lower quartile = 8 Upper quartile = 14 Iqr = 14 - 8 = 6 Quartiles and IQR Example

77 77 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Boxplots Constructing a Skeletal Boxplot 1.Draw a horizontal (or vertical) scale. 2.Construct a rectangular box whose left (or lower) edge is at the lower quartile and whose right (or upper) edge is at the upper quartile (the box width = iqr). Draw a vertical (or horizontal) line segment inside the box at the location of the median. 3.Extend horizontal (or vertical) line segments from each end of the box to the smallest and largest observations in the data set. (These lines are called whiskers.)

78 78 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Skeletal Boxplot Example Using the student work hours data we have 0 5 10 15 20 25

79 79 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Outliers An observations is an outlier if it is more than 1.5 iqr away from the closest end of the box (less than the lower quartile minus 1.5 iqr or more than the upper quartile plus 1.5 iqr. An outlier is extreme if it is more than 3 iqr from the closest end of the box, and it is mild otherwise.

80 80 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Modified Boxplots A modified boxplot represents mild outliers by shaded circles and extreme outliers by open circles. Whiskers extend on each end to the most extreme observations that are not outliers.

81 81 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Modified Boxplot Example Using the student work hours data we have 0 5 10 15 20 25 Lower quartile + 1.5 iqr = 14 - 1.5(6) = -1 Upper quartile + 1.5 iqr = 14 + 1.5(6) = 23 Smallest data value that isn’t an outlier Largest data value that isn’t an outlier Upper quartile + 3 iqr = 14 + 3(6) = 32 Mild Outlier

82 82 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Modified Boxplot Example Consider the ages of the 79 students from the classroom data set from the slideshow Chapter 3. Iqr = 22 – 19 = 3 17 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 25 26 28 28 30 37 38 44 47 Median Lower Quartile Upper Quartile Moderate OutliersExtreme Outliers Lower quartile – 3 iqr = 10 Lower quartile – 1.5 iqr =14.5 Upper quartile + 3 iqr = 31 Upper quartile + 1.5 iqr = 26.5

83 83 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Smallest data value that isn’t an outlier Largest data value that isn’t an outlier Mild Outliers Extreme Outliers 15 20 25 30 35 40 45 50 Modified Boxplot Example Here is the modified boxplot for the student age data.

84 84 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Modified Boxplot Example 50 45 40 35 30 25 20 15 Here is the same boxplot reproduced with a vertical orientation.

85 85 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparative Boxplot Example 100 120 140 160 180 200 220 240 Females Males GenderGender Student Weight By putting boxplots of two separate groups or subgroups we can compare their distributional behaviors. Notice that the distributional pattern of female and male student weights have similar shapes, although the females are roughly 20 lbs lighter (as a group).

86 86 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparative Boxplot Example

87 87 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Interpreting Variability Chebyshev’s Rule

88 88 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. For specific values of k Chebyshev’s Rule reads  At least 75% of the observations are within 2 standard deviations of the mean.  At least 89% of the observations are within 3 standard deviations of the mean.  At least 90% of the observations are within 3.16 standard deviations of the mean.  At least 94% of the observations are within 4 standard deviations of the mean.  At least 96% of the observations are within 5 standard deviations of the mean.  At least 99 % of the observations are with 10 standard deviations of the mean. Interpreting Variability Chebyshev’s Rule

89 89 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Consider the student age data Example - Chebyshev’s Rule 17 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 25 26 28 28 30 37 38 44 47 Color code: within 1 standard deviation of the mean within 2 standard deviations of the mean within 3 standard deviations of the mean within 4 standard deviations of the mean within 5 standard deviations of the mean

90 90 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing the student age data Example - Chebyshev’s Rule IntervalChebyshev’sActual within 1 standard deviation of the mean  0% 72/79 = 91.1% within 2 standard deviations of the mean  75% 75/79 = 94.9% within 3 standard deviations of the mean  88.8% 76/79 = 96.2% within 4 standard deviations of the mean  93.8% 77/79 = 97.5% within 5 standard deviations of the mean  96.0% 79/79 = 100% Notice that Chebyshev gives very conservative lower bounds and the values aren’t very close to the actual percentages.

91 91 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Empirical Rule If the histogram of values in a data set is reasonably symmetric and unimodal (specifically, is reasonably approximated by a normal curve), then 1.Approximately 68% of the observations are within 1 standard deviation of the mean. 2.Approximately 95% of the observations are within 2 standard deviation of the mean. 3.Approximately 99.7% of the observations are within 3 standard deviation of the mean.

92 92 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Z Scores The z score is how many standard deviations the observation is from the mean. A positive z score indicates the observation is above the mean and a negative z score indicates the observation is below the mean.

93 93 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Computing the z score is often referred to as standardization and the z score is called a standardized score. Z Scores

94 94 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example A sample of GPAs of 38 statistics students appear below (sorted in increasing order) 2.00 2.25 2.36 2.37 2.50 2.50 2.60 2.67 2.70 2.70 2.75 2.78 2.80 2.80 2.82 2.90 2.90 3.00 3.02 3.07 3.15 3.20 3.20 3.20 3.23 3.29 3.30 3.30 3.42 3.46 3.48 3.50 3.50 3.58 3.75 3.80 3.83 3.97

95 95 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example The following stem and leaf indicates that the GPA data is reasonably symmetric and unimodal. 2 0 2 233 2 55 2 667777 2 88899 3 0001 3 2222233 3 444555 3 7 3 889 Stem: Units digit Leaf: Tenths digit

96 96 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example

97 97 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example IntervalEmpirical RuleActual within 1 standard deviation of the mean  68% 27/38 = 71% within 2 standard deviations of the mean  95% 37/38 = 97% within 3 standard deviations of the mean  99.7% 38/38 = 100% Notice that the empirical rule gives reasonably good estimates for this example.

98 98 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparison of Chebyshev’s Rule and the Empirical Rule The following refers to the weights in the sample of 79 students. Notice that the stem and leaf diagram suggest the data distribution is unimodal but is positively skewed because of the outliers on the high side. Nevertheless, the results for the Empirical Rule are good. 10 3 11 37 12 011444555 13 000000455589 14 000000000555 15 000000555567 16 000005558 17 0000005555 18 0358 19 5 20 00 21 0 22 55 23 79 Stem: Hundreds & tens digits Leaf: Units digit

99 99 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comparison of Chebyshev’ Rule and the Empirical Rule Interval Chebyshev ’s Rule Empiric al Rule Actual within 1 standard deviation of the mean  0%  68% 56/79 = 70.9% within 2 standard deviations of the mean  75%  95% 75/79 = 94.9% within 3 standard deviations of the mean  88.8%  99.7% 79/79 = 100% Notice that even with moderate positive skewing of the data, the Empirical Rule gave a much more usable and meaningful result.

100 100 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Numerical Methods for Describing Data Describing Center and Variability in a Data Set! (Wow, that sounds sooooo exciting!)

101 101 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Means and Medians The arithmetic average The middle value in the list of data from smallest to largest values.

102 102 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Mean

103 103 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Median Sample Median  Middle value if n is odd.  Average of middle two values if n is even. Population Median  The symbols for sample and population median are obscure and rarely used. We generally just refer to them by name. You can use your calculator to find means and medians!!!

104 104 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Sensitivity Outliers Which of you can resist us? Median He can’t. He’s too sensitive! Mean That’s rather extreme!

105 105 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Which is Better Mean or Median? It depends on the context of the data. Many times reported “averages” are really medians. The term “average” is sometimes used to refer to the value most representative of what is typical instead of the mean.

106 106 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. A Compromise! The Trimmed Mean

107 107 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Proportions Sample proportion: p Population proportion:  How do you find it? What does “Success” mean?

108 108 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variance and Standard Deviation These describe variability in a data set. Range describes variability but only in very rough terms. Variance and standard deviation are based on the typical amount of deviation from the average (mean).

109 109 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Symbols and Formulas The ultimate in cool! Population variance:  2 Population standard deviation:  Sample variance: s 2 Sample standard deviation: s Variance… For standard deviation take the square root of the variance. Why n-1?

110 110 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Other measures of variability: Box and Whisker displays

111 111 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minimum Q1 Median Q3 Maximum The five number summary Minimum Q1 Median Q3 Maximum

112 112 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Quartiles and IQR IQR : Inner Quartile Range A measure of the “middle 50%” of the data. Compare the IQR the total range to get an idea of the spread of the data.

113 113 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Outlier rule You can use the box and whisker plot to determine whether or not an extreme value is an outlier. Outliers can be regarded as values more than 1.5 IQRs above Q3 and more than 1.5 IQRs below Q1. Extreme outliers are more than 3 IQRs above and below Q1 and Q3 respectively. Sometimes we use a modified box plot to indicate the presence of such outliers.

114 114 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. What can you tell from this? Mean Median Standard deviation If these were all you knew about a data set how much would they tell you? If you knew the 5 number summary what would that tell you? In what sort of situations might one set of summary values convey more information than the other?

115 115 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. A word to the wise! (Okay – “words”) Measures of center don’t tell all. Data distributions with different shapes can have the same mean and standard deviation. Mean and standard deviation are both sensitive to extreme values in a data set, especially if the sample size is small. Measures of center and variability describe the values of the variable studied not the heights of the bars. Boxplots based on small sample sizes can be misleading and can misrepresent the shape, center, and variability of the data set the sample came from. Not all distributions are normal or even approximately normal. Watch out for outliers.


Download ppt "1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 3 Graphical Methods for Describing Data."

Similar presentations


Ads by Google