Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive Statistics (Part 1) Chapter44 Numerical Description Central Tendency Dispersion McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies,

Similar presentations


Presentation on theme: "Descriptive Statistics (Part 1) Chapter44 Numerical Description Central Tendency Dispersion McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies,"— Presentation transcript:

1 Descriptive Statistics (Part 1) Chapter44 Numerical Description Central Tendency Dispersion McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.

2 StatisticsStatistics are descriptive measures derived from a sample (n items). ParametersParameters are descriptive measures derived from a population (N items). Numerical Description 4A-2

3 Three key characteristics of numerical data:Three key characteristics of numerical data: CharacteristicInterpretation Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Numerical Description Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? 4A-3

4 Numerical statistics can be used to summarize this random sample of brands.Numerical statistics can be used to summarize this random sample of brands. Defect rate = total no. defectsDefect rate = total no. defects no. inspected x 100 Must allow for sampling error since the analysis is based on sampling.Must allow for sampling error since the analysis is based on sampling. Numerical Description Example: Vehicle Quality Example: Vehicle Quality Consider the data set of vehicle defect rates from J. D. Power and Associates.Consider the data set of vehicle defect rates from J. D. Power and Associates. 4A-4

5 Numerical Description Numerical Description Numerical Description Numerical Description Number of defects per 100 vehicles, 2006 models.Number of defects per 100 vehicles, 2006 models. 4A-5

6 To begin, sort the data in Excel. Numerical Description 4A-6

7 Sorted data provides insight into central tendency and dispersion.Sorted data provides insight into central tendency and dispersion. Numerical Description 4A-7

8 The dot plot offers a visual impression of the data.The dot plot offers a visual impression of the data. Visual Displays Visual Displays Numerical Description 4A-8

9 Histograms with 5 bins (suggested by Sturge’s Rule) and 10 bins are shown below.Histograms with 5 bins (suggested by Sturge’s Rule) and 10 bins are shown below. Both are symmetric with no extreme values and show a modal class toward the low end.Both are symmetric with no extreme values and show a modal class toward the low end. Visual Displays Visual Displays Numerical Description 4A-9

10 The central tendency is the middle or typical values of a distribution.The central tendency is the middle or typical values of a distribution. Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics.Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics. Central Tendency 4A-10

11 StatisticFormula Excel Formula ProCon Mean=AVERAGE(Data) Familiar and uses all the sample information. Influenced by extreme values. Central Tendency Six Measures of Central Tendency Six Measures of Central TendencyMedian Middle value in sorted array =MEDIAN(Data) Robust when extreme data values exist. Ignores extremes and can be affected by gaps in data values. 4A-11

12 StatisticFormula Excel Formula ProCon Mode Most frequently occurring data value =MODE(Data) Useful for attribute data or discrete data with a small range. May not be unique, and is not helpful for continuous data. Central Tendency Six Measures of Central Tendency Six Measures of Central TendencyMidrange =0.5*(MIN(Data) +MAX(Data)) Easy to understand and calculate. Influenced by extreme values and ignores most data values. 4A-12

13 StatisticFormula Excel Formula ProCon Geometric mean (G) =GEOMEAN(Data) Useful for growth rates and mitigates high extremes. Less familiar and requires positive data. Trimmed mean Same as the mean except omit highest and lowest k% of data values (e.g., 5%) =TRIMMEAN(Data, Percent) Mitigates effects of extreme values. Excludes some data values that could be relevant. Central Tendency Six Measures of Central Tendency Six Measures of Central Tendency 4A-13

14 A familiar measure of central tendency.A familiar measure of central tendency. In Excel, use function =AVERAGE(Data) where Data is an array of data values.In Excel, use function =AVERAGE(Data) where Data is an array of data values. Population FormulaSample Formula Central Tendency Mean Mean 4A-14

15 For the sample of n = 37 car brands:For the sample of n = 37 car brands: Central Tendency Mean Mean 4A-15

16 Arithmetic mean is the most familiar average.Arithmetic mean is the most familiar average. Affected by every sample item.Affected by every sample item. The balancing point or fulcrum for the data.The balancing point or fulcrum for the data. Central Tendency Characteristics of the Mean Characteristics of the Mean 4A-16

17 Regardless of the shape of the distribution, absolute distances from the mean to the data points always sum to zero.Regardless of the shape of the distribution, absolute distances from the mean to the data points always sum to zero. Central Tendency Characteristics of the Mean Characteristics of the Mean Consider the following asymmetric distribution of quiz scores whose mean = 65.Consider the following asymmetric distribution of quiz scores whose mean = 65. = (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65) = (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0 4A-17

18 The median (M) is the 50 th percentile or midpoint of the sorted sample data.The median (M) is the 50 th percentile or midpoint of the sorted sample data. M separates the upper and lower half of the sorted observations.M separates the upper and lower half of the sorted observations. If n is odd, the median is the middle observation in the data array.If n is odd, the median is the middle observation in the data array. If n is even, the median is the average of the middle two observations in the data array.If n is even, the median is the average of the middle two observations in the data array. Central Tendency Median Median 4A-18

19 Consider the following n = 6 data values: 11 12 15 17 21 32Consider the following n = 6 data values: 11 12 15 17 21 32 What is the median?What is the median? M = (x 3 +x 4 )/2 = (15+17)/2 = 16 16 11 12 15 16 17 21 32 For even n, Median = n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4 Central Tendency Median Median 4A-19

20 Central Tendency Median Median (Figure 4.6) For n = 8, the median is between the fourth and fifth observations in the data array.For n = 8, the median is between the fourth and fifth observations in the data array. 4A-20

21 Central Tendency Median Median For n = 9, the median is the fifth observation in the data array.For n = 9, the median is the fifth observation in the data array. 4A-21

22 Consider the following n = 7 data values: 12 23 23 25 27 34 41Consider the following n = 7 data values: 12 23 23 25 27 34 41 What is the median?What is the median? M = x 4 = 25 25 12 23 23 25 27 34 41 For odd n, Median = (n+1)/2 = (7+1)/2 = 8/2 = 4 Central Tendency Median Median 4A-22

23 Use Excel’s function =MEDIAN(Data) where Data is an array of data values.Use Excel’s function =MEDIAN(Data) where Data is an array of data values. For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19.For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. So, the median is x 19 = 121.So, the median is x 19 = 121. When there are several duplicate data values, the median does not provide a clean “50-50” split in the data.When there are several duplicate data values, the median does not provide a clean “50-50” split in the data. Central Tendency Median Median 4A-23

24 The median is insensitive to extreme data values.The median is insensitive to extreme data values. For example, consider the following quiz scores for 3 students:For example, consider the following quiz scores for 3 students: Tom’s scores: 20, 40, 70, 75, 80 Mean =57, Median = 70, Total = 285 Jake’s scores: 60, 65, 70, 90, 95 Mean = 76, Median = 70, Total = 380 Mary’s scores: 50, 65, 70, 75, 90 Mean = 70, Median = 70, Total = 350 What does the median for each student tell you?What does the median for each student tell you? Central Tendency Characteristics of the Median Characteristics of the Median 4A-24

25 The most frequently occurring data value.The most frequently occurring data value. Similar to mean and median if data values occur often near the center of sorted data.Similar to mean and median if data values occur often near the center of sorted data. May have multiple modes or no mode.May have multiple modes or no mode. Central Tendency Mode Mode 4A-25

26 Lee’s scores: 60, 70, 70, 70, 80Mean =70, Median = 70, Mode = 70 Pat’s scores: 45, 45, 70, 90, 100Mean = 70, Median = 70, Mode = 45 Sam’s scores: 50, 60, 70, 80, 90Mean = 70, Median = 70, Mode = none Xiao’s scores: 50, 50, 70, 90, 90Mean = 70, Median = 70, Modes = 50,90 Central Tendency Mode Mode For example, consider the following quiz scores for 3 students:For example, consider the following quiz scores for 3 students: What does the mode for each student tell you?What does the mode for each student tell you? 4A-26

27 Easy to define, not easy to calculate in large samples.Easy to define, not easy to calculate in large samples. Use Excel’s function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal.Use Excel’s function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal. May be far from the middle of the distribution and not at all typical.May be far from the middle of the distribution and not at all typical. Central Tendency Mode Mode 4A-27

28 Generally isn’t useful for continuous data since data values rarely repeat.Generally isn’t useful for continuous data since data values rarely repeat. Best for attribute data or a discrete variable with a small range (e.g., Likert scale).Best for attribute data or a discrete variable with a small range (e.g., Likert scale). Central Tendency Mode Mode 4A-28

29 Consider the following P/E ratios for a random sample of 68 Standard & Poor’s 500 stocks.Consider the following P/E ratios for a random sample of 68 Standard & Poor’s 500 stocks. What is the mode?What is the mode? Central Tendency Example: Price/Earnings Ratios and Mode Example: Price/Earnings Ratios and Mode 78810 1213 14 15 16 1718 19 20 21 22 23 242526 2729 303134363740414548556891 4A-29

30 Excel’s descriptive statistics results are:Excel’s descriptive statistics results are: The mode 13 occurs 7 times, but what does the dot plot show? Mean22.7206 Median19 Mode13 Range84 Minimum7 Maximum91 Sum1545 Count68 Central Tendency Example: Price/Earnings Ratios and Mode Example: Price/Earnings Ratios and Mode 4A-30

31 The dot plot shows local modes (a peak with valleys on either side) at 10, 13, 15, 19, 23, 26, 29.The dot plot shows local modes (a peak with valleys on either side) at 10, 13, 15, 19, 23, 26, 29. These multiple modes suggest that the mode is not a stable measure of central tendency.These multiple modes suggest that the mode is not a stable measure of central tendency. Central Tendency Example: Price/Earnings Ratios and Mode Example: Price/Earnings Ratios and Mode 4A-31

32 Points scored by the winning NCAA football team tends to have modes in multiples of 7 because each touchdown yields 7 points.Points scored by the winning NCAA football team tends to have modes in multiples of 7 because each touchdown yields 7 points. Central Tendency Example: Rose Bowl Winners’ Points Example: Rose Bowl Winners’ Points Consider the dot plot of the points scored by the winning team in the first 87 Rose Bowl games.Consider the dot plot of the points scored by the winning team in the first 87 Rose Bowl games. What is the mode?What is the mode? 4A-32

33 A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data.A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data. Occurs when dissimilar populations are combined in one sample. For example,Occurs when dissimilar populations are combined in one sample. For example, Central Tendency Mode Mode 4A-33

34 Compare mean and median or look at histogram to determine degree of skew ness.Compare mean and median or look at histogram to determine degree of skew ness. Central Tendency Skew ness Skew ness 4A-34

35 Distribution’s Shape Histogram Appearance Statistics Skewed left (negative skew ness) Long tail of histogram points left (a few low values but most data on right) Mean < Median Central Tendency Symptoms of Skew ness Symptoms of Skew nessSymmetric Tails of histogram are balanced (low/high values offset) Mean  Median Skewed right (positive skew ness) Long tail of histogram points right (most data on left but a few high values) Mean > Median 4A-35

36 For the sample of spending per customer at 74 Noodles &, the mean ($7.04) exceeds the median ($7.00). What does this suggest?For the sample of spending per customer at 74 Noodles &, the mean ($7.04) exceeds the median ($7.00). What does this suggest? Central Tendency Skew ness Skew ness 4A-36

37 The geometric mean (G) is a multiplicative average.The geometric mean (G) is a multiplicative average. For the J. D. Power quality data (n=37):For the J. D. Power quality data (n=37): In Excel use =GEOMEAN(Array)In Excel use =GEOMEAN(Array) The geometric mean tends to mitigate the effects of high outliers.The geometric mean tends to mitigate the effects of high outliers. Central Tendency Geometric Mean Geometric Mean 4A-37

38 A variation on the geometric mean used to find the average growth rate for a time series.A variation on the geometric mean used to find the average growth rate for a time series. For example, from 2002 to 2006, JetBlue Airlines revenues are:For example, from 2002 to 2006, JetBlue Airlines revenues are: YearRevenue (mil) 2002635 2003998 20041265 20051701 20062363 Central Tendency Growth Rates Growth Rates 4A-38

39 The average growth rate is given by taking the geometric mean of the ratios of each year’s revenue to the preceding year.The average growth rate is given by taking the geometric mean of the ratios of each year’s revenue to the preceding year. Due to cancellations, only the first and last years are relevant:Due to cancellations, only the first and last years are relevant: = 1.389  1 =.389 or 38.9% per year In Excel use =(2363/635)^(1/4)-1In Excel use =(2363/635)^(1/4)-1 Central Tendency Growth Rates Growth Rates 4A-39

40 The midrange is the point halfway between the lowest and highest values of X.The midrange is the point halfway between the lowest and highest values of X. Easy to use but sensitive to extreme data values.Easy to use but sensitive to extreme data values. Midrange = For the J. D. Power quality data (n=37):For the J. D. Power quality data (n=37): Midrange = = Here, the midrange (147.5) is higher than the mean (134.51) or median (132).Here, the midrange (147.5) is higher than the mean (134.51) or median (132). Central Tendency Midrange Midrange 4A-40

41 To calculate the trimmed mean, first remove the highest and lowest k percent of the observations.To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k =.05).For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k =.05). To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations.To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations. So, we would remove the three smallest and three largest observations before averaging the remaining values.So, we would remove the three smallest and three largest observations before averaging the remaining values. Central Tendency Trimmed Mean Trimmed Mean 4A-41

42 Here is a summary of all the measures of central tendency for the n = 68 P/E values.Here is a summary of all the measures of central tendency for the n = 68 P/E values. The trimmed mean mitigates the effects of very high values, but still exceeds the median.The trimmed mean mitigates the effects of very high values, but still exceeds the median. Mean:22.72 =AVERAGE(PERatio) Median:19.00 =MEDIAN(PERatio) Mode:13.00 =MODE(PERatio) Geometric Mean: 19.85 =GEOMEAN(PERatio) Midrange:49.00 (MIN(PERatio)+MAX(PERatio))/2 5% Trim Mean: 21.10 =TRIMMEAN(PERatio,0.1) Central Tendency Trimmed Mean Trimmed Mean 4A-42

43 Central Tendency Trimmed Mean Trimmed Mean The Federal Reserve uses a 16% trimmed mean to mitigate the effects of extremes in its analysis of the Consumer Price Index. 4A-43

44 Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion:Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion: StatisticFormulaExcelProCon Rangex max – x min =MAX(Data)- MIN(Data) Easy to calculate Sensitive to extreme data values. DispersionDispersion Variance (s 2 ) =VAR(Data) Plays a key role in mathematical statistics. Non-intuitive meaning. Measures of Variation Measures of Variation 4A-44

45 StatisticFormulaExcelProCon Standard deviation (s) =STDEV(Data) Most common measure. Uses same units as the raw data ($, £, ¥, etc.). Non- intuitive meaning. DispersionDispersion Measures of Variation Measures of Variation Coef- ficient. of variation (CV) None Measures relative variation in percent so can compare data sets. Requires non- negative data. 4A-45

46 StatisticFormulaExcelProCon Mean absolute deviation (MAD) =AVEDEV(Data) Easy to understand. Lacks “nice” theoretical properties. DispersionDispersion Measures of Variation Measures of Variation 4A-46

47 The difference between the largest and smallest observation.The difference between the largest and smallest observation. Range = x max – x min For example, for the n = 68 P/E ratios,For example, for the n = 68 P/E ratios, Range = 91 – 7 = 84 DispersionDispersion Range Range 4A-47

48 population varianceThe population variance (  2 ) is defined as the sum of squared deviations around the mean  divided by the population size. sample varianceFor the sample variance (s 2 ), we divide by n – 1 instead of n, otherwise s 2 would tend to underestimate the unknown population variance  2. DispersionDispersion Variance Variance 4A-48

49 The square root of the variance.The square root of the variance. Units of measure are the same as X.Units of measure are the same as X. Population standard deviation Sample standard deviation Explains how individual values in a data set vary from the mean.Explains how individual values in a data set vary from the mean. DispersionDispersion Standard Deviation Standard Deviation 4A-49

50 Excel’s built in functions areExcel’s built in functions are Statistic Excel population formula Excel sample formula Variance=VARP(Array)=VAR(Array) Standard deviation =STDEVP(Array)=STDEV(Array) DispersionDispersion Standard Deviation Standard Deviation 4A-50

51 Consider the following five quiz scores for Stephanie.Consider the following five quiz scores for Stephanie. (Table 4.12) DispersionDispersion Calculating a Standard Deviation Calculating a Standard Deviation 4A-51

52 Now, calculate the sample standard deviation:Now, calculate the sample standard deviation: Somewhat easier, the two-sum formula can also be used:Somewhat easier, the two-sum formula can also be used: DispersionDispersion Calculating a Standard Deviation Calculating a Standard Deviation 4A-52

53 The standard deviation is nonnegative because deviations around the mean are squared.The standard deviation is nonnegative because deviations around the mean are squared. When every observation is exactly equal to the mean, the standard deviation is zero.When every observation is exactly equal to the mean, the standard deviation is zero. Standard deviations can be large or small, depending on the units of measure.Standard deviations can be large or small, depending on the units of measure. Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially.Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially. DispersionDispersion Calculating a Standard Deviation Calculating a Standard Deviation 4A-53

54 Useful for comparing variables measured in different units or with different means.Useful for comparing variables measured in different units or with different means. A unit-free measure of dispersionA unit-free measure of dispersion Expressed as a percent of the mean.Expressed as a percent of the mean. Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.Only appropriate for nonnegative data. It is undefined if the mean is zero or negative. DispersionDispersion Coefficient of Variation Coefficient of Variation 4A-54

55 For example:For example: Defect rates (n = 37) s = 22.89 = 125.38givesCV = 100 × (22.89)/(125.38) = 18% ATM deposits (n = 100) s = 280.80 = 233.89givesCV = 100 × (280.80)/(233.89) = 120% P/E ratios (n = 68) s = 14.28 = 22.72givesCV = 100 × (14.08)/(22.72) = 62% DispersionDispersion Coefficient of Variation Coefficient of Variation 4A-55

56 Mean Absolute DeviationThe Mean Absolute Deviation (MAD) reveals the average distance from an individual data point to the mean (center of the distribution). Uses absolute values of the deviations around the mean. Excel’s function is =AVEDEV(Array) DispersionDispersion Mean Absolute Deviation Mean Absolute Deviation 4A-56

57 Consider the histograms of hole diameters drilled in a steel plate during manufacturing.Consider the histograms of hole diameters drilled in a steel plate during manufacturing. The desired distribution is outlined in red.The desired distribution is outlined in red. DispersionDispersion Machine A Machine B Central Tendency vs. Dispersion: Manufacturing Central Tendency vs. Dispersion: Manufacturing 4A-57

58 Desired mean (5mm) but too much variation. Acceptable variation but mean is less than 5 mm. Take frequent samples to monitor quality.Take frequent samples to monitor quality. Machine A Machine B DispersionDispersion Central Tendency vs. Dispersion: Manufacturing Central Tendency vs. Dispersion: Manufacturing 4A-58

59 Consider student ratings of four professors on eight teaching attributes (10-point scale).Consider student ratings of four professors on eight teaching attributes (10-point scale). DispersionDispersion Central Tendency vs. Dispersion: Job Performance Central Tendency vs. Dispersion: Job Performance 4A-59

60 Jones and Wu have identical means but different standard deviations.Jones and Wu have identical means but different standard deviations. DispersionDispersion Central Tendency vs. Dispersion: Job Performance Central Tendency vs. Dispersion: Job Performance 4A-60

61 Smith and Gopal have different means but identical standard deviations.Smith and Gopal have different means but identical standard deviations. DispersionDispersion Central Tendency vs. Dispersion: Job Performance Central Tendency vs. Dispersion: Job Performance 4A-61

62 A high mean (better rating) and low standard deviation (more consistency) is preferred. Which professor do you think is best?A high mean (better rating) and low standard deviation (more consistency) is preferred. Which professor do you think is best? DispersionDispersion Central Tendency vs. Dispersion: Job Performance Central Tendency vs. Dispersion: Job Performance 4A-62

63 Applied Statistics in Business & Economics End of Chapter 4A 4A-63


Download ppt "Descriptive Statistics (Part 1) Chapter44 Numerical Description Central Tendency Dispersion McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies,"

Similar presentations


Ads by Google