Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Descriptive Statistics: Numerical Methods Statistics for Business (Env) 1.

Similar presentations


Presentation on theme: "Chapter 3 Descriptive Statistics: Numerical Methods Statistics for Business (Env) 1."— Presentation transcript:

1 Chapter 3 Descriptive Statistics: Numerical Methods Statistics for Business (Env) 1

2 Descriptive Statistics 3.1Describing Central Tendency 3.2Measures of Variation 3.3Percentiles, Quartiles and Box-and- Whiskers Displays 3.4Covariance, Correlation, and the Least Square Line 3.5Weighted Means and Grouped Data (Optional) 3.6The Geometric Mean (Optional)

3 Describing Central Tendency In addition to describing the shape of a distribution, want to describe the data set’s central tendency – A measure of central tendency represents the center or middle of the data – It is most typical or most representative of the entire data

4 Parameters and Statistics A population parameter is a number calculated from all the population measurements that describes some aspect of the population A sample statistic is a number calculated using the sample measurements that describes some aspect of the sample

5 Measures of Central Tendency Mean,  The average or expected value Median, M d The value of the middle point of the ordered measurements Mode, M o The most frequent value

6 The Mean Population X 1, X 2, …, X N  Population Mean Sample x 1, x 2, …, x n Sample Mean

7 The Sample Mean and is a point estimate of the population mean  It is the value to expect, on average and in the long run And the amount each member gets when the total is distributed equally within the sample For a sample of size n, the sample mean is defined as

8 8 Mean as the balance point for a distribution Data: 2, 2, 6, 10 mean=(2+2+6+10)/4=5

9 9 Data: 3, 6, 6, 9, 11 mean=(3+6+6+9+11)/5=7 What will happen to the mean if we add one more number to the data?

10 The Median The median M d is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it 1.If the number of measurements is odd, the median is the middlemost measurement in the ordering 2.If the number of measurements is even, the median is the average of the two middlemost measurements in the ordering

11 11 Data: 3, 5, 8, 10, 11 median=8

12 12 Data: 3, 3, 4, 5, 7, 8 median=(4+5)/2=4.5

13 13 Data: 1, 2, 2, 3, 4, 4, 4, 4, 4, 5 median=4??

14 14

15 Example for non-integer data Example 3.1: First five observations from Table 3.1: 30.8, 31.7, 30.1, 31.6, 32.1 In order: 30.1, 30.8, 31.6, 31.7, 32.1 There is an odd so median is one in middle, or 31.6

16 16 Data: 2, 2, 2, 3, 3, 12mean=4 median=(2+3)/2=2.5

17 The Mode The mode M o of a population or sample of measurements is the measurement that occurs most frequently – Modes are the values that are observed “most typically” – Sometimes higher frequencies at two or more values If there are two modes, the data is bimodal If more than two modes, the data is multimodal – When data are in classes, the class with the highest frequency is the modal class The tallest box in the histogram

18 Histogram Describing the 50 Mileages

19 19

20 Selecting a measure of Central Tendency Usually the mean is a good measure, because it uses every score in the distribution. There are some extreme cases in which the mean is not representative (or calculable). Then the mode and the median are used. 20

21 21 Mean=(10+11*4+12*3+13+100)/10=20.3 Median=(11+12)/2=11.5 Mode=11

22 22 Mean – not computable Median=(12+13)/2=12.5 Mode – not meaningful Open-ended distributions A distribution is said to be open-ended when there is no upper limit (or lower limit) for one of the categories

23 Measures of Variation/variability Knowing the measures of central tendency is not enough Both of the distributions below have identical measures of central tendency

24 24

25 Measures of Variation RangeLargest minus the smallest measurement VarianceThe average of the squared deviations of all the population measurements from the population mean StandardThe square root of the variance Deviation They provide quantitative measures of the degree to which data in a distribution are spread out or clustered together.

26 Range for discrete & continuous data The range is the distance between the largest score (X max ) and the smallest score (X min ) in the distribution for discrete data. For continuous data, you must also take into account the real limits of the maximum and minimum X values. range = URL X max - LRL X min 26

27 Population Variance and Standard Deviation The population variance (σ 2 ) is the average of the squared deviations of the individual population measurements from the population mean (µ) The population standard deviation (σ) is the positive square root of the population variance

28 Variance For a population of size N, the population variance σ 2 is: For a sample of size n, the sample variance s 2 is:

29 29 Sample variability tends to underestimate the population value

30 Standard Deviation Population standard deviation (σ): Sample standard deviation (s):

31 Example: Sample Variance and Standard Deviation Data points are: 60, 41, 15, 30, 34 Mean is 36 Variance is: Standard deviation is:

32 32 X

33 Percentiles & Quartiles For a set of measurements arranged in increasing order, the p th percentile is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value The first quartile Q 1 is the 25th percentile The second quartile (or median) is the 50 th percentile The third quartile Q 3 is the 75th percentile The interquartile range IQR is Q 3 - Q 1

34 Cumulative percentages & PERCENTILES 34 X=2 means that the measurement was somewhere between the real limits of 1.5 and 2.5. 30% of the individuals have been accumulated by the time you reach the top of the interval for X=2. Q3Q3 Q1Q1

35 35

36 36 What is the 95th percentile? (Answer: X = 4.5.) What is the percentile rank for X = 3.5? (Answer: 70%.) What is the 50th percentile? What is the percentile rank for X = 4? estimates of these values by a standard procedure known as interpolation

37 37

38 38

39 39 Using the following distribution of scores, we will use interpolation to find the 50 th percentile:

40 40 For the scores, the width of the interval is 5 points. For the percentages, the width is 50 points. The value of 50% is located 10 points from the top of the percentage interval. As a fraction of the whole interval, this is 10 out of 50, or 1/5 of the total interval. The 50th percentile is X = 8.5.

41 41 USING INTERPOLATION TO FIND THE MEDIAN Answer: X = 3.70 is the median Notice that this is exactly the same answer we obtained using the graphic method of interpolation in Figure 3.7

42 42

43 43

44 Example: Quartiles 20 customer satisfaction ratings: 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 M d = (8+8)/2 = 8 Q 1 = (7+8)/2 = 7.5 Q 3 = (9+9)/2 = 9 IQR = Q 3  Q 1 = 9  7.5 = 1.5 A slightly different way to find the quartiles (without using interpolation).

45 Five Number Summary in descriptive statistic 1.The smallest measurement 2.The first quartile, Q 1 3.The median, M d 4.The third quartile, Q 3 5.The largest measurement Displayed visually using a box-and-whiskers plot

46 Box-and-whisker plots 46 A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set.

47 Outliers Outliers are measurements that are very different from other measurements – They are either much larger or much smaller than most of the other measurements Outliers lie beyond the fences of the box-and- whiskers plot – Measurements between the inner and outer fences are mild outliers – Measurements beyond the outer fences are severe outliers

48 Box-and-Whiskers Plots The box plots the: – first quartile, Q 1 – median, M d – third quartile, Q 3 – inner fences – outer fences From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree,

49 Box-and-Whiskers Plots Continued Inner fences – Located 1.5  IQR away from the quartiles: Q 1 – (1.5  IQR) Q 3 + (1.5  IQR) Outer fences – Located 3  IQR away from the quartiles: Q 1 – (3  IQR) Q 3 + (3  IQR) From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree,

50 Box-and-Whiskers Plots Continued The “whiskers” are dashed lines that plot the range of the data – A dashed line drawn from the box below Q 1 down to the smallest measurement between the inner fences – Another dashed line drawn from the box above Q 3 up to the largest measurement between the inner fences

51 Box-and-Whiskers Plots Continued

52 Symmetric distribution Symmetric distribution : A distribution having the same shape on either side of the center Skewed distribution Skewed distribution : One whose shapes on either side of the center differ; a nonsymmetrical distribution. Can be positively or negatively skewed, or bimodal negative positive

53 The Relative Positions of the Mean, Median, and Mode: Symmetric Distribution Symmetric distribution Symmetric distribution : mean = median = mode

54 54

55 Relationships Among Mean, Median and Mode

56 The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is also said to be left-skewed. The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is also said to be right-skewed. Mean<Median<Mode Mode<Median<Mean Relative Positions of the Mean, Median, and Mode

57 Skewness is the measurement of the lack of symmetry of the distribution. The coefficient of skewness can range from -3.00 up to 3.00 when using the following formula: A value of 0 indicates a symmetric distribution. Negatively Skewed Distribution has negative coefficient of skewness. Positively Skewed Distribution has positive coefficient of skewness. Some software packages use a different formula which results in a wider range for the coefficient.  s MedianX SkSk   3

58 Normal distribution (Gaussian distribution) is a symmetric distribution that often gives a good description of data that cluster around the mean. The graph of the distribution is bell-shaped, with a peak at the mean, and is known as the Gaussian function or bell curve. Normal distribution

59 The bean machine is a device invented by Sir Francis Galton to demonstrate how the normal distribution appears in nature. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve. Normal distribution in nature

60 Height (in.) Normal distribution in nature Distribution of the heights of 1052 women fits the normal distribution, with a goodness of fit p value of 0.75

61 Histogram of daily percentage changes in the S&P 500 index

62 The Empirical Rule for a Normal/Gaussian distribution If a population has mean µ and standard deviation σ and is described by a normal distribution, then – 68.26% of the population measurements lie within one standard deviation of the mean: [µ-σ, µ+σ] – 95.44% of the population measurements lie within two standard deviations of the mean: [µ-2σ, µ+2σ] – 99.73% of the population measurements lie within three standard deviations of the mean: [µ-3σ, µ+3σ]

63 Origin and meaning of “Six Sigma Process" If one has six standard deviations between the process mean and the nearest specification limit, as shown in the graph, practically no items will fail to meet specifications. If the upper and lower specification limits (USL, LSL) are at a distance of 6σ from the mean, values lying that far away from the mean are extremely unlikely. Even if the mean were to move right or left by 1.5σ at some point in the future (1.5 sigma shift), there is still a good safety cushion. This is why if the mean is at least 6σ away from the nearest specification limit, the process will be under good quality control. Six Sigma – quality control of process outputs

64 Chebyshev’s Theorem Let µ and σ be a population’s mean and standard deviation, then for any value k> 1 At least 100(1 - 1/k 2 )% of the population measurements lie in the interval [µ-kσ, µ+kσ] Or, at most 100/k 2 % of the population measurements lie outside the interval [µ-kσ, µ+kσ]

65 Chebyshev’s Theorem tells us roughly the percentage of data with values that will fall within a certain number of standard deviations of the mean. Chebyshev's Theorem applies to all distributions regardless of their shape and can therefore be used for non normal distributions and distributions where the shape is unknown. Chebyshev’s Theorem: what is it?

66 Chebyshev’s Theorem: comparision Minimum percentage of the population lie within the interval

67 One study of heights in the U.S. concluded that women have a mean height of 65 inches with a standard deviation of 2.5 inches. If this is correct, what is the percentage of women that have heights between 60 and 70 inches according to Chebyshev’s Theorem? If we assume the distribution is a normal distribution, what is the percentage of women with heights between 60 and 70 inches? Since the interval between 60 & 70 inches is 2 σ away from the mean, according to Chebyshev’s Theorem, there are at least 100(1 - 1/2 2 )% or 75% of women with heights between 60 and 70 inches. But if we know the distribution is a normal distribution, then there is 95.44% of the population measurements lie within two standard deviations from the mean. Chebyshev’s Theorem: Example

68 z Scores For any x in a population or sample, the associated z score is defined as

69 Z-score is: the exact location of a score within a distribution The z-score transforms each X value into a signed number so that 1. The sign tells whether the score is located above (+) or below (-) the mean, and 2. The number tells the distance between the score and the mean in terms of the number of standard deviations. 69

70 70 Two distributions of exam scores. For both distributions, = 70, but for one distribution, = 3, and for the other, = 12. The position of X = 76 is very different for these two distributions.  σ σ Z=(76-70)/12=0.5 Z=(76-70)/3=2

71 What if you score a 65 on a test? On the surface it might look bad. But what if that was the mean is 45 and the standard deviation is 10. Then your z-Score isZ = (65 – 45)/10 = 2. That means you are 2 σ above the mean value. From Chebyshev’s Theorem, you are at worse in the top 12.5% of the class if the distribution is symmetric. If the distribution is normal you are in the top 2-3% of the class. z Scores: application

72 z Scores: assignment In order to get an “A”, you should be in the top 10% of the class. What should be your z-Score to get an “A”, according to Chebyshev’s Theorem and assuming the distribution is symmetrical?

73 Example: z Score Population of profit margins for five American companies: 8%, 10%, 15%, 12%, 5% µ = 10%, σ = 3.406%

74 74

75 75 TRANSFORMING z-SCORES TO A DISTRIBUTION WITH A PREDETERMINED mean and standard deviation An instructor gives an exam to a psychology class. For this exam, the distribution of raw scores has a mean of 57 with sd= 14. The instructor would like to simplify the distribution by transforming all scores into a new, standardized distribution with mean= 50 and sd= 10. To demonstrate this process, we will consider what happens to two specific students: Joe, who has a raw score of X= 64 in the original distribution; and Maria, whose original raw score is X= 43. Step 1: Transform each of the original raw scores into z-scores. For Joe, X= 64, so his z-score is 0.5For Maria, X= 43, and her z-score is -1.0 Step 2: Change the z-scores to the new, standardized scores. Joe’s z-score, z=0.5, indicates that he is above the mean by exactly 1/2 standard deviation, so his standardized score would be 55. Maria’s score is located 1 standard deviation below the mean. So her new score would be X= 40.

76 76 Example: A psychologist has developed a new intelligence test. For years, the test has been given to a large number of people; for this population, mean= 65 and sd= 10. The psychologist would like to make the scores of this test comparable to scores on other IQ tests, which have mean= 100 and sd= 15. If the test is standardized so that it is comparable (has the same mean and sd) to other tests, what would be the standardized scores for the following individuals? For Az=(75-65)/10=1 standardized score=100+1x15=115 Bz=(45-65)/10=-2100-2x15=70 Cz=(67-65)/10=0.2100+0.2x15=103

77 Coefficient of Variation Measures the size of the standard deviation relative to the size of the mean Coefficient of variation =standard deviation/mean × 100% Used to: – Compare the relative variabilities of values about the mean – Compare the relative variability of populations or samples with different means and different standard deviations – Measure risk

78 Covariance, Correlation, and the Least Squares Line When points on a scatter plot seem to fluctuate around a straight line, there is a linear relationship between x and y A measure of the strength of a linear relationship is the covariance s xy

79 Covariance A positive covariance indicates a positive linear relationship between x and y – As x increases, y increases A negative covariance indicates a negative linear relationship between x and y – As x increases, y decreases

80 Interpretation of the Sample Covariance

81 Interpretation of the Sample Covariance Continued Points in quadrant I correspond to x i and y i both greater than their averages – (x- ) and (y-y ̅ ) both positive so covariance positive Points in quadrant III correspond to x i and y i both less than their averages – (x- ) and (y-y ̅ ) both negative so covariance positive If s xy is positive, the points having the greatest influence are in quadrants I and III Therefore, a positive s xy indicates a positive linear relationship

82 Interpretation of the Sample Covariance Continued Points in quadrant II correspond to x i less than and y i greater than y – (x- ) is negative and (y-y ̅ ) is positive so covariance negative Points in quadrant IV correspond to x i greater than and y i less than y ̅ – (x- ) is positive and (y- y ̅ ) is negative so covariance negative If s xy is negative, the points having the greatest influence are in quadrants II and IV Therefore, a negative s xy indicates a negative linear relationship

83 Correlation Coefficient Magnitude of covariance does not indicate the strength of the relationship – Magnitude depends on the unit of measurement used for the data Sample correlation coefficient (r) is a measure of the strength of the relationship that does not depend on the magnitude of the data

84 Correlation Coefficient Continued Sample correlation coefficient r is always between -1 and +1 – Values near -1 show strong negative correlation – Values near 0 show no correlation – Values near +1 show strong positive correlation Sample correlation coefficient is the point estimate for the population correlation coefficient ρ

85 Least Squares Line If there is a linear relationship between x and y, might wish to predict y on the basis of x This requires the equation of a line describing the linear relationship Line is calculated based on least squares line – Discussed in detail in Chapter 13

86 Least Squares Line Continued Need to calculate slope (b 1 ) and y-intercept (b 0 )

87 Least Squares Line for the Sales Volume Data

88 Weighted Means Sometimes, some measurements are more important than others – Assign numerical “weights” to the data Weights measure relative importance of the value Calculate weighted mean as where w i is the weight assigned to the i th measurement x i

89 Example: Weighted Mean June 2001 unemployment rates by census region – Northeast, 26.9 million in civilian labor force, 4.1% unemployment rate – South, 50.6 million, 4.7% unemployment – Midwest, 34.7 million, 4.4% unemployment – West, 32.5 million, 5.0 unemployment Want the mean unemployment rate for the US

90 Example: Weighted Mean Continued Want the mean unemployment rate for the U.S. Calculate it as a weighted mean – So that the bigger the region, the more heavily it counts in the mean The data values are the regional unemployment rates The weights are the sizes of the regional labor forces

91 Example: Weighted Mean Continued Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03% – That is, 0.0003  144.7 million = 43,410 workers

92 Descriptive Statistics for Grouped Data Data already categorized into a frequency distribution or a histogram is called grouped data Can calculate the mean and variance even when the raw data is not available Calculations are slightly different for data from a sample and data from a population

93 Descriptive Statistics for Grouped Data (Sample) Sample mean for grouped data: Sample variance for grouped data: f i is the frequency for class i M i is the midpoint of class i n = Σf i = sample size

94 Descriptive Statistics for Grouped Data (Population) Population mean for grouped data: Population variance for grouped data: f i is the frequency for class i Mi is the midpoint of class i N = Σf i = population size

95 The Geometric Mean (Optional) For rates of return of an investment, use the geometric mean to give the correct wealth at the end of the investment Suppose the rates of return (expressed as decimal fractions) are R 1, R 2, …, R n for periods 1, 2, …, n The mean of all these returns is the calculated as the geometric mean:

96 The arithmetic mean of the absolute values of the deviations from the arithmetic mean. The main features:  All values are used in the calculation.  It is sensitive to unusual data.  The absolute values are difficult to manipulate. So, it is not used in inferential stat. Mean Deviation 3- 96 Mean Deviation

97 Chapter Three: Summary ONE Calculate the arithmetic mean, median, mode, weighted mean, and the geometric mean. TWO Explain the characteristics of each measure of location. THREE Identify the position of the arithmetic mean, median, and mode for both a symmetrical and a skewed distribution.

98 FOUR Compute and interpret the range, the mean deviation, the variance, and the standard deviation of ungrouped data. FIVE Explain the characteristics, uses, advantages, and disadvantages of each measure of dispersion. SIX Understand Chebyshev’s theorem and the Empirical Rule as they relate to a set of observations. Chapter Three: Summary


Download ppt "Chapter 3 Descriptive Statistics: Numerical Methods Statistics for Business (Env) 1."

Similar presentations


Ads by Google