Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Descriptive Statistics: Numerical Methods Chapter 4.

Similar presentations


Presentation on theme: "1 Descriptive Statistics: Numerical Methods Chapter 4."— Presentation transcript:

1 1 Descriptive Statistics: Numerical Methods Chapter 4

2 2 Introduction §In this chapter we use numerical measures to describe data sets, that represent populations or samples. §Usually, we focus our attention on two types of measures when describing population characteristics: · Measure of the central location. · Measure of dispersion.

3 3 §Why both the central location and the variability are used to describe a set of number? §Observe the following example. Introduction

4 4 §Think of a sample portfolio composed of three stocks. 100 shares ARR = 10% 200 shares ARR = 15% 100 shares ARR = 20% A central measure for this portfolio’s ARR for is 15%. §Now observe the following portfolio 100 shares ARR = 5% 100 shares ARR = 5% 200 shares ARR = 15% 100 shares ARR = 25% 100 shares ARR = 25% A central measure of this portfolio’s ARR for is 15% too.

5 5 Introduction §Considering the average ARR only the two portfolios are equal. But are they really? §Is the dispersion of ARR the same for the two portfolio? §The dispersion (variability) is an important property when describing a set of numbers, at least as important as the central location. §We’ll have more detailed discussions on these two important measures later.

6 6 4.1 Measures of Central Location With one data point clearly the central location is at the point itself. §The central data point reflects the locations of all the actual data points. §How? With two data points, the central location should fall in the middle between them (in order to reflect the location of both of them).

7 7 4.1 Measures of Central Location §The central data point reflects the locations of all the actual data points. §How? If the third data point appears in the center the measure of central location will remain in the center, but… (click) But if the third data point appears on the left hand-side of the midrange, it should “pull” the central location to the left.

8 8 As more and more data points are added, the central location moves (left and right) as required in order to reflect the effects of all the points. 4.1 Measures of Central Location

9 9 Sum of the measurements Number of measurements Mean = §This is the most popular and useful measure of central location The Arithmetic Mean

10 10 Sample meanPopulation mean Sample sizePopulation size The Arithmetic Mean

11 11 Find the mean rate of return for a portfolio equally invested in five stocks having the following annual rate of returns: 11.2%, 8.07%, 5.55%, 13.7%, 21%. Solution Example 1 The Arithmetic Mean

12 12 §The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude. §When determining the median pay attention to the number of observations (k). · ‘k’ is odd Median = the number at the (k+1)/2 th location of the ordered array. · ‘k’ is Even Median = the average of the two numbers in the middle (The number at the (k/2) th and the [(k/2)+1)] th locations of the ordered array.) The Median

13 13 30,32,60,3126,26,28,29, Odd number of observations 26,26,28,29,30,32,60 Example 2 The salaries of seven employees were recorded (in 1000s): 28, 60, 26, 32, 30, 26, 29. Find the median salary. Suppose an additional salary of $31,000 is added to the group of salaries recorded before. Find the median salary. Even number of observations 29.5, The Median There are seven salaries (K = 7). The (k+1)/2 th salary of the ordered array is the number at the (7+1)/2 th = 4 th location. The median is 29. There are eight salaries (K = 8). The two salaries in the middle are 29 (in the (k/2) th =4 th location), and 30 (in the [(k/2)+1] th =5 th location. The median is the average number – 29.5.

14 14 §The Mode of a set of measurements is the value that occurs most frequently. §A Set of data may have one mode (or modal class), or two or more modes. The modal class For large data sets the modal class is much more relevant than a single-value mode. The Mode

15 15 § Example 3  The manager of a men’s clothing store observes the waist size (in inches) of trousers sold last week: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.  The mode of this data set is 34 in. This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.” The Mode

16 16 Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mean Median Mode

17 17 §If a distribution is symmetrical, the mean, median and mode coincide §If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mean Median Mode Mean Median Mode A negatively skewed distribution (“skewed to the left”) Relationship among Mean, Median, and Mode

18 18 Using the Mean, Median, and Mode §When to use (not use) each measure of central location): The mean - is very sensitive to extreme values, thus, should not be used when a few extreme values residing away from most of the observations, are present. The mean is used in most statistical analyses. The median – is not effected by extreme values therefore, can be used in their presence. Yet, the medians does not reflect all the values included in the data set, but rather the location of the observation in the middle. The mode – should be used mainly for categorical data.

19 19 § Example 4 A professor of statistics wants to report the results of a midterm exam, taken by 100 students. The mean of the test marks is 73.90 The median of the test marks is 81 The mode of the test marks is 84 Describe the information each one provides. The mean provides information about the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams. The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his/her mark relative to other students in the class. The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute. The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute. Summary Examples

20 20 Summary Examples § Example 5 · The following sample represents the lateness of arriving flights in a certain domestic flight airport (in minutes): 22, 12, 4, -3,… (the data is found in Lateness.xls)Lateness.xls (a)Find the mean, median, and mode of this sample. Are these data form a skewed distribution? negative, positive? (b)Which measure should not be used? Change the largest lateness to 34 minutes (rather than 67). Which central location measures are effected? (c) A person is waiting for the arrival of a certain flight. He is told the flight will probably be late not more than10 minutes. Should he believe this is a reliable estimate? Use the distribution of data requested in part (b).

21 21 §Example 5 - solution  We run the data on Excel using the ‘Descriptive Statistics’ tool. §The distribution of these data shows a positive skewness: §Do not use the mean, because an ‘outlier’ of 67 minutes lateness effects (increases) the mean value to be almost 11 minutes. 20 15 10 5 0 -8 5 18 31 44 57 70 Summary Examples

22 22 §Example 5 - solution · When changing the largest observation from 67 to 34, the mean reduces to 9.80 minutes, but the median and mode do not change. It is reasonable to believe that the lateness will not exceed 10 minutes. From the Ogive we see that about 60 % of the flights arrive within 10 minutes of the scheduled arrival time. Summary Examples

23 23 Problems P4-1: Consider the following sample of measurements: 27, 32, 30, 28, 31, 32, 35, 28, 28, 29. Compute the mean, median, mode. Does it appear that the mode is a good measure of central location for this set of numbers? P4-2: The manager at a local supermarket (facing tough competition) tries to improve service to customers waiting to pay by adding a second cashier. The goal is to have customers wait at most 4.5 minutes before leaving the cashier area. From the data presented in P4-02.xls, was the manager successful in achieving this goal? Use Excel and numerical descriptive measures.P4-02.xls

24 24 4.2 Measures of Variability §Measures of central location fail to tell the whole story about the distribution. §A question of interest still remains unanswered: How much are the values of a given set spread out around the mean value?

25 25 Observe two hypothetical data sets: The mean provides a good representation of the values in the data set. Set 1: Small variability Why do we need measures of variability?

26 26 Why do we need measures of variability? Observe two hypothetical data sets: Set 1: Small variability Set 2: Larger variability The mean is the same as before but no longer represents the set values as good as before. The mean provides a good representation of the values in the data set.

27 27 · The range of a set of measurements is the difference between the largest and smallest measurements. · Its major advantage is the ease with which it can be computed. · Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points. ? ? ? But, how do all the measurements spread out? Smallest measurement Largest measurement The range cannot assist in answering this question Range The Range

28 28 · This measure reflects the dispersion of all the measurement values.  The variance of a population of N measurements x 1, x 2,…,x N having a mean  is defined as · The variance of a sample of n measurements x 1, x 2, …,x n having a mean is defined as The Variance

29 29 Consider two small populations: 10 98 74 1112 1316 8-10= -2 9-10= -1 11-10= +1 12-10= +2 4-10 = - 6 7-10 = -3 13-10 = +3 16-10 = +6 Sum = 0 The mean of both populations is 10... …but measurements in B are more dispersed then those in A. A measure of dispersion should agree with this observation. Can the sum of deviations from the mean be a good measure of dispersion? A B The Variance

30 30 The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion, since clearly their dispersion is not equal. The Variance

31 31 Let us calculate the variance of the two populations Why is the variance defined as the average squared deviation? Why not use the sum of squared deviations as a measure of dispersion instead? After all, the sum of squared deviations increases in magnitude when the dispersion of a data set increases!! The Variance

32 32 Which data set has a larger dispersion? 131 32 5 AB Data set B is more dispersed around the mean Let us calculate the sum of squared deviations for both data sets The Variance

33 33 13 1 3 2 5 AB Sum A = (1-2) 2 +…+(1-2) 2 +(3-2) 2 + … +(3-2) 2 = 10 Sum B = (1-3) 2 + (5-3) 2 = 8 Sum A > Sum B. This is inconsistent with the observation that set B is more dispersed. The Variance

34 34 13 1 3 2 5 AB However, when calculated on “per observation” basis (variance), the dispersions are properly ordered.  A 2 = Sum A /N = 10/10 = 1  B 2 = Sum B /N = 8/2 = 4 The Variance

35 35 § Example 6 · Find the variance of the following set of numbers, representing annual rates of returns for a group of mutual funds. Assume the set is (i) a sample, (ii) a population: -2, 4, 5, 6.9, 10 § Solution Assuming a sample The Variance

36 36 § Example 6 - solution continued Assuming a population The Variance

37 37 The standard deviation of a set of measurements is the square root of the set variance. Standard Deviation

38 38 · Example 7 The daily percentage of defective items in two weeks of production (10 working days) were calculated for two production lines? Which line provides good items more consistently? Line 1: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Line 2: 12.1, 2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, 1.3, 11.4 Standard Deviation

39 39 Example 7, Solution Line 1 should be considered less consistent because the standard deviation of its defective proportion is larger (i.e. therefore the standard deviation of the good item proportion is also larger). Standard Deviation Let us use the Excel printout obtained from the “Descriptive Statistics” sub-menu.

40 40 Interpreting the Standard Deviation §The standard deviation can be used to · compare the variability of several distributions · make a statement about the general shape of a distribution. §When describing the shape of a distribution we refer to · A distribution with any shape · A mound shaped distribution

41 41 The Empirical Rule – Describing a Mound Shaped Data Set If a sample of measurements has a mound- shaped distribution, the interval…

42 42 § Example 10 Describe the set of data provided in Data 10 using numerical descriptive measures.Data 10 The Empirical Rule § Solution · From the histogram it appears that the distribution is approximately mound shaped. We ’ll use the empirical rule to describe the data.

43 43 §From the empirical rule we get: · Approximately 68% of the data lie between 17.403 and 18.515 [17. 959-1(.556), 17.959 + 1(.556)] · Approximately 95% of the data lie between 16.847 and 19.071 [ 17. 959-2(.556), 17. 959+2(.556) ] · Approximately 99.7% of the data lie between 16.291 and 19.627 [ 17. 959-3(.556), 17. 959+3(.556) ] Example 10 – solution continued §Running the Descriptive statistics tool in Excel we have Mean = 17.959 Standard deviation (sample) = 0.556 The Empirical Rule – Interpreting the Standard Deviation Actual count: 26 (100%) Actual count: 25(96%) Actual count: 19 (73%)

44 44 §The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/z 2 for any z > 1. §This theorem is valid for any set of measurements (sample, population) of any shape!! KIntervalMinimum % 1at least 75% 2at least 89% 3at least 94% The Chebyshev Theorem - Describing Any Data Set (1-1/2 2 ) (1-1/3 2 ) (1-1/4 2 )

45 45 § Example 9 · Employee salaries were recorded and a histogram was created. Describe this data using the correct numerical measures. The Chebyshev Theorem § Solution · Creating the histogram we realize that the distribution is positively skewed. Chebychev Theorem needs to be used to describe the data.

46 46 § Example 9 – solution continued · From Excel we have: Mean = 243.2 Standard deviation = 58.354 · Applying Chebychev Theorem At least 75% of the salaries lie within [243.2-2(58.354), 243.2+2(58.354)] = [126.492, 359.908] At least 88.9% of the salaries lie within [243.2-3(58.354), 243.2+3(58.354)] = [68.138, 418.262] The Chebyshev Theorem Actual count 39 (97.5%) All (100%)

47 47 §The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. §This coefficient provides a proportionate measure of variation. A standard deviation of 10 may be perceived large when the mean value is 100, but only moderately large when the mean value is 500 The Coefficient of Variation

48 48 4.3 Measures of Relative Location and Box Plots § Additional information on the general shape of a data set can be obtained by describing the relative location of 5 values within the data set. § We use percentiles to describe these 5 relative locations. What is a percentile?

49 49 Your score § Percentile · The p th percentile of a set of measurements is the value for which At most p% of the measurements are less than that value At most (100-p)% of all the measurements are greater than that value. § Example · Suppose your score is the 60 th percentile of a SAT test. Then 60% of all the scores lie here 40% 4.3 Measures of Relative Location and Box Plots

50 50 § Here are two possible approaches commonly used to describe a set of values. § The five number summary: · Smallest value · First quartile (Q 1 ) · Median (Q 2 ) · Third quartile (Q 3 ) · Largest value - OR - The first decile (the 10 th percentile) First quartile (Q1) Median (Q2) Third quartile (Q3) The ninth decile (90 th percentile) 4.3 Measures of Relative Location and Box Plots

51 51 · First (lower)decile= 10th percentile · First (lower) quartile, Q 1, = 25th percentile · Median,= 50th percentile · Third quartile, Q 3, = 75th percentile · Ninth (upper)decile= 90th percentile Lower decile A demostration of Commonly used percentiles 10% 90% lie here

52 52 § Commonly used percentiles: · First (lower)decile= 10th percentile · First (lower) quartile, Q 1, = 25th percentile · Median,= 50th percentile · Third quartile, Q 3, = 75th percentile · Ninth (upper)decile= 90th percentile Lower quartile A demostration of Commonly used percentiles - optional 10% 90% lie here 25%75% lie here Click

53 53 § Commonly used percentiles: · First (lower)decile= 10th percentile · First (lower) quartile, Q 1, = 25th percentile · Median,= 50th percentile · Third quartile, Q 3, = 75th percentile · Ninth (upper)decile= 90th percentile Middle decile -Median A demostration of Commonly used percentiles And so on… 25%75% lie here 50% lie here 50% lie here Click

54 54 §There are two general cases to consider: · The percentile is a member of the data set · The percentile is not a member of the data set; It falls in between two values of the data set. §Let us demonstrate the two cases with two examples. Determining Percentiles and their Location

55 55 § Example 11 Find the quartiles for the data set of flight lateness presented in example 4.5. Data: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Determining Percentiles and their Location

56 56 At most (.25)(10) = 2.5 measurements should appear below the first quartile. Check the smallest 2 measurements on the left hand side. At most (.25)(10) = 2.5 measurements should appear below the first quartile. Check the smallest 2 measurements on the left hand side. At most (.75)(10)=7.5 measurements should appear above the first quartile. Check the largest 7 measurements on the right hand side. At most (.75)(10)=7.5 measurements should appear above the first quartile. Check the largest 7 measurements on the right hand side. The first quartile 10 measurements §Example 11 - Solution Sort the measurements 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 Determining Percentiles and their Location

57 57 §Example 11 – solution continued · The second quartile (Median): At most (.5)(10) = 5 numbers lie below and above Q 2 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 Q2Q2 Q 2 = (8.3 + 20.9)/2 = 14.6 Determining Percentiles and their Location

58 58 §Example 11 – solution continued · The third quartile At most (.75)10 = 7.5 numbers lie below Q 3 At most (.25)10 = 2.5 numbers lie above Q 3 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 Q3Q3 Determining Percentiles and their Location

59 59 § Example 12 Find the 20 th percentile for the data set of flight lateness presented in example 11. § Solution · Following the procedure applied to the previous example, At most (.20)10 = 2 numbers should fall below the 20 th percentile. At least (.80)10 = 8 numbers should fall above the 20 th percentile. The sorted data set is: 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42. From the sorted data set we see that every number greater than 3.1 and smaller than 5.2 meets these two conditions. We show next how to determine the location and value of a percentile whose value is not one of the data set points. Determining Percentiles and their Location

60 60 §Find the location of any percentile using the formula Determining Percentiles and their Location

61 61 §Example 12-solution continued · Finding the location of the 20 th percentile: · 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 · Finding the value of the 20 th percentile. The 20 th percentile is located at location 2.75, that is, at.75 the distance from 3.1 to 5.2. Therefore, 23 3.1 5.2 2.75 P 20 = 3.1 +.75(5.2 – 3.1) = 4.675 Determining Percentiles and their Location

62 62 Quartiles and Variability §Quartiles can provide an idea about the shape of a histogram Q 1 Q 2 Q 3 Positively skewed histogram Q 1 Q 2 Q 3 Negatively skewed histogram

63 63 §This is a measure of the spread of the middle 50% of the observations §Large value indicates a large spread of the observations Interquartile range = Q 3 – Q 1 Inter-quartile Range

64 64 1.5(Q 3 – Q 1 ) · A box plot is a pictorial display that provides the main descriptive measures of the measurement set: L - the largest measurement Q 3 - The upper quartile Q 2 - The median Q 1 - The lower quartile S - The smallest measurement SQ1Q1 Q2Q2 Q3Q3 L Whisker Box Plot An outlier is defined as any value that is more than 1.5(Q 3 – Q 1 ) away from the box.

65 65 · Example 13 Create a box plot for the data regarding the GMAT scores of 200 applicants (see Data13.xls)Data13.xls Box Plot 537512449575417.5 512-1.5(IQR) 575+1.5(IQR) 669.5 788

66 66 · Interpreting the box plot results The scores range from 449 to 788. About half the scores are smaller than 537, and about half are larger than 537. About half the scores lie between 512 and 575. About a quarter lies below 512 and a quarter above 575. Q 1 512 Q 2 537 Q 3 575 25%50%25% 449 669.5 Box Plot Example 13 - continued

67 67 50% 25% The data set is positively skewed Q 1 512 Q 2 537 Q 3 575 25%50%25% 449 669.5 Box Plot Example 13 - continued

68 68 4.4 Measures of Linear Relationship §The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. · The Covariance answers the question: Is there any pattern to the way two variables move together? · The C orrelation Coefficient answers the question: How strong is the linear relationship between two variables.

69 69  x (  y ) is the population mean of the variable X (Y). N is the population size. Covariance x (y) is the population mean of the variable X (Y). n is the sample size.

70 70 §If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number. Covariance 1 3 4 6 10 8 X Y

71 71 §If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number. Covariance X Y 4 63 10 1 8

72 72 §If the two variables are unrelated, the covariance will be close to zero. Covariance 1 36 10 4 8 X Y

73 73 The coefficient of correlation The coefficient of correlation measures the strength of the linear relationship between two variables.

74 74 §If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). The coefficient of correlation

75 75 The coefficient of correlation §If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

76 76 §A weak linear relationship is indicated by a coefficient close to zero. §Also, a non-linear relationship translates to a weak linear relationship The coefficient of correlation

77 77 § Example 14 · Compute the covariance and the coefficient of correlation to measure how are car speed (mile per hour) and gas consumption (miles per gallon) related to one another (see data next). § Solution · We believe speed affects gas consumption. Thus Speed is labeled X Miles per gallon is labeled Y The coefficient of correlation and the covariance

78 78 Car x y x 2 y 2 xy The coefficient of correlation and the covariance Example 14 – solution continued

79 79 Car x y x 2 y 2 xy The coefficient of correlation and the covariance Example 14 – solution continued

80 80 The coefficient of correlation and the covariance Example 14 – solution continued Interpretation: Speed and mileage per gallon are strongly positively linearly related for the speed range of 15 to 50 miles per hour.


Download ppt "1 Descriptive Statistics: Numerical Methods Chapter 4."

Similar presentations


Ads by Google