Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1 Copyright © 2010, HJ Shanghai Normal Uni. Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Location n Measures of Variability n Measure.

Similar presentations


Presentation on theme: "1 1 Copyright © 2010, HJ Shanghai Normal Uni. Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Location n Measures of Variability n Measure."— Presentation transcript:

1 1 1 Copyright © 2010, HJ Shanghai Normal Uni. Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Location n Measures of Variability n Measure of Relative Location and Detecting Outliers n Exploratory Data Analysis n Measures of Association between Two Variables n The Weighted mean and Working with Grouped Data x x     % %

2 2 2 Copyright © 2010, HJ Shanghai Normal Uni. 3.1 Measures of Location n Mean (均值) n Median (中位数) n Mode (众数) n Percentiles (百分位数) n Quartiles (四分位数)

3 3 3 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents Given below is a sample of monthly rent values ($) for one-bedroom apartments. The data is a sample of 70 apartments in a particular city. The data are presented in ascending order.

4 4 4 Copyright © 2010, HJ Shanghai Normal Uni. Mean n The mean (平均值) of a data set is the average of all the data values. n If the data are from a sample, the mean is denoted by. If the data are from a population, the mean is denoted by  (mu). If the data are from a population, the mean is denoted by  (mu).

5 5 5 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Mean

6 6 6 Copyright © 2010, HJ Shanghai Normal Uni. Median n The median (中位数) is the measure of location most often reported for annual income and property value data. n A few extremely large incomes or property values can inflate the mean.

7 7 7 Copyright © 2010, HJ Shanghai Normal Uni. Median n The median of a data set is the value in the middle when the data items are arranged in ascending order. n For an odd number of observations, the median is the middle value. n For an even number of observations, the median is the average of the two middle values.

8 8 8 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Median Median = 50th percentile Median = 50th percentile i = ( p /100) n = (50/100)70 = 35 Averaging the 35th and 36th data values: Median = (475 + 475)/2 = 475

9 9 9 Copyright © 2010, HJ Shanghai Normal Uni. Mode n The mode (众数) of a data set is the value that occurs with greatest frequency. n The greatest frequency can occur at two or more different values. n If the data have exactly two modes, the data are bimodal. (双峰) n If the data have more than two modes, the data are multimodal. (多峰)

10 10 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Mode 450 occurred most frequently (7 times) 450 occurred most frequently (7 times) Mode = 450 Mode = 450

11 11 Copyright © 2010, HJ Shanghai Normal Uni. Percentiles n A percentile (百分位数) provides information about how the data are spread over the interval from the smallest value to the largest value. n Admission test scores for colleges and universities are frequently reported in terms of percentiles.

12 12 Copyright © 2010, HJ Shanghai Normal Uni. n The p th percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 - p ) percent of the items take on this value or more. Arrange the data in ascending order. Arrange the data in ascending order. Compute index i, the position of the p th percentile. Compute index i, the position of the p th percentile. i = ( p /100) n i = ( p /100) n If i is not an integer, round up. The p th percentile is the value in the i th position. If i is not an integer, round up. The p th percentile is the value in the i th position. If i is an integer, the p th percentile is the average of the values in positions i and i +1. If i is an integer, the p th percentile is the average of the values in positions i and i +1. Percentiles

13 13 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n 90th Percentile i = ( p /100) n = (90/100)70 = 63 Averaging the 63rd and 64th data values: 90th Percentile = (580 + 590)/2 = 585 90th Percentile = (580 + 590)/2 = 585

14 14 Copyright © 2010, HJ Shanghai Normal Uni. Quartiles n Quartiles (四分位数) are specific percentiles n First Quartile = 25th Percentile n Second Quartile = 50th Percentile = Median n Third Quartile = 75th Percentile

15 15 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Third Quartile Third quartile = 75th percentile Third quartile = 75th percentile i = ( p /100) n = (75/100)70 = 52.5 = 53 i = ( p /100) n = (75/100)70 = 52.5 = 53 Third quartile = 525 Third quartile = 525

16 16 Copyright © 2010, HJ Shanghai Normal Uni. Measures of Variability n It is often desirable to consider measures of variability (dispersion), as well as measures of location. n For example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.

17 17 Copyright © 2010, HJ Shanghai Normal Uni. Measures of Variability n Range (极差) n Interquartile Range (四分位点内距) n Variance (方差) n Standard Deviation (标准差) n Coefficient of Variation (变异系数)

18 18 Copyright © 2010, HJ Shanghai Normal Uni. Range n The range (极差) of a data set is the difference between the largest and smallest data values. n It is the simplest measure of variability. n It is very sensitive to the smallest and largest data values.

19 19 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Range Range = largest value - smallest value Range = largest value - smallest value Range = 615 - 425 = 190 Range = 615 - 425 = 190

20 20 Copyright © 2010, HJ Shanghai Normal Uni. Interquartile Range n The interquartile range (四分位点内距) of a data set is the difference between the third quartile and the first quartile. n It is the range for the middle 50% of the data. n It overcomes the sensitivity to extreme data values.

21 21 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Interquartile Range 3rd Quartile ( Q 3) = 525 3rd Quartile ( Q 3) = 525 1st Quartile ( Q 1) = 445 1st Quartile ( Q 1) = 445 Interquartile Range = Q 3 - Q 1 = 525 - 445 = 80 Interquartile Range = Q 3 - Q 1 = 525 - 445 = 80

22 22 Copyright © 2010, HJ Shanghai Normal Uni. Variance n The variance (方差) is a measure of variability that utilizes all the data. It is based on the difference between the value of each observation ( x i ) and the mean ( x for a sample,  for a population). It is based on the difference between the value of each observation ( x i ) and the mean ( x for a sample,  for a population).

23 23 Copyright © 2010, HJ Shanghai Normal Uni. Variance n The variance is the average of the squared differences between each data value and the mean. n If the data set is a sample, the variance is denoted by s 2. If the data set is a population, the variance is denoted by  2. (sigma) If the data set is a population, the variance is denoted by  2. (sigma)

24 24 Copyright © 2010, HJ Shanghai Normal Uni. Standard Deviation n The standard deviation (标准差) of a data set is the positive square root of the variance. n It is measured in the same units as the data, making it more easily comparable, than the variance, to the mean. n If the data set is a sample, the standard deviation is denoted s. If the data set is a population, the standard deviation is denoted  (sigma). If the data set is a population, the standard deviation is denoted  (sigma).

25 25 Copyright © 2010, HJ Shanghai Normal Uni. Coefficient of Variation n The coefficient of variation (变异系数) indicates how large the standard deviation is in relation to the mean. n If the data set is a sample, the coefficient of variation is computed as follows: n If the data set is a population, the coefficient of variation is computed as follows:

26 26 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Variance n Standard Deviation n Coefficient of Variation

27 27 Copyright © 2010, HJ Shanghai Normal Uni. 课堂练习 一项关于大学生体重状况的研究发现,男生的平均体重 为 60kg ,标准差为 5kg ;女生的平均体重为 50kg ,标 准差为 5kg 。请回答下面的问题: 要求:( 1 )男生的体重差异大还是女生的体重差异大? 为什么?  ( 2 )以磅为单位( 1 磅 =2.2kg )求体重的平均数和 标准差。  ( 3 )粗略地估计一下,男生中有百分之几的人体 重在 55kg ~ 65kg 之间?  ( 4 )粗略地估计一下,女生中有百分之几的人体 重在 40kg ~ 60kg 之间?

28 28 Copyright © 2010, HJ Shanghai Normal Uni. Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Relative Location and Detecting Outliers n Exploratory Data Analysis n Measures of Association Between Two Variables n The Weighted Mean and Working with Grouped Data Working with Grouped Data     % % x x

29 29 Copyright © 2010, HJ Shanghai Normal Uni. Measures of Relative Location and Detecting Outliers n z-Scores ( Z- 分数) n Chebyshev’s Theorem (切比雪夫定理) n Empirical Rule (经验法则) n Detecting Outliers (异常值检测)

30 30 Copyright © 2010, HJ Shanghai Normal Uni. z-Scores ( Z- 分数) n The z-score is often called the standardized value. n It denotes the number of standard deviations a data value x i is from the mean. n A data value less than the sample mean will have a z- score less than zero. n A data value greater than the sample mean will have a z-score greater than zero. n A data value equal to the sample mean will have a z- score of zero.

31 31 Copyright © 2010, HJ Shanghai Normal Uni. n z-Score of Smallest Value (425) Standardized Values for Apartment Rents Example: Apartment Rents

32 32 Copyright © 2010, HJ Shanghai Normal Uni. Example : Z-scores for the class-size n Sample mean: 44; sample standard deviation:8

33 33 Copyright © 2010, HJ Shanghai Normal Uni. Chebyshev’s Theorem (切比雪夫定理) At least (1 - 1/ z 2 ) of the items in any data set will be At least (1 - 1/ z 2 ) of the items in any data set will be within z standard deviations of the mean, where z is any value greater than 1. At least 75% of the items must be within At least 75% of the items must be within z = 2 standard deviations of the mean. At least 89% of the items must be within At least 89% of the items must be within z = 3 standard deviations of the mean. At least 94% of the items must be within At least 94% of the items must be within z = 4 standard deviations of the mean. 与均值的距离必定在 z 个标准差以内的数据比例至少为 (1 - 1/ z 2 ) At least (1 - 1/ z 2 ) of the items in any data set will be At least (1 - 1/ z 2 ) of the items in any data set will be within z standard deviations of the mean, where z is any value greater than 1. At least 75% of the items must be within At least 75% of the items must be within z = 2 standard deviations of the mean. At least 89% of the items must be within At least 89% of the items must be within z = 3 standard deviations of the mean. At least 94% of the items must be within At least 94% of the items must be within z = 4 standard deviations of the mean. 与均值的距离必定在 z 个标准差以内的数据比例至少为 (1 - 1/ z 2 )

34 34 Copyright © 2010, HJ Shanghai Normal Uni. Example: the midterm test scores n the midterm test scores for 100 students in a college business statistics course had a mean of 70 and a standard deviation of 5. How many students had test scores between 60 and 80? How many students had test scores between 58 and 82? n 60-80: Z 60 =(60-70)/5=-2 ; Z 80 =(80-70)/5=2; Z 60 =(60-70)/5=-2 ; Z 80 =(80-70)/5=2; At least (1 - 1/(2) 2 ) = 0.75 or 75% of the students have scores between 60 and 80. At least (1 - 1/(2) 2 ) = 0.75 or 75% of the students have scores between 60 and 80. n 58-82?

35 35 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Chebyshev’s Theorem (切比雪夫定理) Let z = 1.5 with = 490.80 and s = 54.74 Let z = 1.5 with = 490.80 and s = 54.74 At least (1 - 1/(1.5) 2 ) = 1 - 0.44 = 0.56 or 56% of the rent values must be between of the rent values must be between - z ( s ) = 490.80 - 1.5(54.74) = 409 - z ( s ) = 490.80 - 1.5(54.74) = 409 and and + z ( s ) = 490.80 + 1.5(54.74) = 573 + z ( s ) = 490.80 + 1.5(54.74) = 573

36 36 Copyright © 2010, HJ Shanghai Normal Uni. n Chebyshev’s Theorem (continued) Actually, 86% of the rent values Actually, 86% of the rent values are between 409 and 573. are between 409 and 573. Example: Apartment Rents

37 37 Copyright © 2010, HJ Shanghai Normal Uni. Empirical Rule (经验法则) For data having a bell-shaped distribution: For data having a bell-shaped distribution: Approximately 68% of the data values will be within one standard deviation of the mean. Approximately 68% of the data values will be within one standard deviation of the mean.

38 38 Copyright © 2010, HJ Shanghai Normal Uni. Empirical Rule For data having a bell-shaped distribution: Approximately 95% of the data values will be within two standard deviations of the mean. Approximately 95% of the data values will be within two standard deviations of the mean.

39 39 Copyright © 2010, HJ Shanghai Normal Uni. Empirical Rule For data having a bell-shaped distribution: Almost all (99.7%) of the items will be within three standard deviations of the mean. Almost all (99.7%) of the items will be within three standard deviations of the mean.

40 40 Copyright © 2010, HJ Shanghai Normal Uni.

41 41 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Empirical Rule Interval % in Interval Interval % in Interval Within +/- 1 s 436.06 to 545.5448/70 = 69% Within +/- 2 s 381.32 to 600.2868/70 = 97% Within +/- 3 s 326.58 to 655.0270/70 = 100%

42 42 Copyright © 2010, HJ Shanghai Normal Uni. 应用: six sigma( 六西格玛 ) n 用 “σ” 度量质量特性总体上对目标值 的偏离程度。几个西格玛是一种表 示品质的统计尺度。任何一个工作 程序或工艺过程都可用几个西格玛 表示。 n 六个西格玛可解释为每一百万个机 会中有 3.4 个出错的机会,即合格率 是 99.99966 %。而三个西格玛的合 格率只有 93.32 %。 n 六个西格玛的管理方法重点是将所 有的工作作为一种流程,采用量化 的方法 分析流程中影响质量的因素 ,找出最关键的因素加以改进从而 达到更高的客户满意度。

43 43 Copyright © 2010, HJ Shanghai Normal Uni. Detecting Outliers (异常值检测) n An outlier is an unusually small or unusually large value in a data set. n A data value with a z-score less than -3 or greater than +3 might be considered an outlier. n It might be: an incorrectly recorded data value an incorrectly recorded data value a data value that was incorrectly included in the data set a data value that was incorrectly included in the data set a correctly recorded data value that belongs in the data set a correctly recorded data value that belongs in the data set

44 44 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Detecting Outliers The most extreme z-scores are -1.20 and 2.27. Using | z | > 3 as the criterion for an outlier, there are no outliers in this data set. Standardized Values for Apartment Rents

45 45 Copyright © 2010, HJ Shanghai Normal Uni. Exploratory Data Analysis (探索性数据分析) n Five-Number Summary (五数据概括法) n Box Plot (箱形图)

46 46 Copyright © 2010, HJ Shanghai Normal Uni. Five-Number Summary n Smallest Value (最小值) n First Quartile (第一四分位数) n Median (中位数) n Third Quartile (第三四分位数) n Largest Value (最大值)

47 47 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Five-Number Summary Lowest Value = 425 First Quartile = 450 Median = 475 Median = 475 Third Quartile = 525 Largest Value = 615

48 48 Copyright © 2010, HJ Shanghai Normal Uni. Box Plot n A box is drawn with its ends located at the first and third quartiles. n A vertical line is drawn in the box at the location of the median. n Limits are located (not drawn) using the interquartile range (IQR). The lower limit is located 1.5(IQR) below Q 1. The lower limit is located 1.5(IQR) below Q 1. The upper limit is located 1.5(IQR) above Q 3. The upper limit is located 1.5(IQR) above Q 3. Data outside these limits are considered outliers. Data outside these limits are considered outliers.

49 49 Copyright © 2010, HJ Shanghai Normal Uni. Box Plot (Continued) n Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the limits. n The locations of each outlier is shown with the symbol *.

50 50 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Box Plot Lower Limit: Q1 - 1.5(IQR) = 450 - 1.5(75) = 337.5 Lower Limit: Q1 - 1.5(IQR) = 450 - 1.5(75) = 337.5 Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5 Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5 There are no outliers. 37 5 40 0 42 5 45 0 47 5 50 0 52 5 550 575 600 625

51 51 Copyright © 2010, HJ Shanghai Normal Uni. Measures of Association Between Two Variables n Covariance (协方差) n Correlation Coefficient (相关系数)

52 52 Copyright © 2010, HJ Shanghai Normal Uni. n The covariance (协方差) is a measure of the linear association between two variables. n If the data sets are samples, the covariance is denoted by s xy. n If the data sets are populations, the covariance is denoted by. Covariance

53 53 Copyright © 2010, HJ Shanghai Normal Uni. Covariance n Positive values indicate a positive relationship. n Negative values indicate a negative relationship.

54 54 Copyright © 2010, HJ Shanghai Normal Uni. Correlation Coefficient (相关系数) n The coefficient can take on values between -1 and +1. n Values near -1 indicate a strong negative linear relationship. n Values near +1 indicate a strong positive linear relationship. n If the data sets are samples, the coefficient is r xy. n If the data sets are populations, the coefficient is.

55 55 Copyright © 2010, HJ Shanghai Normal Uni. The Weighted Mean and Working with Grouped Data n Weighted Mean (加权平均值) n Mean for Grouped Data (分组数据均值) n Variance for Grouped Data (分组数据方差) n Standard Deviation for Grouped Data (分组数据标 准差)

56 56 Copyright © 2010, HJ Shanghai Normal Uni. Weighted Mean (加权平均值) n When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean. n In the computation of a grade point average (GPA), the weights are the number of credit hours earned for each grade. n When data values vary in importance, the analyst must choose the weight that best reflects the importance of each value.

57 57 Copyright © 2010, HJ Shanghai Normal Uni. Weighted Mean x =  w i x i x =  w i x i  w i  w iwhere: x i = value of observation i x i = value of observation i w i = weight for observation i w i = weight for observation i

58 58 Copyright © 2010, HJ Shanghai Normal Uni. Grouped Data n The weighted mean computation can be used to obtain approximations of the mean, variance, and standard deviation for the grouped data. n To compute the weighted mean, we treat the midpoint of each class as though it were the mean of all items in the class. n We compute a weighted mean of the class midpoints using the class frequencies as weights. n Similarly, in computing the variance and standard deviation, the class frequencies are used as weights.

59 59 Copyright © 2010, HJ Shanghai Normal Uni. n Sample Data n Population Data where: f i = frequency of class i f i = frequency of class i M i = midpoint of class i M i = midpoint of class i Mean for Grouped Data (分组数据均值)

60 60 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents Given below is the previous sample of monthly rents for one-bedroom apartments presented here as grouped data in the form of a frequency distribution.

61 61 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Mean for Grouped Data This approximation differs by $2.41 from This approximation differs by $2.41 from the actual sample mean of $490.80. the actual sample mean of $490.80.

62 62 Copyright © 2010, HJ Shanghai Normal Uni. Variance for Grouped Data (分组数据方差) n Sample Data n Population Data

63 63 Copyright © 2010, HJ Shanghai Normal Uni. Example: Apartment Rents n Variance for Grouped Data n Standard Deviation for Grouped Data (分组数据标 准差) This approximation differs by only $.20 from the actual standard deviation of $54.74.

64 64 Copyright © 2010, HJ Shanghai Normal Uni. 小结 n 中心位置的度量:均值、中位数、众数 n 数据集其它位置的描述:百分位数,四分位点 n 变异程度或分散程度:极差、四分位点内距、方差、 标准差、变异系数、 Z 分数、切比雪夫定理 n 构建五数概括法和箱形图 n 两变量之间的协方差和相关系数 n 加权平均值、分组数据的均值、方差和标准差

65 65 Copyright © 2010, HJ Shanghai Normal Uni.

66 66 Copyright © 2010, HJ Shanghai Normal Uni.

67 67 Copyright © 2010, HJ Shanghai Normal Uni. End of Chapter 3, Part B


Download ppt "1 1 Copyright © 2010, HJ Shanghai Normal Uni. Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Location n Measures of Variability n Measure."

Similar presentations


Ads by Google