2 Chapter Outline Measures of Central Location Measures of Variability MeanMedianModePercentile (Quartile, Quintile, etc.)Measures of VariabilityRangeVariance (Standard Deviation, Coefficient of Variation)
3 A Recall A sample is a subset of a population. Numerical measures calculated for sample data are called sample statistics.Numerical measures calculated for population data are called population parameters.A sample statistic is referred to as the point estimator of the corresponding population parameter.
4 MeanAs a measure of central location, mean is simply the arithmetic average of all the data values.The sample mean is the point estimator of the population mean .
5 Sample Mean The symbol (called sigma) means ‘sum up’. is the value of th observation in the sample.n is the number of observations in the sample.
6 Population Mean The symbol (called sigma) means ‘sum up’. is the value of th observation in the sample.N is the number of observations in the population.is pronounced as ‘miu’.
7 Sample Mean Example: Sales of Starbucks Stores 50 Starbucks stores are randomly chosen in the NYC. The table below shows the sales of those stores in December 2012.
9 Median The median of a data set is the value in the middle when the data items are arranged in ascending order.Whenever a data set has extreme values, the medianis the preferred measure of central location.The median is the measure of location most oftenreported for annual income and property value data.A few extremely large incomes or property valuescan inflate the mean since the calculation of meanuses all the data items.
10 Median For an odd number of observations: 26 18 27 12 14 27 19 in ascending orderthe median is the middle value.Median = 19
11 Median For an even number of observations: 26 18 27 12 14 27 19 30 in ascending orderthe median is the average of the middle two values.Median = ( )/2 =
12 Mean vs. Median As noted, extremes values can change means remarkably, while medians might not be affected much by extremevalues. Therefore, in that regard, median is a betterrepresentative of central location.301214181926272730280For the previous example, the median is 22.5 and the mean is If we add one large number (280) to the data, the median becomes 26 (the value in the middle). But the mean becomes In this case we prefer median to mean as a measure of central location.
13 Mode The mode of a data set is the value that occurs most frequently. The greatest frequency can occur at two or more different values.If the data have exactly two modes, the data are bimodal.If the data have more than two modes, the data are multimodal.Caution: If the data are bimodal or multimodal, Excel’s MODE function will incorrectly identify a single mode.
14 Mode1214181926272730For the example above, 27 shows up twice while all the other data values show up once. So, the mode is 27.
15 PercentilesA percentile provides information about how the data are spread over the interval from the smallest value to the largest value.Admission test scores for colleges and universities are frequently reported in terms of percentiles.The pth percentile of a data set is a value such that at least p percent of the items are less than or equal to this value and at least (100 - p) percent of the items are more than or equal to this value.The 50th percentile is simply the median.
16 Percentiles Arrange the data in ascending order. Compute index i, the position of the pth percentile.i = (p/100)nIf i is not an integer, round up. The p th percentileis the value in the i th position.If i is an integer, the p th percentile is the averageof the values in positions i and i +1.
17 So, averaging the 6th and 7th data values: PercentilesFind the 75th percentile of the following data1214181926272930Note: The data is already in ascending order.i = (p/100)n = (75/100)8 = 6So, averaging the 6th and 7th data values:75th percentile = ( )/2 = 28
18 Percentiles Find the 20th percentile of the following data 12 14 18 19 26272930Note: The data is already in ascending order.i = (p/100)n = (20/100)8 = 1.6, which is rounded up to 2.So, the 20th percentile is simply the 2nd data value, i.e. 14.
19 Quartiles Quartiles are specific percentiles. First Quartile = 25th percentileSecond Quartile = 50th percentile = MedianThird Quartile = 75th percentile
20 Measures of Variability It is often desirable to consider measures of variability (dispersion), as well as measures of central location.For example, when two stocks provide the same average return of 5% a year, but stock A’s return is very stable – close to 5% and stock B’s return is volatile ( it could be as low as –10%), are you indifferent with regard to which stock to invest in?For another example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.
21 Measures of Variability RangeInterquartile RangeVariance/Standard DeviationCoefficient of Variation
22 RangeThe range of a data set is the difference between the largest and smallest data values.It is the simplest measure of variability.It is very sensitive to the smallest and largest data values.
23 RangeExample:1214181926272930Range = largest value - smallest value= 30 – 12= 8
24 Interquartile RangeThe interquartile range of a data set is the difference between the 3rd quartile and the 1st quartile.It is the range of the middle 50% of the data.It overcomes the sensitivity to extreme data values.
26 VarianceThe variance is a measure of variability that utilizes all the data.It is based on the difference between the value of each observation (xi) and the mean ( for a sample, for a population)The variance is useful in comparing the variability of two or more variables.
27 VarianceThe variance is the average of the squared differences between each data value and the mean.The variance is calculated as follows:for asamplefor apopulation
28 Standard DeviationThe standard deviation of a data set is the positive square root of the variance.It is measured in the same units as the data, making it more appropriately interpreted than the variance.
29 Standard Deviation The standard deviation is computed as follows: for asamplefor apopulation
30 Variance and Standard Deviation Example1214181926272930VarianceStandard Deviation
31 Coefficient of Variation The coefficient of variation indicates how large the standard deviation is in relation to the mean.In a comparison between two data sets with different units or with the same units but a significant difference in magnitude, coefficient of variation should be used instead of variance.
32 Coefficient of Variation The coefficient of variation is computed as follows:for asamplefor apopulation
33 Coefficient of Variation Example1214181926272930
34 Coefficient of Variation Example: Height vs. WeightIn a class of 30 students, the average height is 5’5’’ with a standard deviation of 3’’ and the average weight is 120 lbs with a standard deviation of 20 lbs. Question, in which measure (height or weight) are students more different?Since height and weight don’t have the same unit, we have to use coefficient of variation to remove the units before comparing the variations in height and weight.As shown below, students’ weight is more variant than their height.
35 Measures of Distribution Shape, Relative Location, and Detecting Outliers z-ScoresChebyshev’s TheoremEmpirical RuleDetecting Outliers
36 Distribution Shape: Skewness An important measure of the shape of a distribution is called skewness.The formula for the skewness of sample data isSkewness can be easily computed using statistical software.
37 Distribution Shape: Skewness Symmetric (not skewed)Skewness is zero.Mean and median are equal.Skewness = 0.05.10.15.188.8.131.52Relative Frequency
38 Distribution Shape: Skewness Skewed to the leftSkewness is negative.Mean is usually less than the median.Skewness =.05.10.15.20.25.30.35Relative Frequency
39 Distribution Shape: Skewness Skewed to the rightSkewness is positive.Mean is usually more than the median.Skewness = .31.05.10.15.184.108.40.206Relative Frequency
40 Z-ScoresThe z-score is often called the standardized value.It denotes the number of standard deviations a data value xi is from the mean.Excel’s STANDARDIZE function can be used to compute the z-score.
41 A data value less than the sample mean has a negative z-score. Z-ScoresAn observation’s z-score is a measure of the relative location of the observation in a data set.A data value less than the sample mean has a negative z-score.A data value greater than the sample mean has a positive z-score.A data value equal to the sample mean has a z-score of zero.
43 Chebyshev’s TheoremAt least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, I.e. between ( ) and ( ), where z is any value greater than 1.Chebyshev’s theorem requires z > 1, but z need not be an integer.
44 Chebyshev’s TheoremAt least 55.6% of the data values must be within z = 1.5 standard deviations of the mean.At least 89% of the data values must be within z = 3 standard deviations of the mean.At least 94% of the data values must be within z = 4 standard deviations of the mean.
45 Chebyshev’s TheoremExample: Given that = 10 and s = 2, at least what percentage of all the data values falls into 2 standard deviations of the mean?At least (1-1/22) = 1-1/4 = 75% of all the data values must be between 6 and 14.= 10-2(2) = 6= 10+2(2) = 14
46 Empirical RuleWhen the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean.The empirical rule is based on the normal distribution, which is covered in Chapter 6.
47 Empirical Rule For data having a bell-shaped distribution: About of values of a normal random variableare between - and + .68%Expected number of correct answersAbout of values of a normal random variableare between - 2 and + 2.95%About of values of a normal random variableare between - 3 and + 3.99%
48 Empirical Rule Expected number of correct answers x About 99% m – 3sm – 1sm + 1sm + 3sm – 2sm + 2s
49 Detecting OutliersAn outlier is an unusually small or unusually large value in a data set.A data value with a z-score less than –3 or greater than +3 might be considered an outlier.It might be:An incorrectly recorded data valueA data value that was incorrectly included in the data set.A correctly recorded data value that belongs in the data set.
50 Measures of Association Between Two Variables So far, we have examined numerical methods used to summarize the data for one variable at a time.Often a manager or decision maker is interested in the relationship between two variables.Two numerical measures of the relationship between two variables are covariance and correlation coefficient.
51 CovarianceThe covariance is a measure of the linear association between two variables.Positive values indicate a positive relationship.Negative values indicate a negative relationship.
52 Covariance The covariance is computed as follows: for samples for populations
53 Correlation Coefficient Correlation is a measure of linear association and not necessarily causation.Just because two variables are highly correlated, it does not mean that one variable is the cause of the other.
54 Correlation Coefficient The correlation coefficient is computed as follows:forsamplesforpopulations
55 Correlation Coefficient The correlation can take on values between –1 and +1.Values near –1 indicate a strong negative linear relationship.Values near +1 indicate a strong positive linear relationship.The closer the correlation is to zero, the weaker the relationship.
56 Covariance and Correlation Coefficient Example: Stock ReturnsThe table below presents the monthly returns (in percentage) of the market index S&P 500 (SPY) and the Apple stock (AAPL) from December 2012 to May 2013.
57 Covariance and Correlation Coefficient Example: Stock Returns