2Descriptive Statistics (Part 2) 4ChapterDescriptive Statistics (Part 2)Standardized DataPercentiles and QuartilesBox PlotsGrouped DataSkewness and Kurtosis (optional)
3Standardized Data Chebyshev’s Theorem Developed by mathematicians Jules Bienaymé ( ) and Pafnuty Chebyshev ( ).For any population with mean m and standard deviation s, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2].
4Standardized Data Chebyshev’s Theorem For k = 2 standard deviations, 100[1 – 1/22] = 75%So, at least 75.0% will lie within m + 2sFor k = 3 standard deviations, 100[1 – 1/32] = 88.9%So, at least 88.9% will lie within m + 3sAlthough applicable to any data set, these limits tend to be too wide to be useful.
5Standardized Data The Empirical Rule The normal or Gaussian distribution was named for Karl Gauss ( ).The normal distribution is symmetric and is also known as the bell-shaped curve.The Empirical Rule states that for data from a normal distribution, we expect that fork = 1 about 68.26% will lie within m + 1sk = 2 about 95.44% will lie within m + 2sk = 3 about 99.73% will lie within m + 3s
6Standardized Data The Empirical Rule Distance from the mean is measured in terms of the number of standard deviations.Note: no upper bound is given. Data values outside m + 3s are rare.
7Standardized Data Example: Exam Scores If 80 students take an exam, how many will score within 2 standard deviations of the mean?Assuming exam scores follow a normal distribution, the empirical rule statesabout 95.44% will lie within m + 2sso 95.44% x 80 76 students will score + 2s from m.How many students will score more than 2 standard deviations from the mean?
8Standardized Data Unusual Observations Unusual observations are those that lie beyond m + 2s.Outliers are observations that lie beyond m + 3s.
9Standardized Data Unusual Observations For example, the P/E ratio data contains several large data values. Are they unusual or outliers?78101213141516171819202122232425262729303134363740414548556891
10Standardized Data The Empirical Rule If the sample came from a normal distribution, then the Empirical rule states= ± 1(14.08)= (8.9, 38.8)= ± 2(14.08)= (-5.4, 50.9)= ± 3(14.08)= (-19.5, 65.0)
11Standardized Data The Empirical Rule Are there any unusual values or outliers?UnusualOutliers65.0-19.550.9-5.438.88.922.72
12Standardized Data Defining a Standardized Variable A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean.Standardization formula for a population:Standardization formula for a sample:
13Standardized Data Defining a Standardized Variable zi tells how far away the observation is from the mean.For example, for the P/E data, the first value x1 = 7. The associated z value is=7 – 22.7214.08-1.12
14Standardized Data Defining a Standardized Variable A negative z value means the observation is below the mean.Positive z means the observation is above the mean. For x68 = 91,=91 – 22.7214.084.85
15Standardized Data Defining a Standardized Variable Here are the standardized z values for the P/E data:What do you conclude for these four values?
16Standardized Data Defining a Standardized Variable MegaStat calculates standardized values as well as checks for outliers.In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized z value.
17Standardized Data Outliers What do we do with outliers in a data set? If due to erroneous data, then discard.An outrageous observation (one completely outside of an expected range) is certainly invalid.Recognize unusual data points and outliers and their potential impact on your study.Research books and articles on how to handle outliers.
18Standardized Data Estimating Sigma For a normal distribution, the range of values is 6s (from m – 3s to m + 3s).If you know the range R (high – low), you can estimate the standard deviation as s = R/6.Useful for approximating the standard deviation when only R is known.This estimate depends on the assumption of normality.
19Percentiles and Quartiles Percentiles are data that have been divided into 100 groups.For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you.Deciles are data that have been divided into 10 groups.Quintiles are data that have been divided into 5 groups.Quartiles are data that have been divided into 4 groups.
20Percentiles and Quartiles Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles).Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios.Percentiles are used in employee merit evaluation and salary benchmarking.
21Percentiles and Quartiles Quartiles are scale points that divide the sorted data into four groups of approximately equal size.Q1Q2Q3Lower 25%|Second 25%Third 25%Upper 25%The three values that separate the four groups are called Q1, Q2, and Q3, respectively.
22Percentiles and Quartiles The second quartile Q2 is the median, an important indicator of central tendency.Q2 Lower 50% | Upper 50% Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values.Q1Q3Lower 25%| Middle 50% Upper 25%
23Percentiles and Quartiles The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2.Q1Q2Q3Lower 25%|Second 25%Third 25%Upper 25%For first half of data, 50% above, 50% below Q1.For second half of data, 50% above, 50% below Q3.
24Percentiles and Quartiles Depending on n, the quartiles Q1,Q2, and Q3 may be members of the data set or may lie between two of the sorted data values.
25Percentiles and Quartiles Method of MediansFor small data sets, find quartiles using method of medians:Step 1. Sort the observations.Step 2. Find the median Q2.Step 3. Find the median of the data values that lie below Q2.Step 4. Find the median of the data values that lie above Q2.
26Percentiles and Quartiles Excel QuartilesUse Excel function =QUARTILE(Array, k) to return the kth quartile.Excel treats quartiles as a special case of percentiles. For example, to calculate Q3=QUARTILE(Array, 3)=PERCENTILE(Array, 75)Excel calculates the quartile positions as:Position of Q10.25nPosition of Q20.50nPosition of Q30.75n
27Percentiles and Quartiles Example: P/E Ratios and QuartilesConsider the following P/E ratios for 68 stocks in a portfolio.78101213141516171819202122232425262729303134363740414548556891Use quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high-priced (top quartile).
28Percentiles and Quartiles Example: P/E Ratios and QuartilesUsing Excel’s method of interpolation, the quartile positions are:Quartile PositionFormulaInterpolate BetweenQ1= 0.25(68) = 17.75X17 + X18Q2= 0.50(68) = 34.50X34 + X35Q3= 0.75(68) = 51.25X51 + X52
29Percentiles and Quartiles Example: P/E Ratios and QuartilesThe quartiles are:QuartileFormulaFirst (Q1)Q1 = X (X18-X17) = (14-14) = 14Second (Q2)Q2 = X (X35-X34) = (19-19) = 19Third (Q3)Q3 = X (X52-X51) = (26-26) = 26
30Percentiles and Quartiles Example: P/E Ratios and QuartilesSo, to summarize:Q1Q2Q3Lower 25% of P/E Ratios14Second 25% of P/E Ratios19Third 25% of P/E Ratios26Upper 25% of P/E RatiosThese quartiles express central tendency and dispersion. What is the interquartile range?Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations.
31Percentiles and Quartiles TipWhether you use the method of medians or Excel, your quartiles will be about the same. Small differences in calculation techniques typically do not lead to different conclusions in business applications.
32Percentiles and Quartiles CautionQuartiles generally resist outliers.However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values.Data set A:1, 2, 4, 4, 8, 8, 8, 8Q1 = 3, Q2 = 6, Q3 = 8Data set B:0, 3, 3, 6, 6, 6, 10, 15Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well.
33Percentiles and Quartiles Dispersion Using QuartilesSome robust measures of central tendency and dispersion using quartiles are:StatisticFormulaExcelProConMidhinge=0.5*(QUARTILE (Data,1)+QUARTILE (Data,3))Robust to presence of extreme data values.Less familiar to most people.
34Percentiles and Quartiles Dispersion Using QuartilesStatisticFormulaExcelProConMidspreadQ3 – Q1=QUARTILE(Data,3)-QUARTILE(Data,1)Stable when extreme data values exist.Ignores magnitude of extreme data values.Coefficient of quartile variation (CQV)NoneRelative variation in percent so we can compare data sets.Less familiar to non-statisticians
35Percentiles and Quartiles MidhingeThe mean of the first and third quartiles.Midhinge =For the 68 P/E ratios,Midhinge =A robust measure of central tendency since quartiles ignore extreme values.
36Percentiles and Quartiles Midspread (Interquartile Range)A robust measure of dispersionMidspread = Q3 – Q1For the 68 P/E ratios,Midspread = Q3 – Q1 = 26 – 14 = 12
37Percentiles and Quartiles Coefficient of Quartile Variation (CQV)Measures relative dispersion, expresses the midspread as a percent of the midhinge.For the 68 P/E ratios,Similar to the CV, CQV can be used to compare data sets measured in different units or with different means.
38Box Plots A useful tool of exploratory data analysis (EDA). Also called a box-and-whisker plot.Based on a five-number summary:Xmin, Q1, Q2, Q3, XmaxConsider the five-number summary for the 68 P/E ratios:Xmin, Q1, Q2, Q3, Xmax
39Center of Box is Midhinge Box PlotsWhiskersCenter of Box is MidhingeBoxMinimumRight-skewedMaximumMedian (Q2)Q1Q3
40Box Plots Fences and Unusual Data Values Use quartiles to detect unusual data points.These points are called fences and can be found using the following formulas:Inner fencesOuter fences:Lower fenceQ1 – 1.5 (Q3–Q1)Q1 – 3.0 (Q3–Q1)Upper fenceQ (Q3–Q1)Q (Q3–Q1)Values outside the inner fences are unusual while those outside the outer fences are outliers.
41Box Plots Fences and Unusual Data Values For example, consider the P/E ratio data:Inner fencesOuter fences:Lower fence:14 – 1.5 (26–14) = 414 – 3.0 (26–14) = 22Upper fence:(26–14) = +44(26–14) = +62Ignore the lower fence since it is negative and P/E ratios are only positive.
42Box Plots Fences and Unusual Data Values Truncate the whisker at the fences and display unusual values and outliers as dots.Inner FenceOuter FenceUnusualOutliersBased on these fences, there are three unusual P/E values and two outliers.
43Grouped Data Nature of Grouped Data Although some information is lost, grouped data are easier to display than raw data.When bin limits are given, the mean and standard deviation can be estimated.Accuracy of grouped estimates depend on - the number of bins - distribution of data within bins - bin frequencies
44Grouped Data Mean and Standard Deviation Consider the frequency distribution for prices of Lipitor® for three cities:Where mj = class midpoint fj = class frequency k = number of classes n = sample size
45Grouped Data Nature of Grouped Data Estimate the mean and standard deviation byNote: don’t round off too soon.
46Grouped Data Nature of Grouped Data Accuracy Issues Now estimate the coefficient of variationCV = 100 (s / ) = 100 ( / ) = 9.2%Accuracy IssuesHow accurate are grouped estimates compared to ungrouped estimates?For the previous example, we can compare the grouped data statistics to the ungrouped data statistics.
47Grouped Data Accuracy Issues For this example, very little information was lost due to grouping.However, accuracy could be lost due to the nature of the grouping (i.e., if the groups were not evenly spaced within bins).
48Grouped Data Accuracy Issues The dot plot shows a relatively even distribution within the bins.Effects of uneven distributions within bins tend to average out unless there is systematic skewness.
49Grouped Data Accuracy Issues Accuracy tends to improve as the number of bins increases.If the first or last class is open-ended, there will be no class midpoint (no mean can be estimated).Assume a lower limit of zero for the first class when the data are nonnegative.You may be able to assume an upper limit for some variables (e.g., age).Median and quartiles may be estimated even with open-ended classes.
50Skewness and Kurtosis Skewness Generally, skewness may be indicated by looking at the sample histogram or by comparing the mean and median.This visual indicator is imprecise and does not take into consideration sample size n.
51Skewness and Kurtosis Skewness Skewness is a unit-free statistic. The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution).Calculate the sample’s skewness coefficient as:Skewness =
52Skewness and Kurtosis Skewness In Excel, go to Tools | Data Analysis | Descriptive Statistics or use the function =SKEW(array)
53Skewness and Kurtosis Skewness Consider the following table showing the 90% range for the sample skewness coefficient.
54Skewness and Kurtosis Skewness Coefficients within the 90% range may be attributed to random variation.
55Skewness and Kurtosis Skewness Coefficients outside the range suggest the sample came from a nonnormal population.
56Skewness and Kurtosis Skewness As n increases, the range of chance variation narrows.
57Skewness and Kurtosis Kurtosis Kurtosis is the relative length of the tails and the degree of concentration in the center.Consider three kurtosis prototype shapes.
58Skewness and Kurtosis Kurtosis A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ.Excel and MINITAB calculate kurtosis as:Kurtosis =
59Skewness and Kurtosis Kurtosis Consider the following table of expected 90% range for sample kurtosis coefficient.
60Skewness and Kurtosis Kurtosis A sample coefficient within the ranges may be attributed to chance variation.
61Skewness and Kurtosis Kurtosis Coefficients outside the range would suggest the sample differs from a normal population.
62Skewness and Kurtosis Kurtosis As sample size increases, the chance range narrows.Inferences about kurtosis are risky for n < 50.
63Applied Statistics in Business and Economics End of Chapter 4