# Chapter 3 – Descriptive Statistics

## Presentation on theme: "Chapter 3 – Descriptive Statistics"— Presentation transcript:

Chapter 3 – Descriptive Statistics
Numerical Measures

Chapter Outline Measures of Central Location Measures of Variability
Mean Median Mode Percentile (Quartile, Quintile, etc.) Measures of Variability Range Variance (Standard Deviation, Coefficient of Variation)

A Recall A sample is a subset of a population.
Numerical measures calculated for sample data are called sample statistics. Numerical measures calculated for population data are called population parameters. A sample statistic is referred to as the point estimator of the corresponding population parameter.

Mean As a measure of central location, mean is simply the arithmetic average of all the data values. The sample mean is the point estimator of the population mean .

Sample Mean The symbol  (called sigma) means ‘sum up’.
is the value of th observation in the sample. n is the number of observations in the sample.

Population Mean The symbol  (called sigma) means ‘sum up’.
is the value of th observation in the sample. N is the number of observations in the population. is pronounced as ‘miu’.

Sample Mean Example: Sales of Starbucks Stores
50 Starbucks stores are randomly chosen in the NYC. The table below shows the sales of those stores in December 2012.

Sample Mean Example: Sales of Starbucks Stores

Median The median of a data set is the value in the middle
when the data items are arranged in ascending order. Whenever a data set has extreme values, the median is the preferred measure of central location. The median is the measure of location most often reported for annual income and property value data. A few extremely large incomes or property values can inflate the mean since the calculation of mean uses all the data items.

Median For an odd number of observations: 26 18 27 12 14 27 19
in ascending order the median is the middle value. Median = 19

Median For an even number of observations: 26 18 27 12 14 27 19 30
in ascending order the median is the average of the middle two values. Median = ( )/2 =

Mean vs. Median As noted, extremes values can change means remarkably,
while medians might not be affected much by extreme values. Therefore, in that regard, median is a better representative of central location. 30 12 14 18 19 26 27 27 30 280 For the previous example, the median is 22.5 and the mean is If we add one large number (280) to the data, the median becomes 26 (the value in the middle). But the mean becomes In this case we prefer median to mean as a measure of central location.

Mode The mode of a data set is the value that occurs most frequently.
The greatest frequency can occur at two or more different values. If the data have exactly two modes, the data are bimodal. If the data have more than two modes, the data are multimodal. Caution: If the data are bimodal or multimodal, Excel’s MODE function will incorrectly identify a single mode.

Mode 12 14 18 19 26 27 27 30 For the example above, 27 shows up twice while all the other data values show up once. So, the mode is 27.

Percentiles A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. Admission test scores for colleges and universities are frequently reported in terms of percentiles. The pth percentile of a data set is a value such that at least p percent of the items are less than or equal to this value and at least (100 - p) percent of the items are more than or equal to this value. The 50th percentile is simply the median.

Percentiles Arrange the data in ascending order.
Compute index i, the position of the pth percentile. i = (p/100)n If i is not an integer, round up. The p th percentile is the value in the i th position. If i is an integer, the p th percentile is the average of the values in positions i and i +1.

So, averaging the 6th and 7th data values:
Percentiles Find the 75th percentile of the following data 12 14 18 19 26 27 29 30 Note: The data is already in ascending order. i = (p/100)n = (75/100)8 = 6 So, averaging the 6th and 7th data values: 75th percentile = ( )/2 = 28

Percentiles Find the 20th percentile of the following data 12 14 18 19
26 27 29 30 Note: The data is already in ascending order. i = (p/100)n = (20/100)8 = 1.6, which is rounded up to 2. So, the 20th percentile is simply the 2nd data value, i.e. 14.

Quartiles Quartiles are specific percentiles.
First Quartile = 25th percentile Second Quartile = 50th percentile = Median Third Quartile = 75th percentile

Measures of Variability
It is often desirable to consider measures of variability (dispersion), as well as measures of central location. For example, when two stocks provide the same average return of 5% a year, but stock A’s return is very stable – close to 5% and stock B’s return is volatile ( it could be as low as –10%), are you indifferent with regard to which stock to invest in? For another example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.

Measures of Variability
Range Interquartile Range Variance/Standard Deviation Coefficient of Variation

Range The range of a data set is the difference between the largest and smallest data values. It is the simplest measure of variability. It is very sensitive to the smallest and largest data values.

Range Example: 12 14 18 19 26 27 29 30 Range = largest value - smallest value = 30 – 12 = 8

Interquartile Range The interquartile range of a data set is the difference between the 3rd quartile and the 1st quartile. It is the range of the middle 50% of the data. It overcomes the sensitivity to extreme data values.

Interquartile Range Example: 12 14 18 19 26 27 29 30
3rd Quartile (Q3) = 75th percentile = 28 1st Quartile (Q1) = 25th percentile = 16 Interquartile Range = Q3 – Q1 = 28 – 16 = 12

Variance The variance is a measure of variability that utilizes all the data. It is based on the difference between the value of each observation (xi) and the mean ( for a sample, for a population) The variance is useful in comparing the variability of two or more variables.

Variance The variance is the average of the squared differences between each data value and the mean. The variance is calculated as follows: for a sample for a population

Standard Deviation The standard deviation of a data set is the positive square root of the variance. It is measured in the same units as the data, making it more appropriately interpreted than the variance.

Standard Deviation The standard deviation is computed as follows:
for a sample for a population

Variance and Standard Deviation
Example 12 14 18 19 26 27 29 30 Variance Standard Deviation

Coefficient of Variation
The coefficient of variation indicates how large the standard deviation is in relation to the mean. In a comparison between two data sets with different units or with the same units but a significant difference in magnitude, coefficient of variation should be used instead of variance.

Coefficient of Variation
The coefficient of variation is computed as follows: for a sample for a population

Coefficient of Variation
Example 12 14 18 19 26 27 29 30

Coefficient of Variation
Example: Height vs. Weight In a class of 30 students, the average height is 5’5’’ with a standard deviation of 3’’ and the average weight is 120 lbs with a standard deviation of 20 lbs. Question, in which measure (height or weight) are students more different? Since height and weight don’t have the same unit, we have to use coefficient of variation to remove the units before comparing the variations in height and weight. As shown below, students’ weight is more variant than their height.

Measures of Distribution Shape, Relative Location, and Detecting Outliers
z-Scores Chebyshev’s Theorem Empirical Rule Detecting Outliers

Distribution Shape: Skewness
An important measure of the shape of a distribution is called skewness. The formula for the skewness of sample data is Skewness can be easily computed using statistical software.

Distribution Shape: Skewness
Symmetric (not skewed) Skewness is zero. Mean and median are equal. Skewness = 0 .05 .10 .15 .20 .25 .30 .35 Relative Frequency

Distribution Shape: Skewness
Skewed to the left Skewness is negative. Mean is usually less than the median. Skewness = .05 .10 .15 .20 .25 .30 .35 Relative Frequency

Distribution Shape: Skewness
Skewed to the right Skewness is positive. Mean is usually more than the median. Skewness = .31 .05 .10 .15 .20 .25 .30 .35 Relative Frequency

Z-Scores The z-score is often called the standardized value. It denotes the number of standard deviations a data value xi is from the mean. Excel’s STANDARDIZE function can be used to compute the z-score.

A data value less than the sample mean has a negative z-score.
Z-Scores An observation’s z-score is a measure of the relative location of the observation in a data set. A data value less than the sample mean has a negative z-score. A data value greater than the sample mean has a positive z-score. A data value equal to the sample mean has a z-score of zero.

Z-Scores Example 12 14 18 19 26 27 29 30

Chebyshev’s Theorem At least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, I.e. between ( ) and ( ), where z is any value greater than 1. Chebyshev’s theorem requires z > 1, but z need not be an integer.

Chebyshev’s Theorem At least 55.6% of the data values must be within z = 1.5 standard deviations of the mean. At least 89% of the data values must be within z = 3 standard deviations of the mean. At least 94% of the data values must be within z = 4 standard deviations of the mean.

Chebyshev’s Theorem Example: Given that = 10 and s = 2, at least what percentage of all the data values falls into 2 standard deviations of the mean? At least (1-1/22) = 1-1/4 = 75% of all the data values must be between 6 and 14. = 10-2(2) = 6 = 10+2(2) = 14

Empirical Rule When the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean. The empirical rule is based on the normal distribution, which is covered in Chapter 6.

Empirical Rule For data having a bell-shaped distribution:
About of values of a normal random variable are between  -  and  + . 68% Expected number of correct answers About of values of a normal random variable are between  - 2 and  + 2. 95% About of values of a normal random variable are between  - 3 and  + 3. 99%

m – 3s m – 1s m + 1s m + 3s m – 2s m + 2s

Detecting Outliers An outlier is an unusually small or unusually large value in a data set. A data value with a z-score less than –3 or greater than +3 might be considered an outlier. It might be: An incorrectly recorded data value A data value that was incorrectly included in the data set. A correctly recorded data value that belongs in the data set.

Measures of Association Between Two Variables
So far, we have examined numerical methods used to summarize the data for one variable at a time. Often a manager or decision maker is interested in the relationship between two variables. Two numerical measures of the relationship between two variables are covariance and correlation coefficient.

Covariance The covariance is a measure of the linear association between two variables. Positive values indicate a positive relationship. Negative values indicate a negative relationship.

Covariance The covariance is computed as follows: for samples for
populations

Correlation Coefficient
Correlation is a measure of linear association and not necessarily causation. Just because two variables are highly correlated, it does not mean that one variable is the cause of the other.

Correlation Coefficient
The correlation coefficient is computed as follows: for samples for populations

Correlation Coefficient
The correlation can take on values between –1 and +1. Values near –1 indicate a strong negative linear relationship. Values near +1 indicate a strong positive linear relationship. The closer the correlation is to zero, the weaker the relationship.

Covariance and Correlation Coefficient
Example: Stock Returns The table below presents the monthly returns (in percentage) of the market index S&P 500 (SPY) and the Apple stock (AAPL) from December 2012 to May 2013.

Covariance and Correlation Coefficient
Example: Stock Returns

Covariance and Correlation Coefficient
Example: Stock Returns Sample Covariance Sample Correlation Coefficient