# Chapter 3: Descriptive Statistics

## Presentation on theme: "Chapter 3: Descriptive Statistics"— Presentation transcript:

Chapter 3: Descriptive Statistics

Learning Objectives LO1 Apply various measures of central tendency— including the mean, median, and mode—to a set of ungrouped data. LO2 Apply various measures of variability—including the range, interquartile range, mean absolute deviation, variance, and standard deviation (using the empirical rule and Chebyshev’s theorem)—to a set of ungrouped data. LO3 Compute the mean, median, mode, standard deviation, and variance of grouped data. LO4 Describe a data distribution statistically and graphically using skewness, kurtosis, and box-and-whisker plots. LO5 Use computer packages to compute various measures of central tendency, variation, and shape on a set of data, as well as to describe the data distribution graphically.

Measures of Central Tendency Ungrouped Data
Ungrouped data is any array of numbers which have not been summarized by statistical techniques Measures of central tendency reveal information about the values at the center, or middle part, of a group of numbers (or ordered array) Common Measures of Central Tendency are the : Mean Median Mode Percentiles Quartiles

The Arithmetic Mean The arithmetic mean is commonly called ‘the mean’
It is the average of a group of numbers It is a concept applicable for interval and ratio data It is not applicable for nominal or ordinal data The mean is computed by summing all values in the data set and dividing the sum by the number of values in the data set Thus, its value is affected by each value in the data set, including extreme values

Application of Arithmetic Mean in Statistics
As a summary statistic of central tendency in data produced by business and economic processes When used in these settings it is important to make the distinction between The population mean: µ and the Sample mean The population mean is based on all of the values within the population The sample mean only uses some of the values within a population

Computing Population Mean
Suppose a company has five departments with 24, 13, 19, 26, and 11 workers in each department. The population mean number of workers in each department is 18.6 workers. The computations follow:

Computing Sample Mean The calculation of a sample mean uses the same algorithm as for a population mean and will produce the same answer if computed on the same data. However, a separate symbol is necessary for the population mean and for the sample mean. Given the following set of numbers: 57, 86, 42, 38, 90, and 66. The sample mean is The computations follow:

Impact of Extreme Values on the Mean
The mean is the most commonly used measure of central tendency because of its mathematical properties and because it uses all the data point in the data set However, the mean is affected by extremely large or extremely small numbers Note that for the sample mean example, if the largest number 90 is replaced by the number 1,000 the mean becomes as opposed to If the smallest number 38 is replaced by the number 5 the mean becomes as opposed to Extreme values can significantly distort the mean.

The Median The median is the middle value in an ordered array of numbers The median applies for ordinal, interval, and ratio data Advantage of the median – it is unaffected by extremely large and extremely small values in the data set A disadvantage of the median is that not all the information from the numbers is used

Computing the Median First Step Second Step Third Step
Arrange the observations in an ordered array Second Step For an array with an odd number of terms, the median is the middle number. Third Step For an array with an even number of terms, the median is the average of the two middle numbers. Locating the Median The median’s location in an ordered array is found by (n+1)/2

Median Example with an Odd Number of Data
Let X be an ordered array such that X has the following values: 3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21, 22 There are 17 values in the ordered array Position of median = (n+1)/2 = (17+1)/2 = 9th position Counting from left to right to the 9th position, the median is 15 Advantage - extreme values do not distort the median Note that if 22 (maximum value) is replaced by 100, the median is still 15 If 3 (minimum value) is replaced by -103, the median is still 15

Median Example with an Even Number of Data
Let X be an ordered array such that X assumes the following values: 3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21 There are 16 values in the ordered array Position of median = (n+1)/2 = (16+1)/2 = 8.5th position The median is a value between the 8th and 9th observations in the ordered array. The median is (15-14) = or simply, (14+15)/2 =14.5 Advantage - extreme values do not distort the median If 21 (maximum value) is replaced by 100, the median is still 14.5 If 3 (minimum value) is replaced by -88, the median is still 14.5

The Mode The mode is the value that occurs most frequently in an array of data The mode applies to all levels of data measurement: nominal, ordinal, interval, and ratio Unimodal: describes data sets with a single mode Bimodal: describes data sets that have two modes Multimodal: describes data sets that contain more than two modes

Example of the Mode Organizing the data into an ordered array helps to locate the mode The arrangement of the numbers represents an ordered array 44 is the value that occurs most frequently (occurs 5 times). The mode is 44

Percentiles Percentiles are measures of central tendency that divide a group of data into 100 parts The nth percentile is the value such that at least n percent of the data are below that value and at most (100 - n) percent are above that value For example: If a plant operator takes a safety examination and 87.6% of the safety exam scores are below that person’s score, he or she still scores at only the 87th percentile, even though more than 87% of the scores are lower. The median is the 50th percentile and has the same value as the 50th percentile

Percentiles Percentiles are stair step values: for example, the 87th and 88th percentile have no values between them Percentile methods are applicable for ordinal, interval, and ratio data and are not applicable for nominal data In general percentiles are not influenced by extreme values in the data set

Steps in Determining the Location of the Percentile
Organize the data into ascending order Calculate the percentile location (i) using: Determine the location If i is a whole number, the Pth percentile is the average of the value at the ith location and the value at the (i + 1)th location. If i is not a whole number, the Pth percentile value is located at the whole-number part of i + 1. Where P = percentile i = percentile location n = number in the data set

Calculating Percentiles: An Example
Raw Data: 14, 12, 19, 23, 5, 13, 28, 17 Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28 Problem: Find 30th percentile Number of observations n = 8 Location of 30th Percentile: The location index, i, is not a whole number. Therefore put location at whole number portion of ( i + 1) = = 3.4. The whole number portion is 3. The 30th percentile is at the 3rd location of the array: 30th percentile = 13

Quartiles Quartiles are measures of central tendency that divide a group of data into four subgroups or parts Q1: 25% of the data set is below the first quartile Q2: 50% of the data set is below the second quartile Q3: 75% of the data set is below the third quartile Relationship between Quartiles and percentiles Q1 is equal to the 25th percentile Q2 is located at 50th percentile and equals the median Q3 is equal to the 75th percentile Quartile values are not necessarily members of the data set

Calculating Quartiles: An Example
Let X be an ordered array: If X={ 106, 109, 114, 116, 121, 122, 125, 129} then Q1: Q2: Q3: Note that when i is a whole number the quartile is the average of the ith and (i+1)th values in the ordered set

Measures of Variability: Ungrouped Data
Measures of variability are used to describe the spread or dispersion of data By using variability with measures of central tendency, the result is a more complete description of data Measures of variability for ungrouped data include: range, interquartile range, mean absolute deviation, variance, standard deviation, z scores and coefficient of variation

Measures of Variability: Ungrouped Data
Measures of variability describe the dispersion (spread) of a set of data or the convergence (unity) of a set of data Dispersion explains how far data is spread apart or disassociates from the mean Convergence explains how data moves towards union or conformity of the mean Variability is most frequently expressed in terms of deviation from the norm or mean. The images in the next slides express this visually

Variability Mean Mean No Variability in Cash Flow (same amounts)
Variability in Cash Flow (different amounts) Mean Mean

Variability No Variability Variability

Range The range is the difference between the largest and smallest values in the data set Usefulness: Advantage - simple to compute Disadvantages: Ignores all data points except the two extremes Influenced by extreme values Has no reference point Has limited use by itself Example of range using data provided:

Interquartile Range Interquartile Range = Q3 – Q1
The interquartile range contains all values in the interval between the first and third quartiles The interquartile range accounts for the middle 50% of values in the ordered data set The interquartile range is especially useful in situations where data users are more interested in values toward the middle and less interested in extremes The interquartile range is less influenced by extremes

Deviation from the Mean
An examination of deviations from the mean can reveal information about the variability of data However, the individual deviations are used mostly as a tool to compute other measures of variability Example – The following data set includes: 5, 9, 16, 17, 18 with a mean of µ = 13 (x - ) show distances around the mean or individual deviation from the mean: -8, -4, 3, 4, 5

Mean Absolute Deviation
Absolute deviations express the tendency for observations to differ on the average from the mean Easy to calculate but not as statistically useful or unbiased as the use of variance and standard deviation measures Below is an example calculating the mean absolute deviation

Population Variance Population variance is the sum of the square deviations divided by the number of observations Statistics are measured in terms of square units of measurement Square units of measurement are hard to interpret so variance is typically used as a process of obtaining the standard deviation of a data set

Example of Population Variance
Given the following x values, the solution would be expressed as 26.0 units squared

Population Standard Deviation
Square root of the population variance Easier to interpret in practice than the variance Measures the dispersion of the population data from the mean

Example of Sample Variance
Sample variances are also expressed as units squared. For example:

Example of Sample Standard Deviation
The sample standard deviation is the square root of the sample variance Easier to interpret in practice than square units Sample standard deviation is used as a good estimator of the population standard deviation

Standard Deviation Standard deviation is the square root of the variance Standard deviation of a population is denoted by: The standard deviation of a sample is denoted by:

Uses of Standard Deviation
Indicator of financial risk Quality Control construction of quality control charts process capability studies Comparing two or more populations household incomes in two cities employee absenteeism at two plants used as a percentage of the mean, the coefficient of variation (CV)

Standard Deviation as an Indicator of Financial Risk

Symmetric and Asymmetric Distributions
Data are either symmetric or non-symmetric with respect to some measure of central tendency Statisticians have observed that distributions describing many types of business and economic data tend to be symmetric or have a normal shape They found that in practical terms the processes that generate symmetric data have special and exact properties (the empirical rule) with respect to data concentration Non-symmetric distributions, in practice and theory, obey as a minimum specified rules with respect to the concentration of data values in a population (The Chebyschev Theorem)

Empirical Rule When data are normally distributed or approximately normal

- Chebyshev’s Theorem - When Data are Not Normally Distributed or Nonsymmetric.
The Chebyshev Theorem applies to all distributions It measures the minimum mass or concentration of data that lies within a specified number of standard deviation around the mean

Number of Standard Deviations
Chebyshev’s Theorem A general theory applying to all distributions Calculations for k= 2,3,4 . k = 1 is not defined Number of Standard Deviations k Distance from the Mean Minimum Proportion of Values Falling within Distance from the Mean 2 0.75 3 0.89 4 0.94

Z Scores The z score represents the number of standard deviations a value (x) is above or below the mean Data for a z score is normally distributed Translates into standard deviations Z score formula

Coefficient of Variation
Ratio of the standard deviation to the mean, expressed as a percentage Measurement of relative dispersion expressed as: ( ) C V = s m 100

Examples of Coefficient of Variation
( ) 2 84 10 100 11 90 m s = C V . 1 29 4 6 15 86

Measures of Central Tendency and Variability: Grouped Data
Mean Median Mode Measures of Variability Variance Standard Deviation

Mean of Grouped Data Weighted average of class midpoints
Class frequencies are the weights Mean of group data:

Example Calculation of Grouped Mean

Median of Grouped Data

Calculating the Median of Grouped Data

Estimating the Mode from Grouped Data
The modal class is class interval with the greatest frequency -(7- under 9) for the example below. The mode for the grouped data is the class midpoint of the modal class. Mode = 8 for the example below.

Variance and Standard Deviation from Grouped Data

Population Variance and Standard Deviation of Grouped Data

Descriptions and Measures of Shape
Skewness Absence of symmetry Presence of extreme values in one or other side of a distribution Kurtosis Peakedness of a distribution Leptokurtic: high and thin peak Mesokurtic: normal or mound shaped top Platykurtic: flat topped and spread out Box and Whisker Plots Graphic display of a distribution using 5-summary statistics Reveals skewness and data location or clustering

Probability Distributions Showing Symmetry and Skewness
Symmetrical Right or Positively Skewed Left or Negatively Skewed

Symmetrical Shape Frequency Histogram Showing Relationship of Mean, Median and Mode

Coefficient of Skewness
A summary measure for skewness based on the relationship of mean to median and the variation in the data If < 0, the distribution is negatively skewed (skewed to the left). If = 0, the distribution is symmetric (not skewed). If > 0, the distribution is positively skewed (skewed to the right).

Effect of Changes in Mean on the Coefficient of Skewness

Types of Kurtosis

Requirements for A Box and Whisker Plot
Five specific numbers are used: Median, Q2 First quartile, Q1 Third quartile, Q3 Minimum value in the data set Maximum value in the data set Inner Fences: First Indicators of extreme values IQR = Q3 - Q1 Lower inner fence = Q IQR Upper inner fence = Q IQR Outer Fences: Strong Indicators of extreme values Lower outer fence = Q IQR Upper outer fence = Q IQR

Skewness and the Box Plot
Box and whisker plot can determine skewness of a distribution. The location of the median in the box can indicate the skewness of the middle 50% of the data. If the median is located on the right side of the box, then the middle 50% are skewed to the left . If the median is on the left side, then the middle 50% are skewed to the right. Researcher can make judgment about skewness based on length of whiskers If the longest whisker is to the right of the box, then the outer data are skewed to the right, and vice versa. See box and whisker plot in next slide

Box and Whisker Plot

Similar presentations