Chapter 4 – Numerical Description Population (Size = N): Characterized by Parameters e.g., = pop. Mean, = pop. Std. dev. Sample (Size = n): Statistics are computed and estimate parameters e.g., = sample mean, S = sample std. dev. Recall: Statistics are descriptive measures derived from a sample (n items). Parameters are descriptive measures derived from a population (N items).
Chapter 4 – Numerical Description There are three key characteristics of numerical data: CharacteristicInterpretation Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?
Chapter 4 – Numerical Description Example: Vehicle Quality Consider the data set of vehicle defect rates from J. D. Power and Associates. Numerical statistics can be used to summarize this random sample of brands. Must allow for sampling error since the analysis is based on sampling. Defect rate = total no. defects no. inspected x 100
Chapter 4 – Numerical Description Number of defects per 100 vehicles, 2004 models.
Chapter 4 – Numerical Description Sorted data provides insight into central tendency and dispersion.
Chapter 4 – Numerical Description Visual Displays: The dot plot offers a visual impression of the data. Histograms with 5 bins (suggested by Sturges’ Rule) and 10 bins are shown below. Both are symmetric with no extreme values and show a modal class toward the low end.
Chapter 4 – Numerical Description We can compute descriptive statistics using Excel and discuss measures of central tendency and dispersion… –Figures 4.4 and 4.5 in your text details the Excel menus for computing descriptive statistics. –Figure 4.7 in your text details the MegaStat menus for computing descriptive statistics.
Chapter 4 – Central Tendency The central tendency is the middle or typical values of a distribution. Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics. The Text presents six measures of central tendency… –Mean– Median –Mode– Midrange –Geometric Mean (G)– Trimmed Mean The mean and median are the most frequently used, but we will discuss the merits of all six.
Chapter 4 – Central Tendency Mean – A familiar measure of central tendency. In Excel, use function =AVERAGE(Data) where Data is an array of data values. For the sample of n = 37 car brands: Population FormulaSample Formula
Chapter 4 – Central Tendency Characteristics of the Mean: Arithmetic mean is the most familiar average. Affected by every sample item. The balancing point or fulcrum for the data. Regardless of the shape of the distribution, distances from the mean to the data points always sum to zero.
Chapter 4 – Central Tendency Median (M) – the 50 th percentile or midpoint of the sorted sample data. Use Excel’s function =MEDIAN(Data) where Data is an array of data values. M separates the upper and lower half of the sorted observations. –If n is even, the median is the average of the middle two observations in the data array. –If n is odd, the median is the middle observation in the data array.
Chapter 4 – Central Tendency Median: To compute the median by hand, sort the n observations in the data:To compute the median by hand, sort the n observations in the data: For even n, Median = For odd n, Median = where
Chapter 4 – Central Tendency Example: Consider the following n = 6 data values: What is the median? M = (x 3 +x 4 )/2 = (15+17)/2 = 16 For even n, Median = n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4
Clickers Consider the following n = 7 data values: What is the median? A = 24 B = 25 C = 26 D = 27
Chapter 4 – Central Tendency Median For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. So, the median is x 19 = 121. When there are several duplicate data values, the median does not provide a clean “50-50” split in the data.
Chapter 4 – Central Tendency Characteristics of the Median The median is insensitive to extreme data values. For example, consider the following quiz scores for 3 students: What does the median for each student tell you? Tom’s scores: 20, 40, 70, 75, 80 Mean =57, Median = 70, Total = 285 Jake’s scores: 60, 65, 70, 90, 95 Mean = 76, Median = 70, Total = 380 Mary’s scores: 50, 65, 70, 75, 90 Mean = 70, Median = 70, Total = 350
Chapter 4 – Central Tendency Mode – The most frequently occurring data value. Similar to mean and median if data values occur often near the center of sorted data. May have multiple modes or no mode. Easy to define, not easy to calculate in large samples. Use Excel’s function =MODE(Array) –will return #N/A if there is no mode. –will return first mode found if multimodal. May be far from the middle of the distribution and not at all typical. Generally isn’t useful for continuous data since data values rarely repeat. –Best for attribute data or a discrete variable with a small range (e.g., Likert scale).
Chapter 4 – Central Tendency Mode: A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data. Occurs when dissimilar populations are combined in one sample. For example,
Chapter 4 – Central Tendency Skewness: Compare mean and median or look at histogram to determine degree of skewness. Mean, Median & Skewness: If median > mean, skewed left. If median = mean, symmetric. If median < mean, skewed right. Mean, Mode & Skewness: If mode > mean, skewed left. If mode = mean, symmetric. If mode < mean, skewed right.
Chapter 4 – Central Tendency Midrange – the point halfway between the lowest and highest values of X. Easy to use but sensitive to extreme data values. Midrange =
Clickers Consider the J. D. Power quality data (n=37): What is the midrange? A = 121B = 122 C = 130D = 173
Chapter 4 – Central Tendency Trimmed Mean: To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. To determine how many observations to trim, multiply k x n: –Remove (k x n) highest and lowest observations. Mitigates the effects of extreme values. May exclude relevant data values.
Chapter 4 – Dispersion Variation is the “spread” of data points about the center of the distribution in a sample. The text considers the following measures of dispersion: –Range –Variance (S 2 ) –Standard Deviation (S) –Coefficient of Variation (CV) –Mean Absolute Deviation (MAD) The variance and standard deviation are the most frequently used, but we will briefly discuss the merits of all five.
Chapter 4 – Dispersion Range – The difference between the largest and smallest observation. Easy to calculate, but sensitive to extreme data values. Range = x max – x min
Chapter 4 – Dispersion Variance: The population variance ( 2 ) is defined as the sum of squared deviations around the mean divided by the population size. For the sample variance (s 2 ), we divide by n – 1 instead of n, otherwise s 2 would tend to underestimate the unknown population variance 2.
Chapter 4 – Dispersion Standard Deviation – The square root of the variance. Explains how individual values in a data set vary from the mean. Units of measure are the same as X. For the 37 vehicle quality ratings … Population standard deviation Sample standard deviation
Chapter 4 – Dispersion
Calculating Standard Deviation: Excel’s built in functions are… The standard deviation is nonnegative because deviations around the mean are squared. When every observation is exactly equal to the mean, the standard deviation is zero. Standard deviations can be large or small, depending on the units of measure. Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially. Statistic Excel population formula Excel sample formula Variance=VARP(Array)=VAR(Array) Standard deviation =STDEVP(Array)=STDEV(Array)
Chapter 4 – Dispersion Coefficient of Variation – A unit-free measure of dispersion. Expressed as a percent of the mean. Useful for comparing variables measured in different units or with different means. Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.
Clickers Recall from the J. D. Power quality data (n=37): What is the Coefficient of Variation ? A = 5.48% B = 18.26% C = 22.89% D = %
Chapter 4 – Dispersion Mean Absolute Deviation (MAD) – reveals the average distance from an individual data point to the mean (center of the distribution). Uses absolute values of the deviations around the mean. Excel’s function is =AVEDEV(Array).
Chapter 4 – Dispersion Consider the histograms of hole diameters drilled in a steel plate during manufacturing. The desired distribution is outlined in red. Machine A Machine B Central Tendency vs. Dispersion: Manufacturing Desired mean (5mm) but too much variation. Acceptable variation but mean is less than 5 mm.