BUS173: Applied Statistics Instructor: Nahid Farnaz (Nhn) North South University BUS173: Applied Statistics Chapter 3 Describing Data: Numerical
Measures of Central Tendency Overview Central Tendency Mean In this chapter, we describe data numerically with measures of central tendency measures of variability measures of the direction and strength of relationship between two variables. Median Mode Arithmetic average Midpoint of ranked values Most frequently observed value
Arithmetic Mean The arithmetic mean (mean) is the most common measure of central tendency For a population of N values: For a sample of size n: Population values Population size Observed values Sample size
Arithmetic Mean The most common measure of central tendency (continued) The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 Mean = 4
Median In an ordered list, the median is the “middle” number (50% above, 50% below) Not affected by extreme values 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3
Finding the Median The location of the median: If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers Note that is not the value of the median, only the position of the median in the ranked data
Mode A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may may be no mode There may be several modes 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 No Mode Mode = 9
Which measure of location is the “best”? Mean is generally used, unless extreme values (outliers) exist Then median is often used, since the median is not sensitive to extreme values. Example: Median home prices may be reported for a region – less sensitive to outliers
Shape of a Distribution Describes how data are distributed Measures of shape Symmetric or skewed Left-Skewed Symmetric Right-Skewed Mean < Median Mean = Median Median < Mean
Exercise 3.2 A department store manager is interested in the number of complaints received by the customer service dept. about the quality of electrical products sold. Records over a 5 week period show the following number of complaints for each week: 13, 15, 8, 16, 8 Compute the mean number of weekly complaints Compute the median Find the mode Ans: a.12; b.13; c.8 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Ex. 3.6 The demand for bottled water increases during the hurricane season in Florida. A random sample of 7 hours showed that the following numbers of 1 gallon bottles were sold in one store: 40, 55, 62, 43, 50, 60, 65. Describe the central tendency of the data Comment on symmetry or skewness Ans: Mean=53.57; no unique mode; median=55; left skewed Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Measures of Variability Variation Range Interquartile Range Variance Standard Deviation Coefficient of Variation Measures of variation give information on the spread or variability of the data values. Same center, different variation
Range = Xlargest – Xsmallest Simplest measure of variation Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13
Disadvantages of the Range Ignores the way in which data are distributed Sensitive to outliers 7 8 9 10 11 12 7 8 9 10 11 12 Range = 12 - 7 = 5 Range = 12 - 7 = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 - 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 - 1 = 119
Interquartile Range Can eliminate some outlier problems by using the interquartile range Eliminate high- and low-valued observations and calculate the range of the middle 50% of the data Interquartile range = 3rd quartile – 1st quartile IQR = Q3 – Q1
Quartile Formulas Find a quartile by determining the value in the appropriate position in the ranked data, where First quartile position: Q1 = 0.25(n+1) Second quartile position: Q2 = 0.50(n+1) (the median position) Third quartile position: Q3 = 0.75(n+1) where n is the number of observed values
Minimum < Q1 < Median Q2 < Q3 < Maximum Interquartile Range Example: Median (Q2) X X Q1 Q3 maximum minimum 25% 25% 25% 25% 12 30 45 57 70 Interquartile range = 57 – 30 = 27 Five number summary: Minimum < Q1 < Median Q2 < Q3 < Maximum
Quartiles Example: Find the first quartile (n = 9) Sample Ranked Data: 11 12 13 16 16 17 18 21 22 (n = 9) Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data so use the value half way between the 2nd and 3rd values, so Q1 = 12.5
Example 3.4: Waiting Times at Gilotti’s Grocery (five number summary) A stem and leaf display for a sample of 25 waiting times (in seconds) is given below. Compute the 5 number summary. Calculate and interpret IQR Minutes N = 25 Stem Unit =10.0; Leaf Unit = 1.0 1 2 4 6 7 8 9 3
Population Variance Average of squared deviations of values from the mean Population variance: Where = population mean N = population size xi = ith value of the variable x
Sample Variance Average (approximately) of squared deviations of values from the mean Sample variance: Where = sample mean n = sample size Xi = ith value of the variable X
Population Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data Population standard deviation:
Sample Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data Sample standard deviation:
Example: Sample Standard Deviation Sample Data (xi) : 10 12 14 15 17 18 18 24 n = 8 Mean = x = 16 A measure of the “average” scatter around the mean
Example 3.5 A professor teaches two large sections of marketing and randomly selects a sample of test scores from both sections. Find the range and standard deviation from each sample: Section 1 50 60 70 80 90 Section 2 72 68 74 66 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.
Chebyshev’s Theorem This theorem established data intervals for any data set, regardless of the shape and distribution For any population with mean μ and standard deviation σ , and k > 1 , the percentage of observations that fall within the interval [μ kσ] Is at least Where k is the number of standard deviations.
Chebyshev’s Theorem (continued) Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean (for k > 1) Examples: (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ) (1 - 1/22) = 75% …........ k=2 (μ ± 2σ) (1 - 1/32) = 89% ………. k=3 (μ ± 3σ) At least within
The Empirical Rule If the data distribution is bell-shaped, then the interval: contains about 68% of the values in the population or the sample 68%
The Empirical Rule contains about 95% of the values in the population or the sample contains about 99.7% of the values in the population or the sample 95% 99.7%
Exercise 3.18 Use Chebychev’s theorem to approximate each of the following observations if the mean is 250 and S.D. is 20. Approximately what proportion of the observations is Between 190 and 310? Between 210 and 290? Between 230 and 270? Ans: a) 88.9% k=3; b) 75% k=2; c) 0% k=1
Coefficient of Variation Measures relative variation Always in percentage (%) Shows standard deviation relative to mean Can be used to compare two or more sets of data measured in different units
Comparing Coefficient of Variation Stock A: Average price last year = $50 Standard deviation = $5 Stock B: Average price last year = $100 Both stocks have the same standard deviation, but stock B is less variable relative to its price
The Sample Covariance The covariance measures the strength of the linear relationship between two variables The population covariance: The sample covariance: Only concerned with the strength of the relationship No causal effect is implied
Interpreting Covariance Covariance between two variables: Cov(x,y) > 0 x and y tend to move in the same direction Cov(x,y) < 0 x and y tend to move in opposite directions Cov(x,y) = 0 x and y are independent
Correlation Coefficient Measures the relative strength of the linear relationship between two variables Population correlation coefficient: Sample correlation coefficient:
Features of Correlation Coefficient, r Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship
Scatter Plots of Data with Various Correlation Coefficients Y Y Y some examples of scatter plots and their corresponding correlation coefficients. X X X r = -1 r = -.6 r = 0 Y Y Y X X X r = +1 r = +.3 r = 0
Interpreting the Result There is a relatively strong positive linear relationship between test score #1 and test score #2 Students who scored high on the first test tended to score high on second test
Example 3.13 (Covariance and Correlation Coefficient) Rising Hills Manufacturing Inc. wishes to study the relationship between the number of workers, X, and the number of tables produced, Y. it has obtained a random sample of 10 hours of production. The following (x,y) combinations of points were obtained: (12,20) (30,60) (15,27) (24,50) (14,21) (18,30) (28,61) (26,54) (19,32) (27,57) Compute the covariance and correlation coefficient. Briefly discuss the relationship between X and Y.
Obtaining Linear Relationships An equation can be fit to show the best linear relationship between two variables: Y = β0 + β1X Where Y is the dependent variable and X is the independent variable
Least Squares Regression Estimates for coefficients β0 and β1 are found to minimize the sum of the squared residuals The least-squares regression line, based on sample data, is Where b1 is the slope of the line and b0 is the y-intercept:
Example 3.15 Using the answers from Example 3.13 (Rising Hills Inc.), compute the sample regression coefficients b1 and b0. What is the sample regression line? Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc.