# Numerically Summarizing Data

## Presentation on theme: "Numerically Summarizing Data"— Presentation transcript:

Numerically Summarizing Data
Learning Objectives 1. Understand the difference between a parameter and a statistic 2. Describe and compute measures of central tendency 3. Describe and compute measures of dispersion 4. Compute measures of location 5. Learn to read box plots and check for outliers

Measures of Central Tendency
(Mean, Median and Mode) A parameter is a descriptive measure of a population. In most real world cases, the population parameter is not known. For example, the average gas price in the whole nation. A statistic is a descriptive measure of a sample. We use statistic to estimate the corresponding parameter. For example, Average gas price of the nation is not known. However, we can take a random sample of 100 stations and compute the sample average gas price, then use the sample average to estimate the unknown population average.

The population mean is a parameter.
The population mean, is computed using all the individuals in a population, the total # of all individuals is N. The population mean is a parameter. The sample mean, is computed using sample data. The sample mean is a statistic that is an unbiased estimator of the population mean. NOTE: In real world applications, population mean m is usually not known, and is estimated by using sample mean

Median The median of a variable is the value that lies in the middle of the data when arranged in ascending order. That is, half the data is below the median and half the data is above the median. We use m to represent the median.

Steps in Computing the Median of a Data Set
Arrange the data in ascending order. Determine the number of observations (n). Determine the observation in the middle of the data set. The position is (n+1)/2 If (n+1)/2 is an integer, locate the data value at the (n+1)/2 position. This is the median (NOTE: for this situation, # of data values, n is an odd number.) If (n+1)/2 is NOT an integer, the median is the average of the two data values on either side of the observations that lies in the (n+1)/2 position. [ NOTE: for this situation, n is even].

EXAMPLE Computing the Median of Data
Find the mean and median of the following pulse rates from a sample of 8 individuals {NOTE: n = 8 in this case} 80, 76, 65, 68, 72, 73, 65, 80 Arrange them in ascending order: 65, 65, 68, 72, 73, 76, 80, 80 Find the position: (n+1)/2 = (8+1)/2=4.5 Position is not an integer: Median = (72+73)/2 = 72.5 Adding one additional pulse rate of 100, now find the median of the data {NOTE n = 9 in this case}: 80, 76, 65, 68, 72, 73, 65, 80,100 Ascending order: 65,65,68,72,73,76,80,80,100 Position: (9+1)/2 = 5: Median is 73 (on the 5th position)

Exercise: Find the mode of the following pulse rate data
The mode of a variable is the most frequent observation of the variable that occurs in the data set. If there are two values that occur with the most frequency, we say the data has is bimodal. Exercise: Find the mode of the following pulse rate data 80, 76, 65, 68, 72, 73, 65, 80,100, 80, 74, 65, 66, 70, 74, 65, 80,98 Modes are: 65 and 80

Comparing Mean and Median:
How does the extreme observation affect the mean and median? [similar exam questions] Example: The following is the quiz scores of 10 students in class A: 5,5,5,5,5,7,7,7,7,7 Find mean = ______, find median: ________ The following is the quiz score of 10 students in class B: 5,5,5,5,5,7,7,7,7,30 Find mean = ________, find median =_________ Fact: The mean is sensitive to extreme data values. Median is robust to extreme data values.

How does the unusual cases affect the average, median and the shape of the histogram? Compare Histograms with/without the ‘outlier’ case, 5000 miles Shape is _________ Shape is _________ Descriptive Statistics: Miles for 148 cases (with the case of 5000 miles) Variable N Mean SE Mean TrMean StDev Min Q1 Median Q Max Miles Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles) Variable N Mean SE Mean TrMean StDev Min Q1 Median Q3 Maximum Miles ______ ______

Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles)
Variable N Mean Min Median Maximum Miles NOTE: Median remains unchange. Why? Since it only uses the middle one (or two data points) to find median. But, it uses everyone data to find average. So, a very large unusual data will make average larger. But, not median. When data sets have unusually large or small values relative to the entire set of data or when the distribution of the data is skewed, the median is the preferred measure of central tendency over the arithmetic mean because it is more representative of the typical observation.

Comparison of Mean, Median, and Mode for different shapes of distributions [Similar exam question]
Mean<Medain Mean~Median Mean>Median Right-Skewed Left-Skewed Symmetric Mean = Median Mode Shape Concerned with extent to which values are symmetrically distributed. Kurtosis The extent to which a distribution is peaked (flatter or taller). For example, a distribution could be more peaked than a normal distribution (still may be ‘bell-shaped). If values are negative, then distribution is less peaked than a normal distribution. Skew The extent to which a distribution is symmetric or has a tail. Values are 0 if normal distribution. If the values are negative, then negative or left-skewed.

Exercise NOTE: In real world applications, distribution of a sample data can never be perfectly symmetric. The shape can only be approximately symmetric. IF MEAN IS CLOSE TO MEDIAN (NOT NECESSARY EXACTLY MEAN = MEDIAN), WE WOULD SAY THE DISTRIBUTION IS APPROXIMATELY SYMMETRIC. Exercise: A sample of 50 gas prices are recorded and summarized. The average price is \$3.15, median price is \$3.13. Is the shape of the price distribution more likely to be skew-to-left, approximately symmetric, skewed-to-right? ANS:

Measures of Dispersion
Four different measures of dispersion: Range, Variance, Standard Deviation, Interquartile Range (IQR) Measures of dispersion measure the degree that the data values spread. The larger the data values spread, the larger the variation of the data values. Example: Scores of 5 students in class A: 60,60,70,80,80 Scores of 5 students in class B: 40,60,70,80,100 Scores of 5 students in class C: 70,70,70,70,70 Q: Scores in Class ____ have largest variation. Scores in Class _____ has zero variation.

Visualizing Variability using Histogram
Which one shows the largest variation: Which one shows the smallest variation: C

How to measure the variation?
Range = R = Largest Data Value – Smallest Data Value The sample variance is : s = s2 The sample standard deviation is: NOTE: the divider: (n-1) is called the Degrees of Freedom. The population variance is symbolically represented by lower case Greek sigma squared. The population standard deviation is:

NOTE: As mentioned before, for real world problems, population mean, population variance and population standard deviation are NOT KNOWN. Similar to Sample Mean, sample variance and sample standard deviation are obtained from sample data. They are used to estimate the unknown population variance and population standard deviation. This is the major part of the inferential statistics, which will be dealt with later. In this Chapter, we are learning how to compute and interpret these sample descriptive summaries to understand the sample data.

NOTE: If the original measurement unit is (ft),
Notation: s 2: sample variance : sample standard deviation NOTE: If the original measurement unit is (ft), the variance s2 has measurement unit (ft)2, since If x has unit (ft), then, (x- )2 has the unit (ft)(ft) , which is (ft)2 The measurement unit of s2 is (ft)2 . The measurement unit of s is (ft). s 2: population variance : population standard deviation

Some important Tips NOTE: Sample statistics: such as sample mean , sample median, s, s2 will be different for different samples. Population parameters: such as population mean, m, population variance, s2, population s.d., s are fixed constant for a given population. They do not change for different samples.

Exercise Comparing Variation: Quiz Scores of 40 students [similar exam questions]
Class A 20 Class B 20 Class C Variation: Which one has smallest s.d.? Which has largest s.d.?

Answer Class B has smallest standard deviation Class A has largest standard deviation

Points to remember about variance and standard deviation and the relationship with histogram:
- The value of s and s2 is always greater than or equal to zero. - The larger the value of s 2 or s, the greater the variability of the data set. - If s 2 or s is equal to zero, all measurements must have the same. - The standard deviation s is computed in order to have a measure of variability having the same unit as the observations. - The larger the s.d., the more spread the data, the flatter the histogram. - The smaller the s.d., the more clustered the data around the mean, the taller the peak of the histogram.

Exercise (Similar Exam questions)
1. The gas price is a concern for people. A random sample of 40 stations gives the following data summary: Sample mean = \$2.15 Median = \$2.12 S = \$.15 Q: Is the distribution of the gas prices more likely to be (a) Symmetric (b) skewed-to-right (c) Skewed-to-left And WHY? 2. The following two data are prices of milk from 6 stores, one was from January, and one year after. Store: A B C D E F Price in January Price in January True or False for each of the following statements: The average price remains the same between two years. (b) The price range remains the same between two years. (c ) The median remains the same between two years. (d) The standard deviation (s) remains the same between two years.

Descriptive Summary for the 56 distances
s , the sample standard deviation. s2 = (112.2)2 Mean after excluding the lowest 5% and the highest 5% of the data. Called: Trimmed Mean m Descriptive Statistics: distance Variable N Mean Median TrMean StDev SE Mean distance Variable Minimum Maximum Q Q3 distance 25% of the distances are lower than Q1, the first Quartile, or 25th Percentile 75% of the distances are lower than Q3, the third Quartile, or 75th Percentile Smallest Largest

If we add the max, 6000 to the data, so that we 57 cases, what is the effect of 6000 to the following summary statistics: Increase? Decease? The same? (a) the average distance: (b) the median distance: (c) the standard deviation: (d) the range:

Average distance is increased. Median distance for this example is the same. (in general, will be almost the same) Standard deviation is increased. Range is increased.

Empirical Rule and Applications What is the meaning of variation and how is it used in solving real world problems? For Symmetric mound-shaped data (Bell-shaped ) Approximately 68% of the data is between ± 1 s 95% of the data is between ± 2 s 100% of the data is between ± 3 s of the mean

The important Application of Empirical rule is: It is applied to identify rare (unusual, extreme )observations. If an observation falls outside two s.d. range, it only has 5% of chance to occur. Therefore, it is considered to be a rare (or unusual) case. 2.5% 34% m-2s m-s m m+s m+2s 13.5% NOTE: If you add the % on each side of the center line m, it adds to 50%. A mounded-shape distribution is symmetric about the mean.

Applying Empirical Rule to identify Rare Events
A simple and powerful tool for identifying outliers, extremes, or unusual, or rare events. We will use this rule very often through out the entire semester. (Similar questions in the test) Consider the 2010 ACT test, the average was 21 and a standard deviation was 4. The distribution of the ACT scores is mounded-shaped. Q1: A student received a score of 25. Is this an unusually high score? Q2: If CMU will admit students with a minimum ACT to be one standard deviations below the mean, what is the minimum ACT for CMU admission? Q3: A student received an ACT of 30. Is this an unusually high score? ANSWER: Q1: 25 = 21+4 (that is one s.d. above the mean. It is inside two s.d. from the mean. So, it is NOT unusually high score. Q2: The score at one s.d. below the mean = 21 – 4 = 17. Q3: the score 30 > (4) = is outside the two s.d. from mean. There is only 2.5% of scores higher than 29. Hence, 30 is an unusually high score.

Exercise: Estimating average, standard deviation and applying Empirical Rule when distribution is mounded-shaped We collect a sample of 40 weekly spending from 40 students. Suppose the spending has a mounded-shape distribution. We only know the min = \$20 and max = \$80. As you see the weekly spending varies. There is a variation among spending. (a) Give a good estimate of the average spending and standard deviation of the weekly spending based on the 40 students data. (b) Approximately how many % of students would spend \$35 or more per week: ANS: Since the distribution is mounded-shaped, we can use (20+80) / 2 = \$50 to estimate the average spending. Since this is a sample, so, we use s = range/4 to estimate the s.d., which would be (80-20)/4 = \$15.0. ANS: We can then use this estimated average spending and s to answer question (b): \$35 is about one s.d. below the mean. Hence, the % of spending \$35 or more = 34% + 50% = 84%. Approximately 84% of individuals spend \$35 or more per week.

Five Number Summary; Box plots
The Five-Number Summary MINIMUM Q Median Q MAXIMUM IQR (Inter-quartile Range) = Q3 – Q1

Steps for Drawing a Box plot
Step 1: Determine the lower and upper fence: Lower fence = Q1 – 1.5(IQR) Upper fence = Q (IQR) Step 2: Draw vertical lines at Q1, M and Q3. Enclose these vertical lines in a box. Step 3: Label the lower and upper fences. Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. Step 5: Any data value less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk (*).

EXAMPLE Drawing a Boxplot
Min Q M Q Max IQR Q3-Q1 =56-38=18 Draw a boxplot for the serum HDL. Median Mean Q1 Q3 Compute the lower and upper fence and draw a boxplot.

Relationship between Distribution Shape and Boxplot (Similar questions in the test)
1. If the median is near the center of the box and each of the horizontal lines are approximately equal length, then the distribution is roughly symmetric. 2. If the median is left of the center of the box and/or the right line is substantially longer than the left line, the distribution is right skewed. 3. If the median is right of the center of the box and/or the left line is substantially longer than the right line, the distribution is left skewed

Symmetric

Skewed Right

Skewed Left

Distance data – 100 distance data

EXAMPLE Comparing Two Data Sets Using Boxplots
The following boxplots represent the birth rate for women years of age in 1990 and 1997 for each state. What conclusion can you make?