Displaying and Summarizing Quantitative Data

Displaying and Summarizing Quantitative Data
AP Stats Chapter 4 Part 3 Displaying and Summarizing Quantitative Data

Learning Goals Know how to display the distribution of a quantitative variable with a histogram, a stem-and-leaf display, or a dotplot. Know how to display the relative position of quantitative variable with a Cumulative Frequency Curve and analysis the Cumulative Frequency Curve. Be able to describe the distribution of a quantitative variable in terms of its shape. Be able to describe any anomalies or extraordinary features revealed by the display of a variable.

Learning Goals Be able to determine the shape of the distribution of a variable by knowing something about the data. Know the basic properties and how to compute the mean and median of a set of data. Understand the properties of a skewed distribution. Know the basic properties and how to compute the standard deviation and IQR of a set of data.

Learning Goals Understand which measures of center and spread are resistant and which are not. Be able to select a suitable measure of center and a suitable measure of spread for a variable based on information about its distribution. Be able to describe the distribution of a quantitative variable in terms of its shape, center, and spread.

Learning Goal 6 Know the basic properties and how to compute the mean and median of a set of data.

Learning Goal 6: Measures of Central Tendency
A measure of central tendency for a collection of data values is a number that is meant to convey the idea of centralness or center of the data set. The most commonly used measures of central tendency for sample data are the: mean, median, and mode.

Learning Goal 6: Measures of Central Tendency
Overview Central Tendency Mean Median Mode Midpoint of ranked values Most frequently observed value

Learning Goal 6: The Mean
Mean: The mean of a set of numerical (data) values is the (arithmetic) average for the set of values. When computing the value of the mean, the data values can be population values or sample values. Hence we can compute either the population mean or the sample mean

Learning Goal 6: Mean Notation
NOTATION: The population mean is denoted by the Greek letter µ (read as “mu”). NOTATION: The sample mean is denoted by 𝑥 (read as “x-bar”). Normally the population mean is unknown.

The mean is the most common measure of central tendency. The mean is also the preferred measure of center, because it uses all the data in calculating the center. For a sample of size n: Observed values Sample size

Learning Goal 6: The Mean - Example
What is the mean of the following 11 sample values?

Learning Goal 6: The Mean - Example (Continued)
Solution:

Learning Goal 6: Mean – Frequency Table
When a data set has a large number of values, we summarize it as a frequency table. The frequencies represent the number of times each value occurs. When the mean is calculated from a frequency table it is an approximation, because the raw data is not known.

Learning Goal 6: Mean – Frequency Table Example
What is the mean of the following 11 sample values (the same data as before)? Class Frequency -10 to < -4 2 -4 to < 2 4 2 to < 8 8 to < 14 14 to < 20 1

Learning Goal 6: Mean – Frequency Table Example
Solution: Class Midpoint Frequency -10 to < -4 -7 2 -4 to < 2 -1 4 2 to < 8 5 8 to < 14 11 14 to < 20 17 1

Learning Goal 6: Calculate Mean on TI-84 Raw Data
Enter the raw data into a list, STAT/Edit. Calculate the mean, STAT/CALC/1-Var Stats List: L1 FreqList: (leave blank) Calculate

Learning Goal 6: Calculate Mean on TI-84 Frequency Table Data
Enter the Frequency table data into two lists (L1 – Class Midpoint, L2 – Frequency), STAT/Edit. Calculate the mean, STAT/CALC/1-Var Stats List: L1 FreqList: L2 Calculate Same Data Class Mark Freq 0-50 25 1 50-100 75 125 3 175 4 225 7 275

Learning Goal 6: Calculate Mean on TI-84 – Your Turn
Raw Data: 548, 405, 375, 400, 475, 450, , 364, 492, 482, 384, 490, , 435, 390, 500, 400, 491, , 848, 792, 700, 572, 739, 572

Learning Goal 6: Calculate Mean on TI-84 – Your Turn
Frequency Table Data (same): Class Limits Frequency 350 to < 450 450 to < 550 550 to < 650 650 to < 750 750 to < 850 850 to < 950 11 10 2 1

Learning Goal 6: Median The median is the midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to smallest) If the number of observations is: Odd, then the median is the middle observation Even, then the median is the average of the two middle observations

Center of a Distribution -- Median
The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data.

Learning Goal 6: Finding the Median
The location of the median: If the number of values is odd, the median is the middle number. If the number of values is even, the median is the average of the two middle numbers. Note that 𝑛+1 2 is not the value of the median, only the position of the median in the ranked data.

Learning Goal 6: Finding the Median – Example (n odd)
What is the median for the following sample values?

Learning Goal 6: Finding the Median – Example (n odd)
First of all, we need to arrange the data set in order ( STATS/SortA ) The ordered set is: Since the number of values is odd, the median will be found in the 6th position in the ordered set (To find; data number divided by 2 and round up, 11/2 = 5.5⇒6). Thus, the value of the median is 2. 6th value

Learning Goal 6: Finding the Median – Example (n even)
Find the median age for the following eight college students.

Learning Goal 6: Finding the Median – Example (n even)
First we have to order the values as shown below. Since there is an even number of ages, the median will be the average of the two middle values (To find; data number divided by 2, that number and the next are the two middle numbers, 8/2 = 4⇒4th & 5th are the middle numbers). Thus, median = ( )/2 = 23.5. Middle Two Average

Learning Goal 6: The Median - Summary
The median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger.  n = 25 n/2 = 25/2 = 12.5=13 Median = 3.4 If n is odd, the median is observation n/2 (round up) down the list n = 24  n/2 = 12 &13 Median = ( ) /2 = 3.35 3. If n is even, the median is the mean of the two center observations 1. Sort observations from smallest to largest.n = number of observations ______________________________

Learning Goal 6: Finding the Median on the TI-84
Enter data into L1 STAT; CALC; 1:1-Var Stats

Learning Goal 6: Find the Mean and Median – Your Turn
CO2 Pollution levels in 8 largest nations measured in metric tons per person: Mean = Median = 1.5 Mean = Median = 5.8 Mean = Median = 4.6

Learning Goal 6: Mode A measure of central tendency.
Value that occurs most often or frequent. Used for either numerical or categorical data. There may be no mode or several modes. Not used as a measure of center. Mode = 9 No Mode

Learning Goal 6: Mode - Example
The mode is the measurement which occurs most frequently. The set: 2, 4, 9, 8, 8, 5, 3 The mode is 8, which occurs twice The set: 2, 2, 9, 8, 8, 5, 3 There are two modes - 8 and 2 (bimodal) The set: 2, 4, 9, 8, 5, 3 There is no mode (each value is unique).

Learning Goal 6: Summary Measures of Center

Understand the properties of a skewed distribution.
Learning Goal 7 Understand the properties of a skewed distribution.

Learning Goal 7: Where is the Center of the Distribution?
If you had to pick a single number to describe all the data what would you pick? It’s easy to find the center when a histogram is unimodal and symmetric—it’s right in the middle. On the other hand, it’s not so easy to find the center of a skewed histogram or a histogram with outliers.

Learning Goal 7: Meaningful measure of Center
Your measure of center must be meaningful. The distribution of women’s height appears coherent and symmetrical. The mean is a good measure center. Height of 25 women in a class While we are looking at a number of histograms at once, and talking about means, here is another example of how you might use histograms and descriptive statistics like means to find out something of biological interest. You are interested in studying what pollinators visit a particular species of plant. Let’s say that there has been an increase in agriculture in the area with all the pesticide spraying that comes along with that. If insects are needed to pollinate the plant, and the pesticides kill the insects, the plant species may go extinct. Here is the mean of this distribution., but is it a good description of th center? Why would we care? Maybe plant height is a measure of plant age, and we wonder how well the population is holding up. - here you see there are not very many little plants, which might make you worry that there has been insufficient pollination. One of the things you have noticed about the plants is that the flower color varies. Pollinators are attracted to flower color, so you happen to have the plants divided up into three groups - red pink and white flowers. Typically hummingbirds pollinate red flowers and moths pollinate white flowers. Which makes you start to wonder about your sample. So group them by flower color and get means for each group. Is the mean always a good measure of center?

Learning Goal 7: Impact of Skewed Data
Disease X: Mean and median are the same. Mean and median of a symmetric distribution Multiple myeloma: and skewed distribution. The mean is pulled toward the skew. It is maybe easier to see that by comparing the two distributions we just looked at that show time to death after diagnosis. For both disease X and MM you have on average 3 years to live. Does that mean you don’t care which one you get? Well, of the 25 people getting disease X, only 1 died in the first year after diagnosis. Of the ones getting MM, 7 did. So if you get X, according to what we see here only 1/25 or about 4 percent of people don’t make it through year one. But if you get MM, well, if 1 in 7 die in year one, it means you have an almost 30% chance of not making it even a year. Now, you might be one of these very few who live a long time, but it is much more likely that it is time to get your will together and hurry around to say goodbye to your loved ones. Means are the same, medians are different, because of the shape of the distribution. This is one of the major take-home messages from this class - you all thought you knew what an average meant, and you did, But you should also realized that what the average is telling you is different depending on the distribution. When the doctor diagnoses you with some disease, and people with that disease live on average for 3 years, You say Doctor! Show me the distribution! And as you go on in biology and you see charts like this in journal articles or even in the paper, you now know why they are showing them to you. Statistical descriptors, like using the mean to describe the center, are only telling you so much. To really understand what is going on you have to plot the data and look at the distribution for things like overall shape, symmetry, and the presence of outliers, and you have to understand the effect they have on things like the mean. Now, the next obvious question for a biologist of course is why you see these different types of patterns. The top is a normal distribution, represents lots of things in the natural world as we have seen in our women’s height and toucan bill examples. The distribution on the bottom is very different, and when you see something like this it challenges researchers to understand it - why do such a large percentage of people die so quickly - is there one single thing that if we could figure it out would save a huge chunk of the people dying down here? Could they figure out what it is about either these people or their treatment that allowed them to live so long? Lots still not known but a big part of it is that this diagnosis, MM, does not have the word multiple in its name for no reason. When you get down to the level of the cells involved, lots of different ones - so is really a suite of diseases. So this diagnosis is like “cancer” in general - a term that covers a broad range of biological phenomena that you can study and pick apart and understand on the cell biology to epidemiological level using not your intuition, but statistics. Now let’s move on from describing the center to describing the spread and symmetry, which are, again, really different for these two distributions.

Nonresistant – The mean is sensitive to the influence of extreme values and/or outliers. Skewed distributions pull the mean away from the center towards the longer tail. The mean is located at the balancing point of the histogram. For a skewed distribution, is not a good measure of center.

Learning Goal 7: Mean – Nonresistant Example
The most common measure of central tendency. Affected by extreme values (skewed dist. or outliers). Mean = 3 Mean = 4

Learning Goal 7: The Median
Resistant – The median is said to be resistant, because extreme values and/or outliers have little effect on the median. In an ordered array, the median is the “middle” number (50% above, 50% below).

Learning Goal 7: Median – Resistant Example
Not affected by extreme values (skewed distributions or outliers). Median = 3 Median = 3

Learning Goal 7: Mean vs. Median with Outliers
Without the outliers With the outliers Percent of people dying Here is the same data set with some outliers - some lucky people who managed to live longer than the others. The few large values moved the mean up from 3.5 to 4.0 However, the median , the number of years it takes for half the people to die only went from 3.4 to 3.6 This is typical behavior for the mean and median. The mean is sensitive to outliers, because when you add all the values up to get the mean the outliers are weighted disproportionately by their large size. However, when you get the median, they are just another two points to count - the fact that their size is so large does not matter much. The median (resistant), on the other hand, is only slightly pulled to the right by the outliers (from 3.4 to 3.6). The mean (non-resistant) is pulled to the right a lot by the outliers (from 3.4 to 4.2).

Learning Goal 7: Effect of Skewed Distributions
The figure below shows the relative positions of the mean and median for right-skewed, symmetric, and left-skewed distributions. Note that the mean is pulled in the direction of skewness, that is, in the direction of the extreme observations. For a right-skewed distribution, the mean is greater than the median; for a symmetric distribution, the mean and the median are equal; and, for a left-skewed distribution, the mean is less than the median. New Slide: Insert Figure 3.1

Learning Goal 7: Comparing the mean and the median
The mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean Median Mean and median for skewed distributions Left skew Mean Median Right skew Mean Median

Learning Goal 7: Which measure of location is the “best”?
Because the median considers only the order of values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center. To choose between the mean and median, start by looking at the distribution. Mean is used, for unimodal symmetric distributions, unless extreme values (outliers) exist. Median is used, for skewed distributions or when there are outliers present, since the median is not sensitive to extreme values.

Learning Goal 7: Class Problem
Observed mean =2.28, median=3, mode=3.1 What is the shape of the distribution and why?

Learning Goal 7: Example
Five houses on a hill by the beach. House Prices: $2,000, , , , ,000

Learning Goal 7: Example – Measures of Center
House Prices: $2,000,000 500, , , ,000 Sum $3,000,000 Mean: ($3,000,000/5) = $600,000 Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 Which is the best measure of center? Median

Conclusion – Mean or Median?
Mean – use with symmetrical distributions (no outliers), because it is nonresistant. Median – use with skewed distribution or distribution with outliers, because it is resistant.

Learning Goal 8 Know the basic properties and how to compute the standard deviation and IQR of a set of data.

Learning Goal 8: How Spread Out is the Distribution?
Variation matters, and Statistics is about variation. Are the values of the distribution tightly clustered around the center or more spread out? Always report a measure of spread along with a measure of center when describing a distribution numerically.

Learning Goal 8: Measures of Spread
A measure of variability for a collection of data values is a number that is meant to convey the idea of spread for the data set. The most commonly used measures of variability for sample data are the: range interquartile range variance and standard deviation

Learning Goal 8: Measures of Variation
Range Interquartile Range Variance Standard Deviation Measures of variation give information on the spread or variability of the data values. Same center, different variation

Learning Goal 8: The Interquartile Range
One way to describe the spread of a set of data might be to ignore the extremes and concentrate on the middle of the data. The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data. To find the IQR, we first need to know what quartiles are…

Learning Goal 8: The Interquartile Range
Quartiles divide the data into four equal sections. One quarter of the data lies below the lower quartile, Q1 One quarter of the data lies above the upper quartile, Q3. The quartiles border the middle half of the data. The difference between the quartiles is the interquartile range (IQR), so IQR = upper quartile(Q3) – lower quartile(Q1)

Learning Goal 8: Interquartile Range
Eliminate some outlier or extreme value problems by using the interquartile range. Eliminate some high- and low-valued observations and calculate the range from the remaining values. IQR = 3rd quartile – 1st quartile IQR = Q3 – Q1

Learning Goal 8: Finding Quartiles
Order the Data Find the median, this divides the data into a lower and upper half (the median itself is in neither half). Q1 is then the median of the lower half. Q3 is the median of the upper half. Example Even data Q1=27, M=39, Q3=50.5 IQR = 50.5 – 27 = 23.5 Odd data Q1=35, M=46, Q3=54 IQR = 54 – 35 = 19

Learning Goal 8: Quartiles
Example: Median (Q2) X maximum minimum Q1 Q3 25% % % % Interquartile range = 57 – 30 = 27 Middle fifty Not influenced by extreme values (Resistant).

Learning Goal 8: Quartiles
Quartiles split the ranked data into 4 segments with an equal number of values per segment. The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. Q2 is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile. 25% Q1 Q2 Q3

Learning Goal 8: The Interquartile Range - Histogram
The lower and upper quartiles are the 25th and 75th percentiles of the data, so… The IQR contains the middle 50% of the values of the distribution, as shown in figure:

Learning Goal 8: Find and Interpret IQR IQR = Q3 – Q1 = 42.5 – 15
Travel times to work for 20 randomly selected New Yorkers 10 30 5 25 40 20 15 85 65 60 45 5 10 15 20 25 30 40 45 60 65 85 5 10 15 20 25 30 40 45 60 65 85 Q1 = 15 M = 22.5 Q3= 42.5 IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes.

Learning Goal 8: Interquartile Range on the TI-84
Use STATS/CALC/1-Var Stats to find Q1 and Q3. Then calculate IQR = Q3 – Q1. Interquartile range = Q3 – Q1 = 9 – 6 = 3.

Learning Goal 8: Calculate IQR - Your Turn
The following scores for a statistics 10-point quiz were reported. What is the value of the interquartile range?

Learning Goal 8: 5-Number Summary
Definition: The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. Minimum Q1 M Q3 Maximum

Learning Goal 8: 5-Number Summary
The 5-number summary of a distribution reports its minimum, 1st quartile Q1, median, 3rd quartile Q3, and maximum in that order. Obtain 5-number summary from 1-Var Stats. Min. 3.7 Q1 6.6 Med. 7 Q3 7.6 Max. 9

Learning Goal 8: Calculate 5 Number Summary
Enter data into L1. STAT; CALC; 1:1-Var Stats; Enter. List: L1. Calculate. Scroll down to 5 number summary.

Learning Goal 8: Calculate 5 Number Summary – Your Turn
The grades of 25 students are given below : 42, 63, 47, 77, 46, 71, 68, 83, 91, 55, 67, 66, 63, 57, 50, 69, 73, 82, 77, 58, 66, 79, 88, 97, 86. Calculate the 5 number summary for the students grades.

Learning Goal 8: Calculate 5 Number Summary – Your Turn
A group of University students took part in a sponsored race. The number of laps completed is given in the table. Calculate the 5 number summary. 1 2 31 – 35 25 26 – 30 17 21 – 25 20 16 – 20 15 11 – 15 9 6 – 10 1 - 5 frequency (x) number of laps

Learning Goal 8: Standard Deviation
A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean. A deviation is the distance that a data value is from the mean. Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations. But to calculate the standard deviation you must first calculate the variance.

Learning Goal 8: Variance
The variance is measure of variability that uses all the data. It measures the average deviation of the measurements about their mean.

The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them: Used to calculate Standard Deviation. The variance will play a role later in our study, but it is problematic as a measure of spread - it is measured in squared units – not the same units as the data, a serious disadvantage!

The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean m. The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1). Sigma Squared S Squared

The standard deviation, s, is just the square root of the variance. Is measured in the same units as the original data. Why it is preferred over variance.

In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements. To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance.

Learning Goal 8: Finding Standard Deviation
The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation. Let’s explore it! Consider the following data on the number of pets owned by a group of 9 children. Calculate the mean. Calculate each deviation. deviation = observation – mean deviation: = -4 deviation: = 3 = 5

Learning Goal 8: Finding Standard Deviation
xi (xi-mean) (xi-mean)2 1 1 - 5 = -4 (-4)2 = 16 3 3 - 5 = -2 (-2)2 = 4 4 4 - 5 = -1 (-1)2 = 1 5 5 - 5 = 0 (0)2 = 0 7 7 - 5 = 2 (2)2 = 4 8 8 - 5 = 3 (3)2 = 9 9 9 - 5 = 4 (4)2 = 16 Sum=? 3) Square each deviation. 4) Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n-1)…this is called the variance. 5) Calculate the square root of the variance…this is the standard deviation. “average” squared deviation = 52/(9-1) = This is the variance. Standard deviation = square root of variance =

Learning Goal 8: Standard Deviation - Example
The standard deviation is used to describe the variation around the mean. 1) First calculate the variance s2. 2) Then take the square root to get the standard deviation s. Boxplots are used to show the spread around a median - can use no matter what the distribution, and is a good way to contrast variables having different distributions. But if your distribution is symmetrical, you can use the mean as the center of your distribution, you can use a different (and more common) measure of spread around the mean - standard deviation. The Standard Deviation measures spread by looking at how far the observations are from their mean. Go through calc. This is women’s height data again, First, N is again the number of observations. From this we calculate the degrees of freedom, which is just n-1. Come back to this in a second. Take difference from mean, square it so all are positive, add them up. Then divide not by number of observations by by n-1 = df Although variance is a useful measure of spread, it’s units are units squared. So we like to take the square root and use that number, the SD, which has the same units as the mean. Height squared is not intuitive. Now, as to why dividing by n-1 instead of n. When we got the mean it was easy to imagine why we divided by N intuitively. But actually, what we are doing even there is dividing by the number of independent pieces of information that go into the estimate of a parameter. This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself. For example, if the variance, s2 , is to be estimated from a random sample of N independent scores, then the degrees of freedom is equal to the number of independent scores (N) minus the number of parameters estimated as intermediate steps (here, we have estimated the mean) and is therefore equal to N-1. But why the term “degrees of freedom”? When we calculate the s-square of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n - 1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample. All of the other (n - 1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the statistic s-square is said to have only (n - 1) degrees of freedom. I know this is hard to understand. I don’t expect you to understand it completely. But in a second I will come back to it to show you the effect of dividing by n-1 rather than n, and perhaps that will make is easier to accept. Mean ± 1 s.d.

Learning Goal 8: Standard Deviation - Procedure
Compute the mean . Subtract the mean from each individual value to get a list of the deviations from the mean Square each of the differences to produce the square of the deviations from the mean Add all of the squares of the deviations from the mean to get Divide the sum by [variance] Find the square root of the result.

Find the standard deviation of the Mulberry Bank customer waiting times. Those times (in minutes) are 1, 3, 14. Use a Table. We will not normally calculate standard deviation by hand.

Learning Goal 8: Calculate Standard Deviation
Enter data into L1 STAT; CALC; 1:1-Var Stats; Enter List: L1;Calculator Sx is the sample standard deviation. σx is the population standard deviation.

Learning Goal 8: Calculate Standard Deviation – Your Turn
The prices ($) of 18 brands of walking shoes: Calculate the standard deviation.

Learning Goal 8: Calculate Standard Deviation – Your Turn
During 3 hours at Heathrow airport 55 aircraft arrived late. The number of minutes they were late is shown in the grouped frequency table. Calculate the standard deviation for the number of minutes late. 2 4 5 7 10 27 0 - 9 frequency minutes late

Learning Goal 8: Standard Deviation - Properties
The value of s is always positive. s is zero only when all of the data values are the same number. Larger values of s indicate greater amounts of variation. The units of s are the same as the units of the original data. One reason s is preferred to s2. Measures spread about the mean and should only be used to describe the spread of a distribution when the mean is used to describe the center (ie. symmetrical distributions). Nonresistant (like the mean), s can increase dramatically due to extreme values or outliers.

Larger values of standard deviation indicate greater amounts of variation. Small standard deviation Large standard deviation

New Slide: Insert Table 3.11 and table 3.12 Standard Deviation: the more variation, the larger the standard deviation. Data set II has greater variation.

Data Set I Change to page 113 Data Set II Data set II has greater variation and the visual clearly shows that it is more spread out.

Learning Goal 8: Comparing Standard Deviations
The more variation, the larger the standard deviation. Data A Mean = 15.5 S = 3.338 Data B Mean = 15.5 S = 0.926 Data C Mean = 15.5 S = 4.567 Values far from the mean are given extra weight (because deviations from the mean are squared).

Learning Goal 8: Spread: Range
The range of the data is the difference between the maximum and minimum values: Range = max – min A disadvantage of the range is that a single extreme value can make it very large and, thus, not representative of the data overall.

Range = Xlargest – Xsmallest
Learning Goal 8: Range Simplest measure of variation. Difference between the largest and the smallest values in a set of data. Example: Range = Xlargest – Xsmallest Range = = 13

Learning Goal 8: Disadvantages of the Range
Ignores the way in which data are distributed Sensitive to outliers Range = = 5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = = 4 Range = = 119

Learning Goal 8: Range The range is affected by outliers (large or small values relative to the rest of the data set). The range does not utilize all the information in the data set only the largest and smallest values. Thus, range is not a very useful measure of spread or variation.

Learning Goal 8: Summary Measures
Describing Data Numerically Central Tendency Quartiles Variation Shape Mean Range Skewness Median Interquartile Range Mode Variance Standard Deviation

Learning Goal 9 Understand which measures of center and spread are resistant and which are not.

Learning Goal 9: Resistant or Non-Resistant
Which measures of center and spread are resistant? Median – Extreme values and outliers have little effect. IQR – Measures the spread of the middle 50% of the data, therefore extreme values and outliers have no effect. When using Median to measure the center of a distribution, use IQR to measure the spread of the distribution.

Which measures of center and spread are Non-Resistant? Mean – Extreme values and outliers pull the mean towards those values. Standard Deviation – Measures the spread relative to the mean. Extreme values or outliers will increase the standard deviation of the distribution. When using Mean to measure the center of a distribution, use Standard Deviation to measure the spread of the distribution.

Measures of Center: Mean (not resistant) Median (resistant) Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often and preferred, use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information.

Animated Center and Spread Quiz Scores 50 55 60 65 70 75 80 85 90 95 40 45 30 35 100 20 25 Mean: 72.5 Median: 72.5 S: 10.16 IQR: 15 Mean: 72.5 Median: 72.5 S: 10.16 IQR: 15 Mean: 63.33 Median: 70 S: 16.84 IQR: 30 Mean: 68.82 Median: 70 S: 12.56 IQR: 20 What is the difference between the center and spread of a distribution? Which measure of center (mean or median) was affected more by adding data points that skewed the distribution? Explain your answer. 50 55 60 65 70 75 80 85 90 95 40 45 30 35 100 20 25 Quiz Scores For each distribution below, which measure of center and spread would you use? How do you know? A B In a symmetric distribution: The mean, non-resistant, is used to represent the center. The standard deviation (S), non-resistant, is used to represent the spread. In a skewed distribution: The median, resistant, is used to represent the center. The interquartile range (IQR), resistant, is used to represent the spread. Mean & S Median & IQR

Median and IQR are paired together – Resistant. Mean and Standard Deviation are paired together – Non-Resistant.

Learning Goal 10 Be able to select a suitable measure of center and a suitable measure of spread for a variable based on information about its distribution.

Learning Goal 10: Choosing Measures of Center and Spread
We now have a choice between two descriptions for center and spread Mean and Standard Deviation Median and Interquartile Range The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! Choosing Measures of Center and Spread

Learning Goal 10: Choosing Measures of Center and Spread
Plot your data Dotplot, Stemplot, Histogram Interpret what you see: Shape, Outliers, Center, Spread Choose numerical summary: 𝒙 and s, or Median and IQR

Choosing Center and Spread - Practice
Learning Goal 10: Choosing Center and Spread - Practice The distribution of a data set shows the arrangement of values in the data set. The center of a distribution is a number that represents all the values in the data set. The spread of a distribution is a number that describes the variability in the data set. The dot plots below show the ratings given to a new movie by two different audiences. Symmetric Center: Mean Spread: S 1. Shape: Center: Spread: 2. 1 2 3 4 5 6 7 8 9 10 Audience Rating Audience #1 1 2 3 4 5 6 7 8 9 10 Audience Rating Audience #2 Mean: 7 Median: 7 S: 1.43 IQR: 2 Mean: 5.71 Median: 6 S: 1.67 IQR: 3 The shape of the distribution is mostly symmetric. The shape of the distribution is mostly symmetric. Skewed Center: Median Spread: IQR Audience #1: 4,4,4,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7,8,8,8,8,8,9,9,9,9,10,10,10,10 Audience #2: 2,2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7,7,7,7,7,8,8,8,8,9,9,9,10 Because the distribution is symmetric, the mean of 7 can be used as the measure of center. Because the distribution is symmetric, the mean of 5.71 can be used as the measure of center. The S of the distribution is 1.43. The S of the distribution is 1.67.

Learning Goal 10: Choosing Center and Spread - Practice The distribution of a data set shows the arrangement of values in the data set. The center of a distribution is a number that represents all the values in the data set. The spread of a distribution is a number that describes the variability in the data set. The histograms below show the number of hours studied in a week for students in two math classes. Symmetric Center: Mean Spread: S 3. Shape: Center: Spread: 4. Class #1 Class #2 0-2 3-5 6-8 9-11 12-14 2 4 6 8 10 Hours Studied Students 15-17 0-2 3-5 6-8 9-11 12-14 2 4 6 8 10 Hours Studied Students 15-17 Mean: 9.69 Median: 10.5 S: 3.6 IQR: 6.5 Mean: 7.75 Median: 7 S: 2.93 IQR: 4.5 The shape of the distribution is skewed to the left. The shape of the distribution is skewed to the right Skewed Center: Median Spread: IQR Class #1: 0,2,3,4,4,5,5,6,7,7,8,9,9,9,10,10,11,12,12,12,13,13,13,13,13,14,14,15,15,15,16 Class #2: 3,3,3,4,4,5,5,5,6,6,6,6,6,7,7,7,8,8,9,9,11,11,11,11,11,14,14,17 Because the distribution is skewed, the median of 10.5 can be used as the measure of center. Because the distribution is skewed, the median of 7 can be used as the measure of center. The IQR of the distribution is 6.5. The IQR of the distribution is 4.5.

Learning Goal 10: Choosing Center and Spread - Practice The distribution of a data set shows the arrangement of values in the data set. The center of a distribution is a number that represents all the values in the data set. The spread of a distribution is a number that describes the variability in the data set. The dot plot below shows the number of hours of sleep per night for 33 students in a 6th-grade class. The histogram below shows the number of hours of sleep per night for 33 adults selected at random. Symmetric Center: Mean Spread: S 1. Shape: Center: Spread: 2. 0-1 2-3 4-5 6-7 8-9 2 4 6 8 10 Hours Slept Adults 10+ 12 4 6 7 8 9 10 11 Hours of Sleep 5 Mean: 8.4 Median: 9 S: 1.53 IQR: 3 Mean: 6.8 Median: 7 S: 1.54 IQR: 2.5 The shape of the distribution is skewed left. The shape of the distribution is fairly symmetric, with a slight skew to the left. Skewed Center: Median Spread: IQR 6th Grade Class: 4,5,5,6,6,6,7,7,7,8,8,8,8,8,8,9,9,9,9,9,9,9,10,10,10,10,10,10,10,11,11,11,11 Adults: 3,3,4,4,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,8,9,9,9,9,10,11 Because the distribution is skewed, the median of 9 can be used as the measure of center. Because the distribution is mostly symmetric, the mean of 6.8 can be used as the measure of center. The IQR of the distribution is 3. The S of the distribution is 1.54.

Learning Goal 10: Choosing Center and Spread - Practice The histograms below show the scores of 31 students on a pretest and posttest. Pretest Posttest 1. Shape: Center: Spread: 2. 41-50 51-60 61-70 71-80 81-90 2 4 6 8 10 Score Students 91-100 12 41-50 51-60 61-70 71-80 81-90 2 4 6 8 10 Score Students 91-100 12 Mean: 57.67 Median: 54 S: 9.07 IQR: 14 Mean: 76 Median: 76 S: 9.81 IQR: 24 The shape of the distribution is skewed right. The shape of the distribution is mostly symmetric. Because the distribution is skewed, the median of 54 can be used as the measure of center. Because the distribution is mostly symmetric, the mean of 76 can be used as the measure of center. The IQR of the distribution is 14. The S of the distribution is 9.81. Pretest: 42, 44, 44, 44, 46, 46, 46, 50, 50, 52, 52, 52, 54, 54, 54, 54, 56, 58, 60, 64, 64, 64, 64, 64, 66, 66, 70, 72, 72, 80, 84 Posttest: 50, 54, 56, 60, 64, 64, 64, 64, 68, 72, 74, 74, 74, 74, 76, 76, 78, 78, 78, 80, 80, 82, 84, 88, 88, 88, 90, 90, 92, 96, 100 Did scores on the test improve from the pretest to the posttest? Explain your answer. Yes, test scores improved from the pretest to the posttest. It can be seen by the noticeably higher center in the distribution of scores for the posttest.

Learning Goal 10: Choosing Center and Spread - Practice The dot plot below shows the number of pets in each household of 28 students in a 6th-grade class. Mean: 1.82 Median: 2 S: 1.13 IQR: 1.5 1. Shape: Center: Spread: 1 2 3 4 5 6 7 8 9 Number of Pets The shape of the distribution is skewed right. Because the distribution is skewed, the median of 2 can be used as the measure of center. The IQR of the distribution is 1.5. 0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,4,4,7

Choosing Center and Spread - Questions
Learning Goal 10: Choosing Center and Spread - Questions Choose Yes or No to indicate whether each statement is true about this distributions. A. Both distributions are symmetric. B. The median is the best measure of center for Distribution A. C. Overall, scores were higher in Distribution A than Distribution B. D. There is more variability in scores for Distribution A than Distribution B. E. Distribution A is skewed to the right. F. The Standard Deviation can be used to describe the spread for Distribution B. O Yes O No

Learning Goal 11 Be able to describe the distribution of a quantitative variable in terms of its shape, center, and spread.

Learning Goal 11: How to Analysis Quantitative Data
Examine each variable by itself. Then study relationships among the variables. Start with a graph or graphs Add numerical summaries

Learning Goal 11: How to Describe a Quantitative Distribution
The purpose of a graph is to help us understand the data. After you make a graph, always ask, “What do I see?” How to Describe the Distribution of a Quantitative Variable In any graph, look for the overall pattern and for striking departures from that pattern. Describe the overall pattern of a distribution by its: Shape Outliers Center Spread Note individual values that fall outside the overall pattern. These departures are called outliers. Don’t forget your SOCS!

Learning Goal 11: Describing a Quantitative Distribution
We describe a distribution (the values the variable takes on and how often it takes these values) using the acronym SOCS. Shape– We describe the shape of a distribution in one of two ways: Symmetric/Approx. Symmetric or Skewed right/Skewed left Approx. Symmetric (with extreme values) Babe Ruth’s Single Season Home Runs

Outliers: Observations that we would consider “unusual”. Data that don’t “fit” the overall pattern of the distribution. Babe Ruth had two seasons that appear to be somewhat different than the rest of his career. These may be “outliers”. (We’ll learn a numerical way to determine if observations are truly “unusual” later). Outliers 22, 25 Babe Ruth’s Single Season Home Runs Possible Outliers Unusual observation???

Center: A single value that describes the entire distribution. Symmetric distributions use mean and skewed distributions use median. Median is 46 Babe Ruth’s Single Season Home Runs

Spread: Talk about the variation of a distribution. Symmetric distributions use standard deviation and skewed distributions use IQR. IQR is 19 Babe Ruth’s Single Season Home Runs Q1 Q3

Learning Goal 11: Distribution Description using SOCS
The distribution of Babe Ruth’s number of home runs in a single season is approximately symmetric1 with two possible outlier observations at 23 and 25 home runs.2 He typically hits about 463 home runs in a season. Over his career, the number of home runs has normally varied from between 35 and 54.4 1-Shape 2-Outliers 3-Center 4-Spread

Learning Goal 11: Describe the Distribution – Your Turn
The table and dotplot below displays the Environmental Protection Agency’s estimates of highway gas mileage in miles per gallon (MPG) for a sample of 24 model year 2009 midsize cars. Describe the shape, center, and spread of the distribution. Are there any outliers?

Learning Goal 11: Describe the Distribution – Your Turn
Smart Phone Battery Life (minutes) Apple iPhone 300 Motorola Droid 385 Palm Pre Blackberry Bold 360 Blackberry Storm 330 Motorola Cliq Samsung Moment Blackberry Tour HTC Droid 460 Smart Phone Battery Life: Here is the estimated battery life for each of 9 different smart phones in minutes. Describe the distribution.

Cartoon Time

Displaying and Summarizing Quantitative Data

Similar presentations

Presentation on theme: "Displaying and Summarizing Quantitative Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Displaying and Summarizing Quantitative Data

Similar presentations

Presentation on theme: "Displaying and Summarizing Quantitative Data"— Presentation transcript:

Similar presentations

About project

Feedback