Understanding and Comparing Distributions

Slides:



Advertisements
Similar presentations
Copyright © 2010 Pearson Education, Inc. Slide
Advertisements

Understanding and Comparing Distributions
Understanding and Comparing Distributions 30 min.
Chapter 5: Understanding and Comparing Distributions
Understanding and Comparing Distributions
Understanding and Comparing Distributions
Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Copyright © 2009 Pearson Education, Inc. Chapter 5 Understanding and Comparing Distributions.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Understanding and Comparing Distributions.
Understanding and Comparing Distributions
Copyright © 2009 Pearson Education, Inc. Slide 4- 1 Practice – Ch4 #26: A meteorologist preparing a talk about global warming compiled a list of weekly.
Chapter 5 Understanding and Comparing Distributions.
Chapter 1: Exploring Data
Describing Distributions Numerically
Chapter 1: Exploring Data
Understanding and Comparing Distributions
Warm Up.
Choosing the “Best Average”
Understanding and Comparing Distributions
Understanding and Comparing Distributions
CHAPTER 2: Describing Distributions with Numbers
Describing Distributions Numerically
Understanding and Comparing Distributions
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
2.6: Boxplots CHS Statistics
DAY 3 Sections 1.2 and 1.3.
Please take out Sec HW It is worth 20 points (2 pts
Understanding and Comparing Distributions
CHAPTER 1 Exploring Data
Displaying and Summarizing Quantitative Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Understanding and Comparing Distributions
Describing Distributions Numerically
Honors Statistics Review Chapters 4 - 5
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Understanding and Comparing Distributions
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Presentation transcript:

Understanding and Comparing Distributions AP Statistics Chapter 5 Understanding and Comparing Distributions

Learning Goals Know how to construction and analysis boxplots. Be able to calculate outliers and discuss how they deviate from the overall pattern of the data. Know how to compare the distribution of two or more groups by comparing their shapes, centers, and spreads. Be able to display data over time: Time Plots. Understand how re-expressing data can improve the symmetry of the data.

Know how to construction and analysis boxplots. Learning Goal 1 Know how to construction and analysis boxplots.

Learning Goal 1: The Five-Number Summary The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum). Example: The five-number summary for the daily wind speed is: Min 0.20 Q1 1.15 Median 1.90 Q3 2.93 Max 8.67

Learning Goal 1: The Five-Number Summary Consists of the minimum value, Q1, the median, Q3, and the maximum value, listed in that order. Offers a reasonably complete description of the center and spread. Calculate on the TI-84 using 1-Var Stats. Used to construct the Boxplot. Example: Five-Number Summary 1: 20, 27, 34, 50, 86 2: 5, 10, 18.5, 29, 33

Learning Goal 1: Boxplot A boxplot is a graphical display of the five-number summary. Boxplots are useful when comparing groups. Boxplots are particularly good at pointing out outliers.

Learning Goal 1: Boxplot Boxplot: A Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum Example: 25% 25% 25% 25%

Learning Goal 1: Boxplots A graph of the Five-Number Summary. Can be drawn either horizontally or vertically. The box represents the IQR (middle 50%) of the data. Show less detail than histograms or stemplots, they are best used for side-by-side comparison of more than one distribution.

Learning Goal 1: Constructing Boxplots Put the data in increasing numerical order (if it isn’t already). Find the median of your set of data. Consider only the values to the left of the median, the lower half of the data, and find the median of the lower half, the 1st quartile Q1. Consider only the values to the right of the median, the upper half of the data, and find the median of the upper half, the 3rd quartile Q3. Determine the lowest and highest values of the data. Draw a scale on a number line and plot the lowest value, lower quartile, median, upper quartile, and the highest value on the number line. Put a line through the lower quartile, median, and upper quartile. Then put a box around those lines. Lastly draw a line from your extreme values to the box.

Learning Goal 1: Constructing Boxplots - Example Put the date in increasing numerical order (if it isn’t already). Can use the TI-84, STATS/SortA(. Example: 100, 27, 34, 54, 59, 18, 52, 61, 78, 68, 82, 87, 85, 93, 91. Ascending order… 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100.

Learning Goal 1: Constructing Boxplots - Example Find the median of the data *Remember the median is the value exactly in the middle of an ordered set of numbers* 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100.

Learning Goal 1: Constructing Boxplots - Example Next, consider only the values to the left of the median, the lower half of the data. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100. Then, find the median of those numbers, the 1st quartile Q1. 18, 27, 34, 52, 54, 59, 61

Learning Goal 1: Constructing Boxplots - Example Next, consider only the values to the right of the median, the upper half of the data. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100. Then find the median of those numbers, the 3rd quartile Q3. 78, 82, 85, 87, 91, 93, 100.

Learning Goal 1: Constructing Boxplots - Example Now indicate your lowest and highest values 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100.

Learning Goal 1: Constructing Boxplots - Example Now we are ready to begin to draw the graph. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100. Plot the lowest value, lower quartile Q1, median, upper quartile Q3, and the highest value on a number line with an appropriate scale.

Learning Goal 1: Constructing Boxplots - Example Put a line through the Lower Quartile Q1, Median, and Upper Quartile Q3. Then Put a box around those lines.

Learning Goal 1: Constructing Boxplots - Example Lastly draw a line from your extreme values to the box. There is your Boxplot. Don’t forget the label the horizontal scale and the graph. Also, label the scale and title the graph.

Learning Goal 1: Constructing Boxplots – Your Turn Given the commuting times (in minutes) of 20 randomly selected New York workers below, construct a boxplot. 10 30 5 25 40 20 15 85 65 60 45

Learning Goal 1: Shape from a Boxplot Box Plots do not display the shape of the distribution as clearly as histograms, but may be useful in determining the shape. Skewed Left Skewed Right

Learning Goal 1: Shape from a Boxplot – Looking at the Median If the median is close to the center of the box, the distribution of the data values will be approximately symmetrical. If the median is to the left of the center of the box, the distribution of the data values will be Skewed Right. If the median is to the right of the center of the box, the distribution of the data values will be Skewed Left. Skewed Right Skewed Left

Learning Goal 1: Shape from a Boxplot– Looking at the Whiskers If the whiskers are approximately the same length, the distribution of the data values will be approximately symmetrical. If the right whisker is longer than the left whisker, the distribution of the data values will be Skewed Right. If the left whisker is longer than the right whisker, the distribution of the data values will be Skewed Left. Skewed Right Skewed Left

Learning Goal 1: Distribution Shape and Boxplot Left-Skewed Symmetric Right-Skewed Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Learning Goal 1: Boxplots - Problem Below you have a boxplot for the tar content of 25 different cigarettes. What is a plausible set of values for the five-number summary? Min = 13, Q1 = 10, Median = 12.6, Q3 = 14, Max = 15 Min = 1, Q1 = 8.5, Median = 12.6, Q3 = 15, Max = 17 Min = 1, Q1 = 8.5, Median = 11.5, Q3 = 13, Max = 15 Min = 8.5, Q1 = 10, Median = 11.5, Q3 = 15, Max = 17

Learning Goal 1: Boxplots - Problem The shape of the boxplot below can be described as: Bi-modal Left-skewed Right-skewed Symmetric Uniform

Learning Goal 1: Boxplots - Problem What is the approximate range of the Male Wrist Girth dataset shown below? 14.5 to 19.5 16.5 to 17 16.5 to 18 17 to 19.5 14.5 to 16.5 and 18 to 19.5

Learning Goal 1: Boxplots - Problem What is the approximate interquartile range of the Male Wrist Girth dataset shown below? 14.5 to 19.5 16.5 to 17 16.5 to 18 17 to 19.5 14.5 to 16.5 and 18 to 19.5

Learning Goal 2 Be able to calculate outliers and discuss how they deviate from the overall pattern of the data.

Learning Goal 2: Outliers Recall that an outlier is an extremely small or extremely large data value when compared with the rest of the data values. What should we do about outliers? Try to understand them in the context of the data. Causes Data error Special nature to the data

Learning Goal 2: Outliers If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers. The following procedure allows us to check whether a data value can be considered as an outlier.

Learning Goal 2: Outliers IQR is used to determine if extreme values are actually outliers An observation is an outlier if it falls more than 1.5 times IQR below Q1 or above Q3. To test for outliers Construct an upper and lower fence Upper Fence = Q3 + (1.5)IQR Lower Fence = Q1 – (1.5)IQR If an observation falls outside the fences (ie. Greater than the upper fence or less than the lower fence) than it is an outlier. Indicate outliers as individual points or asterisk on a boxplot.

Learning Goal 2: Outliers – Example (odd set of data) Data: 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 Find Q1, M, Q3, IQR and any outliers. Sort data Q1 Q3 20 25 25 27 28 31 33 34 36 37 44 50 59 85 86 lower half median upper half IQR = 50 – 27 = 23 Upper Fence = Q3 + (1.5)IQR = 50 + 34.5 = 84.5 Lower Fence = Q1 – (1.5)IQR = 27 – 34.5 = -7.5 Outliers 85 and 86 (greater than the upper fence)

Learning Goal 2: Outliers – Example (even set of data) Find Q1, M, Q3, IQR and outliers. Q1 Q3 5 7 10 14 18 19 25 29 31 33 lower half upper half median IQR = 29 – 10 = 19 Upper Fence = 29 + (1.5)IQR = 29 + 28.5 = 57.5 Lower Fence = 10 – (1.5)IQR = 10 – 28.5 = -18.5 No Outliers

Learning Goal 2: Outliers – Your Turn The data below represent the 20 countries with the largest number of total Olympic medals, including the United States, which had 101 medals for the 1996 Atlanta games. Determine whether the number of medals won by the United States is an outlier relative to the numbers for the other countries. Data values – 63, 65, 50, 37, 35, 41, 25, 23, 27, 21, 17, 17, 20, 19, 22, 15, 15, 15, 15, 101.

Learning Goal 2: Outliers – Modified Boxplot Plots outliers as isolated points, where regular boxplots conceal outliers. From now on when we say “boxplot”, we mean “modified boxplot”. The modified boxplot is more useful than the boxplot. Constructing a Modified Boxplot. Same as a boxplot with the exception of the “whiskers”. Extend the “left whisker” to the minimum value if there are no outliers or to the last data value greater than or equal to the lower fence if there are outliers. Extend the “right whisker” to the maximum value if there are no outliers or to the last data value less than or equal to the upper fence if there are outliers. Outliers (either low or high) are then represented by a dot or an asterisk.

Learning Goal 2: Constructing Modified Boxplots – Example (vertical) Draw a single vertical axis spanning the range of the data. Label and Scale. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box.

Learning Goal 2: Constructing Modified Boxplots – Example (vertical) Erect “fences” around the main part of the data. The upper fence is 1.5 IQRs above the upper quartile. The lower fence is 1.5 IQRs below the lower quartile. Note: the fences only help with constructing the boxplot and should not appear in the final display.

Learning Goal 2: Constructing Modified Boxplots – Example (vertical) Use the fences to grow “whiskers.” Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If a data value falls outside one of the fences, we do not connect it with a whisker.

Learning Goal 2: Constructing Modified Boxplots – Example (vertical) Add the outliers by displaying any data values beyond the fences with special symbols. We often use a different symbol for “far outliers” that are farther than 3 IQRs from the quartiles.

Learning Goal 2: Modified Boxplots – Example Outliers Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier Q3 Median Q1

Learning Goal 2: Modified Boxplots – Problem Which boxplot goes with the histogram of waiting times for the bus? (a) (b) (c) The data do not show any low outliers.

Learning Goal 2: Constructing Modified Boxplots – Your Turn Given the commuting times (in minutes) of 20 randomly selected New York workers below, construct a modified boxplot (same data as earlier). 10 30 5 25 40 20 15 85 65 60 45

Learning Goal 2: More Outliers Far Outlier – Data values farther than 3 IQRs from the quartiles. Outliers Far Outlier Boxplot

Learning Goal 2: Boxplots – TI-84 Use the AL 2008 run data to construct a boxplot. AL runs scored: 782 845 811 805 821 691 765 829 789 646 671 774 901 714

Learning Goal 2: Boxplots – TI-84 Store this data as a list. Press STAT and choose the first option EDIT Enter the 14 AL data values in L1.

Learning Goal 2: Boxplots – TI-84 Now set up the boxplot. Exit back into the home screen. Then press STAT PLOT (2nd and y= ). Choose Plot1. Then, turn Plot1 on. Scroll to Type and choose the boxplot icon (with outliers). It is the first option in the second row. Enter L1 for Xlist. Enter 1 for Freq. Choose a mark for outliers.

Learning Goal 2: Boxplots – TI-84 Now display the graph. Press ZOOM. Then select option 9: ZOOMSTAT. Press enter. Press TRACE and scroll around to see different statistics for the distribution.

Learning Goal 2: Boxplots – TI-84 STAT, EDIT, (enter data). STAT PLOT Two boxplots (1st modified boxplot, 2nd regular boxplot) ZOOM, #9:ZoomStat Use trace key to indicate values and find outliers.

Learning Goal 2: Boxplots TI-84 - Your Turn Data: 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 Use TI-83/84

Learning Goal 3 Know how to compare the distribution of two or more groups by comparing their shapes, centers, and spreads.

Learning Goal 3: Comparing Distributions We can answer much more interesting questions about variables when we compare distributions for different groups. Some of the most interesting statistics questions involve comparing two or more groups. Always discuss shape, center, spread, and possible outliers whenever you compare distributions of a quantitative variable.

Learning Goal 3: Comparing Distributions When asked to compare two distributions, you must address four points: The shape The outliers The center The spread Think of the acronym SOCS to help you remember what to address.

Learning Goal 3: Describing and Comparing Distributions Key Words: Comparing the Distribution of a Quantitative Variable. 2 graphs S Roughly symmetric (slightly/heavily) Right-skewed (slightly/heavily) Left-skewed unimodal, bell-shaped, bimodal, gap between ? and ?. C Typical value (midpoint, median) About the same, greater than, less than Data varies from ? to ? varies more than, varies less than O Possible outlier at, possible outlier between ? No outlier.

Learning Goal 3: Comparing Distributions Compare the histogram and boxplot for daily wind speeds: How does each display represent the distribution? The shape of a distribution is not always evident in a boxplot. Boxplots are particularly good at pointing out outliers.

Learning Goal 3: Comparing Distributions - Example It is almost always more interesting to compare groups. With histograms, note the shapes, centers, and spreads of the two distributions. When using histograms to compare data sets make sure to use the same scale for both sets of data. What does this graphical display tell you?

Learning Goal 3: Comparing Distributions – Example Solution The shapes, centers, and spreads of these two distributions are strikingly different. During spring and summer (histogram on the left), the distribution is skewed to the right. A typical day has an average wind speed of only 1 to 2 mph, with a range between .5 and 4 mph. In the colder months (histogram on the right), the shape is less strongly skewed and more spread out. The typical wind speed is higher, between 1 and 6 mph, and days with average wind speeds above 3 mph are not unusual. There is also an outlier at about 9 mph.

Learning Goal 3: Comparing Distributions - Example Boxplots offer an ideal balance of information and simplicity, hiding the details while displaying the overall summary information, when comparing different groups. We often plot them side by side for groups or categories we wish to compare. What do these boxplots tell you?

Learning Goal 3: Comparing Distributions – Example Solution By placing the boxplots side by side, we can easily see which groups have higher medians, which have the greater IQRs, where the middle 50% of the data is located in each group, and which have the greater overall range When the boxes are placed in order, we can get a general idea of patterns in both the centers and the spreads. Equally important, we can see past any outliers in making these comparisons because they’ve been displayed separately.

Learning Goal 3: Comparing Distributions - Example Here are boxplots for the number of runs scored in the AL and in the NL during 2008. (Note: the plots are on the same scale for comparison purposes.) Let’s compare using our four points (SOCS).

Learning Goal 3: Comparing Distributions – Example Solution Shape The AL distribution is skewed slightly left (the left half of the distribution appears more spread out). The NL distribution is approximately symmetric.

Learning Goal 3: Comparing Distributions – Example Solution Outliers Neither distribution contains an outlier.

Learning Goal 3: Comparing Distributions – Example Solution Center Typically, teams score more runs in the AL, about 785, because the median for the AL distribution is higher than the median for the NL distribution , about 740.

Learning Goal 3: Comparing Distributions – Example Solution Spread The AL distribution is slightly more spread out because it has both a larger range, 250 vs. 225, and larger IQR, around 100 vs. 70. This indicates there is more variability among AL teams and more consistency among NL teams.

Learning Goal 3: Comparing Distributions – Your Turn These data represent the responses of 20 female and 20 male Statistics students to the question, “How many pairs of shoes do you have?” Construct a stemplot and compare the two distributions. Females 50 26 31 57 19 24 22 23 38 13 34 30 49 15 51 Males 14 7 6 5 12 38 8 10 11 4 22 35

Learning Goal 3: Comparing Distributions - Problem Look at the following side-by-side boxplots and compare the female and male shoulder girth. Females have a typically smaller shoulder girth than males. Females have a typically larger shoulder girth than males. Females and males have about the same shoulder girths.

Learning Goal 3: Comparing Distributions - Problem Look at the following side-by-side boxplots and compare the female and male thigh girth. Females have a typically smaller thigh girth than males. Females have a typically larger thigh girth than males. Females and males have about the same thigh girth.

Learning Goal 3: Comparing Distributions - Problem Compare the centers of Distribution A (Female Shoulder Girth) and Distribution B (Male Shoulder Girth) shown below. The center of Distribution A is greater than the center of Distribution B. The center of Distribution A is less than the center of Distribution B. The center of Distribution A is equal to the center of Distribution B.

Learning Goal 3: Comparing Distributions - Problem Compare the spreads of Distribution A (Female Shoulder Girth) and Distribution B (Male Shoulder Girth) shown below. The spread of Distribution A is greater than the spread of Distribution B. The spread of Distribution A is less than the spread of Distribution B. The spread of Distribution A is equal to the spread of Distribution B.

Be able to display data over time: Time Plots. Learning Goal 4 Be able to display data over time: Time Plots.

Learning Goal 4: Time Plots Used for displaying a time series, a data set collected over time. Plots each observation on the vertical scale against the time it was measured on the horizontal scale. Points are usually connected. Common patterns in the data over time, known as trends, should be noted.

Learning Goal 4: Time Plots - Example A Time Plot from 1995 – 2001 of the number of people worldwide who use the Internet.

Learning Goal 4: Constructing Time Plots To make a Time Plot – Put time on the horizontal scale. Put the variable being measured on the vertical scale. Connect the data points by lines. Interpreting Time Plots – look for An overall pattern – TREND (upward or downward) Outliers – strong deviations from the overall pattern or trend.

Learning Goal 4: Time Plots - Patterns We describe time series by looking for an overall pattern and for striking deviations from that pattern. In a time series: A trend is a rise or fall that persists over time, despite small irregularities. A pattern that repeats itself at regular intervals of time is called seasonal variation.

Learning Goal 4: Describing Time Plots This time plot shows a regular pattern of yearly variations. These are seasonal variations in fresh orange pricing most likely due to similar seasonal variations in the production of fresh oranges. There is also an overall upward trend in pricing over time. It could simply be reflecting inflation trends or a more fundamental change in this industry. Retail price of fresh oranges over time

Learning Goal 4: Describing Time Plots Scales Matter How you stretch the axes and choose your scales can give a different impression. A picture is worth a thousand words, BUT There is nothing like hard numbers.  Look at the scales.

Learning Goal 4: Describing Time Plots May contain outliers, deviations from the pattern. Try to explain. Interpolation – to predict a value from the pattern between known values. (within the data) Extrapolation – to predict a value by extending the pattern into the future (outside the data). Dangerous, no guarantee the pattern continues.

Learning Goal 4: Time Plots - Problem Which of these plots is a time plot? Plot A Plot B both neither

Learning Goal 4: Time Plots - Problem What type of trend does the following time plot show? Downward Upward No trend

Learning Goal 4: Time Plots - Problem What would be the correct interpretation of the following graph? There is an upward trend in the data. There is no trend in the data. There is a downward trend in the data. The data show an existence of seasonal variation.

Learning Goal 5 Understand how re-expressing data can improve the symmetry of the data.

Learning Goal 5: Re-expressing Skewed Data to Improve Symmetry When the data are skewed it can be hard to summarize them simply with a center and spread, and hard to decide whether the most extreme values are outliers or just part of a stretched out tail. How can we say anything useful about such data? The secret is to re-express the data by applying a simple function (logarithms, square roots, and reciprocals) to each value.

Learning Goal 5: Re-expressing Skewed Data to Improve Symmetry One way to make a skewed distribution more symmetric is to re-express or transform the data by applying a simple function (e.g., logarithmic function). Note the change in skewness from the raw data (top) to the transformed data (right):