Chapter 4 Exploring Numerical Data

Slides:



Advertisements
Similar presentations
Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Advertisements

Describing Quantitative Variables
Chapter 2 Exploring Data with Graphs and Numerical Summaries
Describing Distributions With Numbers
Click the mouse button or press the Space Bar to display the answers.
Descriptive Measures MARE 250 Dr. Jason Turner.
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 2 Picturing Variation with Graphs.
SECTION 3.3 MEASURES OF POSITION Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Quantitative Data Continued Histograms. Used with numerical data Bars touch on histograms Two types – Discrete Bars are centered over discrete values.
Displaying & Summarizing Quantitative Data
1.2: Describing Distributions
Understanding and Comparing Distributions
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Objective To understand measures of central tendency and use them to analyze data.
Enter these data into your calculator!!!
Frequency Distributions, Histograms, and Related Topics.
1.1 Displaying Distributions with Graphs
1.3: Describing Quantitative Data with Numbers
Have out your calculator and your notes! The four C’s: Clear, Concise, Complete, Context.
Chapter 4 Displaying and Summarizing Quantitative Data Math2200.
What is Statistics? Statistics is the science of collecting, analyzing, and drawing conclusions from data –Descriptive Statistics Organizing and summarizing.
Describing distributions with numbers
Warm Up Find the mean, median, mode, range, and outliers of the following data. 11, 7, 2, 7, 6, 12, 9, 10, 8, 6, 4, 8, 8, 7, 4, 7, 8, 8, 6, 5, 9 How does.
Categorical vs. Quantitative…
Warm-up The number of deaths among persons aged 15 to 24 years in the United States in 1997 due to the seven leading causes of death for this age group.
To be given to you next time: Short Project, What do students drive? AP Problems.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Chapter 5 Describing Distributions Numerically.
1.2 Displaying Quantitative Data w/ Graphs Pages Objectives SWBAT: 1)Make and interpret dotplots and stemplots of quantitative data. 2)Describe the.
1.3 Describing Quantitative Data with Numbers Pages Objectives SWBAT: 1)Calculate measures of center (mean, median). 2)Calculate and interpret measures.
Histograms & Stemplots for Quantitative Data Describing Data using Summary Features of Quantitative Variables Center — Location in middle of all data.
Chapter 6: Interpreting the Measures of Variability.
Summary Statistics, Center, Spread, Range, Mean, and Median Ms. Daniels Integrated Math 1.
More Univariate Data Quantitative Graphs & Describing Distributions with Numbers.
 Boxplot  TI-83/84 Calculator  5 number summary  Do you have an outlier  Modified Boxplot.
AP Statistics Objective: Students will be able to construct and determine when to use bar charts, pie charts, and dot plots. (Histograms)
Statistics Unit Test Review Chapters 11 & /11-2 Mean(average): the sum of the data divided by the number of pieces of data Median: the value appearing.
1.2 Displaying Quantitative Data with Graphs.  Each data value is shown as a dot above its location on the number line 1.Draw a horizontal axis (a number.
Describe Quantitative Data with Numbers. Mean The most common measure of center is the ordinary arithmetic average, or mean.
UNIT ONE REVIEW Exploring Data.
Exploratory Data Analysis
CHAPTER 1 Exploring Data
Statistics Unit Test Review
Warm Up.
4. Interpreting sets of data
Choosing the “Best Average”
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Unit 2 Section 2.5.
CHAPTER 1 Exploring Data
Measures of central tendency
DAY 3 Sections 1.2 and 1.3.
Please take out Sec HW It is worth 20 points (2 pts
Describing Distributions of Data
Measure of Center And Boxplot’s.
Measure of Center And Boxplot’s.
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
Common Core Math I Unit 2: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Dotplots Horizontal axis with either quanitative scale or categories
Measures of central tendency
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Chapter 1: Exploring Data
Describing a Skewed Distribution Numerically
Chapter 1: Exploring Data
Measures of Position Section 3.3.
CHAPTER 1 Exploring Data
Exploring Numerical Data
Lesson Plan Day 1 Lesson Plan Day 2 Lesson Plan Day 3
Presentation transcript:

Chapter 4 Exploring Numerical Data Objectives Students will be able to: graph the distribution of a numerical variable 2) calculate summary statistics for a distribution of a numerical variable 3) compare distributions of a numerical variable

NL vs AL In MLB, what is the lineup difference between the NL and the AL? In 1973, the AL enacted the designated hitter (DH). The DH is a player that only bats (does not play defense). In the AL, the DH bats in the place of the pitcher. The DH was designed to increase offense, which would in turn generate more interest in AL games. The assumption was that fans would like to see more offense. Does the DH increase offense in MLB?

Reminder… In Chapters 1-3 we dealt with categorical variables (variables whose outcomes fall into categories). In this chapter we will begin looking at numerical variables. Numerical variables are variables whose possible outcomes take on numerical values that represent different quantities of the variable. Examples: number of runs scored by teams in the AL number of sacks in an NFL season by DeMarcus Ware the amount of time it takes to swim 100 meters

Numerical Variables Many of the lessons learned about categorical variables still hold true for numerical variables. As with categorical variables, it is beneficial to begin an analysis of numerical variables with a graph of the data.

Here are the run totals for the 30 MLB teams in 2008 Here are the run totals for the 30 MLB teams in 2008. Note: the Astros were still in the NL.

There are various ways to graph the distribution of numerical variables. We already know how to make a dotplot. Note: when making dotplots that compare two distributions, it is important to ensure the dotplots are on the same scale. Otherwise, the distributions are difficult to compare.

Here are the dotplots comparing the distribution of runs scored for AL and NL teams in 2008 (pg 120). At first glance, the distribution of runs scored seems fairly similar for both leagues. Perhaps AL teams score a little more often than NL teams.

Histograms A histogram is a graph that divides the values of a numerical variable into classes and uses bars to represent the number of values in each class. The frequency describes the number of observations in each class. For our histogram, the number of runs will be broken down into classes, and the frequency will be the number of teams in those classes.

One easy way to make a histogram is by starting with a dotplot, and building from there. Let’s make a histogram for the number of runs scored during 2008 for all 30 MLB teams. (pg 121-122) Step 1: Start with a dotplot showing the runs scored for each of the 30 MLB teams.

Step 2: Divide the data into 5 to 10 equally wide classes Step 2: Divide the data into 5 to 10 equally wide classes. For this example, we can use classes that are 50 runs wide. Therefore, our first class will be 600-650 runs, the next class is 650-700 runs, etc…

Step 3: Count how many observations are in each class Step 3: Count how many observations are in each class. If an observation falls exactly on a border line, it is considered part of the class above the boundary. For example, the observation on 750 would count as part of the 750-800 class.

Step 4: Draw bars for each class Step 4: Draw bars for each class. Bars should be equally wide and have no spaces between them. The height for each bar corresponds with the number of observations in that class.

It is also possible to make a relative frequency histogram It is also possible to make a relative frequency histogram. This shows the percentage of observations in each class, rather than the number of observations.

When comparing two or more histograms: Use the same scales! The scales on the horizontal axes should match. The scales on the vertical axes should match. When the number of observations is not the same between distributions, we should make a relative frequency histogram. Let’s look at why….

Here are two frequency histograms comparing the number of points scored for players on the LA Lakers and players not on the Lakers in the 2008-2009 regular season. Because there are many more players not on the Lakers, it is hard to compare these distributions.

Let’s now use a relative frequency histogram: The comparison is now much easier to make.

Describing the Shape of the Distribution There are several phrases we can use to describe the shape of the distribution of numerical data. Let’s look at this using different data from all 2009 MLB players who had at least 300 plate appearances. A plate appearance occurs each time a player takes their turn at bat.

Symmetric A distribution is symmetric if the left side of the graph is roughly a mirror image of the right side.

Skewed right A distribution is skewed to the right when the right side of the graph is more spread out than the left side. Think about your right foot. The toes are tall on the left side and get progressively smaller as you move to the right.

Skewed left A distribution is skewed to the left when the left side of the graph is more spread out than the right side. Think about your left foot. The toes are tall on the right side and get progressively smaller as you move to the left.

Unimodal A distribution is unimodal when it shows one distinct peak Unimodal A distribution is unimodal when it shows one distinct peak. Note: the previous three graphs can also be considered unimodal.

Bimodal A distribution is bimodal if it has two distinct peaks Bimodal A distribution is bimodal if it has two distinct peaks. This graph has a peak at 0 and a peak at 0.8.

Caution: Unimodal vs Bimodal A common error is calling a distribution bimodal when it is really unimodal. To call a distribution bimodal, the peaks need to be clearly distinct. Sometimes a peak occurs because of our choice in boundaries. A good rule of thumb is that if moving one or two observations would eliminate a peak, then there is a good chance that the peak is only there because of our choice in boundaries.

Caution: Unimodal vs Bimodal Here are two histograms that use the exact same data, but different class widths. The first looks like it has two peaks, but the second seems clearly unimodal.

Uniform A distribution is uniform when the heights of the bars are all about the same.

Dotplot vs Histogram General rule of thumb: When the data sets are small, a dotplot is more useful (allows us to see each individual observation). When the data sets are large, a histogram is more useful. Think about trying to make a dotplot of the heights of all Americans. There would be way too many dots.

Time for some Magic! Turn to pages 126-127

Describing Numerical Data with Summary Statistics To completely describe the distribution of a numerical variable, we need to describe where the distribution is centered and how spread out it is, in addition to the shape. Measuring Center There are two common ways to measure where a distribution of numerical data is centered: the mean and the median

Mean To find the mean (also known as the average), simply add up all observations and divide by the total number of observations.

Here are the number of runs scored by the 14 AL teams in 2008 Here are the number of runs scored by the 14 AL teams in 2008. Let’s find the mean.

Here are both means identified on the dotplot with arrows. The mean is also called the balancing point of a distribution. Think of the dotplot like a see-saw. The mean is the place you would put the fulcrum (place the see-saw pivots).

Median The median of a data set is the middle value when the values are in order from smallest to largest (or vice versa). If there are two middle values, then the median is the average of the two middle values. The median of a set of PERFORMANCES is denoted by a capital M.

Again, here are the number of runs scored by the 14 AL teams in 2008 Again, here are the number of runs scored by the 14 AL teams in 2008. Let’s find the median.

To recap, here’s what we know: Based on this information, it is clear that the center of the AL distribution is higher than the center of the NL distribution, meaning that AL teams typically score more runs than NL teams.

There is a connection between the shape of a distribution and the relationship between the mean and median of the distribution. When a distribution is symmetric, the mean and median will be approximately the same. When a distribution is skewed right, the mean will be greater than the median. When a distribution is skewed left (a rarity in sports), the mean will be smaller than the median.

This distribution of stolen bases is skewed right, with a median of 5, as noted on the histogram. It does not seem plausible that the balancing point (mean) is also 5. Because the distribution is stretched out to the right, the mean must be greater than 5. Think of all the extremely values that will pull the mean up.

Being able to identify the shape and center of a distribution is a great start. However, two distributions can have the same shape and center, but look quite different. Here are the dotplots that show 100 PERFORMANCES by two different bowlers. Both distributions are unimodal and symmetric, with centers around 150. However, it is important to compare the spread (variability) of the distributions.

Measuring Variability In sports, it is important to measure variability because it shows the consistency of an athlete or team. For example, if the distribution of an athlete’s PERFORMANCES has little variability, it means that he or she is very consistent. There are several ways to measure the variability of a distribution. For now, we will focus on the range and the interquartile range.

Range The range of a distribution is the distance between the minimum value and the maximum value. Examples: The range of AL runs= 901-646= 255 runs The range of NL runs= 855-637= 218 runs We have some evidence there is less variability in the NL distribution. Range can be a bit deceptive if there is an unusually high or unusually low value in a distribution. For this reason, we often use a second measure of variability called the…

Interquartile Range The interquartile range (IQR) is a single number that measures the range of the middle half of the distribution, ignoring the values in the lowest quarter of the distribution and the values in the highest quarter of the distribution

In order to calculate the IQR, we first have to find the quartiles of the distribution, which are the values that divide the distribution into four groups of roughly the same size. Let’s look at the dotplot for the number of runs scored by NL teams in 2008.

As you can see, there are 16 teams As you can see, there are 16 teams. The quartiles would divide the distribution into 4 groups of 4 teams. We have a procedure to help us calculate the quartile values.

Steps to calculate quartiles 1) Put the data in numerical order and find the median (this also happens to be the second quartile) 2) Find the median of the values whose position in the ordered list is to the left of the median. This value is the first quartile. 3) Find the median of the values whose position in the ordered list is to the right of the median. This value is the third quartile. Note: When the number of observations is odd, don’t include the median value in the calculations in steps 2 and 3.

Let’s practice using the 2008 NL runs scored data.

The IQR for the NL distribution is smaller than the IQR for the AL distribution. Therefore, we have evidence that there is less variability in the NL distribution. Let’s practice by looking at Tom Brady’s passer ratings.

Unusually large or small values can have a big impact on measures like the mean and range. Think about if we were going to calculate the mean salary and range of salaries for students in this classroom. Let’s say Adam Sandler finds out he is one class short of graduating high school, and that class happens to be Statistics. He moves to Lyndhurst and transfers into this class. What effect would his salary have on the mean? On the range? What type of effect would it have on the median? On IQR?

Outliers Outliers are any value that falls out of the pattern of the rest of the data (unusually high or unusually low values in a distribution). Outliers can have a big effect on summary statistics, such as the mean and range.

Here are Tennessee Titan’s running back Chris Johnson’s yards for each rush during a game against the Houston Texans on September 20, 2009. Do there appear to be any outliers? The mean is brought up greatly by the two outliers. However, the median is relatively unaffected.

A measure of center or spread is resistant if it isn’t influenced by outliers. Median and interquartile range are resistant to outliers Mean and range are not resistant to outliers

The rule of thumb for an observation being an outlier is if the observation lies more than 1.5 IQR’s below the first quartile or above the third quartile. Let’s practice using Chris Johnson’s 16 rushing attempts from the September 20, 2009 game against the Texans.

Boxplots Another way to graph the distribution of a numerical variable is through a boxplot (aka box-and-whisker plot). A boxplot is a visual representation of the five-number summary of the distribution of a numerical variable. This consists of: The minimum value of the distribution The first quartile The median The third quartile The maximum value of the distribution

Steps to Make a Boxplot Draw a central box (rectangle) from the first quartile to the third quartile Draw a vertical line to mark the median Draw horizontal lines (called whiskers) that extend from the box out to the smallest and largest observations that are not outliers If there are any outliers, mark them separately

Let’s go back to our Chris Johnson example. Let’s reexamine his rushing attempts, along with other key data.

Let’s now go back to our Tom Brady example Let’s now go back to our Tom Brady example. Here were his passer ratings, along with other key data we calculated.

…and the boxplot

Comparing Distributions When asked to compare two distributions, you must address four points: The shape The outliers The center The spread Think of the acronym SOCS to help you remember what to address.

The shape of a distribution may be difficult to determine from a boxplot. Try comparing the distance from the median to the minimum and maximum values to determine if a distribution is skewed or roughly symmetric. You will not be able to tell if a distribution is unimodal from looking at a boxplot.

Here are boxplots for the number of runs scored in the AL and in the NL during 2008. (Note: the plots are on the same scale for comparison purposes.) Let’s compare using our four points.

Shape The AL distribution is skewed slightly left (the left half of the distribution appears more spread out). The NL distribution is approximately symmetric.

Outliers Neither distribution contains an outlier.

Center Typically, teams score more runs in the AL because the median for the AL distributions is higher than the median for the NL distributions.

Spread The AL distribution is slightly more spread out because it has both a larger range and larger IQR. This indicates there is more variability among AL teams and more consistency among NL teams.

Using the TI-84 to Make Graphs and Calculate Summary Statistics As fun as it is to calculate everything by hand, the TI-84 calculator can do many of our calculations for us. The calculator can create boxplots, histograms, and calculate summary statistics.

Boxplot Let’s use our 2008 run data. Here are the numbers: AL runs scored: 782 845 811 805 821 691 765 829 789 646 671 774 901 714 NL runs scored: 720 753 855 704 747 770 712 700 750 799 799 735 637 640 779 641 Write these numbers down or open to pg 120!

The first thing we have to do is store this data as a list. Press STAT and choose the first option EDIT Enter the 14 AL data values in L1 and the 16 NL values in L2

Now we are going to set up the boxplot. Exit back into the home screen Now we are going to set up the boxplot. Exit back into the home screen. Then press STAT PLOT (2nd and y= ). Choose Plot1. Then, turn Plot1 on. Scroll to Type and choose the boxplot icon (with outliers). It is the first option in the second row. Enter L1 for Xlist. Enter 1 for Freq. Choose a mark for outliers.

Now we will display the graph. Press ZOOM Now we will display the graph. Press ZOOM. Then select option 9: ZOOMSTAT. Press enter. Press TRACE and scroll around to see different statistics for the distribution.

To see the boxplot for the NL distribution at the same time: Go back into STAT PLOT and turn on Plot2. Repeat the steps, but enter L2 for Xlist. To do this, scroll down to Xlist. Then press 2nd-2 (you will see the L2 button on top of the number 2).

Histogram Note: We can only view one histogram at a time. Start by pressing STAT PLOT. We want to turn on Plot 1. Make sure no other plots are turned on. Once in Plot1, change Type to Histogram. Enter L1 for Xlist. Keep Freq at 1.

To display the graph, press ZOOM and select the 9th option 9: ZOOMSTAT. Press TRACE to see the class boundaries and frequencies.

To change the boundaries, press WINDOW. Xmin defines where the first class begins and Xscl defines the class width. Xmax, Ymin, and Ymax define how big the window will be. To have classes of size 50 starting at 600, adjust your setting to match the example below.

Calculating Summary Statistics Make sure your run data is still stored in lists. Press STAT, scroll to the CALC menu, and choose the first option 1:1-Var Stats Next, press 2nd-1 to indicate you want the statistics for L1. Then press enter.

Here is the information given. Scroll down for additional information.

To get the data for the NL distribution, repeat the process using L2.

One iPad app that can calculate summary statistics for us is called Bstatistics Lite. Download it!!! When entering data, observations have to be separated with commas.