Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring Numerical Data

Similar presentations


Presentation on theme: "Exploring Numerical Data"— Presentation transcript:

1 Exploring Numerical Data
Objectives Students will be able to: graph the distribution of a numerical variable 2) calculate summary statistics for a distribution of a numerical variable 3) compare distributions of a numerical variable

2 NL vs AL In MLB, what is the lineup difference between the NL and the AL? In 1973, the AL enacted the designated hitter (DH). The DH is a player that only bats (does not play defense). In the AL, the DH bats in the place of the pitcher. The DH was designed to increase offense, which would in turn generate more interest in AL games. The assumption was that fans would like to see more offense. Does the DH increase offense in MLB?

3 numerical variables are variables whose possible outcomes take on numerical values that represent different quantities of the variable. Examples: number of runs scored by teams in the AL number of sacks in an NFL season by DeMarcus Ware the amount of time it takes to swim 100 meters It is beneficial to begin an analysis of numerical variables with a graph of the data.

4 Here are the run totals for the 30 MLB teams in 2008
Here are the run totals for the 30 MLB teams in Note: the Astros were still in the NL.

5 There are various ways to graph the distribution of numerical variables.
We already know how to make a dotplot. Note: when making dotplots that compare two distributions, it is important to ensure the dotplots are on the same scale. Otherwise, the distributions are difficult to compare.

6 Here are the dotplots comparing the distribution of runs scored for AL and NL teams in 2008 (pg 120). At first glance, the distribution of runs scored seems fairly similar for both leagues. Perhaps AL teams score a little more often than NL teams.

7 Histograms A histogram is a graph that divides the values of a numerical variable into classes and uses bars to represent the number of values in each class. The frequency describes the number of observations in each class. For our histogram, the number of runs will be broken down into classes, and the frequency will be the number of teams in those classes.

8 One easy way to make a histogram is by starting with a dotplot, and building from there.
Let’s make a histogram for the number of runs scored during 2008 for all 30 MLB teams. Step 1: Start with a dotplot showing the runs scored for each of the 30 MLB teams.

9 Step 2: Divide the data into 5 to 10 equally wide classes
Step 2: Divide the data into 5 to 10 equally wide classes. For this example, we can use classes that are 50 runs wide. Therefore, our first class will be runs, the next class is runs, etc…

10 Step 3: Count how many observations are in each class
Step 3: Count how many observations are in each class. If an observation falls exactly on a border line, it is considered part of the class above the boundary. For example, the observation on 750 would count as part of the class.

11 Step 4: Draw bars for each class
Step 4: Draw bars for each class. Bars should be equally wide and have no spaces between them. The height for each bar corresponds with the number of observations in that class.

12 It is also possible to make a relative frequency histogram
It is also possible to make a relative frequency histogram. This shows the percentage of observations in each class, rather than the number of observations.

13 When comparing two or more histograms:
Use the same scales! The scales on the horizontal axes should match. The scales on the vertical axes should match. When the number of observations is not the same between distributions, we should make a relative frequency histogram. Let’s look at why….

14 Here are two frequency histograms comparing the number of points scored for players on the LA Lakers and players not on the Lakers in the regular season. Because there are many more players not on the Lakers, it is hard to compare these distributions.

15 Let’s now use a relative frequency histogram:
The comparison is now much easier to make.

16 Describing the Shape of the Distribution
There are several phrases we can use to describe the shape of the distribution of numerical data. Let’s look at this using different data from all 2009 MLB players who had at least 300 plate appearances. A plate appearance occurs each time a player takes their turn at bat.

17 Symmetric A distribution is symmetric if the left side of the graph is roughly a mirror image of the right side.

18 Skewed right A distribution is skewed to the right when the right side of the graph is more spread out than the left side.

19 Skewed left A distribution is skewed to the left when the left side of the graph is more spread out than the right side.

20 Unimodal A distribution is unimodal when it shows one distinct peak
Unimodal A distribution is unimodal when it shows one distinct peak. Note: the previous three graphs can also be considered unimodal.

21 Bimodal A distribution is bimodal if it has two distinct peaks
Bimodal A distribution is bimodal if it has two distinct peaks. This graph has a peak at 0 and a peak at 0.8.

22 Caution: Unimodal vs Bimodal
A common error is calling a distribution bimodal when it is really unimodal. To call a distribution bimodal, the peaks need to be clearly distinct. Sometimes a peak occurs because of our choice in boundaries. A good rule of thumb is that if moving one or two observations would eliminate a peak, then there is a good chance that the peak is only there because of our choice in boundaries.

23 Caution: Unimodal vs Bimodal
Here are two histograms that use the exact same data, but different class widths. The first looks like it has two peaks, but the second seems clearly unimodal.

24 Uniform A distribution is uniform when the heights of the bars are all about the same.

25 Dotplot vs Histogram General rule of thumb:
When the data sets are small, a dotplot is more useful (allows us to see each individual observation). When the data sets are large, a histogram is more useful. Think about trying to make a dotplot of the heights of all Americans. There would be way too many dots.


Download ppt "Exploring Numerical Data"

Similar presentations


Ads by Google