Statistics the science of collecting, analyzing, and drawing conclusions from data.

Chapter 1 & 3 The Role of Statistics & Graphical Methods for Describing Data

Statistics the science of collecting, analyzing, and drawing conclusions from data

What term would be used to describe “all high school graduates”?
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year. We could collect data from all high schools in the nation. What term would be used to describe “all high school graduates”? population

What do you call it when you collect data about the entire population?
The entire collection of individuals or objects about which information is desired A census is performed to gather about the entire population What do you call it when you collect data about the entire population?

We could collect data from all high schools in the nation.
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year. We could collect data from all high schools in the nation. Why might we not want to use a census here? If we didn’t perform a census, what would we do?

Sample A subset of the population, selected for study in some prescribed manner What would a sample of all high school graduates across the nation look like? A list created by randomly selecting the GPAs of all high school graduates from each state.

Once we have collected the data, what would we do with it?
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year. We could collect data from a sample of high schools in the nation. Organize it – graph & make some calculations etc.

Descriptive statistics
the methods of organizing & summarizing data If the sample of high school GPAs contained 10,000 numbers, how could the data be described or summarized? Create a graph State the range of GPAs Calculate the average GPA

Could we use the data from this sample to answer our question?
Suppose we wanted to know something about the GPAs of high school graduates in the nation this year. We could collect data from a sample of high schools in the nation. Organize it – graph & make some calculations etc. Could we use the data from this sample to answer our question?

Inferential statistics
involves making generalizations from a sample to a population Based on the sample, if the average GPA for high school graduates was 3.0, what generalization could be made? The average national GPA for this year’s high school graduate is approximately 3.0. Could someone claim that the average GPA for SHS graduates is 3.0? Be sure to sample from the population of interest!! No. Generalizations based on the results of a sample can only be made back to the population from which the sample came from.

The number of wrecks per week at the intersection outside?
Variable any characteristic whose value may change from one individual to another Is this a variable . . . The number of wrecks per week at the intersection outside?

Data observations on single variable or simultaneously on two or more variables For this variable . . . The number of wrecks per week at the intersection outside What could observations be?

Types of variables

Categorical variables
or qualitative identifies basic differentiating characteristics of the population

Numerical variables or quantitative
observations or measurements take on numerical values makes sense to average these values two types - discrete & continuous

Discrete (numerical) listable set of values usually counts of items

Continuous (numerical)
data can take on any values in the domain of the variable usually measurements of something

Classification by the number of variables
Univariate - data that describes a single characteristic of the population Bivariate - data that describes two characteristics of the population Multivariate - data that describes more than two characteristics (beyond the scope of this course

Identify the following variables:
the appraised value of homes in Fort Smith the color of cars in the teacher’s lot the number of calculators owned by students at your school the zip code of an individual the amount of time it takes students to drive to school Discrete numerical Is money a measurement or a count? Categorical Discrete numerical Categorical Continuous numerical

Graphs for categorical data

Bar Graph Used for categorical data Bars do not touch
Categorical variable is typically on the horizontal axis To describe – comment on which occurred the most often or least often May make a double bar graph or segmented bar graph for bivariate categorical data sets

Segmented Bar Charts Instead of a circle, uses a rectangular bar to represent entire data set Divided into segments representing different categories Each segment is proportional to relative frequency for that category

Using class survey data: graph favorite subject graph gender & favorite subject

A frequency distribution for categorical data is a table that displays the possible categories along with the associated frequencies and/or relative frequencies. History Math Science English Business Foreign language Totals 3 21 13 6 1 4 48 The frequency for a particular category is the number of times the category appears in the data set. History Math Science English Business Foreign language Totals 0.06 0.44 0.27 0.13 0.02 0.08 1.00 The relative frequency for a particular category is the fraction or proportion of the observations resulting in the category.

History Math Science English Business Foreign language Totals 3 21 13 6 1 4 48 History Math Science English Business Foreign language Totals 0.06 0.44 0.27 0.13 0.02 0.08 1.00

Pie (Circle) graph Used for categorical data To make:
Proportion ° Using a protractor, mark off each part To describe – comment on which occurred the most often or least often

Graphs for numerical data

Dotplot Used with numerical data (either discrete or continuous)
Made by putting dots (or X’s) on a number line Can make comparative dotplots by using the same axis for multiple groups

Distribution Activity . . .

Types (shapes) of Distributions

Symmetrical refers to data in which both sides are (more or less) the same when the graph is folded vertically down the middle bell-shaped is a special type has a center mound with two sloping tails

Uniform refers to data in which every class has equal or approximately equal frequency

Skewed (left or right) refers to data in which one side (tail) is longer than the other side the direction of skewness is on the side of the longer tail

Bimodal (multi-modal)
refers to data in which two (or more) classes have the largest frequency & are separated by at least one other class

How to describe a numerical, univariate graph
Do after Features of Distributions Activity

What strikes you as the most distinctive difference among the distributions of exam scores in classes A, B, & C ?

1. Center discuss where the middle of the data falls
three types of central tendency mean, median, & mode

What strikes you as the most distinctive difference among the distributions of scores in classes D, E, & F?

2. Spread discuss how spread out the data is
refers to the variability of the data Range, standard deviation, IQR

What strikes you as the most distinctive difference among the distributions of exam scores in classes G, H, & I ?

3. Shape refers to the overall shape of the distribution
symmetrical, uniform, skewed, or bimodal

What strikes you as the most distinctive difference among the distributions of exam scores in class K ?

4. Unusual occurrences outliers - value that lies away from the rest of the data gaps clusters anything else unusual

5. In context You must write your answer in reference to the specifics in the problem, using correct statistical vocabulary and using complete sentences!

Describing Quantitative Data
Histograms Boxplots Dotplots Stem and Leaf Plots

Just CUSS and BS!

Center “the typical value” Mean Median Unusual Features Gaps Outliers

single vs. multiple modes
Shape single vs. multiple modes (unimodal, bimodal) symmetry vs. skewness

Illustrated Distribution Shapes
Unimodal Bimodal Multimodal Skew positively (right) Skew negatively (left) Symmetric

“how tightly values cluster around the center”
Spread “how tightly values cluster around the center” Standard deviation Range IQR 5-number summary

And Be Specific!

More graphs for numerical data

Stemplots (stem & leaf plots)
Used with univariate, numerical data Must have key so that we know how to read numbers Can split stems when you have long list of leaves Can have a comparative stemplot with two groups Would a stemplot be a good graph for the number of pieces of gum chewed per day by AP Stat students? Why or why not? Would a stemplot be a good graph for the number of pairs of shoes owned by AP Stat students? Why or why not?

Basic Stemplots A stemplot is quite similar to a dotplot.
Like the dotplot, the stemplot is arranged along a type of number line (stems). Instead of plotting dots above corresponding points, you place (in order) leaves above the points. Let’s consider how it looks in an example.

Basic Stemplots Consider the dataset comprised of people’s ages at a family reunion. Their ages are: 4, 6, 7, 13, 16, 17, 23, 31, 36, 40, 42, 44, 53, 57, 58, 62, 84

Basic Stemplots A dotplot would be cumbersome and too spread out to be informative (consider numbering off the number line from 4 to 84). A stemplot will group the data together, thus compacting the graph and making it easier to visualize the data.

4, 6, 7, 13, 16, 17, 23, 31, 36, 40, 42, 44, 53, 57, 58, 62, 84 Basic Stemplots 1 2 3 4 5 6 7 8 To create the stemplot, first determine the stems. The stems in this case should be the tens place (the values in the tens place range from 0 to 8. Create your “number line” from 0 to 8. This is typically done vertically.

4, 6, 7, 13, 16, 17, 23, 31, 36, 40, 42, 44, 53, 57, 58, 62, 84 Basic Stemplots Then, place the leaves next to the corresponding stems in sequential order. Each digit placed as a stem should take up the same amount of space to provide a visual sense of how many values fall in that area of the number line. 4 6 7 1 3 6 7 2 3 1 6 4 0 2 4 5 3 7 8 6 7 8

Things to note include describing the center, shape, spread, and extreme values (outliers) of the distribution. The center seems to be around the low 40s. The shape is a bit odd; certainly not bell-shaped as it seems to have two peaks. This is typically referred to as bimodal. The spread is what you might expect for data representing ages of humans. It seems like the 84 year-old may be an outlier. 4 6 7 1 3 6 7 2 3 1 6 4 0 2 4 5 3 7 8 6 7 8

Example: The following data are price per ounce for various brands of dandruff shampoo at a local grocery store. Can you make a stemplot with this data?

Advanced Stemplots Many times, the data does not lend itself to tens digits and units digits (in fact, it rarely does). In most cases you will need to do some rounding or truncating (cutting off the excess digits for the purpose of the graph) of the figures (see example 1). Also, you may need split the stems. This means you may have two stems for each leading digit. The first stem will contain leaves from 0 to 4 and the second stem will contain leaves from 5 to 9. See the example 2 below for clarification. Finally you may want to construct back-to-back stemplots in order to compare two distribution. See example 3.

Example 1 - Rounding Original Data:
2.234, , , 3.794, 4.252, Revised Data (after rounding): 2.2, 3.2, 3.8, 3.8, 4.3, 4.9 Stemplot (done in StatCrunch – plus it rounds for you!) 2 : 2 3 : 288 4 : 39

Example 2 – Split Stems Original Data:
1.1, 1.2, 1.4, 1.6, 1.6, 1.7, 1.9, 1.9, 2.0, 2.0, 2.3, 2.5, 2.7, 3.5 1 : L : 124 2 : OR 1H : 66799 3 : L: 003 2H : 57 3L : 3H : 5 Which of the above plots is more telling with regards to the shape, center, spread, and extremes of the distribution?

Example 3 – Back-to-Back Stemplots
Original data consists of two datasets. Consider the number of home runs hit by Barry Bonds and Mark McGwire from the years 1987 to 2001. Year Bonds McGwire 1987 25 49 1988 24 32 1989 19 33 1990 39 1991 22 1992 34 42 1993 46 9 1994 37 1995 1996 52 1997 40 58 1998 70 1999 65 2000 2001 73 29 To construct the back-to-back stemplot, the stems go down the middle. On the left-hand side, plot one of your distributions data points (see McGwire’s data on the left). Notice the ordering of the leaves on McGwire’s side of the stemplot. On the right-hand side, plot your regular stemplot. This plot is useful in comparing distributions. It can quickly give the educatied reader a sense of how the center, shape, extremes and spread of two distributions compare.

Construct a back-to-back ranked stemplot of the data.
On a given day, the total purchases of 20 randomly chosen shoppers at Border’s bookstore and 30 randomly chosen shoppers at Barnes & Noble bookstore were recorded. The data, rounded to the nearest dollar, is below. Construct a back-to-back ranked stemplot of the data. Borders Barnes and Noble 97 70 10 49 52 79 34 113 41 108 42 26 65 38 69 93 22 44 30 47 66 58 85 59 130 33 83 143 48 66 78 59 82 43 80 84 79 22 70 29 64 36 56 37 62 58 41 48 39

Histograms Used with numerical data Bars touch on histograms Two types
Discrete Bars are centered over discrete values Continuous Bars cover a class (interval) of values For comparative histograms – use two separate graphs with the same scale on the horizontal axis Would a histogram be a good graph for the fastest speed driven by AP Stat students? Why or why not? Would a histogram be a good graph for the number of pieces of gun chewed per day by AP Stat students? Why or why not?

50 students were asked the question, “How many textbooks did you purchase last term?”

“How many textbooks did you purchase last term?”
The largest group of students bought 5 or 6 textbooks with 3 or 4 being the next largest frequency.

Another version with the scales produced differently.

When working with continuous data, the steps to construct a histogram are
Decide into how many groups or “classes” you want to break up the data. Typically somewhere between 5 and 20. A good rule of thumb is to think having an average of more than 5 per group.* Use your answer to help decide the “width” of each group. Determine the “starting point” for the lowest group.

The two histograms below display the distribution of heights of gymnasts and the distribution of heights of female basketball players. Which is which? Why? Heights – Figure A Heights – Figure B

Suppose you found a pair of size 6 shoes left outside the locker room
Suppose you found a pair of size 6 shoes left outside the locker room. Which team would you go to first to find the owner of the shoes? Why? Suppose a tall woman (5 ft 11 in) tells you see is looking for her sister who is practicing with a gym. To which team would you send her? Why? Center & spread

This table shows weights of 79 randomly selected students.

Mark the boundaries of the class intervals on a horizontal axis
Use frequency or relative frequency on the vertical scale.

Another version of a frequency table and histogram for the weight data with a class width of 20.

The resulting histogram.

Yet, another version of a frequency table and histogram for the weight data with a class width of 20.

The corresponding histogram.

Class Frequency 11-20 21-30 31-40 41-50 51-60

The table below lists the number of siblings for this year’s statistics students.
2 3 1 5 7 6 4 8 1 7 8 6 3 5 4 10 20 2 18 Frequency Tallies Interval

8 5 4.2 2.89 4.9 1.5 6 4.6 3 4 7 15 0.5 10 1 5.38 2.6 2 9 6.4 3.3 3.6 5.5 5.7 4.3 6.24 6.2 3.13 13 13.2 6.6 0.2 This table lists the distances statistics student drive to school. 5 14 to 16 2 12 to 14 6 10 to 12 8 to 10 11 6 to 8 13 4 to 6 10 2 to 4 0 to 2 Frequency Tallies Interval

Cumulative Relative Frequency Plot (Ogive)
. . . is used to answer questions about percentiles. Percentiles are the percent of individuals that are at or below a certain value. Quartiles are located every 25% of the data. The first quartile (Q1) is the 25th percentile, while the third quartile (Q3) is the 75th percentile. What is the special name for Q2? Interquartile Range (IQR) is the range of the middle half (50%) of the data. IQR = Q3 – Q1

Cumulative Relative Frequency Table
If we keep track of the proportion of that data that falls below the upper boundaries of the classes, we have a cumulative relative frequency table.

If we graph the cumulative relative frequencies against the upper endpoint of the corresponding interval, we have a cumulative relative frequency plot.

Cumulative Relative Frequency
Class Frequency Relative Frequency Cumulative Frequency Cumulative Relative Frequency 11-20 21-30 31-40 41-50 51-60 DRP Scores Cumulative Relative Frequency

Histograms with uneven class widths
For many reasons, either for convenience or because that is the way data was obtained, the data may be broken up in groups of uneven width as in the following example referring to the student ages.

If a frequency (or relative frequency) histogram is drawn with the heights of the bars being the frequencies (relative frequencies), the result is distorted. Notice that it appears that there are a lot of people over 28 when there is only a few.

To correct the distortion, we create a density histogram
To correct the distortion, we create a density histogram. The vertical scale is called the density and the density of a class is calculated by This choice for the density makes the area of the rectangle equal to the relative frequency.

Continuing this example we have
.329/2 .063/12

The resulting histogram is now a reasonable representation of the data.

Statistics the science of collecting, analyzing, and drawing conclusions from data.

Similar presentations

Presentation on theme: "Statistics the science of collecting, analyzing, and drawing conclusions from data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics the science of collecting, analyzing, and drawing conclusions from data.

Similar presentations

Presentation on theme: "Statistics the science of collecting, analyzing, and drawing conclusions from data."— Presentation transcript:

Similar presentations

About project

Feedback