Download presentation
Presentation is loading. Please wait.
1
Chapter 1: Exploring Data
2
Introduction: Data Analysis… making sense of data
Come up to the whiteboard and write the name of your math class you took last year and your final percentage you earned in the course (to the best of your recollection) For example, perhaps you took Algebra II and earned a 79% as your final grade in the course last June. Don’t give any other instructions as to how to group, organize data students are writing on board.
3
Your data… Who are the individuals in our data?
What is/are our variable(s) in our data? What type of variable(s) did we measure in our data? Individuals do not have to be people; what else could they be? Butterflies, corvettes, bands, colleges, etc. Variables are the characteristic(s) we measure on each individual (class & final % in course). What other variables could we have measured on our individuals?
4
Your data… Categorical (a.k.a. qualitative) variables
Quantitative (a.k.a. numeric) variables Your summer assignment… Camo categorical variables/data Can we transform quantitative variables into categorical variables? If so, how? Camo’d (looks like numerical but isn’t) categorical data include: zip codes, SS #, floor #, address, student ID #, etc. Transform: 89% into B; 73% into C, etc. Could also do examples and non-examples of either/both categorical and/or numerical data
5
Distribution… Variables may take values that are very close together or values that are spread out. The pattern of variation of a variable is its distribution Our class data is a distribution Many different types of distributions in this course that we will analyze Normal, t, z, chi square, probability dist’s, uni-variate, bi-variate, etc.
6
Good General Practice in this Course…
Genuinely explore data! Don’t report the first thing you see Like a treasure hunt… with clues to follow… that ultimately leads to an big prize Context, context, context!!! Show your work!!!
7
“Summary” …at the end of each section in text
Helpful self-check Do you know what you should know? If not, do something about it… now!
8
Homework… Always posted on IC that evening
Always check your answers in the back of the book Always checked by me and discussed the next day in class Page 6, #’s 1, 3, 5, 7, & 8
9
1.1: Analyzing Categorical Data
Which social media platform do you use most? List on board, only put your mark on ONE platform Use a blue marker if you are male; use a green marker if you are female.
10
1.1: Analyzing Categorical Data
What type of graphical representation could we choose to display our categorical data? Bar graphs, segmented bar graphs, side-by-side bar graphs, pie charts, even two-way tables
11
1.1: Analyzing Categorical Data
What type of graphical representation could we choose to display our categorical data? Bar graph, segmented bar graph, side-by- side bar graphs, pie charts… even two-way tables The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall within each category. Strengths and weaknesses of each GR; difference between bar graph and histogram; strengths and weaknesses of #s vs. %s. Create a bar graph, then do side by side bar graphs separated by gender
12
Deception… watch for it…
13
What’s wrong?
14
Deception…
15
… fixed
16
More deception… And how about the people without animals? Or other animals?
17
… fixed again, sort of… HW: find a misleading graph, print it, cut it out, comment why it is misleading; how could we fix it?
18
Deception… with the data, and the graphical representation…
Fox’s million figure for the number of “people on welfare” comes from a Census Bureau’s account…of participation in means-tested programs, which include “anyone residing in a household in which one or more people received benefits” in the fourth quarter of 2011, thus including individuals who did not themselves receive government benefits. On the other hand, the “people with a full time job” figure Fox used included only individuals who worked, not individuals residing in a household where at least one person works.”
19
Relationships between categorical data …
20
Relationships between Categorical Data: Period 1 Data
Which game do you like the best? Female Male TOTALS Duck Duck Goose Musical Chairs Hop Scotch Hot potatoe, simon says, hopscotch, cats cradleOr could change, i.e., hot potato, simon says, hopscotch, cats cradle. Any comments to describe this data? Most people… few people… it appears that xxx is greater than xxx or is similar to xxx; what graphical representations could we create from our data? Copy down. We will use this data again in a little bit.
21
Relationships between Categorical Data: Period 3 Data
Which game do you like the best? Female Male TOTALS Duck Duck Goose Musical Chairs Hop Scotch Hot potatoe, simon says, hopscotch, cats cradle
22
We will come back to our data in a few minutes…
23
Relationships between Categorical Data
Remember, categorical data can be numbers, like addresses, phone numbers, area codes, social security numbers, etc. Categorical data can be ‘created’ by grouping values of quantitative variables into classes, i.e., ages year-olds, year-olds, year-olds, etc.
24
Two-Way Table (describes 2 categorical variables)
25
Entries are counts for each class/group
Bottom & side totals are marginal distributions. If missing always calculate them.
26
Which game do you like best?
Two-Way Tables Go back to our data and label row variables, column variables, counts, & marginal distributions Which game do you like best? Female Male TOTAL Duck Duck Goose Musical Chairs Hop Scotch
27
Which game do you like best?
Two-Way Tables Also helpful to calculate %’s for marginal distributions Often more informative when making comparisons May have rounding errors Which game do you like best? Female Male TOTAL Marginal % Duck Duck Goose Musical Chairs Hop Scotch
28
Two-Way Tables Remember, every marginal (row or column) distribution from a 2-way table is a distribution for a single categorical variable. We can graph categorical variables using …
29
Two-Way Tables: Comparing Groups within the Table
Marginal distributions compare that category to the grand total in the table Marginal distributions compare gender to grand total or game to grand total Marginal distributions don’t tell us how two groups within the table are related, i.e. how does gender and favorite game relate? See our two-way table/marginal distributions
30
Two-Way Tables: Comparing Groups within the Table
To make observations about how gender and preferred game relate, need to look inside the body of the 2-way table. Conditional distribution Again, % most helpful in comparing (versus raw numbers). Why? Could be only 20 (out of 100) did xxx; but 50 (out of 5,000) did xxx. Can be deceptive.
31
Conditional Distribution
“Given” just look at that column or row; that becomes your denominator The specific cell in that ‘given’ column or row becomes your numerator What is the conditional distribution… if we are only considering… do many examples.
32
Given that a college student is female, what is the probability that she is in the age group of years?
33
If a college student is between 25 & 34 years old, what is the probability of that student being male?
34
Conditional Distribution
Let’s look at our data… are the following the same? Given that student is male, % who prefer duck- duck-goose? Given student prefers duck-duck-goose, % who are male? Are the following the same? Given student is female, % who prefer musical chairs? Given student prefers musical chairs, % who are female?
35
Caution No single graph (like scatterplots for bivariate quantitative variables) describes the relationship between 2 categorical variables. No single numerical measure (such as correlation) summarizes the strength of the association between 2 categorical variables. Bar graphs – OK... But think about what you want to compare.
36
Joint Distributions … ‘and’… intersections
What is the probability that a person is both female and in the age group of year olds?
37
“or”… unions What is the probability that a person is either male or in the year age group?
38
Answer the following using our data…
How many of us prefer musical chairs? What proportion of us are female? Just considering those of us who prefer duck-duck- goose, what % are male? Given that a person is male, what % prefer duck- duck-goose? What is the probability that a person prefers hop scotch and is a female? What is the probability that a person is either female or prefers musical chairs?
39
Associations or trends …
When we calculate marginal distributions, conditional distributions, etc., we may notice associations or trends in the data Caution: Even a strong association between two categorical variables can be influenced by other variables lurking in the background Example: There is an on-going, documented association between increased ice cream sales and an increased rate of accidental pool deaths. Re: ice cream sales and pool deaths; one certainly does not cause the other. They are only associated, related, etc.
40
Anything possibly lurking in the background?
People who regularly attend religious services tend to live longer lives. Caution!!! Association ≠ Causation!
41
Homework Pg 20 #11, 13, 15, 19, 23 Pg 23 #27 - #34 MC practice
42
Section Quiz for 1-1
43
If time permits… Simpson’s Paradox
Not in AP course; just do if time permits; for interest only.
44
Simpson’s Paradox Just like with numerical data, the effects of lurking variables can change or even reverse relationship between two categorical variables. Simpson’s Paradox: An association or comparison that holds for each of several groups can reverse direction when the data are combined to form a single group.
45
Simpson’s Paradox o
46
Simpson’s Paradox o
47
Simpson’s Paradox o
48
Kent State Men’s Basketball: 2000 – 2001 Conference Games Only
Simpson’s Paradox Kent State Men’s Basketball: – 2001 Conference Games Only (18 Games) Trevor Huffman Bryan Bedford Made Attempted Average Two-Pointers 57 127 0.449 13 30 0.433 Three-Pointers 35 100 0.350 1 0.000 All 92 227 0.405 31 0.419
49
Simpson’s Paradox The Lurking Variable: The more difficult 3- pointers made up a much higher percentage of Huffman’s attempts than Bedford’s. When the two categories are combined, the harder shots are weighted more in Huffman’s total than in Bedford’s.
50
Simpson’s Paradox: Big Idea
When looking at data as one large group, we may observe one trend When the data is divided or broken up into two or more smaller groups, the observation may be reversed/opposite of the previous trend.
51
Displaying Quantitative/Numerical Data with Graphs
What types of graphs can we use for numerical data? What type of graphs can we NOT use for numerical data?
52
Graphical Representations for Quantitative/Numerical Data …
stem plots back-to-back stem plots dot plots Histograms box plots etc. Note: calculator can’t create all graphical representations
53
Let’s collect some data…
How many pairs of shoes (if you don’t know exact #, your best estimate) do you own? Males: Write your data in blue Females: Write your data in black (we will use this extra, categorical data later) Write on board.
54
Stem Plots & Back-to-Back Stem Plots …
Good for fairly small data sets Include a ‘key’ Calculator can’t create; we create by hand Always, always label and scale graphical representations. Did I say always? Let’s create a stem plot with our data Create stem plot with data collected from our class; collect data from M vs. F so we can also do a back to back stem plot; maybe # pairs of shoes you have (by M/F)
55
Stem Plots & Back-to-Back Stem Plots … did we:
Include a ‘key’? Create by hand? Label and scale? Is it a ‘small’ data set? SOCS (look for trends in data through the graphical representation) Create stem plot with data collected from our class; analyze it thru SOCS; collect data from M vs. F so we can also do a back to back stem plot; maybe # pairs of shoes you have (by M/F)
56
Detour… to SOCS Need a method to describe our data, our distributions, of numeric, uni-variate data SOCS S: Shape O: Outliers C: Center S: Spread
57
Shape Outliers Center Spread (describes numeric, uni-variate distributions)
Shape: uni-modal? bi-modal? tri-modal? symmetric? Skewed to left or right? Outlier(s): Is there a value(s) that is far away from the rest of the data? Center: Where is the “middle” of the data? Spread: How far apart or close together is the distribution?
58
Back to Stem-Plots…Back-to-Back Stem Plots
Used to compare 2 related distributions (small data set) Always label and scale. Include a ‘key’ Similar to stem plots, calculator can’t create, good for fairly small data sets, and look for trends via SOCS Let’s create a back-to-back stem plot of our data, categorizing by gender
59
Modifications to Stem Plots
Splitting stems: Data is very bunched together Tuition (in $1000) Virginia Colleges; Key: 0|2 = $2000 SOCS?
60
Another Modification Trimming: removing the last digit or digits before making a stem plot Examples of when you might want to trim your data in a stem plot (or a back-to-back stem plot)? Use your judgment Maybe the $ you spent at dinner last night; don’t need all digits; like rounding; cost of houses, annual salaries; etc.
61
Publisher’s Applet www.whfreeman.com/tps5e Applet 1B, stemplot
Demo in class for a few minutes.
62
Dot Plots… Let’s collect some data and create a dot plot
What is the last digit of your cell phone number? Before we collect the data, what do you expect the data to be/look like? What are some advantages and disadvantages of a dot plot? Retain original data; good for small data sets; can’t create on calculator; clearly shows clusters and gaps; cumbersome for large sets of data; if data is too spread out, hard to see trends (SOCS)
63
Dot Plots… Let’s create the dot plot
Always label and scale any graphical representation; always Analyze through SOCS… Shape Outlier(s) Center Spread Retain original data; good for small data sets; can’t create on calculator; clearly shows clusters and gaps
64
Comparing two distributions…
Do external rewards—things like money, praise, fame, and grades—promote creativity? A researcher designed an experiment to find out. She recruited 47 experienced creative writers who were college students and divided them into two groups using a chance process (like drawing names from a hat). The students in one group were given a list of statements about external reasons (E) for writing, such as public recognition, making money, or pleasing their parents. Students in the other group were given a list of statements about internal reasons (I) for writing, such as expressing yourself and enjoying playing with words. Both groups were then instructed to write a poem about laughter. Each student’s poem was rated separately by 12 different poets using a creativity scale. These ratings were averaged to obtain an overall creativity score for each poem.
65
Comparing two distributions…
Dot plots of the two groups’ creativity scores are shown below. Compare the two distributions. What do you conclude about whether external rewards promote creativity?
66
Both roughly symmetric; about same amount of variability
Center of (I) greater than center of (E), so, in general, external rewards do not promote creativity Neither distributions appear to have outliers
67
Histograms Another good graphical representation used with uni-variate, numerical data Use technology always. Don’t ever create by hand (too much work) All “classes” must be the same width (technology does this automatically) Label always. Scale always. Example…
68
Why are different class widths misleading?
69
Histograms… let’s collect some data…
How long can you hold your breath for? Pair up. Time to the nearest 10th of a second; for example, I can hold my breath for seconds. Write your data on the board; let’s create a histogram; label, scale… always Disadvantages to histograms… Don’t retain original data; can ‘change’ the way data looks by changing class width; usually want 7-12 classes in a histogram; usually
70
Creating Histograms… TI 83, 84, 84 + input data into list
2nd, stat plot only one plot on Select picture of histogram reference list you input your data into zoom, 9 Walk thru steps on how to do this
71
Histograms All classes are same width
Zoom, 9 forces data to fit screen Trace Can change width of classes if you wish; does this change the appearance of the data? Is there a scale on your screen? Describe the distribution via SOCS
72
Applet Activity 1B Applet Activity 1B, histograms Demo in class with students briefly.
73
X Compare/contrast histograms…
74
Matching… A survey of a large high school class asked the following questions: Are you female or male? (In the data, male = 0, female = 1.) Are you right-handed or left-handed? (In the data, right = 0, left = 1.) What is your height in inches? How many minutes do you study on a typical weeknight?
75
Which graph goes with each variable. Explain your reasoning
Which graph goes with each variable? Explain your reasoning. (gender, r/l handed, height, study time) A is study time; b is r/l handed; c is M/F; d is heights; explanations…
76
Histogram vs. Bar Graph What’s the difference?
Histogram is used for numeric data; bar graphs are used for categorical data No spaces between vertical bars in histogram; spaces between bars in bar graph Don’t confuse the two!
77
Dot plots, stem plots, histograms & Examining Distributions
Always want to create a graphical representation (if at all possible) of a distribution. But graphs aren’t enough. We also want/need… Numeric analysis… Shape, Outlier(s), Center, Spread (SOCS) We have general idea of numeric analysis; next section will be more detailed numeric analysis
78
Homework Use calculator to create graphs whenever possible/applicable
Page 41, #37, 39, 41, 45, 49, 51, 55, 59, 61, 65 MC Practice: Page 47, #69 - #74
79
Section Quiz for 1-2
80
Describing Quantitative Data with Numbers
81
So do we still graph? Yes! We still definitely want/need to graph a distribution... But don’t stop there. Just looking at a graph can sometimes be misleading or not give us a complete picture of what’s going on with a set of data/a distribution We need more information to fully understand the data
82
Measures of Central Tendency
Median: Middle value of a distribution, when the values of a distribution are ordered smallest to largest, Q2 Mean ( 𝑥 ): “Average” value of a distribution; add all values of a distribution and then divide by the number of values (sometimes you will see ‘trimmed mean’ referred to in FRQ’s... Disregard) Two different ideas of ‘center’ that behave differently… so which measure of center do we use?
83
Measures of Central Tendency
Example: Five randomly selected houses in Santa Clarita valued at $95,000 $100,000 $101,000 $106,000 $12,000,000 What is the mean? What is the median?
84
Measures of Central Tendency
input into lists stat, calc, 1-var stats, reference list Median $101,000 Mean $2,480,400 Why are they so different?
85
Measures of Central Tendency
Another example... Random sample of 7 pairs of jeans... $19 $21 $24 $17 $27 $31 $479 1-var stats (remember stat, calc, 1-var stats) to calculate mean and median
86
Measures of Central Tendency
Median: $24 Mean: $88.29 Why are they so different?
87
Measures of Central Tendency
Median: Resistant to extreme values, outliers Mean: Non-resistant to extreme values, outliers So, general rule: use median if there are outliers, if distribution is highly skewed; use mean if distribution is fairly symmetric, no outliers
88
Other Info in 1-Var Stats
(side note…) mean, median, Q1, Q3, max, min, standard deviation, etc.
89
Mean & Median... Think of a stem plot or a dot plot or some graphical representation of numerical data... What does the distribution look like if the mean is greater than the median? How about if the mean is less than the median?
90
If a distribution is left skewed, then the mean < median (generally)
91
If a distribution is right skewed, then the mean > median (generally)
92
Applet: Activity 1C Mean vs. Median on publisher’s website
Demo in class for a few minutes.
93
Measures of Central Tendency: Symbols
Median: usually no symbol; just referred to as ‘median;’ can be referred to as Q2 Mean: Sample: 𝑥 Population: µ
94
Are these the same distributions?
x 3 4 5 6 7 x 3 4 5 6 7 Mean and median are the same… but clearly different distributions. Need another characteristic to analyze distributions… spread
95
Measures of Spread in a Distribution
Quartiles, interquartile range (IQR), range (usually used in box plots; usually used when there are outliers) Standard deviation (usually used in histograms; usually used when there are no outliers)
96
Highway Mileage Highway gas mileage of 20 gasoline-powered two-seater cars are: Enter into list, do 1-var stats 13 16 15 17 19 23 32 29 20 22 25 24 28 26
97
Go over software output; what all/most mean (we don’t know all yet, i
Go over software output; what all/most mean (we don’t know all yet, i.e., SE)
98
Caution… Quartiles can be calculated slightly differently depending on the text, calculator, etc. AP readers won’t penalize you either way
99
Five-Number Summary & Boxplots
Minimum, Q1, Median (Q2), Q3, Maximum Used to create, interpret boxplots IQR (Inter-Quartile Range): Q3 – Q1; measure of spread, variability for boxplots Range
100
Five-Number Summary & Boxplots
Create a boxplot from highway mileage data input into lists Two types of boxplots in calculator Use trace to determine 5-number summary (or 1-var stats) Calculate the IQR Label and scale always
101
Boxplots & Outliers To calculate an outlier for a box plot:
(Q3 – Q1) x 1.5 OR IQR x 1.5 Add this product to Q3 & subtract this product from Q1 Are there any values in the distribution farther out than that point? If so, it/they are a suspected outlier(s)
102
Outlier(s) for highway mileage?
Do calculations to determine if there are any outliers in this distribution
103
Technology Toolbox We will go through all calculator procedures.
If you need additional help, tech toolbox is helpful
104
Boxplots Misc. Box plots are the only graphical representation to formally define what an outlier is and how to calculate it Side-by-side box plots can be created as well. These are helpful in comparing two or more distributions.
105
Side-by-Side Boxplots
How do the ranges compare? How do the centers compare? Which set of data has the most variability? Any outliers? How do the shapes compare?
106
# of Pairs of Shoes Data…Male vs. Female
Create one box plot for female and another box plot for male Compare and contrast (be sure to connect the two distributions by using connecting phrases like ‘is greater than,’ ‘is smaller,’ ‘is similar,’ etc.
107
Measures of Spread in a Distribution
Discussed IQR (and range); used with boxplots; IQR is based on median; generally used with data that has outliers (skewed) Now Standard Deviation; generally used with histograms and density curves; based on mean; usually used with distributions that are fairly symmetric Remember SOCS… this is the last ‘S’
108
Standard Deviation
109
Variance
110
Symbols s versus σ … what’s the difference? s: used for sample standard deviation σ: used for population standard deviation
111
Standard Deviation Use highway mileage data in calculator
Calculate SD of the distribution Calculator doesn’t know difference between s and σ Usually use s (not σ) when looking at 1-var stats in calculator
112
Properties of the Standard Deviation
s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s = 0 only when there is no spread/variability. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger. s, like the mean 𝑥 , is not resistant. A few outliers can make s very large.
113
Standard Deviation Contest!
Form groups of no more than 4 Your group must choose four numbers from the whole numbers 0 to 10, with repeats allowed Choose four numbers that have the smallest possible standard deviation Choose four numbers that have the largest possible standard deviation You have 3 minutes; go! Is more than one choice possible in either of the two questions above? Why?
114
Caution on Standard Deviation
s, standard deviation, does not give much helpful information about distributions with an extreme outlier(s) For example… 1, 2, 2, 2, 2, 2, 3, 3, 4, 9999 Calculate the standard deviation for the above distribution
115
The Variance s2 = variance of a sample σ2 = variance of a population
So 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = standard deviation
116
Histogram Activity Work together in groups
10 minutes & we will debrief This is a matching activity; gives 6 different histograms and 6 sets of means and medians. Students must match up. Copy these worksheets ahead of time.
117
Choosing Measures of Center & Spread
Minimum, Q1, Median, Q3, Maximum (five- Number summary) and boxplot usually preferred for describing a skewed distribution with strong outliers 𝑥 (mean) and s (standard deviation) and histogram usually preferred for reasonably symmetric distributions that are free of outliers. ’General’ guidelines…
118
Caution Graphs give the best overall picture of a distribution
Numerical analysis (center and spread) report specific facts about a distribution, but don’t describe the entire shape; don’t disclose multiple modes or gaps etc. Always plot your data. Always look at numerical analysis as well.
119
Choose a data set you have…
Holding our breath, etc. Create a histogram or box plot (practice) Are there outliers Do 1-var stats Describe using SOCS Etc.
120
Homework… Note: always use technology (when you can) to graph and/or calculate stats (like 1-var stats); don’t calculate by hand Page 69 #79, 81, 83, 85, 87, 89, 91, 95, , 105 MC Page 72, #
121
Section Quiz 1-3
122
Chapter 1 Review Publisher’s website
FRQ practice (in class & at home); Chapter 1 reference sheet (with your data sets); these are not graded/collected Extra Practice (not graded/collected… just more practice/review for you): Chapter Review (page 74); Chapter Review Exercises (page 76); Chapter 1 AP Statistics Practice Test (page 78); Case Closed (page 67); FRAPPY! (page 74); many links on my website!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.