Presentation on theme: "Displaying and Describing"— Presentation transcript:
1Displaying and Describing Categorical DataChapter 3
2Objectives Frequency Table Relative Frequency Table Distribution Area PrincipleBar Chart Pie ChartContingency TableMarginal DistributionConditional DistributionIndependenceSegmented Bar ChartSimpson’s Paradox
3The Three Rules of Data Analysis The three rules of data analysis won’t be difficult to remember:Make a picture—things may be revealed that are not obvious in the raw data. These will be things to think about.Make a picture—important features of and patterns in the data will show up. You may also see things that you did not expect.Make a picture—the best way to tell others about your data is with a well-chosen picture.
4Frequency TableWhat is a frequency Table? A frequency table is an organization of raw data in tabular form, using classes (or intervals) and frequencies.What is a frequency count? The frequency or the frequency count for a data value is the number of times the value occurs in the data set.
5Categorical Frequency Tables NOTE: Later we will consider qualitative frequency tables.What is a categorical frequency table? A categorical frequency table represents data that can be placed in specific categories, such as gender, hair color, political affiliation etc.
6Frequency Table A frequency table is a tabular summary of data showing the frequency (or number) of itemsin each of several non-overlapping categories.The objective is to provide insights about thedata that cannot be quickly obtained by lookingonly at the original data.
7Frequency Tables:We can “organize” the data by counting the number of data values in each category of interest.We can organize these counts into a frequency table, which records the totals and the category names.
8Categorical Frequency Table Example: The blood types of 25 blood donors are given below. Summarize the data using a frequency distribution.AB B A O BO B O A OB O B B BA O AB AB OA B AB O A
9Categorical Frequency Table for the Blood Types Note: The classes for the distribution are the blood types.
10Your TurnGuests staying at Marada Inn were asked to rate the quality of their accommodations as being excellent (E),above average (AA), average (A), below average (BA), or poor (P). The ratings provided by a sample of 20 guests:BA, AA, AA, A, AA, A, AA, A, AA, BA, P, E, AA, A, AA, AA, BA, P, AA, A .Make a frequency table.
11Categorical Frequency Distribution RatingCountsPoor (P)2Below Average (BA)3Average (A)5Above Average (AA)9Excellent (E)1Total20
12Relative Frequency Tables: A relative frequency table is similar, but gives the relative frequency, a decimal or percentage (instead of counts) for each category.
13Relative Frequency Table The relative frequency of a class is the fraction orproportion of the total number of data itemsbelonging to the class.A relative frequency table is a tabular summary ofa set of data showing the relative frequency foreach class.
14Percent Frequency Table The percent frequency of a class is the relativefrequency multiplied by 100.A percent frequency table is a tabularsummary of a set of data showing the percentfrequency for each class.
15Relative & Percent Categorical Frequency Table Using the frequency table below, from the Marada Inn problem, create a relative and percent frequency table.Add two additional columns labeled relative frequency and percent frequency.RatingCountsPoor (P)2Below Average (BA)3Average (A)5Above Average (AA)9Excellent (E)1Total20
17Frequency Tables:All three types of tables show how cases are distributed across the categories.They describe the distribution of a categorical variable because they name the possible categories and tell how frequently each occurs.
18There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli ( )
19Misleading Statistics Now that we have the frequency table, we are ready to make a picture or a graph of the data.Misleading graphsScalePictographs
20Misleading Statistics - Scale Adjusting the scale of a graph is a common way to mislead (or lie) with statistics.Example:
22Misleading Statistics The best data displays observe a fundamental principle of graphing data called the area principle.The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.Violations of the area principle are another common way of misleading with statistics.
23What’s Wrong With This Picture? You might think thata good way to showthe Titanic data iswith this display:
24What’s Wrong With This Picture? The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride.When we look at each ship, we see the area taken up by the ship, instead of the length of the ship.The ship display violates the area principle:The area occupied by a part of the graph should correspond to the magnitude of the value it represents.
25Area Principle - Pictographs Double the length, width, and height of a cube, and the volume increases by a factor of eight
28Ways to Graph Categorical Data Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.).Bar Charts – Each category is represented by a bar.Pie Charts - The slices must represent the parts of one whole.
29Bar ChartsA bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.A bar chart stays true to the area principle.Thus, a better display for the ship data is:
30Bar Charts (cont.)A relative frequency bar chart displays the relative proportion of counts for each category.A relative frequency bar chart also stays true to the area principle.Replacing counts with percentages in the ship data:
31Bar Charts A bar chart is a graphical device for depicting qualitative data.On one axis (usually the horizontal axis), we specifythe labels that are used for each of the categories.A frequency, relative frequency, or percent frequencyscale can be used for the other axis (usually thevertical axis).Using a bar of fixed width drawn above each classlabel, we extend the height appropriately.The bars are separated to emphasize the fact that eachclass is a separate category.
32Bar ChartsEither counts (frequency bar chart) or proportions (relative frequency bar chart) may be shown on the y-axis. This will not change the shape or relationships of the graph.Make sure all graphs have a descriptive title and that the axes are labeled (this is true for all graphs in AP Stats).
33Pie ChartsWhen you are interested in parts of the whole, a pie chart might be your display of choice.Pie charts show the whole group of cases as a circle.They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.
34Pie Chart The pie chart is a commonly used graphical device for presenting relative frequency distributions forqualitative data.First draw a circle; then subdivide the circleinto sectors that correspond in area to therelative frequency for each category.Since there are 360 degrees in a circle,a category with a relative frequency of .25 wouldconsume .25(360) = 90 degrees of the circle.
35Relations between Two Categorical Variables Examples:Is gender or race related to political preference?What type of music can make people relax?Will different packaging of the same product attract people with different social-economic background?A contingency table or two-way table is a way to display the data from two categorical variables. A sort of Venn Diagram which shows how a population splits according to two factors.
36Contingency TablesA contingency table allows us to look at two categorical variables together.It shows how individuals are distributed along each variable, contingent on the value of the other variable.Example: we can examine the class of ticket and whether a person survived the Titanic:
37Contingency Tables (cont.) The margins of the table, both on the right and on the bottom, give totals and the frequency distributions for each of the variables.Each frequency distribution is called a marginal distribution of its respective variable.The marginal distribution of Survival is:
38Contingency Tables (cont.) Each cell of the table gives the count for a combination of values of the two values.For example, the second cell in the crew column tells us that 673 crew members died when the Titanic sunk.
39Conditional Distributions A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable.The following is the conditional distribution of ticket Class, conditional on having survived:
40Conditional Distributions (cont.) The following is the conditional distribution of ticket Class, conditional on having perished:
41Conditional Distributions (cont.) The conditional distributions tell us that there is a difference in class for those who survived and those who perished.This is better shown with pie charts of the two distributions:
42Conditional Distributions (cont.) We see that the distribution of Class for the survivors is different from that of the nonsurvivors.This leads us to believe that Class and Survival are associated, that they are not independent.The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable.
43Segmented Bar ChartsA segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles.Each bar is treated as the “whole” and is divided proportionally into segments corresponding o the percentage in each group.Here is the segmented bar chart for ticket Class by Survival status:
44Example: Income level vs. Job Satisfaction Income Job Satisfaction Row Total< 30K30K-50K50K-80K> 80KC. TotalConditionaldistributionMarginaldistributionTabletotalThis is a Contingency table with Income Level as the Row Variable and Job Satisfaction as the Column Variable.The distributions of income to job satisfaction or job satisfaction to income are called Conditional Distributions.The distributions of income alone and job satisfaction alone are called Marginal Distributions.Relationships between categorical variables are described by calculating appropriate percents from the counts given in each cell.
45Example:A Statistics class reports the following data on sex and eye color for students in the class:Eye ColorSexBlueBrownGreen/Hazel/OtherTotalMales62032Females4161210361864What percent of females are brown-eyed?What percent of brown-eyed students are female?What percent of students are brown-eyed females?What’s the distribution of eye color?What’s the conditional distribution of eye color for the males?Compare the percent who are female among the blue-eyed students to the percent of all students who are female?Does it seem that eye color and sex are independent? Explain.
46Solution: Eye Color Sex Blue Brown Green/Hazel/Other Total Males 6 20 32Females4161210361864What percent of females are brown-eyed? 16/32 = .5 or 50%What percent of brown-eyed students are female? 16/36 = .444 or 44.4%What percent of students are brown-eyed females? 16/64 = .25 or 25%What’s the distribution of eye color? 10/64 = .156 or 15.6% Blue, /64 = .563 or 56.3% Brown, 18/64 =.281 or 28.1% Green/Hazel/OtherWhat’s the conditional distribution of eye color for the males? /32 = .188 or 18.8% Blue, 20/32 = .625 or 62.5% Brown, /32 = .188 or 18.8% Green/Hazel/OtherCompare the percent who are female among the blue-eyed students to the percent of all students who are female? 4/10 = .4 or 40% of the blue-eyed students are female, while 32/64 = .5 or 50% of all students are female.Does it seem that eye color and sex are independent? Explain. Since blue-eyed students appear less likely to be female, it seems that Sex and Eye Color may not be independent. (But the numbers are small.)
48Simpson’s Paradox Discovered by E. H. Simpson in 1951. Occurs when averaging different samples of different sizesTwo groups from one sample are compared to two similar groups from another sampleOne sample’s success rate for both groups is higher than the success rates for the other sample Not E. H. Simpson
49Simpson’s ParadoxHowever, when both groups’ respective success rates are combined, the sample with the lower success rate ends up with the better overall proportion of successes. Thus, the paradox.One sample usually has a considerably smaller number of members than the other groupsSimpson’s Paradox does not occur in populations with similar amounts
50What is Simpson’s Paradox? Simpson’s Paradox occurs when an association between two variables is reversed upon observing a third variable.
51Simpson’s ParadoxSimpson’s paradox lurking variable creates a reversal in the direction of an association (“confounding”)To uncover Simpson’s Paradox, divide data into subgroups based on the lurking variable
52Recent Cleveland Indians season records 2003—68-94, 42.0% winning percentage2004—80-82, 49.4% winning percentageTwo-season record: , 45.7% win percentageRecent Minnesota Twins season records2003—90-72, 55.6% win percentage2004—92-70, 56.8% win percentageTwo-season record: , 56.2% win percentageNotice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox.
53Recent Cleveland Indians season records 2003—68-94, 42.0% winning percentage2004—80-82, 49.4% winning percentageTwo-season record: , 45.7% win percentageRecent Minnesota Twins season records2003—90-72, 55.6% win percentage2004—92-70, 56.8% win percentageTwo-season record: , 56.2% win percentageNotice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox.
54This is Simpson’s Paradox. Simpson’s Paradox at workRonnie Belliard2002—61/289, .211 of his at-bats were hits2003—124/447, .277 of his at-bats were hitsTwo-season average: 185/736, hits of the timeCasey Blake2002—4/20, .200 of his at-bats were hits2003—143/557, .257 of his at-bats were hitsTwo-season average: 147/577, hits of the timeThe two season batting avg. for Belliard was lower than Blake’s, but dividedinto separate seasons, Belliard’s had a higher batting avg. both seasons.This is Simpson’s Paradox.
55Discrimination? (Simpson’s Paradox) Consider college acceptance rates by sexAcceptedNot acceptedTotalMen198162360Women88112200286274560198 of 360 (55%) of men accepted 88 of 200 (44%) of women accepted Is there a sex bias?
56Discrimination? (Simpson’s Paradox) Or is there a lurking variable that explains the association?To evaluate this, split applications according to the lurking variable "major applied to”Business School (240 applicants)Art School (320 applicants)
57Discrimination? (Simpson’s Paradox) BUSINESS SCHOOLAcceptedNot acceptedTotalMen18102120Women24964219824018 of 120 men (15%) of men were accepted to B-school 24 of 120 (20%) of women were accepted to B-school A higher percentage of women were accepted
58Discrimination (Simpson’s Paradox) ART SCHOOLAcceptedNot acceptedTotalMen18060240Women64168024476320180 of 240 men (75%) of men were accepted 64 of 80 (80%) of women were accepted A higher percentage of women were accepted.
59Discrimination? (Simpson’s Paradox) Within each school, a higher percentage of women were accepted than men.No discrimination against womenPossible discrimination against menThis is an example of Simpson’s Paradox.When the lurking variable (School applied to) was ignored, the data suggest discrimination against women.When the School applied to was considered, the association is reversed.
60Colin R. Blyth’s example of Simpson’s Paradox A doctor was planning to try a new treatment on patients mostly local (C) and a few in Chicago (C’). A statistician advised him to use a table of random numbers and as each C patient became available, assign him to the new treatment with probability .91, leave him to the standard treatment with probability .09; and the same for C’ patient with probability .01 and .99 respectively. When the doctor returned with the data the statistician told him that the new treatment was obviously a very bad one, and criticized him for having continued trying it on so many patients.TreatmentStandardNewDead59509005Alive50501095(46%)(11%)
61The doctor replied that he continued because the new treatment was obviously a very good one, having nearly doubled the recovery rate in both cities.C patient onlyC’ patient onlyTreatmentStandardNewDead950900050005Alive501000955%10%50%95%
63Smokers’ ExampleIn England a study was conducted to examine the survival rates of smokers and non-smokers. The result implied a significant positive correlation between smoking & survival rates because only 24% of smokers died as compared to 31% of non-smokers. When the data were broken down by age group in a contingency table, it was found that there were more older people in the non-smoker group. Thus age played a very significant role in the outcome but since it was overlooked the researchers were left with deceiving results. (Appleton & French, 1996).
65What’s true for the parts isn’t true for the whole. The ParadoxWhat’s true for the parts isn’t true for the whole.
66CONCLUSION!!!!Simpson’s paradox is a rare phenomenon! It does not occur often! Thus statisticians must be trained academically & ethically well enough to make sure that if it has occurred they will detect and correct it. This is where practice, critical thinking skills, and repetition come into play!
67What Can Go Wrong? Don’t violate the area principle. While some people might like the pie chart on the left better, it is harder to compare fractions of the whole, which a well-done pie chart does.
68What Can Go Wrong? (cont.) Keep it honest—make sure your display shows what it says it shows.This plot of the percentage of high-school students who engage in specified dangerous behaviors has a problem. Can you see it?
69What Can Go Wrong? (cont.) Don’t confuse similar-sounding percentages—pay particular attention to the wording of the context.Don’t forget to look at the variables separately too—examine the marginal distributions, since it is important to know how many cases are in each category.
70What Can Go Wrong? (cont.) Be sure to use enough individuals!Do not make a report like “We found that 66.67% of the rats improved their performance with training. The other rat died.”
71What Can Go Wrong? (cont.) Don’t overstate your case—don’t claim something you can’t.Don’t use unfair or silly averages—this could lead to Simpson’s Paradox, so be careful when you average one variable across different levels of a second variable.
72What have we learned?We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents).We can display the distribution in a bar chart or pie chart.And, we can examine two-way tables called contingency tables, examining marginal and/or conditional distributions of the variables.