Chapter 3 Displaying and Describing Categorical Data Math2200
Categorical variable A categorical variable has only finite number of possible values Gender Car size Course grade
Titanic WHO People on the Titanic WHAT Survival status, age, sex, ticket class WHY Historical interest WHEN April 14,1912 WHERE North Atlantic HOW A variety of sources and Internet sites SurvivedAgeSexClass DeadAdultMaleThird DeadAdultMaleCrew DeadAdultMaleThird DeadAdultMaleCrew DeadAdultMaleCrew DeadAdultMaleCrew AliveAdultFemaleFirst DeadAdultMaleThird DeadAdultMaleCrew
Three rules of data analysis 1. Make a picture A picture can reveal the pattern and relationship hidden in your data 2. Make a picture A picture can show extraordinary data values or unexpected patterns 3. Make a picture Easy to understand
Florence Nightingale Founder of modern nursing First female member of British Statistical Society Used a picture to argue forcefully for better hospital conditions for soldiers
Frequency tables Making piles: count the number of cases corresponding to each category and pile them up People on Titanic: by ticket class ClassCount First325 Second285 Third706 Crew885
Relatively frequency table Proportion: divide counts by the total number of cases Percentage: multiply by 100 The frequency table or relative frequency table describe the distribution of a categorical variable ClassPercentage First14.77% Second12.95% Third32.08% Crew40.21%
What is your feeling about the proportion of crew members on board?
Why is the picture misleading? The length of each ship corresponds to the number of people in each category Our eyes tend to be more impressed by the area than by other aspects of the image. Even though the length of the ship is about 3 times, but the area is about 9 times. And that is misleading.
The area principle the area occupied by a part of the graph should correspond to the magnitude of the value it represents.
Bar chart Display of counts of a categorical variable with bars
Pie Charts
When you make a bar chart or pie chart, pay attention to the following Make sure the variable is indeed categorical Your data are counts or percentages of cases in categories Make sure that the categories do not overlap
Was there a relationship between the kind of ticket a passenger held and the passenger’s chances of making it into the lifeboat? What table should we make to answer this question?
Contingency table A two-way table The table shows how the subjects are distributed along each variable, contingent on the value of the other variable FirstSecondThirdCrewtotal Alive Dead Total
Add relative frequencies FirstSecondThirdCrewtotal Alive Counts % of Row28.55%16.60%25.04%29.82% % % of Column62.46%41.40%25.21%23.95%32.30% % of Table9.22%5.36%8.09%9.63%32.30% Dead Counts % of Row8.19%11.21%35.44%45.17% % % of Column37.54%58.60%74.79%76.05%67.70% % of Table5.54%7.59%23.99%30.58%67.70% Total Counts % of Row14.77%12.95%32.08%40.21% % % of Column100.00% % of Table14.77%12.95%32.08%40.21% %
Percent of what? What percent of the survivors were in second class? 118/711 = 16.60% What percent were second-class passengers who survived? The Who is everyone on board, i.e., 2201 is the denominator 118/2201 What percent of the second-class passengers survived? 118/285
A simplified table FirstSecondThirdCrewtotal Alive9.22%5.36%8.09%9.63%32.30% Dead5.54%7.59%23.99%30.58%67.70% Total14.76%12.95%32.08%40.21%100.00%
Marginal distribution In the margins of a contingency table, the frequency distribution of one of the variables is called its marginal distribution
Conditional distribution 1 FirstSecondThirdCrewtotal Alive %16.60%25.04%29.82%100.00% Dead %11.21%35.44%45.17%100.00%
Pie chart for conditional distributions of ticket Class for survivors and non-survivors
Conditional distribution 2 FirstSecondThirdCrewtotal Alive Counts % of Column62.46%41.40%25.21%23.95%32.30% Dead Counts % of Column37.54%58.60%74.79%76.05%67.70% Total Counts % of Column100.00%
Bar chart for conditional distributions of Ticket Class
Segmented Bar Chart
What can go wrong? Do not violate the area principle Incorrect correct
What can go wrong? Keep it honest Pay attention to labels Whether all percentages add up to 1? Do not confuse similar-sounding percentages The percentage of passengers who were both in first class and survived The percentage of the first class passengers who survived The percentage of the survivors who were in first class
What can go wrong? Do not forget to look at the variables separately, too. Look at both conditional and marginal distributions Be sure to use enough individuals Do not overstate your case
What can go wrong? Be careful with averages of proportion across several different groups Simpson’s Paradox ( Calculation in last column makes no sense) On-time record for two pilots DayNightOverall Moe90/100=90%10/20=50%100/120=83% Jill19/20=95%75/100=75%94/120=78%
Summary Chapter 3 Bar charts and pie charts are displays for categorical variables. A contingency table shows how cases are distributed along each variable conditioned on the other variable. Row/ column sums of table percentage of each cell in a contingency table give the marginal distributions. Row/column percentage in a contingency table show the conditional distributions. Contingency tables help to show the relationship of two categorical variables.