The Practice of Statistics in the Life Sciences Fourth Edition

The Practice of Statistics in the Life Sciences Fourth Edition
Chapter 5: Two-way tables Copyright © 2018 W. H. Freeman and Company

Objectives Two-way tables Marginal distributions
Conditional distributions Simpson’s paradox

Two-way tables (1 of 2) Two-way tables summarize data about two categorical variables (or factors) collected on the same set of individuals. Each factor can have any number of levels. If the row factor has “r” levels, and the column factor has “c” levels, we say that the two-way table is an “r by c” table. High school students were asked whether they smoke and whether their parents smoke: First factor: Parent smoking status Both parents smoke One parent smokes Neither parent smokes Two-way tables are thus named because there are two ways to group the data: by row variable, or by column variable.

Two-way tables (2 of 2) Second factor: Student smoking status
Student smokes Student does not smoke 400 1380 416 1823 188 1168 Two-way tables are thus named because there are two ways to group the data: by row variable, or by column variable.

Marginal distributions
We can examine each factor in a two-way table separately by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percents. The name “marginal” refers to the fact that the row and column totals are written as if in a margin.

Computing marginal percents
Marginal percents are marginal counts divided by the table grand total.

Graphs The marginal distributions can be displayed on separate bar graphs, typically expressed as percents instead of raw counts. Each graph represents only one of the two variables, ignoring the second one. Each marginal distribution can also be shown in a pie chart.

Conditional distributions
A conditional distribution is the distribution of one factor for each level of the other factor. A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total). Student smokes Student does not smoke Total Both parents smoke 400 1380 1780 One parent smokes 416 1823 2239 Neither parent smokes 188 1168 1356 1004 4371 5375 In computing this conditional percent, we divide the count of students who smoke within the “both parents smoke” row, and divide by the “both parent smoke” row total. In essence, we are computing the percent who smoke for the subgroup of students in this study who have both parents smoking. Percent of students who smoke when both parents smoke = 400/1780 = 22.5%

Comparing conditional distributions (1 of 2)
Comparing conditional distributions helps us describe the “relationship" between the two categorical variables. Student smokes Student does not smoke Total Both parents smoke 400 1380 1780 One parent smokes 416 1823 2239 Neither parent smokes 188 1168 1356 1004 4371 5375 Notice that the percent of students who smoke is highest among those students who have two parents smoking, and lowest among those students whose parents do not smoke. This indicates an association between parental smoking and student smoking. Parental smoking may influence the student’s decision to smoke. We can compare the percent of individuals in one level of factor 1 for each level of factor 2. Substantial differences suggest an association between factor 1 and factor 2.

Comparing conditional distributions (2 of 2)
Conditional distribution of student smokers for different parental smoking statuses: Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Percent of students who smoke when one parent smokes = 416/2239 = 18.6% Percent of students who smoke when neither parent smokes = 188/1356 = 13.9% Notice that the percent of students who smoke is highest among those students who have two parents smoking, and lowest among those students whose parents do not smoke. This indicates an association between parental smoking and student smoking. Parental smoking may influence the student’s decision to smoke.

More graphs The conditional distributions can be compared graphically by displaying the percents making up one factor, for each level of the other factor. Conditional distribution of student smoking status for different levels of parental smoking status: Notice that here, only the percents of students who smoke are shown in the bar graph. That is because the percents who do not smoke are implied. That is, if 22% of students who have two parents smoking smoke themselves, then it follows that the remaining 78% of these students do not smoke.

Conditional distribution graphs (1 of 2)
Conditional distributions of student smoking status for different levels of parental smoking status: Percent who smoke Percent who do not smoke Row total Both parents smoke 22% 78% 100% One parent smokes 19% 81% Neither parent smokes 14% 86% The full set of percents can be displayed in a set of pie charts, one for each level of the condition (here parental smoking status).

Conditional distribution graphs (2 of 2)
Conditional distribution of parental smoking status for different levels of student smoking status: For each two-way table, there are two conditional distributions. Sometimes both are interesting, and sometimes only one conditional distribution truly interests us. In the smoking status study, the second conditional distribution (given student smoking status) is less interesting because we would imagine that parents might influence student smoking decision but not vice versa.

Gallup example A 2013 Gallup survey investigated how phrasing may affect the opinions of American adults regarding physician-assisted suicide. Here are the findings: Which choice describes the value 70%? a marginal value representing the proportion of respondents in favor of physician-assisted suicide a conditional value representing the proportion of respondents in favor of physician-assisted suicide, given that the question was asked in Form A The value 70% represents the percent of respondents in favor when the question was phrased in Form A (“End the patient’s life by some painless means”). It is therefore a conditional value.

Gallup example—Graphs (1 of 2)
The data may a be analyzed as percentages or totals. Form A “End the patient life by some painless means” Form B “Assist the patient to commit suicide” Should be allowed 70% 51% Should not be 27% 45% allowed No opinion 3% 4% Number interviewed 719 816 In the bar chart on the next slide, the percentages can be seen as the proportions within each segmented bar, while the difference in sample sizes can be seen in the overall heights.

Gallup example—Graphs (2 of 2)
“Painless means” “Commit suicide” Allowed 503 416 Not allowed 194 367 No opinion 22 33 Total 719 816

Simpson’s paradox (1 of 4)
Lurking variables are always a problem for interpretation, but their impact can be even more drastic when dealing with categorical data. An association that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson's paradox. The table on the next slide compares the failure rates when removing kidney stones in a sample of patients, using one of two procedures: open surgery and PCNL (a minimally invasive technique). PCNL: percutaneous nephrolithotomy

Open surgery PCNL Success 273 289 Failure 77 61 % failure 22 17% Can you think of a possible lurking variable here? Or is the minimally invasive procedure really riskier than open surgery? PCNL: percutaneous nephrolithotomy

The procedures are not chosen randomly by surgeons! In fact, the minimally invasive procedure is most likely used for smaller stones with a good chance of success, whereas open surgery is likely used for more problematic conditions. Note the majority of the small stones cases (the safer situation) involve PCNL, while the majority of the large stone cases (the riskier situation) involve open surgery. In both cases, PCNL has a higher % failure; the sample sizes cause the overall % failure to reverse. In both cases, small stones and large stones, open surgery has a lower failure rate than PCNL. So why do the combined data suggest that PCNL is better? Because PCNL is used mainly when dealing with small stones, when the failure rate is generally low. Open surgery, by contrast, is used most often when dealing with large stones, and large stones have a higher failure rate overall.

Small stones Open surgery PCNL Success 81 234 Failure 6 36 % failure 7% 13% Large stones Open surgery PCNL Success 192 55 Failure 71 25 % failure 27% 31% In both cases, small stones and large stones, open surgery has a lower failure rate than PCNL. So why do the combined data suggest that PCNL is better? Because PCNL is used mainly when dealing with small stones, when the failure rate is generally low. Open surgery, by contrast, is used most often when dealing with large stones, and large stones have a higher failure rate overall.

Another example (1 of 3) In New York State (excluding New York City), 1,359 white men and 121 black men died from prostate cancer in Based on how many white and black men lived there in 1994, the prostate cancer mortality rates were as follows: Death from prostate cancer White (all ages) Black (all ages) Yes 1,359 121 No 4,736,887 418,871 Total 4,738,246 418,992 Rate per 100,0000 28.7 28.9 Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%).

Another example (2 of 3) Cancer mortality rates are similar in both groups. But when the data are broken down by age group we see that black men had a much higher rate of prostate cancer death than white men. Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%).

Another example (3 of 3) Death from prostate cancer White (under 65 years of age) Black (under 65 years of age) Yes 76 18 No 4,177,823 396,899 Total 4,177,899 396,917 Rate per 100,0000 1.8 4.5 Death from prostate cancer White (age 65 and older) Black (age 65 and older) Yes 1,282 102 No 559,075 21,973 Total 560,357 22,075 Rate per 100,0000 228.8 462.1 Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%). What is the source of this example of Simpson’s paradox?

The Practice of Statistics in the Life Sciences Fourth Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Fourth Edition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Practice of Statistics in the Life Sciences Fourth Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Fourth Edition"— Presentation transcript:

Similar presentations

About project

Feedback