The Practice of Statistics in the Life Sciences Third Edition

The Practice of Statistics in the Life Sciences Third Edition
5. Two-way tables The Practice of Statistics in the Life Sciences Third Edition © 2014 W.H. Freeman and Company

Objectives (PSLS Chapter 5)
Two-way tables Marginal distributions Conditional distributions Simpson’s paradox

Two-way tables Two-way tables summarize data about two categorical variables (or factors) collected on the same set of individuals. Each factor can have any number of levels. If the row factor has “r” levels, and the column factor has “c” levels, we say that the two-way table is an “r by c” table. High school students were asked whether they smoke, and whether their parents smoke: First factor: Parent smoking status Second factor: Student smoking status Two-way tables are thus named because there are two ways to group the data: by row variable, or by column variable.

Marginal distributions
We can examine each factor in a two-way table separately by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percents. The name “marginal” refers to the fact that the row and column totals are written as if in a margin. Marginal distribution for parental smoking Marginal distribution for student smoking

Computing marginal percents
Marginal percents are marginal counts divided by the table grand total. % % % 18.7% 81.3% 100%

Parental smoking Graphs The marginal distributions can be displayed on separate bar graphs, typically expressed as percents instead of raw counts. Each graph represents only one of the two variables, ignoring the second one. Each marginal distribution can also be shown in a pie chart. Student smoking

Conditional distributions
A conditional distribution is the distribution of one factor for each level of the other factor. A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total). In computing this conditional percent, we divide the count of students who smoke within the “both parents smoke” row, and divide by the “both parent smoke” row total. In essence, we are computing the percent who smoke for the subgroup of students in this study who have both parents smoking. Percent of students who smoke when both parents smoke = 400/1780 = 22.5%

Comparing conditional distributions
Comparing conditional distributions helps us describe the “relationship" between the two categorical variables. We can compare the percent of individuals in one level of factor 1 for each level of factor 2. Substantial differences suggest an association between factor 1 and factor 2. Notice that the percent of students who smoke is highest among those students who have two parents smoking, and lowest among those students whose parents do not smoke. This indicates an association between parental smoking and student smoking. Parental smoking may influence the student’s decision to smoke. Conditional distribution of student smokers for different parental smoking statuses: Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Percent of students who smoke when one parent smokes = 416/2239 = 18.6% Percent of students who smoke when neither parent smokes = 188/1356 = 13.9%

Percent who do not smoke
Graphs The conditional distributions can be compared graphically by displaying the percents making up one factor, for each level of the other factor. Conditional distribution of student smoking status for different levels of parental smoking status: Percent who smoke Percent who do not smoke Row total Both parents smoke 22% 78% 100% One parent smokes 19% 81% Neither parent smokes 14% 86% Notice that here, only the percents of students who smoke are shown in the bar graph. That is because the percents who do not smoke are implied. That is, if 22% of students who have two parents smoking smoke themselves, then it follows that the remaining 78% of these students do not smoke.

Percent who do not smoke
Conditional distribution of student smoking status for different levels of parental smoking status: Percent who smoke Percent who do not smoke Row total Both parents smoke 22% 78% 100% One parent smokes 19% 81% Neither parent smokes 14% 86% The full set of percents can be displayed in a set of pie charts, one for each level of the condition (here parental smoking status).

Conditional distribution of parental smoking status for different levels of student smoking status:
Student smokes Student does not smoke Percent with 2 parents smoking 40% 32% Percent with 1 parent smoking 41% 42% Percent with 0 parent smoking 19% 27% Column total 100% For each two-way table, there are two conditional distributions. Sometimes both are interesting, and sometimes only one conditional distribution truly interests us. In the smoking status study, the second conditional distribution (given student smoking status) is less interesting because we would imagine that parents might influence student smoking decision, but not vice-versa.

A 2013 Gallup survey investigated how phrasing may affect the opinions of American adults regarding physician-assisted suicide. Here are the findings: The value 70% is a marginal value representing the proportion of respondents in favor of physician-assisted suicide. a conditional value representing the proportion of respondents in favor of physician-assisted suicide, given that the question was asked in Form A. The value 70% represents the percent of respondents in favor when the question was phrased in Form A (“End the patient’s life by some painless means”). It is therefore a conditional value.

Simpson’s paradox Lurking variables are always a problem for interpretation, but their impact can be even more drastic when dealing with categorical data. An association that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson's paradox. The table on the right compares the failure rates when removing kidney stones in a sample of patients, using one of two procedures: open surgery and PCNL (a minimally invasive technique). PCNL: percutaneous nephrolithotomy 22% 17% Can you think of a possible lurking variable here?

22% 17% The procedures are not chosen randomly by surgeons! In fact, the minimally invasive procedure is most likely used for smaller stones with a good chance of success, whereas open surgery is likely used for more problematic conditions. In both cases, small stones and large stones, open surgery has a lower failure rate than PCNL. So why do the combined data suggest that PCNL is better? Because PCNL is used mainly when dealing with small stones, when the failure rate is generally low. Open surgery, by contrast, is used most often when dealing with large stones; and large stones have a higher failure rate overall.

Cancer mortality rates are similar in both groups.
In New York State (excluding New York City), 1,359 white men and 121 black men died from prostate cancer in Based on how many white and black men lived there in 1994, the prostate cancer mortality rates were as follows: All ages White Black Death from Yes 1,359 121 prostate cancer No 4,736,887 418,871 Total 4,738,246 418,992 Rate per 100,0000 28.7 28.9 Cancer mortality rates are similar in both groups. But when the data are broken down by age group we see that Death from prostate cancer Under 65 years of age Age 65 and older White Black Yes 76 18 1,282 102 No 4,177,823 396,899 559,075 21,973 Total 4,177,899 396,917 560,357 22,075 Rate per 100,0000 1.8 4.5 228.8 462.1 Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%). black men had a much higher rate of prostate cancer death than white men. What is the source of this example of Simpson’s paradox?

The Practice of Statistics in the Life Sciences Third Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Third Edition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Practice of Statistics in the Life Sciences Third Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Third Edition"— Presentation transcript:

Similar presentations

About project

Feedback