The Practice of Statistics in the Life Sciences Fourth Edition

Slides:



Advertisements
Similar presentations
Data Analysis for Two-Way Tables
Advertisements

AP Statistics Section 4.2 Relationships Between Categorical Variables.
AP STATISTICS Section 4.2 Relationships between Categorical Variables.
Lecture 9 Chapter 22. Tests for two-way tables. Objectives The chi-square test for two-way tables (Award: NHST Test for Independence)  Two-way tables.
Analysis of Two-Way tables Ch 9
Unit 3 Relations in Categorical Data. Looking at Categorical Data Grouping values of quantitative data into specific classes We use counts or percents.
CHAPTER 6: Two-Way Tables. Chapter 6 Concepts 2  Two-Way Tables  Row and Column Variables  Marginal Distributions  Conditional Distributions  Simpson’s.
Two-way tables BPS chapter 6 © 2006 W. H. Freeman and Company.
Analysis of two-way tables - Data analysis for two-way tables IPS chapter 2.6 © 2006 W.H. Freeman and Company.
Chapter 3: Displaying and Describing Categorical Data Sarah Lovelace and Alison Vicary Period 2.
BPS - 3rd Ed. Chapter 61 Two-Way Tables. BPS - 3rd Ed. Chapter 62 u In this chapter we will study the relationship between two categorical variables (variables.
Stat1510: Statistical Thinking and Concepts Two Way Tables.
Two-Way Tables Categorical Data. Chapter 4 1.  In this chapter we will study the relationship between two categorical variables (variables whose values.
Aim: How do we analyze data with a two-way table?
Warm-up An investigator wants to study the effectiveness of two surgical procedures to correct near-sightedness: Procedure A uses cuts from a scalpel and.
Lecture 9 Chapter 22. Tests for two-way tables. Objectives (PSLS Chapter 22) The chi-square test for two-way tables (Award: NHST Test for Independence)[B.
Chapter 6 Two-Way Tables BPS - 5th Ed.Chapter 61.
Categorical Data! Frequency Table –Records the totals (counts or percentage of observations) for each category. If percentages are shown, it is a relative.
BPS - 3rd Ed. Chapter 61 Two-Way Tables. BPS - 3rd Ed. Chapter 62 u In prior chapters we studied the relationship between two quantitative variables with.
AP Statistics Section 4.2 Relationships Between Categorical Variables
4.3 Relations in Categorical Data.  Use categorical data to calculate marginal and conditional proportions  Understand Simpson’s Paradox in context.
Categorical Data! Frequency Table –Records the totals (counts or percentage of observations) for each category. If percentages are shown, it is a relative.
CHAPTER 6: Two-Way Tables*
Displaying and Describing Categorical Data
AP Statistics Chapter 3 Part 2 Displaying and Describing Categorical Data.
Graphical and Numerical Summaries of Qualitative Data
Smart Start In June 2003, Consumer Reports published an article on some sport-utility vehicles they had tested recently. They had reported some basic.
22. Chi-square test for two-way tables
Second factor: education
Statistics 200 Lecture #7 Tuesday, September 13, 2016
CHAPTER 1 Exploring Data
Young adults by gender and chance of getting rich
Objectives (PSLS Chapter 22)
Displaying and Describing Categorical Data
Inference about a population proportion.
Displaying and Describing Categorical Data
The Practice of Statistics in the Life Sciences Third Edition
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
AP Statistics Chapter 3 Part 3
Lesson – Teacher Notes Standard:
Chapter 3: Displaying and Describing Categorical Data
Analysis of two-way tables - Data analysis for two-way tables
Second factor: education
Looking at Data - Relationships Data analysis for two-way tables
Chapter 2 Looking at Data— Relationships
22. Chi-square test for two-way tables
The Practice of Statistics in the Life Sciences Fourth Edition
Displaying and Describing Categorical Data
Displaying and Describing Categorical Data
Displaying and Describing Categorical Data
Topic 4: Exploring Categorical Data
Chapter 1 Data Analysis Section 1.1 Analyzing Categorical Data.
CHAPTER 6: Two-Way Tables
AP Statistics Chapter 3 Part 2
Analyzing Categorical Data
Second factor: education
Warmup Which part- time jobs employed 10 or more of the students?
Chapter 2 Looking at Data— Relationships
Displaying and Describing Categorical Data
1.1: Analyzing Categorical Data
CHAPTER 1 Exploring Data
Section 4-3 Relations in Categorical Data
Displaying and Describing Categorical Data
Displaying and Describing Categorical Data
Displaying and Describing Categorical Data
Relations in Categorical Data
Chapter 4: More on Two-Variable Data
Displaying and Describing Categorical Data
Presentation transcript:

The Practice of Statistics in the Life Sciences Fourth Edition Chapter 5: Two-way tables Copyright © 2018 W. H. Freeman and Company

Objectives Two-way tables Marginal distributions Conditional distributions Simpson’s paradox

Two-way tables (1 of 2) Two-way tables summarize data about two categorical variables (or factors) collected on the same set of individuals. Each factor can have any number of levels. If the row factor has “r” levels, and the column factor has “c” levels, we say that the two-way table is an “r by c” table. High school students were asked whether they smoke and whether their parents smoke: First factor: Parent smoking status Both parents smoke One parent smokes Neither parent smokes Two-way tables are thus named because there are two ways to group the data: by row variable, or by column variable.

Two-way tables (2 of 2) Second factor: Student smoking status Student smokes Student does not smoke 400 1380 416 1823 188 1168 Two-way tables are thus named because there are two ways to group the data: by row variable, or by column variable.

Marginal distributions We can examine each factor in a two-way table separately by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percents. The name “marginal” refers to the fact that the row and column totals are written as if in a margin.

Computing marginal percents Marginal percents are marginal counts divided by the table grand total.

Graphs The marginal distributions can be displayed on separate bar graphs, typically expressed as percents instead of raw counts. Each graph represents only one of the two variables, ignoring the second one. Each marginal distribution can also be shown in a pie chart.

Conditional distributions A conditional distribution is the distribution of one factor for each level of the other factor. A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total). Student smokes Student does not smoke Total Both parents smoke 400 1380 1780 One parent smokes 416 1823 2239 Neither parent smokes 188 1168 1356 1004 4371 5375 In computing this conditional percent, we divide the count of students who smoke within the “both parents smoke” row, and divide by the “both parent smoke” row total. In essence, we are computing the percent who smoke for the subgroup of students in this study who have both parents smoking. Percent of students who smoke when both parents smoke = 400/1780 = 22.5%

Comparing conditional distributions (1 of 2) Comparing conditional distributions helps us describe the “relationship" between the two categorical variables. Student smokes Student does not smoke Total Both parents smoke 400 1380 1780 One parent smokes 416 1823 2239 Neither parent smokes 188 1168 1356 1004 4371 5375 Notice that the percent of students who smoke is highest among those students who have two parents smoking, and lowest among those students whose parents do not smoke. This indicates an association between parental smoking and student smoking. Parental smoking may influence the student’s decision to smoke. We can compare the percent of individuals in one level of factor 1 for each level of factor 2. Substantial differences suggest an association between factor 1 and factor 2.

Comparing conditional distributions (2 of 2) Conditional distribution of student smokers for different parental smoking statuses: Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Percent of students who smoke when one parent smokes = 416/2239 = 18.6% Percent of students who smoke when neither parent smokes = 188/1356 = 13.9% Notice that the percent of students who smoke is highest among those students who have two parents smoking, and lowest among those students whose parents do not smoke. This indicates an association between parental smoking and student smoking. Parental smoking may influence the student’s decision to smoke.

More graphs The conditional distributions can be compared graphically by displaying the percents making up one factor, for each level of the other factor. Conditional distribution of student smoking status for different levels of parental smoking status: Notice that here, only the percents of students who smoke are shown in the bar graph. That is because the percents who do not smoke are implied. That is, if 22% of students who have two parents smoking smoke themselves, then it follows that the remaining 78% of these students do not smoke.

Conditional distribution graphs (1 of 2) Conditional distributions of student smoking status for different levels of parental smoking status: Percent who smoke Percent who do not smoke Row total Both parents smoke 22% 78% 100% One parent smokes 19% 81% Neither parent smokes 14% 86% The full set of percents can be displayed in a set of pie charts, one for each level of the condition (here parental smoking status).

Conditional distribution graphs (2 of 2) Conditional distribution of parental smoking status for different levels of student smoking status: For each two-way table, there are two conditional distributions. Sometimes both are interesting, and sometimes only one conditional distribution truly interests us. In the smoking status study, the second conditional distribution (given student smoking status) is less interesting because we would imagine that parents might influence student smoking decision but not vice versa.

Gallup example A 2013 Gallup survey investigated how phrasing may affect the opinions of American adults regarding physician-assisted suicide. Here are the findings: Which choice describes the value 70%? a marginal value representing the proportion of respondents in favor of physician-assisted suicide a conditional value representing the proportion of respondents in favor of physician-assisted suicide, given that the question was asked in Form A The value 70% represents the percent of respondents in favor when the question was phrased in Form A (“End the patient’s life by some painless means”). It is therefore a conditional value.

Gallup example—Graphs (1 of 2) The data may a be analyzed as percentages or totals. Form A “End the patient life by some painless means” Form B “Assist the patient to commit suicide” Should be allowed 70% 51% Should not be 27% 45% allowed No opinion 3% 4% Number interviewed 719 816 In the bar chart on the next slide, the percentages can be seen as the proportions within each segmented bar, while the difference in sample sizes can be seen in the overall heights.

Gallup example—Graphs (2 of 2) “Painless means” “Commit suicide” Allowed 503 416 Not allowed 194 367 No opinion 22 33 Total 719 816

Simpson’s paradox (1 of 4) Lurking variables are always a problem for interpretation, but their impact can be even more drastic when dealing with categorical data. An association that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson's paradox. The table on the next slide compares the failure rates when removing kidney stones in a sample of patients, using one of two procedures: open surgery and PCNL (a minimally invasive technique). PCNL: percutaneous nephrolithotomy

Simpson’s paradox (2 of 4) Open surgery PCNL Success 273 289 Failure 77 61 % failure 22 17% Can you think of a possible lurking variable here? Or is the minimally invasive procedure really riskier than open surgery? PCNL: percutaneous nephrolithotomy

Simpson’s paradox (3 of 4) The procedures are not chosen randomly by surgeons! In fact, the minimally invasive procedure is most likely used for smaller stones with a good chance of success, whereas open surgery is likely used for more problematic conditions. Note the majority of the small stones cases (the safer situation) involve PCNL, while the majority of the large stone cases (the riskier situation) involve open surgery. In both cases, PCNL has a higher % failure; the sample sizes cause the overall % failure to reverse. In both cases, small stones and large stones, open surgery has a lower failure rate than PCNL. So why do the combined data suggest that PCNL is better? Because PCNL is used mainly when dealing with small stones, when the failure rate is generally low. Open surgery, by contrast, is used most often when dealing with large stones, and large stones have a higher failure rate overall.

Simpson’s paradox (4 of 4) Small stones Open surgery PCNL Success 81 234 Failure 6 36 % failure 7% 13% Large stones Open surgery PCNL Success 192 55 Failure 71 25 % failure 27% 31% In both cases, small stones and large stones, open surgery has a lower failure rate than PCNL. So why do the combined data suggest that PCNL is better? Because PCNL is used mainly when dealing with small stones, when the failure rate is generally low. Open surgery, by contrast, is used most often when dealing with large stones, and large stones have a higher failure rate overall.

Another example (1 of 3) In New York State (excluding New York City), 1,359 white men and 121 black men died from prostate cancer in 1994. Based on how many white and black men lived there in 1994, the prostate cancer mortality rates were as follows: Death from prostate cancer White (all ages) Black (all ages) Yes 1,359 121 No 4,736,887 418,871 Total 4,738,246 418,992 Rate per 100,0000 28.7 28.9 Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%).

Another example (2 of 3) Cancer mortality rates are similar in both groups. But when the data are broken down by age group we see that black men had a much higher rate of prostate cancer death than white men. Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%).

Another example (3 of 3) Death from prostate cancer White (under 65 years of age) Black (under 65 years of age) Yes 76 18 No 4,177,823 396,899 Total 4,177,899 396,917 Rate per 100,0000 1.8 4.5 Death from prostate cancer White (age 65 and older) Black (age 65 and older) Yes 1,282 102 No 559,075 21,973 Total 560,357 22,075 Rate per 100,0000 228.8 462.1 Age is the confounding variable here. In both age groups, black men had a higher rate of prostate cancer death. However, for both white men and black men, the rate of prostate cancer death was much higher for older men than for men under the age of 65. Simpson’s paradox arises from the fact that the percent of white men was higher among the older men (96%) than among the younger men (91%). What is the source of this example of Simpson’s paradox?