Learning from Categorical Data

Learning from Categorical Data
Chapter 15 Learning from Categorical Data Created by Kathy Fritz

Chi-Square Tests for Univariate Categorical Data

Univariate Categorical Data
Univariate categorical data arise in a variety of settings. The number of different categories, k, are the possible values for the categorical variable. For example, each person in a sample of 100 registered voters in a particular city might be asked which of five city council members he or she favors for mayor. The variable of interest is the favored candidate and it has 5 categories. Univariate categorical data are usually summarized in a one-way frequency table, displayed either horizontally or vertically.

Notation k = number of categories of a categorical variable
p1 = population proportion for Category 1 p2 = population proportion for Category 2 ⋮ pk = population proportion for Category k Note: p1 + p2 + ⋯ + pk = 1

The Greek letter c (chi) is often used in place of X.
From sample data, you have observed counts for each of the k categories. Expected counts are counts for the k categories that you would expect to have, if the null hypothesis is true. The Greek letter c (chi) is often used in place of X. The goodness-of-fit statistic, denoted X 2, is a quantitative measure of the extent to which the observed counts differ from those expected when H0 is true. When the differences are big, the value of X 2 tends to be large, which suggests H0 should be rejected. The value of the X 2 statistic reflects the magnitude of the discrepancies between observed and expected counts.

Chi-Square Distributions
A chi-square distribution curve is not symmetric, with a longer tail on the right. It has no area associated with negative values. There are many different chi-square distributions. Each one has a different number of degrees of freedom. Df = 3 Df = 5 Df = 10

Chi-Square Distributions
For a test procedure based on X 2, the associated P-value is the area under the appropriate chi-square curve and to the right of the computed X 2 value. For example, for a chi-square distribution with df = 4, the area to the right of X 2 = 8.18 is The area to the right of a X 2 value can be found in Table 5. It can also be found using a statistical software package or a graphing calculator.

Chi-Square Goodness-of-Fit Test
Appropriate when the following conditions are met: Observed cell counts are based on a random sample or a sample that is representative of the population The sample size is large. The sample size is large enough for the chi-square goodness-of-fit test to be appropriate if every expected cell count is at least 5. When these conditions are met, the following test statistic can be used: 𝑋 2 = 𝑎𝑙𝑙 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 (observed count −expected count) 2 expected count When the null hypothesis is true, the X 2 statistic has a chi-square distribution with df = k – 1, where k is the number of category proportions specified in the null hypothesis. Expected count = n (hypothesized proportion for category)

Chi-Square Goodness-of-Fit Test
Hypotheses H0: p1 = hypothesized proportion for Category 1 p2 = hypothesized proportion for Category 2 ⋮ pk = hypothesized proportion for Category k Ha: H0 is not true. At least one of the population category proportions differs from the corresponding hypothesized value. Associated P-values The P-value is the area to the right of X 2 under the chi-square curve with df = k – 1.

Spread Chosen as Dog Food
A study investigated whether people can tell the difference between dog food, pâté (a spread made of finely chopped liver, meat, or fish), and processed meats (such as Spam and liverwurst). Researchers used a food processor to make spreads that had the same texture and consistency as pâté from Newman’s own brand dog food and from the processed meats. You can use these data to test the hypothesis that the five different spreads are chosen equally often when people who have tasted all five spreads are asked to identify the one they think is dog food. Each participant in the study tasted the five spreads (duck liver pâté, Spam, dog food, pork liver pâté, and liverwurst). After tasting all five spreads, each participant was asked to choose the one that they thought was the dog food. The data are summarized in the one-way table below. Spread Chosen as Dog Food Duck Liver Pâté Spam Dog Food Pork Liver Pâté Liverwurst Frequency 3 11 8 6 22

Step 1 (Hypotheses): Step 2: (Method):
The population category proportions are defined as p1 = proportion al all people who would choose duck liver pâté as the dog food p2 = proportion al all people who would choose Spam as the dog food p3 = proportion al all people who would choose dog food as the dog food p4 = proportion al all people who would choose pork liver pâté as the dog food p5 = proportion al all people who would choose liverwurst as the dog food Hypotheses: H0: p1 = p2 = p3 = p4 = p5 = 0.20 Ha: At least one of the population proportions is not 0.20 Step 2: (Method): Because the answers to the four key questions are 1) hypothesis testing, 2) sample data, 3) one categorical variable with more than 2 categories, and 4) one sample, a chi-square goodness-of-fit test is considered. When the null hypothesis is true, this statistic has approximately a chi-square distribution with df = 4. A significance level of a = 0.05 will be used for this test.

Step 3 (Check): Step 4: (Calculate):
You must be willing to assume that the participants in this study can be regarded as a random or representative sample. Because the sample size is 50, all expected counts are 50(0.20) = 10. All expected counts are at least 5,so the sample size is large enough. If this assumption is not reasonable, you should be very careful generalizing results from this analysis to any larger population. Step 4: (Calculate): Test Statistic: 𝑋 2 = 𝑎𝑙𝑙 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 (observed count −expected count) 2 expected count = (3−10) (11−10) (8−10) (6−10) (22−10) =21.4 Degrees of freedom: k – 1 = 5 – 1 = 4 Associated P-Value: P-value = area under chi-square curve to the right of 21.4 < 0.001

Step 5 (Communicate Results):
Because the P-value is less than the selected significance level, the null hypothesis is rejected. Based on these data, there is convincing evidence that the proportion identifying a spread as dog food is not the same for all five spreads. From this plot, it is easy to see the two categories that differ the most from the expected amount. Here, it is interesting to note that the large differences between observed counts and the counts that would have been expected if the null hypothesis of equal proportions were true are in duck liver pâté and the liverwurst categories, indicating that fewer than expected chose the duck liver and many more than expected chose liverwurst. So, although you reject H0, it is not because people were actually able to identify which one was really dog food.

Tests for Homogeneity and Independence in a Two-Way Table

Bivariate categorical data results from observations made on two different categorical variables in a single sample. Suppose a researcher wishes to know whether there is any relationship between political philosophy (liberal, moderate, or conservative) and preferred news network for people who regularly watch the national news. Bivariate categorical data are usually summarized in a two-way frequency table. There are two categorical variables – political philosophy and preferred new network. Two values (one for each variable) would be recorded for each person in the study. There is a cell in the table for each possible combination of the category values. The number times each particular combination occurs in the data set is entered in the corresponding cell of the table. These are called observed counts.

These are the category values for the two categorical variables.
Marginal totals are obtained by adding the observed cell counts in each row and also in each column of the table. The grand total is the sum of all the observed cell counts in the table. The grand total is also the sum of the row marginal totals or the sum of the column marginal totals. ABC CBS NBC FOX Total Liberal 20 25 15 80 Moderate 45 35 50 150 Conservative 40 10 5 70 95 85 300 These are the category values for the two categorical variables.

These are the observed cell counts.
Marginal totals are obtained by adding the observed cell counts in each row and also in each column of the table. The grand total is the sum of all the observed cell counts in the table. The grand total is also the sum of the row marginal totals or the sum of the column marginal totals. ABC CBS NBC FOX Total Liberal 20 25 15 80 Moderate 45 35 50 150 Conservative 40 10 5 70 95 85 300 These are the observed cell counts.

These are the marginal totals.
Marginal totals are obtained by adding the observed cell counts in each row and also in each column of the table. The grand total is the sum of all the observed cell counts in the table. The grand total is also the sum of the row marginal totals or the sum of the column marginal totals. ABC CBS NBC FOX Total Liberal 20 25 15 80 Moderate 45 35 50 150 Conservative 40 10 5 70 95 85 300 These are the marginal totals.

Marginal totals are obtained by adding the observed cell counts in each row and also in each column of the table. The grand total is the sum of all the observed cell counts in the table. The grand total is also the sum of the row marginal totals or the sum of the column marginal totals. ABC CBS NBC FOX Total Liberal 20 25 15 80 Moderate 45 35 50 150 Conservative 40 10 5 70 95 85 300 This is the grand total.

Two-way tables are also used when data are collected to compare two or more populations or treatments on the basis of a single categorical variable. In this situation, independent samples are selected from each population or treatment. For each individual in the three independent samples, ONLY one value is recorded – mode of transportation to campus. For example, data could be collected at a university to compare students, faculty, and staff on the basis of primary mode of transportation to campus (car, bicycle, motorcycle, bus, or by foot). Sample of 150 staff Sample of 200 students Sample of 100 faculty

Chi-Square Test for Homogeneity
Appropriate when the following conditions are met: Observed counts are from independently selected random samples or subjects in an experiment are randomly assigned to treatment groups. The sample sizes are large. The sample size is large enough for the chi-square test for homogeneity if every expected count is at least 5. If some expected counts are less than 5, rows or columns of the table may be combined to achieve a table with satisfactory expected counts.

When these conditions are met, the following test statistic can be used: 𝑋 2 = 𝑎𝑙𝑙 𝑐𝑒𝑙𝑙𝑠 (observed count − expected count) 2 expected count The expected cell counts are estimated from the sample data using the formula 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑒𝑙𝑙 𝑐𝑜𝑢𝑛𝑡= (𝑟𝑜𝑤 𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑡𝑜𝑡𝑎𝑙)(𝑐𝑜𝑙𝑢𝑚𝑛 𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑡𝑜𝑡𝑎𝑙) 𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙 When the conditions above are met and the null hypothesis is true, the X 2 statistic has a chi-square distribution with df = (number of rows – 1)(number of columns – 1)

Hypothesis: H0: the population (or treatment) category proportions are the same for all the populations or treatments Ha: the population (or treatment) category proportions are not all the same for all the populations or treatments Associated P-value: The P-value associated with the computed test statistic value is the area to the right of X 2 under the chi-square curve with df = (number of rows – 1)(number of columns – 1)

A study was conducted to determine if collegiate soccer players had in increased risk of concussions over other athletes or students. The two-way frequency table below displays the number of previous concussions for students in independently selected random samples of 91 soccer players, 96 non-soccer athletes, and 53 non-athletes. Number of Concussions 1 2 3 or more Total Soccer Players 45 25 11 10 91 Non-Soccer Players 68 15 8 5 96 Non-Athletes 3 53 158 22 240 This is univariate categorical data - number of concussions - from 3 independent samples.

A study was conducted to determine if collegiate soccer players had in increased risk of concussions over other athletes or students. The two-way frequency table below displays the number of previous concussions for students in independently selected random samples of 91 soccer players, 96 non-soccer athletes, and 53 non-athletes. Combine the category values “2 concussions” and “3 or more concussions” to create the category value “2 or more concussions) Number of Concussions 1 2 3 or more Total Soccer Players 45 (59.9) 25 (17.1) 11 (8.3) 10 (5.7) 91 Non-Soccer Players 68 (63.2) 15 (18.0) 8 (8.8) 5 (6.0) 96 Non-Athletes 45 (34.9) 5 (10.0) 3 (4.9) 0 (3.3) 53 158 45 22 15 240 The expected counts are shown in parentheses. Notice that two of the expected counts are less than 5.

Risky Soccer Continued . . .
Number of Concussions 1 2 or more Total Soccer Players 45 (59.9) 25 (17.1) 21 (14.0) 91 Non-Soccer Players 68 (63.2) 15 (18.0) 13 (14.8) 96 Non-Athletes 45 (34.9) 5 (10.0) 3 (8.2) 53 158 45 37 240 Step 1 (Hypotheses): H0: Proportions in each head injury category are the same for all three groups. Ha: The head injury category proportions are not all the same for all three groups.

Step 2 (Method): This is a hypothesis testing problem. Random samples from three different populations were independently selected. The response is categorical. In this situation, you should consider a chi-square test of homogeneity. A significance level of 0.05 will be used in this example. Step 3 (Check): Because the samples were independent random samples and the expected counts are all at least 5, the chi-square of homogeneity is appropriate.

Step 4 (Calculate): 𝑋 2 = 𝑎𝑙𝑙 𝑐𝑒𝑙𝑙𝑠 (observed count − expected count) 2 expected count = (45−59.9) ⋯+ 3− =20.6 The largest differences between the observed and expected frequencies occur in the response categories for soccer players and for non-athletes, with soccer players having higher than expected proportions in the one and two or more head injuries categories. Df = (number of rows – 1)(number of columns – 1) = (3 – 1)(3 – 1) = 4 P-value: The P-value is the area to the right of 20.6 under the chi-square curve with df = 4. P-value < 0.001 Step 5 (Check): Because the P-value is less than 0.05, H0 is rejected. There is strong evidence that the proportions in the head injury categories are not the same for the three groups compared.

Chi-Square Test for Independence
Appropriate when the following conditions are met: Observed counts are from a random sample. The sample size is large. The sample size is large enough for the chi-square test for independence if every expected count is at least 5. If some expected counts are less than 5, rows or columns of the table may be combined to achieve a table with satisfactory expected counts.

When these conditions are met, the following test statistic can be used: 𝑋 2 = 𝑎𝑙𝑙 𝑐𝑒𝑙𝑙𝑠 (observed count − expected count) 2 expected count The expected cell counts are estimated from the sample data using the formula 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑒𝑙𝑙 𝑐𝑜𝑢𝑛𝑡= (𝑟𝑜𝑤 𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑡𝑜𝑡𝑎𝑙)(𝑐𝑜𝑙𝑢𝑚𝑛 𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑡𝑜𝑡𝑎𝑙) 𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙 When the conditions above are met and the null hypothesis is true, the X 2 statistic has a chi-square distribution with df = (number of rows – 1)(number of columns – 1)

Hypothesis: H0: the two variables are independent Ha: the two variables are not independent The main difference between the chi-square test of homogeneity and the chi-square test of independence is the hypotheses. The hypotheses of the homogeneity test is to determine if the populations’ proportions are the same, while the hypotheses of the independence test is to determine if a relationship exists between the two variables. Associated P-value: The P-value associated with the computed test statistic value is the area to the right of X 2 under the chi-square curve with df = (number of rows – 1)(number of columns – 1)

Data for 89 patients are summarized in the table below.
A paper examined the relationship between a nurse’s assessment of a patient’s facial expression and the patient’s self-reported level of pain. Because patients with dementia do not always give a verbal indication that they are in pain, the authors of the paper were interested in determining if there is an association between facial expression that reflects pain and self-reported pain. Data for 89 patients are summarized in the table below. Self-Report Facial Expression No Pain Pain 17 40 3 29

Dementia Patients Continued . . . Step 1 (Hypotheses):
H0: Facial expression and self-reported pain are independent Ha: Facial expression and self-reported pain are not independent Step 2 (Method): You should consider a chi-square test of independence because the answers to the four key questions are hypothesis testing, sampling data, two categorical variables, and one sample. df = (2 – 1)(2 – 1) = 1 A significance level of 0.05 will be used for this test.

Dementia Patients Continued . . .
Step 3 (Check): The expected counts are shown below. Self-Report Facial Expression No Pain Pain 17 (12.81) 40 (44.19) 3 (7.19) 29 (24.81) All of the expected counts are greater than 5, so the sample size is large enough. Although the participants in the study were not randomly selected, they were thought to be representative of the population of nursing home patients with dementia.

Dementia Patients Continued . . .
Step 4 (Calculate): 𝑋 2 = 𝑎𝑙𝑙 𝑐𝑒𝑙𝑙𝑠 (observed count − expected count) 2 expected count = (17−12.81) ⋯+ 29− =4.92 P-value: The P-value is the area to the right of 4.92 under the chi-square curve with df = 1. P-value ≈ 0.025 Step 5 (Check): Because the P-value is less than 0.05, H0 is rejected. There is convincing evidence of an association between a nurse’s assessment of facial expression and self-reported pain.

Avoid These Common Mistakes

Don’t confuse tests for homogeneity with tests for independence. The hypotheses and conclusions are different for the two types of test. Tests for homogeneity are used when the individuals in each of two or more independent samples are classified according to a single categorical variable. Tests for independence are used when individuals in a single sample are classified according to two categorical variables.

Remember that a hypothesis test can never show strong support for the null hypothesis. For example, if you do not reject the null hypothesis in a chi-square test for independence, you cannot conclude that there is convincing evidence that the variables are independent. You can only say that you were not convinced that there is an association between the variables.

Be sure that the conditions for the chi-square test are met. P-values based on the chi-square distribution are only approximate, and if the large sample condition is not met, the actual P-value may be quite different from the approximate one based on the chi-square distribution. Also, for the chi-square test of homogeneity, the assumption of independent samples is particularly important.

Don’t jump to conclusions about causation. Just as a strong correlation between two numerical variables does not mean that there is a cause-and-effect relationship between them, an association between two categorical variables does not imply a causal relationship.

Learning from Categorical Data

Similar presentations

Presentation on theme: "Learning from Categorical Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning from Categorical Data

Similar presentations

Presentation on theme: "Learning from Categorical Data"— Presentation transcript:

Similar presentations

About project

Feedback