Inference on Categorical Data

Inference on Categorical Data
Chapter 12 Inference on Categorical Data

Section 12.1 Goodness-of-Fit Test

A goodness-of-fit test is an inferential procedure used to determine whether a frequency distribution follows a specific distribution.

Suppose that there are n independent trials of an
Expected Counts Suppose that there are n independent trials of an experiment with k ≥ 3 mutually exclusive possible outcomes. Let p1 represent the probability of observing the first outcome and E1 represent the expected count of the first outcome; p2 represent the probability of observing the second outcome and E2 represent the expected count of the second outcome; and so on. The expected counts for each possible outcome are given by Ei = μi = npi for i = 1, 2, …, k

Parallel Example 1: Finding Expected Counts
A sociologist wishes to determine whether the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in According to the United States Census Bureau, in 2000, 22.8% of grandparents have been responsible for their grandchildren less than 1 year; 23.9% of grandparents have been responsible for their grandchildren for 1 or 2 years; 17.6% of grandparents have been responsible for their grandchildren 3 or 4 years; and 35.7% of grandparents have been responsible for their grandchildren for 5 or more years. If the sociologist randomly selects 1,000 care-giving grandparents, compute the expected number within each category assuming the distribution has not changed from

Solution Step 1: The probabilities are the relative frequencies from the 2000 distribution: p<1yr = p1-2yr = 0.239 p3-4yr = p ≥5yr = 0.357

Solution Step 2: There are n = 1,000 trials of the experiment so the expected counts are: E<1yr = np<1yr = 1000(0.228) = 228 E1-2yr = np1-2yr = 1000(0.239) = 239 E3-4yr = np3-4yr =1000(0.176) = 176 E≥5yr = np ≥5yr = 1000(0.357) = 357

Example One growing concern regarding the US economy is the inequality in the distribution of income. The data in the following table represent the distribution of household income for various levels of income in An economist wants to know if the distribution of income is changing, so she randomly selects 1500 households and obtains the household income. Find the expected number of households at each income level assuming that the distribution of income has not changed since (Note: Inflation has been adjusted)

Income Percent Under $15,000 7.0 $15,000 to $24,999 8.6 $25,000 to $34,999 9.3 $35,000 to $49,999 14.3 $50,000 to $74,999 19.7 $75,000 to $99,999 15.0 At least $100,000 26.0

Test Statistic for Goodness-of-Fit Tests
Let Oi represent the observed counts of category i, Ei represent the expected counts of category i, k represent the number of categories, and n represent the number of independent trials of an experiment. Then the formula approximately follows the chi-square distribution with k – 1 degrees of freedom, provided that all expected frequencies are greater than or equal to 1 (all Ei ≥ 1) and no more than 20% of the expected frequencies are less than 5. Note: Ei = npi for i = 1, 2, … , k …

The Goodness-of-Fit Test
To test the hypotheses regarding a distribution, we use the steps that follow. Step 1: Determine the null and alternative hypotheses. H0: The random variable follows a certain distribution H1: The random variable does not follow the distribution in the null hypothesis

Step 2: Calculate the expected counts, Ei , for each of the k categories. The expected counts are Ei = npi for i = 1, 2, … , k where n is the number of trials and pi is the probability of the ith category, assuming that the null hypothesis is true. Verify that the requirements for the goodness-of-fit test are satisfied. All expected counts are greater than or equal to 1 (all Ei ≥ 1). No more than 20% of the expected counts are less than 5.

Step 2 (continued): c) Compute the test statistic:
Note: Oi is the observed count for the ith category.

Classical Approach Step 3: Determine the critical value. All goodness-of-fit tests are right-tailed tests, so the critical value is with k – 1 degrees of freedom. Compare the critical value to the test statistic. If reject the null hypothesis. Step 4: State the conclusion.

A sociologist wishes to determine whether the distribution for
Parallel Example 2: Conducting a Goodness-of -Fit Test A sociologist wishes to determine whether the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000. According to the United States Census Bureau, in 2000, 22.8% of grandparents have been responsible for their grandchildren less than 1 year; 23.9% of grandparents have been responsible for their grandchildren for 1 or 2 years; 17.6% of grandparents have been responsible for their grandchildren 3 or 4 years; and 35.7% of grandparents have been responsible for their grandchildren for 5 or more years. The sociologist randomly selects 1,000 care-giving grandparents and obtains the following data.

Test the claim that the distribution is different
today than it was in 2000 at the α = 0.05 level of significance.

Solution Step 1: We want to know if the distribution today is different than it was in The hypotheses are then: H0: The distribution for the number of years care-giving grandparents are responsible for their grandchildren is the same today as it was in 2000 H1: The distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000

Solution The level of significance is α = 0.05. Step 2:
(a) The expected counts were computed in Example 1. Number of Years Observed Counts Expected Counts <1 252 228 1-2 255 239 3-4 162 176 ≥5 331 357

Solution Step 2: Since all expected counts are greater than or equal to 5, the requirements for the goodness-of-fit test are satisfied. The test statistic is

Solution: Classical Approach
Step 3: There are k = 4 categories, so we find the critical value using 4-1=3 degrees of freedom. The critical value is Since the test statistic, is less than the critical value , we fail to reject the null hypothesis.

Solution Step 4: There is insufficient evidence to conclude that the distribution for the number of years care- giving grandparents are responsible for their grandchildren is different today than it was in at the α = 0.05 level of significance.

Example One growing concern regarding the US economy is the inequality in the distribution of income. An economist wants to know if the distribution of income is changing, so she randomly selects 1500 households and obtains the household income shown in the following table. The table also contains the expected counts under the assumption the distribution hasn’t changed. Does the evidence suggest that the distribution of income has changed since 2000 at the α = 0.05 level of significance.

Income Observed Count Expected Counts Under $15,000 130 105 $15,000 to $24,999 137 129 $25,000 to $34,999 150 139.5 $35,000 to $49,999 207 214.5 $50,000 to $74,999 291 295.5 $75,000 to $99,999 202 225 At least $100,000 383 391.5

Example An obstetrician wants to know whether the proportion of children born on each day of the week is the same. She randomly selects 500 birth records and obtains the data in the following table. Is there reason to believe that the day on which a child is born does not occur with equal frequency at the α = level of significance?

Day of Week Frequency Sunday 46 Monday 76 Tuesday 83 Wednesday 81 Thursday Friday 80 Saturday 53

Example A player in a craps game suspects that one of the dice is loaded. A loaded die is one in which all the possibilities are not equally likely. The player throws the die 400 times, records the outcome after each throw, and obtains the following results: Outcome Frequency 1 62 2 76 3 4 5 57 6 67

Example Do you think the that the die is loaded? Use the α = level of significance. Why do you think the player might conduct the test at the α = 0.01level of significance rather than say, the α = 0.1 level of significance?

Tests for Independence and the Homogeneity of Proportions
Section 12.2 Tests for Independence and the Homogeneity of Proportions

The chi-square test for independence is used to determine whether there is an association between a row variable and column variable in a contingency table constructed from sample data. The null hypothesis is that the variables are not associated; in other words, they are independent. The alternative hypothesis is that the variables are associated, or dependent.

In a chi-square independence test, the null hypothesis is always
“In Other Words” In a chi-square independence test, the null hypothesis is always H0: The variables are independent The alternative hypothesis is always H1: The variables are not independent

The idea behind testing these types of claims is to compare actual counts to the counts we would expect if the null hypothesis were true (if the variables are independent). If a significant difference between the actual counts and expected counts exists, we would take this as evidence against the null hypothesis.

Expected Frequencies in a Chi-Square Test for Independence
To find the expected frequencies in a cell when performing a chi-square independence test, multiply the cell’s row total by its column total and divide this result by the table total. That is,

Parallel Example 1: Determining the Expected Counts in a
Parallel Example 1: Determining the Expected Counts in a Test for Independence In a poll, 883 males and 893 females were asked “If you could have only one of the following, which would you pick: money, health, or love?” Their responses are presented in the table below. Determine the expected counts within each cell assuming that gender and response are independent. Source: Based on a Fox News Poll conducted in January, 1999

Solution Step 1: We first compute the row and column totals: Money
Health Love Row Totals Men 82 446 355 883 Women 46 574 273 893 Column totals 128 1020 628 1776

Example Is there a relationship between martial status and happiness? The data in the following table show the material status and happiness of individuals who participated in the General Social Survey. Compute the expected counts within each cell, assuming martial status and happiness are independent.

Marital Status Happiness Married Widowed Divorced/ Separated Never Married Row Totals Very Happy 600 63 112 144 919 Pretty Happy 720 142 355 459 1676 Not Too Happy 93 51 119 127 390 Column Totals 1413 256 586 730 2985

Test Statistic for the Test of Independence
Let Oi represent the observed number of counts in the ith cell and Ei represent the expected number of counts in the ith cell. Then approximately follows the chi-square distribution with (r – 1)(c – 1) degrees of freedom, where r is the number of rows and c is the number of columns in the contingency table, provided that (1) all expected frequencies are greater than or equal to 1 and (2) no more than 20% of the expected frequencies are less than 5.

Chi-Square Test for Independence
To test the hypothesis regarding the association between (or independence of) two variables in a contingency table, we use the steps that follow: Step 1: Determine the null and alternative hypotheses. H0: The row variable and column variable are independent. H1: The row variable and column variables are dependent.

Step 2: Calculate the expected frequencies (counts) for each cell in the contingency table. Verify that the requirements for the chi-square test for independence are satisfied: All expected frequencies are greater than or equal to 1 (all Ei ≥ 1). No more than 20% of the expected frequencies are less than 5.

Classical Approach Step 2: c) Compute the test statistic:
Note: Oi is the observed count for the ith category.

Classical Approach Step 3: Determine the critical value. All chi-square tests for independence are right-tailed tests, so the critical value is with (r – 1)(c – 1) degrees of freedom, where r is the number of rows and c is the number of columns in the contingency table. Compare the critical value to the test statistic. If reject the null hypothesis. Step 4: State the conclusion.

Parallel Example 2: Performing a Chi-Square Test for Independence
In a poll, 883 males and 893 females were asked “If you could have only one of the following, which would you pick: money, health, or love?” Their responses are presented in the table below. Test the claim that gender and response are independent at the α = 0.05 level of significance. Source: Based on a Fox News Poll conducted in January, 1999

Solution Step 1: We want to know whether gender and response are dependent or independent so the hypotheses are: H0: gender and response are independent H1: gender and response are dependent The level of significance is α = 0.05.

Solution Step 2: (a) The expected frequencies were computed in Example 1 and are given in parentheses in the table below, along with the observed frequencies. Money Health Love Men 82 ( ) 446 ( ) 355 ( ) Women 46 ( ) 574 ( ) 273 ( )

Solution Step 2: Since none of the expected frequencies are less than 5, the requirements for the goodness-of-fit test are satisfied. The test statistic is

There are r = 2 rows and c =3 columns, so we find the critical value using (2 – 1)(3 – 1) = 2 degrees of freedom. The critical value is Step 3: Since the test statistic, is greater than the critical value , we reject the null hypothesis. Step 4: There is sufficient evidence to conclude that gender and response are dependent at the α = 0.05 level of significance.

Example Does one’s happiness depend on one’s marital status? We present the data from the table from before to answer the question. Use the α = 0.05 level of significance.

(Expected frequencies)
Marital Status Happiness Married Widowed Divorced/Seperated Never Married Very Happy 600 ( ) 63 (78.816) 112 ( ) 144 ( ) Pretty Happy 720 ( ) 142 ( ) 355 ( ) 459 ( ) Not Too Happy 93 ( ) 51 (33.447) 119 (76.562) 127 (95.377)

In a chi-square test for homogeneity of proportions, we test whether different populations have the same proportion of individuals with some characteristic. The procedures for performing a test of homogeneity are identical to those for a test of independence.

Parallel Example 5: A Test for Homogeneity of Proportions
The following question was asked of a random sample of individuals in 1992, 2002, and 2008: “Would you tell me if you feel being a teacher is an occupation of very great prestige?” The results of the survey are presented below: Test the claim that the proportion of individuals that feel being a teacher is an occupation of very great prestige is the same for each year at the α = 0.01 level of significance. Source: The Harris Poll 1992 2002 2008 Yes 418 479 525 No 602 541 485

Solution Step 1: The null hypothesis is a statement of “no difference” so the proportions for each year who feel that being a teacher is an occupation of very great prestige are equal. We state the hypotheses as follows: H0: p1= p2= p3 H1: At least one of the proportions is different from the others. The level of significance is α = 0.01.

Solution Step 2: (a) The expected frequencies are found by multiplying the appropriate row and column totals and then dividing by the total sample size. They are given in parentheses in the table below, along with the observed frequencies. 1992 2002 2008 Yes 418 ( ) 479 525 ( ) No 602 ( ) 541 485 ( )

c) The test statistic is
Solution Step 2: b) Since none of the expected frequencies are less than 5, the requirements are satisfied. c) The test statistic is

Step 3: There are r = 2 rows and c =3 columns, so we find the critical value using (2 – 1)(3 – 1) = 2 degrees of freedom. The critical value is Because the test statistic, , is greater than the critical value, , we reject the null hypothesis.

Solution Step 4: There is sufficient evidence to reject the null hypothesis at the α = 0.01 level of significance. We conclude that the proportion of individuals who believe that teaching is a very prestigious career is different for at least one of the three years.

Example Zocor is a drug that is meant to reduce the level of LDL (bad) cholesterol and increase the level of HDL (good) cholesterol. In clinical trials of the drug, patients were randomly divided into three groups. Group 1 got Zocor, group 2 got a placebo, group 3 got another drug to help cholesterol. The table contains the number of patients in each group who had pain and did not have pain as a side effect. Is there evidence to indicate that the proportion of subjects in each group who had pain is different at the α = 0.01 level of significance. Group 1 Group 2 Group 3 Number of people who experienced abdominal pain 51 5 16 Number of people who did not 1532 152 163

Inference on Categorical Data

Similar presentations

Presentation on theme: "Inference on Categorical Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inference on Categorical Data

Similar presentations

Presentation on theme: "Inference on Categorical Data"— Presentation transcript:

Similar presentations

About project

Feedback