# Discrete (Categorical) Data Analysis

## Presentation on theme: "Discrete (Categorical) Data Analysis"— Presentation transcript:

Discrete (Categorical) Data Analysis
TOPIC 10 Discrete (Categorical) Data Analysis

Discrete Random Variables
Recall that discrete random variables may take only discrete values. For example, Number of errors in a software product: 0, 1, 2, 3, 4, … Categories of a product’s quality level” High, medium, or low Characteristics of a machine breakdown: Mechanical failure, electrical failure, or operator misuse.

Sample Proportions Recall that the success probability p can be estimated by the sample proportion For large enough values of n the sample proportion can be taken to have approximately the normal distribution This expression may be written in terms of a standard normal distribution as = Standard Error

Confidence Interval Estimation for p
Since the probability of p is unknown then we replace p with its estimated Assumptions:

Example You’re a production manager for a newspaper. You want to find the % defective. Of 200 newspapers, 35 had defects. What is the 90% confidence interval estimate of the population proportion defective?

Example Solution 84

Sample Size for Estimating p
I don’t want to sample too much or too little! SE = Sampling Error If no estimate of p is available, use p = 1 – p = 0.5 89

Example What sample size is needed to estimate p with 90% confidence and a width L of .03? 91

Exercises Exercise: In an election poll a random sample of 500 people showed that 42 preferred voting for a particular candidate. Set up a 90% confidence interval estimate for the population proportion, p of the particular candidate. Suppose that the auditing procedures require you to have 95% confidence in estimating the population proportion of sales invoices with errors within ± The results from the past months indicate that the largest proportion has been no more than Find the sample size needed to satisfy the requirements of the company.

Z Test of Hypothesis for the Proportion
One sample Z test for the proportion where Hypothesized proportion of successes in the population Sample proportion of successes Number of items having the characteristic of interest Sample size

Example You’re an accounting manager. A year-end audit showed 4% of transactions had errors. You implement new procedures. A random sample of 500 transactions had 25 errors. Has the proportion of incorrect transactions changed at the .05 level of significance?

Example Solution Test Statistic: p = .04 H0: Ha:  = , /2 = 0.025
Decision: Conclusion: p = .04 p  .04 H0: Ha:  = , /2 = 0.025 n = Critical Value(s): .05 500 Do not reject H0 at  = .05 Reject H Reject H .025 .025 There is evidence proportion is 4% -1.96 1.96 Z

Exercise A fast-food chain has developed a new process to ensure that orders at the drive-through are filled correctly. The previous process filled orders correctly 85% of the time. Based on a sample of 100 orders using the new process, 94 were filled correctly. At a 0.01 level of significance, can you conclude that the new process has increased the proportion of orders filled correctly?

Large-Sample Inference about p1 – p2
Assumptions: Independent, random samples Normal approximation can be used if

Large-Sample Inference about p1 – p2
(1 – α)100% Confidence Interval for ( p1 – p2) where

Example As personnel director, you want to test the perception of fairness of two methods of performance evaluation. 63 of 78 employees rated Method 1 as fair. 49 of 82 rated Method 2 as fair. Find a 99% confidence interval for the difference in perceptions. To check assumptions, use sample proportions as estimators of population proportion: n1·p1 = 78·63/78 = 63 n1·q1 = 78·(1-63/78) = 15 n2·p2 = 82·49/82 = 49 n2·q2 = 82·(1-49/82) = 33

Example Solution To check assumptions, use sample proportions as estimators of population proportion: n1·p1 = 78·63/78 = 63 n1·q1 = 78·(1-63/78) = 15 n2·p2 = 82·49/82 = 49 n2·q2 = 82·(1-49/82) = 33

Hypothesis Testing for Two Proportions
Large-Sample Inference about p1 – p2 Hypothesis Testing for Two Proportions No Difference Pop 1 Pop 2 Pop 1 Pop 2 Hypothesis Any Difference Pop 1 < Pop 2 Pop 1 > Pop 2 H0 Ha Z – Test Statistic: Hypothesized difference where The rejection region follows the way similar to that in the one sample tests

Example As personnel director, you want to test the perception of fairness of two methods of performance evaluation. 63 of 78 employees rated Method 1 as fair. 49 of 82 rated Method 2 as fair. At the .01 level of significance, is there a difference in perceptions? To check assumptions, use sample proportions as estimators of population proportion: n1·p1 = 78·63/78 = 63 n1·q1 = 78·(1-63/78) = 15 n2·p2 = 82·49/82 = 49 n2·q2 = 82·(1-49/82) = 33

Example Solution 12

Example Solution .005 z H0: Ha:  = n1 = n2 = Critical Value(s):
p1 - p2 = 0 p1 - p2  0 Test Statistic: Decision: Conclusion: H0: Ha:  = n1 = n2 = Critical Value(s): Z = +2.90 .01 78 82 Reject H0 at  = .01 z 2.58 -2.58 Reject H .005 There is evidence of a difference in proportions 11

Chi-Square Tests for k Proportions
This topic extends hypothesis testing to analyze differences between population proportions based on two or more samples. Qualitative data that fall in more than two categories often result from a multinomial experiment. Some of the characteristics of the multinomial experiment are The probabilities of the k outcomes, denoted p1, p2, … , pk, remain the same from trial to trial, where p1 + p2 + … + pk = 1 The trials are independent Recall, binomial experiment is a multinomial experiment with k = 2

Chi-Square (2) Tests Populations p1 = p2 = p3 = p4 = ….. pk
Evidence to accept/reject our claim Populations 2 Test for equality of proportions p1 = p2 = p3 = p4 = ….. pk Observed and expected frequencies x , e Draw Sample

Road Map Decision Making One/Two Samples Analysis of Variance χ2 Tests
One-Way Table Two-Way Table

Multinomial Experiment
n identical and independent trials k outcomes to each trial Constant outcome probability, pk Random variable is count, nk Example: ask 100 people (n) which of 3 candidates (k) they will vote for Uses one-way contingency table: Shows number of observations in k independent groups (outcomes or variable levels)

One Way Contingency Table
Outcomes (k = 3) Candidate Tom Bill Mary Total 35 20 45 100 Number of responses

2 Test Basic Idea Compares observed frequency (xi) to expected frequency [ei] assuming null hypothesis is true Closer observed frequency is to expected frequency, the more likely the H0 is true Measured by squared difference relative to expected frequency Reject large values Assumptions: A multinomial experiment has been conducted The sample size n is large: ei is greater than or equal to 5 for every cell ( i = 1, 2, 3, …, k)

2 Test for k Proportions
Hypothesized probability 1. Hypotheses H0: p1 = p1,0, p2 = p2,0, ..., pk = pk,0 Ha: At least one pi is different from above 2. Test Statistic Observed frequency Expected frequency: ei = npi,0 3. Degrees of Freedom: k – 1 Number of outcomes 24

c Finding Critical Value Example Reject H0 5.991 Upper Tail Area df
What is the critical 2 value if k = 3, and  =.05? If xi = ei, 2 = Do not reject H0 c 2 Upper Tail Area df .995 .95 .05 1 ... 0.004 3.841 0.010 0.103 5.991 2 Table (Portion) Reject H0  = .05 df = k - 1 = 2 5.991 26

2 Test for k Proportions Example
As personnel director, you want to test the perception of fairness of three methods of performance evaluation. Of 180 employees, 63 rated Method 1 as fair, 45 rated Method 2 as fair, 72 rated Method 3 as fair. At the .05 level of significance, is there a difference in perceptions? To check assumptions, use sample proportions as estimators of population proportion: n1·p = 78·63/78 = 63 n1·(1-p) = 78·(1-63/78) = 15 10

2 Test for k Proportions Solution
x1 = 63 x2 = 45 x3 = 72 12

2 Test for k Proportions Solution
H0: Ha:  = n1 = n2 = n3 = Critical Value(s): p1 = p2 = p3 = 1/3 At least 1 is different Test Statistic: Decision: Conclusion: 2 = 6.3 .05 63 45 72 Reject H0 at  = .05 c 2 Reject H0 There is evidence of a difference in proportions 5.991  = .05 11

Road Map Decision Making One/Two Samples Analysis of Variance χ2 Tests
One-Way Table Two-Way Table Test of Independence

2 Test of Independence Multinomial experiment has been conducted
Shows if a relationship exists between two qualitative (categorical) variables One sample is drawn Does not show causality Uses two-way contingency table Assumptions: Multinomial experiment has been conducted The sample size, n, is large: eij is greater than or equal to 5 for every cell

Two-Way Contingency Table
Shows number of observations from 1 sample jointly in 2 qualitative variables Levels of variable 2 Levels of variable 1 40

Degrees of Freedom: (r – 1)(c – 1)
2 Test of Independence Hypotheses H0: Variables are independent Ha: Variables are related (dependent) Test Statistic Observed frequency Expected frequency Degrees of Freedom: (r – 1)(c – 1) Rows Columns 41

2 Test of Independence Expected Frequencies
Statistical independence means joint probability equals product of marginal probabilities Compute marginal probabilities and multiply for joint probability Expected frequency is sample size times joint probability e = Column Tot al Sample Siz Row Total a f f a f

Expected Frequency Example
Joint probability = Marginal probability = Location Urban Rural House Style Obs. Obs. Total Split–Level Ranch Total Ri Cj Expected freq. = 160× = 54.6 Marginal probability = 43

Expected Frequency Calculation
ri: Total frequency in row i-th cj: Total frequency in column j-th 112× 54.6 House Location 112× 57.4 = = Urban Rural House Style Obs. Exp. Obs. Exp. Total Split Level 63 49 112 Ranch 48× 23.4 15 33 48× 24.6 48 Total 78 78 82 82 160 = = 43

Example As a realtor you want to determine if house style and house location are related. At the .05 level of significance, is there evidence of a relationship? 44

 Example Solution 112×78 160 112×82 160 = = 48×78 160 48×82 160 = =
eij  5 in all cells 112× 112× = = 48× 48× = = 45

Example Solution Test Statistic: 12

Example Solution c Reject H0  = .05 3.841 H0: Ha:  = df =
Critical Value(s): No Relationship Relationship Test Statistic: Decision: Conclusion: 2 = 8.41 .05 (2 – 1) (2 – 1) = 1 Reject H0 at  = .05 c 2 Reject H0 There is evidence of a relationship 3.841  = .05 11

Exercise 1 You’re a marketing research analyst. You ask a random sample of 286 consumers if they purchase Diet Pepsi or Diet Coke. At the .05 level of significance, is there evidence of a relationship? Diet Pepsi Diet Coke No Yes Total No 84 32 116 Yes 48 122 170 Total 132 154 286 44

 Exercise 1 Solution eij  5 in all cells 116×132 286 154×116 286 = =
116× 154× = = 170× 170× = = 45

Exercise 1 Solution Test Statistic: 12

Exercise 1 Solution c Reject H0  = .05 3.841 H0: Ha:  = df =
Critical Value(s): No Relationship Relationship Test Statistic: Decision: Conclusion: 2 = 54.29 .05 (2 – 1) (2 – 1) = 1 Reject H0 at  = .05 c 2 Reject H0 There is evidence of a relationship 3.841  = .05 11

Exercise 2 There is a statistically significant relationship between purchasing Diet Coke and Diet Pepsi. So what do you think the relationship is? Aren’t they competitors? Diet Pepsi Diet Coke No Yes Total No 84 32 116 Yes 48 122 170 Total 132 154 286 48

You Re-Analyze the Data
High Income Diet Pepsi Diet Coke No Yes Total No 4 30 34 Yes 40 2 42 There is a spurious relationship between purchasing Diet Coke & Diet Pepsi. Income is an intervening or control variable & is the true cause. The analysis here uses only descriptive statistics. For low income, consumers are price conscious. Either they can’t afford to buy either or they buy whatever is on sale. For high income, consumers buy depending on preference regardless of price. Total 44 32 76 Low Income Diet Pepsi Diet Coke No Yes Total No 80 2 82 Yes 8 120 128 Total 88 122 210 49

True Relationships* Underlying causal relation Apparent relation
Diet Coke There is a spurious relationship between purchasing Diet Coke & Diet Pepsi. Income is an intervening or control variable & is the true cause. The analysis here uses only descriptive statistics. For low income, consumers are price conscious. Either they can’t afford to buy either or they buy whatever is on sale. For high income, consumers buy depending on preference regardless of price. Underlying causal relation Apparent relation Control or intervening variable (true cause) Diet Pepsi 50

Numbers don’t think - People do!
Moral of the Story* Numbers don’t think - People do! 51

Any Questions ?