Presentation on theme: "Chi Squared Tests. Introduction Two statistical techniques are presented. Both are used to analyze nominal data. –A goodness-of-fit test for a multinomial."— Presentation transcript:
Introduction Two statistical techniques are presented. Both are used to analyze nominal data. –A goodness-of-fit test for a multinomial experiment. –A contingency table test of independence. The test statistics in both cases follow the 2 distribution.
The hypothesis tested involves the “success” probabilities p 1, p 2, …, p k. of a multinomial distribution. The multinomial experiment is an extension of the binomial experiment. –There are n independent trials. –The outcome of each trial can be classified into one of k categories, called cells. –The probability p i for an outcome to fall into cell i remains constant for each trial. By assumption, p 1 + p 2 + … +p k = 1. –Trials in the experiment are independent. Chi-Squared Goodness-of-Fit Test
Our objective is to find out whether there is sufficient evidence to reject a pre-specified set of values for p i. The hypotheses: The test builds on comparing actual frequency and the expected frequency of occurrences in all cells.
Example 16.1 –Two competing companies A and B have been dominant players in the market. Both companies conducted recent advertising campaigns on their products. –Market shares before the campaigns were: Company A = 45% Company B = 40% Other competitors = 15%. An Example
Example 16.1 – continued –To study the effect of the campaigns on the market shares, a survey was conducted. –200 customers were asked to indicate their preference regarding the products advertised. –Survey results: 102 customers preferred the company A’s product, 82 customers preferred the company B’s product, 16 customers preferred the competitors product.
Example 16.1 – continued Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns?
Solution –The population investigated is the brand preferences. –The data are nominal (A, B, or other) –This is a multinomial experiment (three categories). –The question of interest: Are p 1, p 2, and p 3 different after the campaign from their values prior to the campaigns?
The hypotheses are: H 0 : p 1 =.45, p 2 =.40, p 3 =.15 H 1 : At least one p i changed. The expected frequency for each category (cell) if the null hypothesis is true is shown below: 90 = 200(.45) 30 = 200(.15) 10282 16 What actual frequencies did the sample return? 80 = 200(.40)
The statistic is: Intuitively, this measures the extent of differences between the observed and the expected frequencies. The rejection region is:
Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities p i is different. Thus, at least two market shares have changed. P valueAlpha 5.998.18 Rejection region 2 with 2 degrees of freedom
Required Conditions – The Rule of Five The test statistic used to perform the test is only approximately Chi-squared distributed. For the approximation to apply, the expected cell frequency has to be at least 5 for all cells (np i 5). If the expected frequency in a cell is less than 5, combine it with other cells.
Chi-squared Test of a Contingency Table This test is used to test whether… –two nominal variables are related? –there are differences between two or more populations of a nominal variable? To accomplish the test objectives, we need to classify the data according to two different criteria. The idea is also based on goodness of fit.
Example 16.2 –In an effort to better predict the demand for courses offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection. –A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data.
There are two ways to view this problem If each undergraduate degree is considered a population, do these populations differ? If each classification is considered a nominal variable, are these two variables dependent? The observed values
Solution –The hypotheses are: H 0 : The two variables are independent H 1 : The two variables are dependent k is the number of cells in the contingency table. –The test statistic – The rejection region Since e i = np i but p i is unknown, we need to estimate the unknown probability from the data, assuming H 0 is true.
Under the null hypothesis the two variables are independent: P(Accounting and BA) = P(Accounting)*P(BA) UndergraduateMBA Major DegreeAccountingFinanceMarketingProbability BA 6060/152 BENG 3131/152 BBA 3939/152 Other 2222/152 614447152 Probability61/15244/15247/152 The number of students expected to fall in the cell “Accounting - BA” is e Acct-BA = n(p Acct-BA ) = 152(61/152)(60/152) = [61*60]/152 = 24.08 = [61/152][60/152]. 60 61 152 The number of students expected to fall in the cell “Finance - BBA” is e Finance-BBA = np Finance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29 44 39 152 Estimating the expected frequencies
e ij = (Column j total)(Row i total) Sample size The expected frequency of cell of row i and column j in the contingency table is calculated by:
UndergraduateMBA Major DegreeAccountingFinanceMarketing BA 31 (24.08)13 (17.37)16 (18.55)60 BENG 8 (12.44)16 (8.97) 7 (9.58)31 BBA 12 (15.65)10 (11.29)17 (12.06)39 Other 10 (8.83) 5 (6.39) 7 (6.80)22 614447152 The expected frequency 31 24.08 (31 - 24.08) 2 24.08 +….+ 5 6.39 (5 - 6.39) 2 6.39 +….+ 7 6.80 (7 - 6.80) 2 6.80 7 6.80 2=2= = 14.70 Calculation of the 2 statistic Solution – continued
Conclusion: Since 2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent. Solution – continued – The critical value in our example is:
Code : Undergraduate degree 1 = BA 2 = BENG 3 = BBA 4 = OTHERS MBA Major 1 = ACCOUNTING 2 = FINANCE 3 = MARKETING Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02Xm16-02 Define a code to specify each nominal value. Input the data in columns one column for each category. Using the computer