1 Lecture 7 Two-Way Tables Slides available from Statistics & SPSS page of www.gpryce.com www.gpryce.com Social Science Statistics Module I Gwilym Pryce.

Slides:



Advertisements
Similar presentations
Contingency Table Analysis Mary Whiteside, Ph.D..
Advertisements

Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Hypothesis Testing IV Chi Square.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Session 7.1 Bivariate Data Analysis
PPA 415 – Research Methods in Public Administration Lecture 9 – Bivariate Association.
Chi-square Test of Independence
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Social Research Methods
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Inferential Statistics
Week 11 Chapter 12 – Association between variables measured at the nominal level.
Presentation 12 Chi-Square test.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
AM Recitation 2/10/11.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
LIS 570 Summarising and presenting data - Univariate analysis continued Bivariate analysis.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Hypothesis Testing for Ordinal & Categorical Data EPSY 5245 Michael C. Rodriguez.
Bivariate Relationships Analyzing two variables at a time, usually the Independent & Dependent Variables Like one variable at a time, this can be done.
1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Week 8 Chapter 8 - Hypothesis Testing I: The One-Sample Case.
1 Measuring Association The contents in this chapter are from Chapter 19 of the textbook. The crimjust.sav data will be used. cjsrate: RATE JOB DONE: CJ.
INFO 515Lecture #91 Action Research More Crosstab Measures INFO 515 Glenn Booker.
In the Lab: Working With Crosstab Tables Lab: Association and the Chi-square Test Chapters 7, 8 and 9 1.
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
1 Lecture 6 Hypothesis Testing II: Proportions and 2 Populations Graduate School Quantitative Research Methods Gwilym Pryce
Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)
Chi-square (χ 2 ) Fenster Chi-Square Chi-Square χ 2 Chi-Square χ 2 Tests of Statistical Significance for Nominal Level Data (Note: can also be used for.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Chapter 16 The Chi-Square Statistic
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Chi-Square Contingency Table Analysis.
Chi-Square X 2. Parking lot exercise Graph the distribution of car values for each parking lot Fill in the frequency and percentage tables.
1 Lecture 7: Two Way Tables Graduate School Quantitative Research Methods Gwilym Pryce
Nonparametric Tests: Chi Square   Lesson 16. Parametric vs. Nonparametric Tests n Parametric hypothesis test about population parameter (  or  2.
Chapter 9 Three Tests of Significance Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
CHI SQUARE TESTS.
Lecture 2 Frequency Distribution, Cross-Tabulation, and Hypothesis Testing.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Copyright © 2014 by Nelson Education Limited Chapter 11 Introduction to Bivariate Association and Measures of Association for Variables Measured.
PART 2 SPSS (the Statistical Package for the Social Sciences)
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
1 Lecture 5 Introduction to Hypothesis Tests Slides available from Statistics & SPSS page of Social Science Statistics Module.
The Chi Square Equation Statistics in Biology. Background The chi square (χ 2 ) test is a statistical test to compare observed results with theoretical.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Interpretation of Common Statistical Tests Mary Burke, PhD, RN, CNE.
Copyright © 2012 by Nelson Education Limited. Chapter 12 Association Between Variables Measured at the Ordinal Level 12-1.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Qualitative data – tests of association
Social Research Methods
The Chi-Square Distribution and Test for Independence
Hypothesis Testing and Comparing Two Proportions
Chapter 10 Analyzing the Association Between Categorical Variables
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Chapter 18: The Chi-Square Statistic
COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.
Presentation transcript:

1 Lecture 7 Two-Way Tables Slides available from Statistics & SPSS page of Social Science Statistics Module I Gwilym Pryce

Notices: Register

Aims and Objectives: Aim: –This session introduces methods of examining relationships between categorical variables Objectives: –By the end of this session you should be able to: Understand how to examine relationships between categorical variables using: » 2 way tables »Chi square test for independence.

Plan: 1. Independent events 2. Contingent events 3. Chi square test for independence 4. Further Study

1. Probability of two Independent events occurring If knowing that one event occurs does not affect the outcome of another event, we say those two outcomes are independent. And if A and B are independent, and we know the probability of each of them occurring, we can calculate the probability of them both occurring

Example: You have a two sided die and a coin, find Pr(1 and H). Answer: ½ x ½ = ¼ Rule: P(A  B) = P(A) x P(B)

e.g. You have one fair coin which you toss twice: what’s the probability of getting two heads? Suppose: A = 1st toss is a head B = 2nd toss is a head –what is the probability of A  B? Answer: A and B are independent and are not disjoint (i.e. not mutually exclusive). P(A) = 0.5 and P(B) = 0.5. P (A  B) = 0.5 x 0.5 = 0.25.

2. Probability of two contingent events occurring If knowing that one event occurs does change the probability that the other occurs, then two events are not independent and are said to be contingent upon each other If events are contingent then we can say that there is some kind of relationship between them So testing for contingency is one way of testing for a relationship

Example of contingent events: There is a 70% chance that a child will go to university if his/her parents are middle class, but only a 10% chance if his/her parents are working class. Given that there is a 60% chance of a child’s parents being working class: What are the chances that a child will be born working class and go to University? What proportion of people at university will be from working working class backgrounds?

A tricky one...

This diagram illustrates graphically how the probability of going to university is contingent upon the social class of your parents.

6% of all children are both working class and end up going to University

% = as percent of all children Working classMiddle class Go to University6%28% Do not go to University 54%12%

% at Uni from WC parents? Of all children, only 34% end up at university (6% WC; 28% MC) –i.e. 6 out of every 34 University students are from WC parents: 6 / 34 = 17.6% of University students are WC

Probability theory states that: –if x and y are independent, then the probability of events x and y simultaneously occurring is simply equal to the product of the two events occurring : But, if x and y are not independent, then : Prob(x  y) = Prob(x)  Prob(y given that x has occurred)

Test for independence We can use these two rules to test whether events are independent –Does the distribution of observations across possible outcomes resemble the random distribution we would get if events were independent? –I.e. if we assume independence and calculate the expected number of of cases in each category, do these figures correspond fairly closely to the actual distribution of outcomes found in our data? Or a distribution of outcomes more akin to contingency –i.e. one event contingent on the other

Example 1: Is there a relationship between social class and education? We might test this by looking at categories in our data of WC, MC, University, no University. Working classMiddle class Go to University 1884 Do not go to University Suppose we have 300 observations distributed as follows: Given this distribution, would you say these two variables are independent?

To do the test for independence we need to compare expected with observed. But how do we calculate e i, the expected number of observations in category i? –e i = number of cases expected in cell i assuming that the two categorical variables are independent –e i is calculated simply as: the probability of an observation falling into category i under the independence assumption, multiplied by the total number of observations. I.e. No contingency

So, if UNI Y or UNI N and WC or MC are independent (i.e. assuming H 0 ) then: Working classMiddle class Go to University P(UNI Y ) x P(WC) x nP(UNI Y ) x P(MC) x n Do not go to University P(UNI N ) x P(WC) x nP(UNI N ) x P(MC) x n Prob(UNIY  WC) = Prob(UNIY)  Prob(WC) so the expected number of cases for each of the four mutually exclusive categories are as follows:

But how do we work out: Prob(UNI Y ) and Prob(WC) which are needed to calcluate Prob(UNI Y  WC): Prob(UNI Y  WC) = Prob(UNI Y )  Prob(WC) Answer: we assume independence and so estimate them from the data by simply dividing the total observations by the total number in the given category: E.g. Prob(UNI Y ) = Total no. cases UNI Y  All observations = ( ) / 300 = 0.34 Prob(WC) is calculated the same way: E.g. Prob(WC) = Total no. cases WC  All observations = ( )/300 = 0.6 Prob(UNI Y  WC) =.34 x.6 x 300 = 61.2

Working classMiddle class Go to University P(UNI Y ) x P(WC) x n = (no. at Uni / n) x (no. WC/n) x n P(UNI Y ) x P(MC) x n = (no. at Uni / n) x (no. MC/n) x n Do not go to University P(UNI N ) x P(WC) x n = (no.not Uni / n) x (no. WC/n) x n P(UNI N ) x P(MC) x n = (no. not Uni / n) x (no. MC/n) x n

Working classMiddle class Go to University Do not go to University

Working classMiddle class Go to University P(UNI Y ) x P(WC) x n = (102 / 300) x (180 /300) x 300 P(UNI Y ) x P(MC) x n = (102 / 300) x (120 /300) x 300 Do not go to University P(UNI N ) x P(WC) x n = (198 / 300) x (180 /300) x 300 P(UNI N ) x P(MC) x n = (198 / 300) x (120 /300) x 300

Expected count in each category: Working classMiddle class Go to University (102 / 300) x (180 /300) x 300 =.34 x.6 x 300 = 61.2 (102 / 300) x (120 /300) x 300 =.34 x.4 x 300 = 40.8 Do not go to University (198 / 300) x (180 /300) x 300 =.66 x.6 x 300 = (198 / 300) x (120 /300) x 300 =.66 x.4 x 300 = 79.2

We have the actual count (I.e. from our data set): Working classMiddle class Go to University 1884 Do not go to University 16236

And the expected count: Working classMiddle class Go to University Do not go to University (I.e. the numbers we’d expect if we assume class & education to be independent of each other):

What does this table tell you? Working classMiddle class Go to University Actual count 1884 Expected count Do not go to University Actual count Expected count

It tells you that if class and education were indeed independent of each other I.e. the outcome of one does not affect the chances of outcome of the other –Then you’d expect a lot more working class people in the data to have gone to university than actually recorded (61 people, rather than 18) –Conversely, you’d expect far fewer middle class people to have gone to university (half the number actually recorded – 41 people rather than 80).

But remember, all this is based on a sample, not the entire population… Q/ Is this discrepancy due to sampling variation alone or does it indicate that we must reject the assumption of independence? –To answer this within the standardised hypothesis testing framework we need to know the chances of false rejection

3. Chi-square test for independence (1) H 0 : expected = actual  x & y are independent » I.e. Prob(x) is not affected by whether or not y occurs; H 1 : expected  actual  there is some relationship »I.e. Prob(x) is affected by y occurring. (2)  = 0.05 k = no. of categories e i = expected (given H 0 ) no. of sample observations in the i th category o i = actual no. of sample observations in the i th category d = no. of parameters that have to be estimated from the sample data. r = no. of rows in table c = no. of colums “ “ (non-parametric -- I.e. no presuppositions re distribution of variables; sample size not relevant)

Chi-square distribution changes shape for different df:

(3) Reject H 0 iff P <  (4) Calculate P: P = Prob(  2 >  2 c ) –N.B. Chi-square tests are always an upper tail test –  2 Tables: are usually set up like a t-table with df down the side, and the probabilities listed along the top row, with values of  2 c actually in the body of the table. So look up  2 c in the body of the table for the relevant df and then find the upper tail probability that heads that column. –SPSS: - CDF.CHISQ(  2 c,df) calculates Prob(  2 <  2 c ), so use the following syntax: »COMPUTE chi_prob = 1 - CDF.CHISQ(  2 c,df). »EXECUTE.

Do a chi-square test on the following table: Working classMiddle class Go to University Actual count 1884 Expected count Do not go to University Actual count Expected count

(1)H 0 : expected = actual  class and Higher Education are independent H 1 : expected  actual  there is some relationship between class and Higher Education

(2) State the formula & calc  2 :  2 = ( ( ) 2 / ( ) 2 / ( ) 2 / ( ) 2 / 79.2 )

 2 = ( ( ) 2 / ( ) 2 / ( ) 2 / ( ) 2 / 79.2 ) = = df = (r-1)(c-1) = 1 Sig = P(  2 > ) = 0

(3) Reject H 0 iff P <  (4) Calculate P: COMPUTE chi_prob = 1 - CDF.CHISQ(115.51,1). EXECUTE. Sig = P(  2 > ) = 0  Reject H 0

Caveat: As with the 2 proportions tests, the chi-square test is, “an approximate method that becomes more accurate as the counts in the cells of the table get larger” (Moore, Basic Practice of Statistics, 2000, p. 485) Cell counts required for the Chi-square test: “You can safely use the chi-square test with critical values from the chi-square distribution when no more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater. In particular, all four expected counts in a 2x2 table should be 5 or greater” (Moore, Basic Practice of Statistics, 2000, p. 485)

Example 2: Is there a relationship between whether a borrower is a first time buyer and whether they live in Durham or Cumberland? Only real problem is how do we calculate e i the expected number of observations in category i? –(I.e. number of cases expected in i assuming that the variables are independent) the formula for e i is the probability of an observation falling into category i multiplied by the total number of observations.

As noted earlier: Probability theory states that: –if x and y are independent, then the probability of events x and y simultaneously occurring is simply equal to the product of the two events occurring : But, if x and y are not independent, then : Prob(x  y) = Prob(x)  Prob(y given that x has occurred)

So, if FTB Y or N and County D or C are independent (i.e. assuming H 0 ) then: Prob(FTB Y  County D ) = Prob(FTB Y )  Prob(County D ) so the expected number of cases for each of the four mutually exclusive categories are as follows:

Prob(FTB N ) = Total no. cases FTB N  All observations

This gives us the expected count: To obtain this table in SPSS, go to Analyse, Descriptive Statistics, Crosstabs, Cells, and choose expected count rather than observed

What does this table tell you? –Does it suggest that the probability of being an FTB independent of location? –Or does it suggest that the two factors are contingent on each other in some way? –Can it tell you anything about the direction of causation? –What about sampling variation?

Summary of Hypothesis test: –(1) H 0 : FTB and County are independent H 1 : there is some relationship –(2)  = 0.05 –(3) Reject H 0 iff P <  –(4) Calculate P: P = Prob(  2 >  2 c ) = Do not reject H 0 I.e. if we were to reject H 0, there would be a 1 in 3 chance of us rejecting it incorrectly, and so we cannot do so. In other words, FTB and County are independent.

Contingency Tables in SPSS:

Click Cells button to select counts & %s If you select all three (row, column and total), you will end up with:

Click the Statistics button to choose which stats you want. If you click Chi-square, the results of a range of tests will be listed…

We have been calculating the Pearson Chi-square:

4. For further study: The Pearson Chi square test only tests for the existence of a relationship It tells you little about the strength of the relationship SPSS includes a raft of measures that try to measure the level of association between categorical variables. Click on the name of one of the statistics and SPSS will give you a brief definition (see below) In the lab exercises, take a look at these statistics and copy and paste the definitions along side your answers –Right click on the definition and select Copy. Then open up a Word document and paste along with your output.

Nominal variables: Contingency coefficient: “A measure of association based on chi-square. The value ranges between zero and 1, with zero indicating no association between the row and column variables and values close to 1 indicating a high degree of association between the variables. The maximum value possible depends on the number of rows and columns in a table.” Phi and Cramer’s V: “Phi is a chi-square based measure of association that involves dividing the chi-square statistic by the sample size and taking the square root of the result. Cramer's V is a measure of association based on chi-square.” Lambda: “A measure of association which reflects the proportional reduction in error when values of the independent variable are used to predict values of the dependent variable. A value of 1 means that the independent variable perfectly predicts the dependent variable. A value of 0 means that the independent variable is no help in predicting the dependent variable.” Uncertainty coefficient: “A measure of association that indicates the proportional reduction in error when values of one variable are used to predict values of the other variable. For example, a value of 0.83 indicates that knowledge of one variable reduces error in predicting values of the other variable by 83%. The program calculates both symmetric and asymmetric versions of the uncertainty coefficient.”

Ordinal Variables: Gamma: A symmetric measure of association between two ordinal variables that ranges between negative 1 and 1. Values close to an absolute value of 1 indicate a strong relationship between the two variables. Values close to zero indicate little or no relationship. For 2-way tables, zero-order gammas are displayed. For 3-way to n-way tables, conditional gammas are displayed. Somers’ d: “A measure of association between two ordinal variables that ranges from -1 to 1. Values close to an absolute value of 1 indicate a strong relationship between the two variables, and values close to 0 indicate little or no relationship between the variables. Somers' d is an asymmetric extension of gamma that differs only in the inclusion of the number of pairs not tied on the independent variable. A symmetric version of this statistic is also calculated.” Kendall’s tau-b: “A nonparametric measure of association for ordinal or ranked variables that take ties into account. The sign of the coefficient indicates the direction of the relationship, and its absolute value indicates the strength, with larger absolute values indicating stronger relationships. Possible values range from -1 to 1, but a value of -1 or +1 can only be obtained from square tables.” Kendall’s tau-c: “A nonparametric measure of association for ordinal variables that ignores ties. The sign of the coefficient indicates the direction of the relationship, and its absolute value indicates the strength, with larger absolute values indicating stronger relationships. Possible values range from -1 to 1, but a value of -1 or +1 can only be obtained from square tables.”

Correlations: Pearson correlation coefficient r “a measure of linear association between two variables” Spearman correlation coefficient “a measure of association between rank orders. Values of both range between -1 (a perfect negative relationship) and +1 (a perfect positive relationship). A value of 0 indicates no linear relationship.”

When you have a dependent variable measured on an interval scale & an independent variable with a limited number of categories: Eta: –“A measure of association that ranges from 0 to 1, with 0 indicating no association between the row and column variables and values close to 1 indicating a high degree of association. Eta is appropriate for a dependent variable measured on an interval scale (e.g., income) and an independent variable with a limited number of categories (e.g., gender). Two eta values are computed: one treats the row variable as the interval variable; the other treats the column variable as the interval variable.”