Presentation on theme: "Matters arising Summary of last week’s lecture The exercises."— Presentation transcript:
1 Matters arisingSummary of last week’s lectureThe exercises
2 Last weekThis week I extended my discussion of statistical association to the topic of partial correlation.A partial correlation can help the researcher to choose from different causal models.I also considered the analysis of nominal data in the form of contingency tables.The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.
3 does not necessarily mean CORRELATIONdoes not necessarily meanCAUSATION
4 The choiceA strong positive correlation between Exposure and Actual violence was obtained.But at least three CAUSAL MODELS are compatible with that result.
5 A background variableFortunately, we had information on a third variable, a measure of parental orientation towards violence.Both Exposure and Actual violence correlated highly with this Background variable.
6 Partial correlationA PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.
7 Partial correlation Removes the influence of the third variable. Rescales with new variances, so that the range is as below.
8 The partial correlation When correlations with Background are taken into account, the original correlation is now insignificant.The third model seems the most convincing.
9 A medical questionIs there an association between the type of body tissue one has and the presence of a potentially harmful antibody?This is a question of whether two QUALITATIVE VARIABLES are associated.
10 A contingency tableThe pattern of frequencies in the CONTINGENCY TABLE suggests that there is indeed an association between Presence and Tissue Type.The null hypothesis is that the two variables are INDEPENDENT.
11 Expected cell frequencies (E) The EXPECTED FREQUENCY E in each cell of the table is calculated from the MARGINAL TOTALS of the contingency table on the assumption that Tissue Type and Presence are independent.We compare the values of E with the OBSERVED FREQUENCIES O.
12 The expected frequencies In the Critical group, there seem to be large discrepancies between O and E: fewer No’s than expected and more Yes’s.
13 Formula for chi-square The magnitude of the discrepancies feeds into the value of the CHI-SQUARE statistic.
15 Degrees of freedomTo decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df.If a contingency table has R rows and C columns, the degrees of freedom is given bydf = (R – 1)(C – 1)In our example, R = 4, C = 2 and sodf = (4 – 1)(2 – 1) = 3.
16 SignificanceSPSS will tell us that the p-value of a chi-square with a value of in the chi-square distribution with three degrees of freedom is .014.We should write this result as:χ2(3) = 10.66; p = .01 .Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.
20 SolutionA is wrong: we usually hope the null hypothesis will be falsified.B is wrong: it’s the null hypothesis that is tested.C is wrong: the p-value must be less than 0.05 for significance.D is correct: significance requires a p-value of less than 0.05.
24 In Variable ViewIn Variable View, Name three variables and assign Values to the code numbers making up the various tissue groups.Always assign CLEAR VALUE LABELS to make the output comprehensible.
25 In Data ViewThe third variable, Count, contains the frequencies of occurrence of the antibody in the different groups.When entering the data, it’s helpful to be able to view the value labels.
26 What the rows in Data View represent SPSS assumes that, in Data View, each row contains information on just ONE participant or CASE.In our example, each row contains information about SEVERAL people.At some point, SPSS must be informed of this.You do this by WEIGHTING THE CASES with the frequencies.
27 Weighting the cases Select Weight Cases from the Data menu. Complete the Weight Cases dialog by transferring Count to the Frequency Variable slot.Click OK to weight the cases with frequencies
28 Another approachWe could have dispensed with the Count variable and simply entered the data on each of the 79 people in the study.Here are 8 of the 79 cases.You don’t need the Weight cases procedure here.
29 Selecting the chi-square test The chi-square test is available in Crosstabs, on the Descriptive Statistics menu.
30 The Crosstabs dialogWe want the columns to represent the Presence variable, as in the contingency table.
31 Clustered bar chartsCheck the box labelled ‘Display clustered bar charts’
32 Crosstabs: Statistics Choose Chi-square.The Chi-square statistic itself is not suitable as a measure of the strength of an association, because it is affected by the size of the data set.Click ‘Phi and Cramer’s V’. These are measures of the STRENGTH of the association between tissue type and the incidence of the antibody.
33 Crosstabs: Cell Display Check the Observed and Expected buttons.Since the columns represent Yes’s and No’s, it will be useful to have the column PERCENTAGES.
34 The output: contingency table The percentages are useful: they show a marked predominance of Presence of the antibody in the Critical tissue group only.
35 The clustered bar chart The figure shows the trend apparent from inspection of the column percentages.There is a marked presence of the antibody in the Critical tissue group.
36 Result of the chi-square test The p-value in the column headed ‘Asymp.Sig.’: p =Write the result as:χ2 (3) = ; p = .01.Notice the information about the number of cells with values of E less than 5.When there are too many, the usual p-value cannot be trusted.
37 Strength of the association Unlike a correlation, the value of chi-square is partly determined by the sample size and is therefore unsuitable as a measure of association strength.Interpret either Phi or Cramer’s statistic as the extent to which the incidence of the antibody can be accounted for by tissue type. Cramer’s V can take values in the range from 0 to +1.
38 A smaller data setIs there an association between Tissue Type and Presence of the antibody?The antibody is indeed more in evidence in the ‘Critical’ tissue group.High incidence in Critical category
39 Result of the chi-square test How disappointing! It looks as if we haven’t demonstrated a significant association.Under the column headed ‘Asymp. Sig.’ is the p-value, which is given as
40 Sampling distributions Because of sampling variability, the values of the statistics we calculate from our data would vary were the data-gathering exercise to be repeated.The distribution of a statistic is known as its SAMPLING DISTRIBUTION.Test statistics such as t, F and chi-square have known sampling distributions.You must know the sampling distribution of any statistic to produce an accurate p-value.
42 The true definition of chi-square The familiar formula is not the defining formula for chi-square.Chi-square is NOT defined in the context of nominal data, but in terms of continuously distributed, independent standard normal variables Z as follows:
44 An approximationThe familiar chi-square statistic is only APPROXIMATELY distributed as chi-square.The approximation is good, provided that the expected frequencies E are adequately large.
45 The meaning of ‘Asymptotic’ The term ASYMPTOTIC denotes the limiting distribution of a statistic as the sample size approaches infinity.The ‘asymptotic’ p-value of a statistic is its p-value under the assumption that the statistic has the limiting distribution.That assumption may be false.
46 Goodness of the approximation… Warning about low expected frequencies.In the SPSS output, the column headed ‘Asymp. Sig.’ contains a p-value calculated on the assumption that the approximate chi-square statistic behaves like the real chi-square statistic.But underneath the table there is a warning about low frequencies, indicating that the ‘asymptotic’ p-value cannot be relied upon.
47 Exact testsFortunately, there are available EXACT TESTS, which do not make the assumption that the approximation is good.There are the Fisher exact tests, designed by R. A. Fisher many years ago; and there are modern ‘brute force’ methods requiring massive computation.
48 Ordering an exact testClick the Exact… button at the bottom of the Crosstabs dialog box.Check the Exact radio button in the Exact Tests dialog.
49 A better result!The exact test has shown that we DO have evidence for an association between tissue type and incidence of the antibody.The exact p-value is markedly lower than the asymptotic value.
52 Linear associationIf two variables have a PERFECT linear relationship, the graph of one against the other is a straight line.The graph of temperature in degrees Fahrenheit against the equivalent Celsius temperature is a straight line.
53 A perfect positive linear relationship Degrees FahrenheitPQIntercept → 32(0, 0)Degrees Celsius
54 The slope of the lineThe COEFFICIENT 9/5 in front of the Celsius variable is the SLOPE of the straight line.When the Celsius temperature increases by FIVE degrees, the Fahrenheit temperature increases by NINE degrees. When the Celsius temperature increases by one degree, the Fahrenheit temperature increases by 1.8 degrees.
55 A strong linear association A narrowly elliptical scatterplot like this indicates a strong positive association between the two variables.The Pearson correlation is
56 RegressionRegression is a set of techniques for exploiting the presence of statistical association among variables to make PREDICTIONS of values of one variable (the DV or CRITERION) from knowledge of the values of other variables (the IVs or REGRESSORS).
57 Simple and multiple regression In the simplest case, there is just one IV or regressor. This is known as SIMPLE regression.In MULTIPLE regression, there are two or more IVs.
59 The regression line of Violence upon Preference The REGRESSION LINE is the line that fits the points best from the point of view of predicting Actual Violence from Preference. There is a precise criterion for the ‘best-fitting’ line.
66 Simple regression B is the slope and B0 is the intercept. Y/ is the y-coordinate of the point on the line above the value of X.An increase of one unit on variable X will result in an estimated increase of B units on variable Y.A NEGATIVE value of B would mean that an increase of one unit on variable X will result in an estimated REDUCTION of B units on Y.regression constant (intercept)regression coefficient (slope)
67 The ‘least-squares’ criterion In ORDINARY LEAST SQUARES (OLS) REGRESSION, a RESIDUAL score e is the difference between the real value Y and the estimate Y/ from the regression equation.e = (Fred’s real violence – Fred’s predicted violence from the regression equation).OLS regression minimizes the sum of the squares of the residuals Σ(Y ─Y/)2 = Σe2.
69 Regression line with independence When the variables show no association, the slope of the regression line is zero and the line runs horizontally through the mean MY of the criterion or dependent variable.The intercept (B0) is MY in this case.
70 Intercept-only prediction In OLS regression, the intercept B0 is related to the regression coefficient B1 according toB0 = MY – B1MXWhen X and Y are independent, the slope of the regression line is zero andB0 = MYThe best we can do with regression is to draw a horizontal line at Y = MY through the middle of the cloud of points.Whatever the degree of association between X and Y, the INTERCEPT-ONLY prediction is Y/ = MY.
71 Improved prediction There is a strong linear association here. The regression line makes much more accurate predictions than simply using the mean score on Actual violence as your prediction whatever the Preference value.
72 Summary How to run chi-square tests of association on SPSS. When the data are scarce, the usual chi-square test can give a misleading result.Run an EXACT TEST if there are warnings about low expected frequencies.REGRESSION is a set of techniques for predicting a target (dependent) variable from a regressor or independent variable.
73 An exerciseI have placed the larger and smaller data sets for the Tissue and Antibody example on my website.Try running the chi-square tests.