Presentation on theme: "1 Matters arising 1.Summary of last weeks lecture 2.The exercises."— Presentation transcript:
1 Matters arising 1.Summary of last weeks lecture 2.The exercises
2 Last week This week I extended my discussion of statistical association to the topic of partial correlation. A partial correlation can help the researcher to choose from different causal models. I also considered the analysis of nominal data in the form of contingency tables. The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.
3 CORRELATION does not necessarily mean CAUSATION
4 The choice A strong positive correlation between Exposure and Actual violence was obtained. But at least three CAUSAL MODELS are compatible with that result.
5 A background variable Fortunately, we had information on a third variable, a measure of parental orientation towards violence. Both Exposure and Actual violence correlated highly with this Background variable.
6 Partial correlation A PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.
7 Partial correlation Removes the influence of the third variable. Rescales with new variances, so that the range is as below.
8 The partial correlation When correlations with Background are taken into account, the original correlation is now insignificant. The third model seems the most convincing.
9 A medical question Is there an association between the type of body tissue one has and the presence of a potentially harmful antibody? This is a question of whether two QUALITATIVE VARIABLES are associated.
10 A contingency table The pattern of frequencies in the CONTINGENCY TABLE suggests that there is indeed an association between Presence and Tissue Type. The null hypothesis is that the two variables are INDEPENDENT.
11 Expected cell frequencies (E) The EXPECTED FREQUENCY E in each cell of the table is calculated from the MARGINAL TOTALS of the contingency table on the assumption that Tissue Type and Presence are independent. We compare the values of E with the OBSERVED FREQUENCIES O.
12 The expected frequencies In the Critical group, there seem to be large discrepancies between O and E: fewer Nos than expected and more Yess.
13 Formula for chi-square The magnitude of the discrepancies feeds into the value of the CHI-SQUARE statistic.
14 The value of chi-square The value of chi-square is
15 Degrees of freedom To decide whether a given value of chi- square is significant, we must specify the DEGREES OF FREEDOM df. If a contingency table has R rows and C columns, the degrees of freedom is given by df = (R – 1)(C – 1) In our example, R = 4, C = 2 and so df = (4 – 1)(2 – 1) = 3.
16 Significance SPSS will tell us that the p-value of a chi-square with a value of in the chi-square distribution with three degrees of freedom is.014. We should write this result as: χ 2 (3) = 10.66; p =.01. Since the result is significant beyond the.05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.
17 Multiple-choice example
18 Solution It isnt easy to ask a sensible multiple-choice question about partial correlation. C is obviously the correct answer.
20 Solution A is wrong: we usually hope the null hypothesis will be falsified. B is wrong: its the null hypothesis that is tested. C is wrong: the p-value must be less than 0.05 for significance. D is correct: significance requires a p-value of less than 0.05.
22 Solution df = (R-1)(C-1) = (4 – 1)(5 – 1) = 12. So the correct answer is B.
23 Lecture 10 RUNNING CHI-SQUARE TESTS ON SPSS
24 In Variable View In Variable View, Name three variables and assign Values to the code numbers making up the various tissue groups. Always assign CLEAR VALUE LABELS to make the output comprehensible.
25 In Data View The third variable, Count, contains the frequencies of occurrence of the antibody in the different groups. When entering the data, its helpful to be able to view the value labels.
26 What the rows in Data View represent SPSS assumes that, in Data View, each row contains information on just ONE participant or CASE. In our example, each row contains information about SEVERAL people. At some point, SPSS must be informed of this. You do this by WEIGHTING THE CASES with the frequencies.
27 Weighting the cases Select Weight Cases from the Data menu. Complete the Weight Cases dialog by transferring Count to the Frequency Variable slot. Click OK to weight the cases with frequencies
28 Another approach We could have dispensed with the Count variable and simply entered the data on each of the 79 people in the study. Here are 8 of the 79 cases. You dont need the Weight cases procedure here.
29 Selecting the chi-square test The chi-square test is available in Crosstabs, on the Descriptive Statistics menu.
30 The Crosstabs dialog We want the columns to represent the Presence variable, as in the contingency table.
31 Clustered bar charts Check the box labelled Display clustered bar charts
32 Crosstabs: Statistics Choose Chi-square. The Chi-square statistic itself is not suitable as a measure of the strength of an association, because it is affected by the size of the data set. Click Phi and Cramers V. These are measures of the STRENGTH of the association between tissue type and the incidence of the antibody.
33 Crosstabs: Cell Display Check the Observed and Expected buttons. Since the columns represent Yess and Nos, it will be useful to have the column PERCENTAGES.
34 The output: contingency table The percentages are useful: they show a marked predominance of Presence of the antibody in the Critical tissue group only.
35 The clustered bar chart The figure shows the trend apparent from inspection of the column percentages. There is a marked presence of the antibody in the Critical tissue group.
36 Result of the chi-square test The p-value in the column headed Asymp.Sig.: p =.014. Write the result as: χ 2 (3) = ; p =.01. Notice the information about the number of cells with values of E less than 5. When there are too many, the usual p-value cannot be trusted.
37 Strength of the association Unlike a correlation, the value of chi-square is partly determined by the sample size and is therefore unsuitable as a measure of association strength. Interpret either Phi or Cramers statistic as the extent to which the incidence of the antibody can be accounted for by tissue type. Cramers V can take values in the range from 0 to +1.
38 A smaller data set Is there an association between Tissue Type and Presence of the antibody? The antibody is indeed more in evidence in the Critical tissue group. High incidence in Critical category
39 Result of the chi-square test How disappointing! It looks as if we havent demonstrated a significant association. Under the column headed Asymp. Sig. is the p- value, which is given as.060.
40 Sampling distributions Because of sampling variability, the values of the statistics we calculate from our data would vary were the data-gathering exercise to be repeated. The distribution of a statistic is known as its SAMPLING DISTRIBUTION. Test statistics such as t, F and chi-square have known sampling distributions. You must know the sampling distribution of any statistic to produce an accurate p-value.
41 The familiar chi-square formula
42 The true definition of chi-square The familiar formula is not the defining formula for chi-square. Chi-square is NOT defined in the context of nominal data, but in terms of continuously distributed, independent standard normal variables Z as follows:
43 True definition of chi-square
44 An approximation The familiar chi- square statistic is only APPROXIMATELY distributed as chi- square. The approximation is good, provided that the expected frequencies E are adequately large.
45 The meaning of Asymptotic The term ASYMPTOTIC denotes the limiting distribution of a statistic as the sample size approaches infinity. The asymptotic p-value of a statistic is its p-value under the assumption that the statistic has the limiting distribution. That assumption may be false.
46 Goodness of the approximation… In the SPSS output, the column headed Asymp. Sig. contains a p-value calculated on the assumption that the approximate chi-square statistic behaves like the real chi-square statistic. But underneath the table there is a warning about low frequencies, indicating that the asymptotic p-value cannot be relied upon. Warning about low expected frequencies.
47 Exact tests Fortunately, there are available EXACT TESTS, which do not make the assumption that the approximation is good. There are the Fisher exact tests, designed by R. A. Fisher many years ago; and there are modern brute force methods requiring massive computation.
48 Ordering an exact test Click the Exact… button at the bottom of the Crosstabs dialog box. Check the Exact radio button in the Exact Tests dialog.
49 A better result! The exact test has shown that we DO have evidence for an association between tissue type and incidence of the antibody. The exact p-value is markedly lower than the asymptotic value.
51 The violence study scatterplot
52 Linear association If two variables have a PERFECT linear relationship, the graph of one against the other is a straight line. The graph of temperature in degrees Fahrenheit against the equivalent Celsius temperature is a straight line.
53 A perfect positive linear relationship Degrees Fahrenheit Degrees Celsius (0, 0) Intercept 32 Q P
54 The slope of the line The COEFFICIENT 9/5 in front of the Celsius variable is the SLOPE of the straight line. When the Celsius temperature increases by FIVE degrees, the Fahrenheit temperature increases by NINE degrees. When the Celsius temperature increases by one degree, the Fahrenheit temperature increases by 1.8 degrees.
55 A strong linear association A narrowly elliptical scatterplot like this indicates a strong positive association between the two variables. The Pearson correlation is
56 Regression Regression is a set of techniques for exploiting the presence of statistical association among variables to make PREDICTIONS of values of one variable (the DV or CRITERION) from knowledge of the values of other variables (the IVs or REGRESSORS).
57 Simple and multiple regression In the simplest case, there is just one IV or regressor. This is known as SIMPLE regression. In MULTIPLE regression, there are two or more IVs.
58 The regression line
59 The regression line of Violence upon Preference The REGRESSION LINE is the line that fits the points best from the point of view of predicting Actual Violence from Preference. There is a precise criterion for the best-fitting line.
60 The regression equation
61 F is a linear function of C Degrees Fahrenheit Degrees Celsius (0, 0) Intercept 32 Q P
62 The regression line Y (Violence) X (Exposure) (0, 0) Q P
63 Using the equation
64 Predicting a score Y (Violence) X (Exposure) (0, 0) Intercept
65 The error in prediction
66 Simple regression B is the slope and B 0 is the intercept. Y / is the y-coordinate of the point on the line above the value of X. An increase of one unit on variable X will result in an estimated increase of B units on variable Y. A NEGATIVE value of B would mean that an increase of one unit on variable X will result in an estimated REDUCTION of B units on Y. regression constant (intercept) regression coefficient (slope)
67 The least-squares criterion In ORDINARY LEAST SQUARES (OLS) REGRESSION, a RESIDUAL score e is the difference between the real value Y and the estimate Y / from the regression equation. e = (Freds real violence – Freds predicted violence from the regression equation). OLS regression minimizes the sum of the squares of the residuals Σ(Y Y / ) 2 = Σe 2.
68 Finding the values of b 0 and b
69 Regression line with independence When the variables show no association, the slope of the regression line is zero and the line runs horizontally through the mean M Y of the criterion or dependent variable. The intercept (B 0 ) is M Y in this case.
70 Intercept-only prediction In OLS regression, the intercept B 0 is related to the regression coefficient B 1 according to B 0 = M Y – B 1 M X When X and Y are independent, the slope of the regression line is zero and B 0 = M Y The best we can do with regression is to draw a horizontal line at Y = M Y through the middle of the cloud of points. Whatever the degree of association between X and Y, the INTERCEPT-ONLY prediction is Y / = M Y.
71 Improved prediction There is a strong linear association here. The regression line makes much more accurate predictions than simply using the mean score on Actual violence as your prediction whatever the Preference value.
72 Summary How to run chi-square tests of association on SPSS. When the data are scarce, the usual chi- square test can give a misleading result. Run an EXACT TEST if there are warnings about low expected frequencies. REGRESSION is a set of techniques for predicting a target (dependent) variable from a regressor or independent variable.
73 An exercise I have placed the larger and smaller data sets for the Tissue and Antibody example on my website. Try running the chi-square tests.