Presentation is loading. Please wait.

Presentation is loading. Please wait.

Matters arising Summary of last week’s lecture The exercises.

Similar presentations


Presentation on theme: "Matters arising Summary of last week’s lecture The exercises."— Presentation transcript:

1 Matters arising Summary of last week’s lecture The exercises

2 Last week This week I extended my discussion of statistical association to the topic of partial correlation. A partial correlation can help the researcher to choose from different causal models. I also considered the analysis of nominal data in the form of contingency tables. The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.

3 does not necessarily mean
CORRELATION does not necessarily mean CAUSATION

4 The choice A strong positive correlation between Exposure and Actual violence was obtained. But at least three CAUSAL MODELS are compatible with that result.

5 A background variable Fortunately, we had information on a third variable, a measure of parental orientation towards violence. Both Exposure and Actual violence correlated highly with this Background variable.

6 Partial correlation A PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.

7 Partial correlation Removes the influence of the third variable.
Rescales with new variances, so that the range is as below.

8 The partial correlation
When correlations with Background are taken into account, the original correlation is now insignificant. The third model seems the most convincing.

9 A medical question Is there an association between the type of body tissue one has and the presence of a potentially harmful antibody? This is a question of whether two QUALITATIVE VARIABLES are associated.

10 A contingency table The pattern of frequencies in the CONTINGENCY TABLE suggests that there is indeed an association between Presence and Tissue Type. The null hypothesis is that the two variables are INDEPENDENT.

11 Expected cell frequencies (E)
The EXPECTED FREQUENCY E in each cell of the table is calculated from the MARGINAL TOTALS of the contingency table on the assumption that Tissue Type and Presence are independent. We compare the values of E with the OBSERVED FREQUENCIES O.

12 The expected frequencies
In the Critical group, there seem to be large discrepancies between O and E: fewer No’s than expected and more Yes’s.

13 Formula for chi-square
The magnitude of the discrepancies feeds into the value of the CHI-SQUARE statistic.

14 The value of chi-square
chi-square is

15 Degrees of freedom To decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df. If a contingency table has R rows and C columns, the degrees of freedom is given by df = (R – 1)(C – 1) In our example, R = 4, C = 2 and so df = (4 – 1)(2 – 1) = 3.

16 Significance SPSS will tell us that the p-value of a chi-square with a value of in the chi-square distribution with three degrees of freedom is .014. We should write this result as: χ2(3) = 10.66; p = .01 . Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.

17 Multiple-choice example

18 Solution It isn’t easy to ask a sensible multiple-choice question about partial correlation. C is obviously the correct answer.

19 Example

20 Solution A is wrong: we usually hope the null hypothesis will be falsified. B is wrong: it’s the null hypothesis that is tested. C is wrong: the p-value must be less than 0.05 for significance. D is correct: significance requires a p-value of less than 0.05.

21 Example

22 Solution df = (R-1)(C-1) = (4 – 1)(5 – 1) = 12.
So the correct answer is B.

23 Lecture 10 RUNNING CHI-SQUARE TESTS ON SPSS

24 In Variable View In Variable View, Name three variables and assign Values to the code numbers making up the various tissue groups. Always assign CLEAR VALUE LABELS to make the output comprehensible.

25 In Data View The third variable, Count, contains the frequencies of occurrence of the antibody in the different groups. When entering the data, it’s helpful to be able to view the value labels.

26 What the rows in Data View represent
SPSS assumes that, in Data View, each row contains information on just ONE participant or CASE. In our example, each row contains information about SEVERAL people. At some point, SPSS must be informed of this. You do this by WEIGHTING THE CASES with the frequencies.

27 Weighting the cases Select Weight Cases from the Data menu.
Complete the Weight Cases dialog by transferring Count to the Frequency Variable slot. Click OK to weight the cases with frequencies

28 Another approach We could have dispensed with the Count variable and simply entered the data on each of the 79 people in the study. Here are 8 of the 79 cases. You don’t need the Weight cases procedure here.

29 Selecting the chi-square test
The chi-square test is available in Crosstabs, on the Descriptive Statistics menu.

30 The Crosstabs dialog We want the columns to represent the Presence variable, as in the contingency table.

31 Clustered bar charts Check the box labelled ‘Display clustered bar charts’

32 Crosstabs: Statistics
Choose Chi-square. The Chi-square statistic itself is not suitable as a measure of the strength of an association, because it is affected by the size of the data set. Click ‘Phi and Cramer’s V’. These are measures of the STRENGTH of the association between tissue type and the incidence of the antibody.

33 Crosstabs: Cell Display
Check the Observed and Expected buttons. Since the columns represent Yes’s and No’s, it will be useful to have the column PERCENTAGES.

34 The output: contingency table
The percentages are useful: they show a marked predominance of Presence of the antibody in the Critical tissue group only.

35 The clustered bar chart
The figure shows the trend apparent from inspection of the column percentages. There is a marked presence of the antibody in the Critical tissue group.

36 Result of the chi-square test
The p-value in the column headed ‘Asymp.Sig.’: p = Write the result as: χ2 (3) = ; p = .01. Notice the information about the number of cells with values of E less than 5. When there are too many, the usual p-value cannot be trusted.

37 Strength of the association
Unlike a correlation, the value of chi-square is partly determined by the sample size and is therefore unsuitable as a measure of association strength. Interpret either Phi or Cramer’s statistic as the extent to which the incidence of the antibody can be accounted for by tissue type. Cramer’s V can take values in the range from 0 to +1.

38 A smaller data set Is there an association between Tissue Type and Presence of the antibody? The antibody is indeed more in evidence in the ‘Critical’ tissue group. High incidence in Critical category

39 Result of the chi-square test
How disappointing! It looks as if we haven’t demonstrated a significant association. Under the column headed ‘Asymp. Sig.’ is the p-value, which is given as

40 Sampling distributions
Because of sampling variability, the values of the statistics we calculate from our data would vary were the data-gathering exercise to be repeated. The distribution of a statistic is known as its SAMPLING DISTRIBUTION. Test statistics such as t, F and chi-square have known sampling distributions. You must know the sampling distribution of any statistic to produce an accurate p-value.

41 The familiar chi-square formula

42 The true definition of chi-square
The familiar formula is not the defining formula for chi-square. Chi-square is NOT defined in the context of nominal data, but in terms of continuously distributed, independent standard normal variables Z as follows:

43 True definition of chi-square

44 An approximation The familiar chi-square statistic is only APPROXIMATELY distributed as chi-square. The approximation is good, provided that the expected frequencies E are adequately large.

45 The meaning of ‘Asymptotic’
The term ASYMPTOTIC denotes the limiting distribution of a statistic as the sample size approaches infinity. The ‘asymptotic’ p-value of a statistic is its p-value under the assumption that the statistic has the limiting distribution. That assumption may be false.

46 Goodness of the approximation…
Warning about low expected frequencies. In the SPSS output, the column headed ‘Asymp. Sig.’ contains a p-value calculated on the assumption that the approximate chi-square statistic behaves like the real chi-square statistic. But underneath the table there is a warning about low frequencies, indicating that the ‘asymptotic’ p-value cannot be relied upon.

47 Exact tests Fortunately, there are available EXACT TESTS, which do not make the assumption that the approximation is good. There are the Fisher exact tests, designed by R. A. Fisher many years ago; and there are modern ‘brute force’ methods requiring massive computation.

48 Ordering an exact test Click the Exact… button at the bottom of the Crosstabs dialog box. Check the Exact radio button in the Exact Tests dialog.

49 A better result! The exact test has shown that we DO have evidence for an association between tissue type and incidence of the antibody. The exact p-value is markedly lower than the asymptotic value.

50 Regression

51 The violence study scatterplot

52 Linear association If two variables have a PERFECT linear relationship, the graph of one against the other is a straight line. The graph of temperature in degrees Fahrenheit against the equivalent Celsius temperature is a straight line.

53 A perfect positive linear relationship
Degrees Fahrenheit P Q Intercept → 32 (0, 0) Degrees Celsius

54 The slope of the line The COEFFICIENT 9/5 in front of the Celsius variable is the SLOPE of the straight line. When the Celsius temperature increases by FIVE degrees, the Fahrenheit temperature increases by NINE degrees. When the Celsius temperature increases by one degree, the Fahrenheit temperature increases by 1.8 degrees.

55 A strong linear association
A narrowly elliptical scatterplot like this indicates a strong positive association between the two variables. The Pearson correlation is

56 Regression Regression is a set of techniques for exploiting the presence of statistical association among variables to make PREDICTIONS of values of one variable (the DV or CRITERION) from knowledge of the values of other variables (the IVs or REGRESSORS).

57 Simple and multiple regression
In the simplest case, there is just one IV or regressor. This is known as SIMPLE regression. In MULTIPLE regression, there are two or more IVs.

58 The regression line

59 The regression line of Violence upon Preference
The REGRESSION LINE is the line that fits the points best from the point of view of predicting Actual Violence from Preference. There is a precise criterion for the ‘best-fitting’ line.

60 The regression equation

61 F is a linear function of C
Degrees Fahrenheit P Q Intercept → 32 (0, 0) Degrees Celsius

62 The regression line Y (Violence) P Q (0, 0) X (Exposure)

63 Using the equation

64 Predicting a score Y (Violence) Intercept 2.091 (0, 0) 9 X (Exposure)

65 The error in prediction

66 Simple regression B is the slope and B0 is the intercept.
Y/ is the y-coordinate of the point on the line above the value of X. An increase of one unit on variable X will result in an estimated increase of B units on variable Y. A NEGATIVE value of B would mean that an increase of one unit on variable X will result in an estimated REDUCTION of B units on Y. regression constant (intercept) regression coefficient (slope)

67 The ‘least-squares’ criterion
In ORDINARY LEAST SQUARES (OLS) REGRESSION, a RESIDUAL score e is the difference between the real value Y and the estimate Y/ from the regression equation. e = (Fred’s real violence – Fred’s predicted violence from the regression equation). OLS regression minimizes the sum of the squares of the residuals Σ(Y ─Y/)2 = Σe2.

68 Finding the values of b0 and b

69 Regression line with independence
When the variables show no association, the slope of the regression line is zero and the line runs horizontally through the mean MY of the criterion or dependent variable. The intercept (B0) is MY in this case.

70 Intercept-only prediction
In OLS regression, the intercept B0 is related to the regression coefficient B1 according to B0 = MY – B1MX When X and Y are independent, the slope of the regression line is zero and B0 = MY The best we can do with regression is to draw a horizontal line at Y = MY through the middle of the cloud of points. Whatever the degree of association between X and Y, the INTERCEPT-ONLY prediction is Y/ = MY.

71 Improved prediction There is a strong linear association here.
The regression line makes much more accurate predictions than simply using the mean score on Actual violence as your prediction whatever the Preference value.

72 Summary How to run chi-square tests of association on SPSS.
When the data are scarce, the usual chi-square test can give a misleading result. Run an EXACT TEST if there are warnings about low expected frequencies. REGRESSION is a set of techniques for predicting a target (dependent) variable from a regressor or independent variable.

73 An exercise I have placed the larger and smaller data sets for the Tissue and Antibody example on my website. Try running the chi-square tests.


Download ppt "Matters arising Summary of last week’s lecture The exercises."

Similar presentations


Ads by Google