Presentation on theme: "1 Matters arising 1.Summary of last weeks lecture 2.The exercises 3.Your queries."— Presentation transcript:
1 Matters arising 1.Summary of last weeks lecture 2.The exercises 3.Your queries
2 The Pearson correlation (r) The PEARSON CORRELATION is a measure of a supposed linear association between two variables.
3 Linear, but imperfect association If the scatterplot is elliptical in shape, a linear association is indicated. In psychology, all measurement is subject to random error. No association between measured variables is ever perfect. That is why the points do not all lie on a straight line.
4 The Pearson correlation Sums of squares Sum of products
Explanation The numerator of r is known as a SUM OF PRODUCTS (SP). It is the sum of products that captures the extent to which X and Y are associated, or CO- VARY. The sums of squares in the denominator merely constrain the range of variation of r.
The sum of products captures covariation Points in the upper right quadrant have positive deviation products; points in the lower left also have positive deviation products (a minus times a minus is a plus). Points in the other two quadrants have negative products. Since the positive products predominate, we can expect the covariance to be very large. The negative products are small: the points are near the intersection of the mean lines. Mean Preference score Mean Actual Violence score
7 An elliptical scatterplot This is fine. The elliptical scatterplot indicates that there is indeed a basically linear relationship between variable Y1 and variable X1.
8 No association There is NO association between Z and Y. The high value of r is driven solely by the presence of a single OUTLIER.
9 Anscombes rule When you examine a scatterplot (something you should ALWAYS do when interpreting a correlation), ask yourself the following question: Would the removal of one or two points at random affect the basically ellipical shape of the scatterplot? If the shape would remain essentially the same, the value of r accurately reflects the association between the variables.
10 Summary The Pearson correlation r is a measure of the strength of a SUPPOSED linear relationship between 2 variables. It is one of the most widely used of statistical measures; but it is also one of the most misused. You should always try to see the scatterplot when interpreting a value of r.
11 Exercise From the Violence data, obtain a scatterplot and calculate the Pearson correlation.
12 Direction of causation When we measure and obtaining the correlation between two variables we nearly always do so because we believe that one variable X causes or influences the other Y. We have measured Exposure X and Violence Y because we have the hypothesis that X causes Y.
13 The scatterplot of Y against X If we believe that X causes Y, we want to PLOT Y AGAINST X. We want a scatterplot with Y on the vertical axis and X on the horizontal axis. Richard John Jim
16 The vertical scale Notice that the vertical axis begins at 3, rather than at zero. I like to see the whole scale on the vertical axis. Double-click on the graph to enter the Chart Editor. Double-click on the vertical axis to enter a dialog which will enable you to control the amount of the vertical scale that you can see.
17 Ordering the full Y scale Uncheck Auto and enter zero into the Custom slot.
19 Why do I like to see the entire scale on the vertical axis?
20 Beware! Modern computing packages such as SPSS afford a bewildering variety of attractive graphs and displays to help you bring out the most important features of your results. You should certainly use them. But there are pitfalls awaiting the unwary.
21 Performance profiles We often want to see how mean performance varies (or not) over various treatment conditions. We might want to compare the performance of participants who have ingested different kinds (or dosages) of drugs with that of a comparison or control group. There is a set of methods known as Analysis of Variance (ANOVA) which enable us to do that.
29 The true picture … The effect is dramatic. The profile now reflects the true situation. ALWAYS BE SUSPICIOUS OF GRAPHS THAT DO NOT SHOW THE COMPLETE VERTICAL SCALE!
30 Your queries Several of you have e-mailed me asking how you fit a line graph to a scatterplot. Last week, I said that an elliptical scatterplot indicated that the relationship between the variables was basically LINEAR. So we want the best-fitting straight line through the points. This is known as the REGRESSION LINE.
31 Drawing the regression line through the points Choose Fit Line at Total. To leave the Chart Editor, choose Close from the Edit menu or double-click on the Viewer outside the rectangle around the figure.
33 Hypothesis testing In HYPOTHESIS TESTING, a proposition known as the NULL HYPOTHESIS (H 0 ) is set up. H 0 is the NEGATION of your scientific hypothesis. So if our scientific hypothesis is that there is an association, H 0 says theres NO association.
34 The p-value To test H 0, we gather our data and calculate the value of a TEST STATISTIC. If the null hypothesis is true, how probable would a value of our test statistic as extreme as ours have been? The answer is given by a probability known as the p-value. SPSS calls the p-value the Sig., i.e., the SIGNIFICANCE PROBABILITY.
35 A significant result A SIGNIFICANCE LEVEL is a small probability accepted by convention as a criterion for a decision about a statistical test. Most commonly, the 0.05 significance level is accepted by psychologists. If the p-value of your test statistic is LESS than the 0.05 significance level, your result is said to be significant beyond the 0.05 level.
36 The result Report this result as follows: r(27) = 0.89; p <.01 Number of pairs value of r p-value Never report a p- value like this! Report the p- value to 2 places of decimals: if its less than.01, use the inequality sign <. The p-value
38 We have shown that there is a strong association between a childs violence and the amount of violent screen material watched …
39 but have we really gathered evidence for the hypothesis that exposure to screen violence promotes actual violence?
40 Remember: CORRELATION does not necessarily mean CAUSATION
41 One causal model The hypothesis implies this CAUSAL MODEL. The results are CONSISTENT with the hypothesis. The correlation may indeed arise because exposure to violence causes actual violence.
42 Another causal model The childs violent tendencies towards and appetite for violence lead to his (or her) watching violent programmes as often as possible. This model is also consistent with the data.
43 A third causal model NEITHER variable causes the other. Both are determined by the behaviour of the childs parents.
44 The choice Does exposure cause violence (top model)? Does Violence lead to more exposure (middle model)? Are both exposure and violence caused by a third, background, variable (bottom model)?
45 A background variable Perhaps neither Exposure nor Actual violence cause one another. Perhaps they are caused by a background parental behaviour variable. We have data on such a variable. The background variable correlates highly with both Exposure and Actual violence.
46 Partial correlation A PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.
47 Three variables Let X 1, X 2 and X 3 be three variables. Let r 12 be the Pearson correlation between X 1 and X 2. Let r (12.3) be the partial correlation between X 1 and X 2 when the covariation of each with X 3 has been removed.
51 The partial correlation The partial correlation fails to reach significance. Now that we have taken the background variable into consideration, we see that there is no significant correlation between Exposure and Actual violence. It appears that, of the three possible causal models, the third party model gives the most convincing account of the data.
52 Levels of measurement There are three levels: 1. The SCALE level. The data are measures on an independent scale with units. Heights, weights, performance scores and IQs are scale data. Each score has stand-alone meaning. 2. The ORDINAL level. Data in the form of RANKS (1 st, 3 rd, 53 rd ). A rank has meaning only in relation to the other individuals in the sample. A rank does not express, in units, the extent to which a property is possessed. 3. The NOMINAL level. Assignments to categories (so-many males, so-many females.)
53 3. Nominal data NOMINAL data relate to qualitative variables or attributes, such as gender or blood group, and are merely records of CATEGORY MEMBERSHIP. Nominal data are merely LABELS: they may take the form of numbers, but such numbers are arbitrary code numbers representing, say, the different blood groups or different nationalities. ANY numbers will do, as long as they are all different.
54 A set of nominal data A medical researcher wishes to test the hypothesis that people with a certain type of body tissue (Critical) are more likely to show the presence of a potentially harmful antibody. Data are obtained on 79 people, who are classified with respect to 2 attributes: –1.Tissue Type; –2.Whether the antibody is present or absent.
55 The research question Do more of the people in the critical group have the antibody? We are asking whether there is an ASSOCIATION between the variables of category membership (tissue type) and presence/absence of the antibody. This is the SCIENTIFIC hypothesis.
56 The null hypothesis The NULL HYPOTHESIS is the negation of the scientific hypothesis. The null hypothesis states that there is NO association between tissue type and presence of the antibody.
57 Contingency tables (cross-tabulations) When we wish to investigate whether an association exists between qualitative or categorical variables, the starting point is usually a display known as a CONTINGENCY TABLE, whose rows and columns represent the categories of the qualitative variables we are studying. Contingency tables are also known as CROSS- TABULATIONS, or CROSSTABS.
58 The contingency table Is there an association between Tissue Type and Presence of the antibody? It looks as if the antibody is indeed more in evidence in the Critical tissue group.
59 The null hypothesis The null hypothesis is the negation of our scientific hypothesis, namely, the statement that the two variables are INDEPENDENT. In other words, any differences in the relative incidence of the antibody in the different tissue groups have resulted from SAMPLING ERROR.
60 Expected cell frequencies The pattern of the OBSERVED FREQUENCIES (O) would suggest that there is a greater incidence of the antibody in the Critical tissue group. But the marginal totals showing the frequencies of the various groups in the sample also vary. What cell frequencies would we expect under the independence hypothesis?
61 Expected cell frequencies (E) According to the null hypothesis, the joint occurrence of the antibody and a particular tissue type are independent events. The probability of the joint occurrence of independent events is the product of their separate probabilities. We find the expected frequencies (E) by multiplying together the marginal totals that intersect at the cells concerned and dividing by the total number of observations.
62 The expected frequencies To obtain, say, the value of E for the top left cell, multiply the intersecting marginal totals (36 and 22) and divide by 79 (the total frequency), obtaining (36×22)/79 = 10.03. In the Critical group, there seem to be large differences between O and E: fewer Nos than expected and more Yess.
63 The chi-square ( χ 2 ) statistic We need a statistic which compares the differences between the O and E, so that a large value will cast doubt upon the null hypothesis of independence. Such a statistic is CHI-SQUARE ( χ 2 ).
64 Formula for chi-square The element of chi- square expresses the square of the difference between O and E as a proportion of E. Add up these squared differences for all the cells in the contingency table.
65 The value of chi-square There are 8 terms in the summation, but only the first two and the last are shown in the calculation below.
66 Degrees of freedom To decide whether a given value of chi- square is significant, we must specify the DEGREES OF FREEDOM df. If a contingency table has R rows and C columns, the degrees of freedom is given by df = (R – 1)(C – 1) In our example, R = 4, C = 2 and so df = (4 – 1)(2 – 1) = 3.
67 Significance SPSS will tell us that the p-value of a chi-square with a value of 10.655 in the chi-square distribution with three degrees of freedom is.014. We should write this result as: χ 2 (3) = 10.66; p =.01. Since the result is significant beyond the.05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.
68 Summary This week I extended my discussion of statistical association to the topic of partial correlation. A partial correlation can help the researcher to choose from different causal models. I also considered the analysis of nominal data in the form of contingency tables. The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.