2 OutlineCrosstab (Exploring the relationship between two categorical variable).Correlation (Exploring the relationship between two continuous variables, typically)
3 Data file YR12SURV2.SAV YR12SURVEYCODING2.DOC (Questionnaire) Holland2fory12data.doc
4 RANKMS and RANKINVRANKMS is a constructed variable ranking informants on the amount of time spent on Maths and ScienceA high rank (e.g. 3) means the informant spent a lot of time on Maths and ScienceA low rank (e.g. 1) means the informant spent very little or no time on Maths and Science.RANKINV is a similar rank type variable focused on investigative interests: e.g. informant interested in laboratory work.A high rank (e.g. 4) means the informant was very interestedA low rank (.e.g. 1) means the informant was least interested.These are ordinal variables.
5 Relationships between two categorical variables Example of research questions Is there a relationship between gender and student investigative interest?Are males more likely to be interested in investigative activities than females?Is the proportion of males in each of the investigative level the same as the proportion of females?
6 VariablesTo answer the research question in the example above we will have to do crosstabulations between two variablesGenderRANKINV
7 Hypothesis of independence There is no association between the two variables gender and RANKINVThere is no difference in the proportion of females and males in each of the categories (levels) of investigative interest
8 How to do crosstabulations in SPSS From the DATA menu select ANALYSE then Descriptive Statistics then CrosstabsMove GENDER into the Column(s) window and RANKINV into the Row(s) windowOpen the Statistics window and tick Chi-square Continue to closeOpen the Cells window, under Counts tick observed, under Percentages, tick Column, then click ContinueClick OK in the Crosstabs window to run
12 Interpreting Association in the Table We can compare the column percentages along the rows and calculate the percentage point difference to see (in this case) whether females differ from males at each ‘level’ of interestIn the rankinvestgtv by Gender crosstabulation, for example, 30.7% of females were in category 1 (Very low investigative interest) compared with 19.6% of males, giving a percentage point difference of 11.1.Similarly, there is a difference of 10.7 percentage points in the number of males having very high investigative interest compared with females
13 Table 2: Chi-square Statistics generated by Crosstabs Pearson Chi-square value, degree of freedom and significant levelThis helps us to check if expected counts less than 5
14 Tests of Statistical Significance for Tables Chi-square used to test the null hypothesis that there is no discrepancy between the observed and expected frequencies or there is no association between row and column variablesChi-square based statistics can be used independently of level of measurement.If chi-square is significant (say Asymp. Sig. <0.05) then we reject the null hypothesis and conclude that the data show some association compared with a (hypothetical) table in which the observed frequencies were determined solely by the separate distributions of the two crosstabulated variables (the ‘marginal distributions’)If chi-square is not significant (say Asymp. Sig. >0.05) then we accept the null hypothesis and conclude that the data show no association compared with a (hypothetical) table in which the observed frequencies were determined solely by the separate distributions of the two crosstabulated variables (the ‘marginal distributions’)
15 Assumptions Random samples Independent observations: the different groups are independent of each otherThe lowest expected frequency in any cell should be 5 or more
16 Chi-square Statistics-limitations Chi-Square measures are sensitive to sample size
17 Interpretation of output from chi-square The note under the table shows that you have not violated one of the assumptions of chi-square concerning ‘minimum expected cell frequency’Pearson chi-square value:at 18.5 for 3 degrees of freedom Chi-square is highly significantprobability of this level of association occurring by chance is less thanDegree of freedom=(r-1)(c-1) where r and c are number of categories in each of the two variables.Conclusion: males are more likely than females to be interested in investigative activities.
18 Class activity 1: Produce a similar table using GENDER by RANKMS
19 Summary of analyses of association RANKINV and RANKMS are 4 and 3 categories (respectively) ordinal variables constructed, respectively, from the total score on Investigative interests and the proportion of curriculum time spent in Maths/ScienceGender heads the columns, interest and curriculum participation in Maths/Science form the rows(thus, by convention, gender is the explanatory or independent variable, interest or curriculum participation the response or dependent variables)
20 CorrelationsStrengths of relationships between two variables.
21 Correlation Examples of Research questions Is there a relationship between student achievement in mathematics and English language?Is there a relationship between parents’ incomes and children VCE results ?Is there a correlation between SES and achievement ?How strong are these relationships?These research questions required us to explore the dependency between two variables.What we need here are: two continuous variables (VCE results, incomes) or one dichotomous variable and one continuous variable. It is accepted that one of the variables are ordinal. However, when at least one of the variables ordinal, Spearman’s rho or Kendall’s tau should be used.
22 Assumptions (1)Scores are obtained using a random sample from populationsIndependence of observationsThe distribution of the variable(s) involved is normalHomoscedasticity: the variance of the dependent variable is the same for values of X (residual variance, or conditional variance)Linearity: The relationship between the two variables should be linear.Related pairs: both pieces of information must be from the same subjects
24 Scatter plotScatter plots are used to display the relationship between two continuous variables. The student in green circle has lowest scores in both strands. We can also see that as student scores in measurement increase their score in number also increase
25 Producing a Scatterplot GRAPHS-SCATTER-SIMPLE-DEFINESelect MEASUREMENT score (pmes500) to make this the Y variableSelect NUMBER score (pnum500) to make this the X variable.Click OKThe scatterplot should appear in the OUTPUT window.
26 Scatter plotInterpret these points to understand scatter plot.
27 Interpretation of Scatter plot Step 1: Checking for outliersStep 2: INSPECTING THE DISTRIBUTION OF DATA POINT:Are the data points scattered all over.Are the data points neatly arranged in a narrow cigar shapeCould we draw a straight line through the main cluster of points or would a curved line better represents the pointsIs the shape of the cluster even from one end to other. (if it starts off narrow and then gets wider, The data may violate the assumption of homoscedasticity: at different value of X, variability of Y is different)Step 3: Determining the direction of the relationship between the two variables: positive or negative correlations
28 Direct RelationshipWhen values on two variables tend to go in the same direction, we call this a direct relationship.The correlation between children’s ages and heights is a direct relationship.That is, older children tend to be taller than younger children.This is a direct relationship because children with higher ages tend to have higher heights.
29 Inverse RelationshipWhen values on two variables tend to go in opposite directions, we call this an inverse relationship.The correlation between students’ number of absences and level of achievement is an inverse relationship.That is, students who are absent more often tend to have lower achievement.This is an inverse relationship because children with higher numbers of absences tend to have lower achievement scores.
31 How to run correlation Highlight ANALYSE, CORRELATE, BIVARIATE Copy THE TWO VARIABLES INTO VARIABLES boxCheck that PEARSON box (two continuous variables- see the notes for other variable types) and the 2 tail boxClick OKNote when at least one of the variables ordinal Spearman’s rho or Kendall’s tau should be used. There for you should tick on the boxes next to them. Kendall’s tau usually produces a slightly smaller correlations. Rho is more commonly used and reported by researchers. However, only use these alternatives when you cannot meet the assumptions of Pearson’s r .
32 OUTPUT AND INTERPRETATION This is correlation coefficient (r)This is the p valueNumber of casesStep 1: Checking information about sample sizeStep 2: Determining the directions and strengths of the relationshipsStep 3:Calculating the coefficient of determination (r2)Step 4: Assessing the significanceHere you can see the number of CASES , the correlation coefficient (R). Pearson’s correlation coefficient shows the degree to which two variables are related linearly (i.e., a straight line). In order to evaluate the correlation between variables, it is important to know the "magnitude" or "strength" as well as the significance of the correlation..If P>0.05, there is no significant correlation between the two variables. If P<0.05, there is significant correlation between the two variables.The significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed.
33 Correlation Coefficient The relationship between two variables may be expressed with a number between and This number is called a correlation coefficient.The closer the correlation coefficient is to 0.00, the lower the relationship between the two variables. The closer the coefficient is to 1.00or the higher the relationship.According to Cohen (1988)R=.10 to .29 or R=-.10 to -.29: SmallR=.30 to 0.49 or R=-.30 to MediumR=.50 to 1.00 or R=-.50 to -1 LargeThe Correlation Coefficient indicates the strength of the relationship. Different authors suggest different interpretations.When you interpret the strength of the relationship, it is important to compare your result with those from others.
34 Some caveats about correlation and scatter plots - 1 Make a scatter plot of Measurement score against Number score again.This time, double click on the plot to get into Chart Editor.Change both X and Y axis scales to have a minimum of 200 and a maximum of 750.Does the strength of the relationship look weaker in this graph as compared to the one where the min is 0 and max is 1000?
35 Some caveats about correlation and scatter plots - 2 Be aware that judging the strength of relationship based on visual perception of scatter plot could be flawed, as the scale of the plots can make a difference.
36 Some caveats about correlation and scatter plots - 3 Create a new variable pmes10 usingTransform compute new variablesuch that pmes10=pmes500/100 +5That is, we have transformed the measurement score to have a mean of 10 and a standard deviation of 1.Compute the correlation between pmes10 and pnum500.How does this correlation compare with the correlation between pmes500 and pnum500?
37 Some caveats about correlation and scatter plots - 4 Compute the correlation between pmes500 and pnum500, but only for scores between 300 and 600.You can do this by selecting a sampleData Select cases If condition is satisfied:pnum500 > 300 and pnum500<600 and pmes500>300 and pmes500<600How does the correlation compare with that from the full sample?
38 Some caveats about correlation and scatter plots - 5 This link shows TIMSS and PISA 2003 maths country mean scores for 22 countries. TIMSSandPISA 2003.docPlot a scatter graph between TIMSS and PISA scoresCompute the correlationRepeat without Tunisia and Indonesia.
39 Coefficient of determination is the squared correlation coefficient, andrepresents the proportion of common variation in the two variablesThe correlation coefficient (r) represents the linear relationship between two variables.If the correlation coefficient is squared, then the resulting value (r2), the coefficient of determination), will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship).
40 Proportion of variance explained The proportion of variance explained is equal to the square of the correlation coefficientIf the correlation between alternate forms of a standardized test is 0.80, then (0.80)2 or 0.64 of the variance in scores on one form of the test is explained or associated with variance of scores on the other formThat is, 64% of the variance one sees in scores on one form is associated with the variance of scores on the other form. Consequently, only 36% (100% – 64%) of the variance of scores on one form is unassociated with variance of scores on the other form.
41 Presenting the results for Correlation Purpose of the testVariables involvedr values, number of cases, p valueR-squareInterpretationIf you examine correlations of a group of variables you can present a correlation matrix.
42 Class activity 1What are the correlations between three dimensions of mathematics: number, measurement and space?Is it true that students who perform well in number strand also doing well in measurement and space strands?Datafile: VNsample2.savVariables: student scores in measurement, number and space
43 Regression lineThis line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible.
44 Producing a Scatterplot with regression line GRAPHS-SCATTER-SIMPLE-DEFINESelect MEASUREMENT 500 SCORE IN PUPIL MATH to make this the Y variableSelect NUMBER 500 SCORE IN PUPIL MATH and to make this the X variable.Click OKThe scatterplot should appear in the OUTPUT window.Double click anywhere in the scatter plot to open SPSS CHART EDITOR, Click on ELEMENTS< FIT LINE AT TOTAL, make sure that LINEAR is selected.
46 Regression lineIf you draw the regression line in excel, you can get equation included in your chart.
47 Linear Regression Example of research questions To what extent can student achievement in Vietnamese language predict student achievement in mathematics?How well family income (or wealth) can predict student performance?How well university entrance scores can predict student success in University?All the questions presented above can be formed using the formula:“To what extent independent variable (factor) influences dependent variable (outcome)? “
48 Regression equation y= a+bx+e y and x are the dependent and independent variables respectivelya is the intercept (the point at which the line cuts the vertical axis.b is the slope of the line or the regression coefficient.e is error term.Simple linear regression is a popular tool for describing the relationship between two variables. Regression analysis presumes that one variable (y) depends linearly on another variable (x). Linear means that a unit change in the independent variable will result in an expected constant change in the dependent variable. Regression involves finding the line that best represents the relationship between y and x based on sample points (X,Y) Line of best fit . To determine how well the estimated line fits the data, analysis of variance is conducted. This involves figuring out how much of the variation in y is explained by variation in x and how much is unexplained.In the equation above, y and x are the dependent and independent variables. a is the intercept (the point at which the line cuts the vertical axis. B is the slope of the line or the regression co-efficient. e is error term which shows the proportion of the dependent variable that was not explained by the equation.The equation for the line of best fit or regression line is Y=a+bx. It is noted that for a person (i) y_i is often different form Y_i. The regression line is the line where the sum of squared difference of y_m and Y_m is the smallest (m is from 1 to N) or square e is smallest.
49 How to RUN REGRESSION Highlight ANALYSE, REGRESSION, LINEAR Copy THE CONTINOUS DEPENDENT VARIABLE INTO DEPENDENT boxCopy THE INDEPENDENT VARIABLE INTO INDEPENDENT boxFor METHOD, make sure that ENTER is selected.Click STATISTICS and tick on ESTIMATES, MODEL FIT AND DESCRIPTIVES, then CONTINUEClick on OPTIONS, then INCLUDE CONSTANT IN EQUATION, EXCLUDE CASES PAIRWISE, then CONTINUEClick OK
50 An example for regression Research question: To what extent that student scores in reading can predict student scores in Mathematics?Datafile: VNsample2.savVariables: Reading 500 scores, Mathematics 500 scores achievement
51 OUTPUT AND INTERPRETATION Step 1: Checking DescriptiveThese outputs help us to check number of cases, Mean and SD of each Variable, correlation between the two Variables
52 OUTPUT AND INTERPRETATION (2) Step 2: Evaluating the modelThe value of interest in this output is R square. It show how much variable in dependent variable explained by the independent variable.
53 Output and interpretation The t-tests with significance levels for the constant (a) and the regression co-efficient (b)Step 3: Evaluating the effect of the independent variableIn the column Unstandardised coefficients you can see Constant a and slope b which are the intercept and the slope in your regression equation for the line of best fit:Y= *X. (X is the value of reading scores, Y is the predicted value of Maths scores)The t-tests with significance levels for the constant a and the regression co-efficient b are both significant. This means that the constant a significantly differed from 0; and the effect of X on Y is also significantly differed from 0.Standardised coefficients are those which have been converted into the same scale so that we can compare them. Standardised coefficients range from (-1 ,0) and (0,1).
54 Presenting the results for regression Purpose of the testVariables involvedNumber of casesInterceptUn-standardised b and standardised (beta) coefficients, SE, p valueR-squareInterpretation of the relationshipIt would be a good idea to look at statistical analysis in the journal relevant to your topic area as different journals may have different requirements or expectations
55 Class activity 2To what extent school resources can predict for student achievement in maths and Vietnamese languageDatafile: VNsample2.savVariables: school resources index, math achievement