11-2 Identifying Variable Types and Forms Direction of Causality Independent variables influences or affects the other Dependent variable is the one being influenced or affected Form of the Variables All nominal variables are categorical Ordinal, interval, and ratio variables are continuous in form Continuous variables may be recoded or treated as categorical If so, they must constitute a limited number of categories
11-3 Measures of Association Independent CategoricalContinuous Dependent Continuous Categorical Discriminant Analysis F-Ratio Cross- Tabulation Chi-Square Analysis of Variance F-Ratio --------------- Paired T-Test Value of t Regression Analysis F-Ratio --------------- Correlation Probability of r
11-4 When To Use Cross- Tabulation Both variables are categorical (in the form of categories), rather than continuous The object is to see if the frequency or percentage distribution breakdown for one variable differs for each level of the other One variable is used to define the rows of the matrix and the other to define the columns If the distribution of each row or each column is proportional to the row or column totals, the two variables are not significantly related
11-5 Expected Cell Frequencies The lowest expected cell frequency for the table must be 5 or more Look down the row totals and circle the lowest row total Look across the column totals and circle the lowest column total Divide the lowest row total by the grand total for the entire table Multiply this value by the lowest column total to get the lowest expected cell frequency If it is less than five, combine the row or the column with another and recalculate the lowest cell frequency
11-6 The Cross-Tabulation Table Table is symmetrical: Either variable can be listed on the rows or columns There need not be a dependent and an independent variable If there is a dependent variable, it's often best to have it define the rows If the dependent variable defines the rows, column percentages work best Each percentage can then be compared to the total row percentages
11-7 Perfectly Proportional Cross-Tab Table and Graph Row One Row Two Col. 1Col. 2 01020304050 100 One Two Col. Total OneTwo Row Total Chi Sq. = 0 Sig. = 1.0000 25
11-8 Slightly Disproportional Cross-Tab Table and Graph Row One Row Two Col. 1Col. 2 01020304050 100 One Two Col. Total OneTwo Row Total Chi Sq. = 4 Sig. = 0.0455 30 20 30
11-9 Highly Disproportional Cross-Tab Table and Graph Row One Row Two Col. 1Col. 2 01020304050 100 One Two Col. Total OneTwo Row Total Chi Sq. = 36 Sig. = 0.0000 40 10 40
11-10 Perfectly Disproportional Cross-Tab Table and Graph Row One Row Two Col. 1Col. 2 01020304050 100 One Two Col. Total OneTwo Row Total Chi Sq. = 100 Sig. = 0.0000 50 0 0
11-11 Significance of Chi Square The statistical significance of the relationship depends on the probability of disproportions by row or by column if the distributions in the population were actually proportional The actual probability is based on the value of Chi-square and the degrees of freedom The number of degrees of freedom equals number of rows minus one times number of columns minus one (R- 1) X (C-1) The probability can be read from a table, but it is usually generated by the analysis program
11-12 Ways to Describe the Statistical Significance of Cross-Tabs What is the probability this much difference in the proportions from row to row or column to column would result only from sampling error if the proportions were were equal in the population? If the proportions from row to row or column to column were the same in the population, what are the odds that a sample of this size would show this much difference in the proportions for the sample? What is the probability that proportions from row to row or column to column would be this different by chance, purely because of sampling error, if the proportions in the population were actually the same?
11-13 Analysis of Variance (ANOVA) Objective To determine if the means of two or more variables are significantly different from one another. Independent Variable Nominal level data in the form of two or more categories. Dependent Variable Interval or ratio level data in continuous form. Requirements Dependent variable must be near-normally distributed and the variance within each category must be approximately equal.
11-14 Variance Not Homogeneous Dispersion in the red category is greater than in the green ANOVA
11-15 Skewed Distributions The distributions are asymmetrical (skewed to one side) ANOVA
11-16 ANOVA or Paired T-Test? ANOVA requires that the data points are independent. (From different cases) ANOVA will measure significance of differences among more than two means or categories Paired T-Tests require that the data points are paired (That they come from the same case) Paired T-Tests can measure the significance of difference between only two means or variables
11-17 c ab ANOVA - Difference Not Significant Mean a and b are very close. Overlapping area is very large.
11-18 ANOVA - Difference Probably Significant Mean a and b are far apart Overlapping area is rather small c ab
11-19 SourceS.S.d.f.M.S.FP Between groups10011005.000.00 Within groups180920 Combined28010 SOURCE - The source of the variance value S.S. - Sums of Squared deviations from a mean d.f. - Degrees of freedom related to variance M.S. -Mean Squares or S.S. divided by d.f. F - The ratio of M.S. Between over M.S. Within P - The probability of this value of the F-ratio The ANOVA Table
11-20 ANOVA Terms — Sums of Squares S.S.—The sum of squared deviations of each data point from some mean value Within groups—The total squared deviation of each point from the group mean Combined—The total squared deviation of each data point from the grand mean Between groups—The difference between S.S. combined and S.S. within groups SourceS.S.d.f.M.S.FP Between groups10011005.000.00 Within groups180920 Combined28010
11-21 ANOVA Terms — Degrees of Freedom d.f.—The number of cases minus some "loss" because of earlier calculations. Within groups d.f.—The total number of cases minus the number of groups. Combined d.f.—Equal to the total number of cases minus one. Between groups d.f.—Equal to the total number of groups minus one. SourceS.S.d.f.M.S.FP Between groups10011005.000.00 Within groups180920 Combined28010
11-22 ANOVA Terms — Mean Squares & F-Ratio M.S.—the sums of squares (S.S.) divided by the degrees of freedom (d.f.). F—the ratio of mean squares between groups to the mean squares within groups. SourceS.S.d.f.M.S.FP Between groups10011005.000.00 Within groups180920 Combined28010
11-23 Ways to Describe the Statistical Significance of ANOVA What is the probability that this much of a difference between these sample mean values would result due to sampling error if the means for the groups in the population were equal? If the group means in the population as a whole were the same, what are the odds that a sample of this size would show this much difference in the sample group means? What is the probability that the sample group means would be this different by chance, purely because of sampling error, if the group means in the population were actually the same?
11-24 Correlation Analysis Objective To determine degree and significance of relationship between a pair of continuous variables Causality The analysis does not assume that one variable is dependent on the other. If A is correlated with B: A may be causing B B may be causing A A and B may be interacting C may be causing A and B
11-25 Correlation Analysis Requirements Both variables must be continuous and obtained from an interval or a ratio scale Non-Parametric Correlation Both variables must be continuous but one or both may be only ordinal scale level
11-26 Regression Analysis Objective To determine if variable X has a significant effect on variable Y Independent Variable X must be continuous, interval or ratio level data Dependent Variable Y must be continuous, interval or ratio level data
11-27 Regression Analysis Requirements The data plot must be linear The data plot must be in a straight line or very nearly so The data plot must be homoskedastic The vertical spread must be about the same from left to right
11-28 Regression Unacceptable Heteroskedastic Regression Plot Typical funnel-shaped plot The scatterplot must be homoskedastic Variance must be approximately the same
11-29 + + + + + + - + + - - - - - - - - + Unacceptable Curvilinear Regression Plot The scatterplot must be linear A runs test will reveal nonlinearity It gives probability of consecutive signs Regression
11-30 Unacceptable Quadratic Regression Plot Two linear segments with one bend Three segments, two bends is cubic, etc. Regression must be limited to one range Regression
11-31 Weak RelationshipStrong Relationship The Regression Scatterplot Independent variable X on the horizontal axis Dependent variable Y on the vertical axis Regression equation: Y = a + bX
11-32 Regression Plot and Regression Table Regression Table Corr. (r).93784N of cases25Missing 0 R-Square.87954S.E. Est.8.76849Sig. R0.0000 Intercept (A)88.90818S.E. of A3.64090Sig. A0.0000 Slope (B)-0.96698S.E. of B0.07462Sig. B0.0000 Analysis of Variance SourceS.S.d.f.M.S.F RatioF Prob. Regression12911.77112911.77167.93320.0000 Residual1768.382376.89 0 25 50 75 100 0204060 80 100
11-33 Regression Coefficients Corr. (r) — The coefficient of correlation R-Square — The coefficient of determination The percentage of variance in Y explained by knowing X Intercept (A) — Value of Y if X is zero Slope (B) — The rise over the run Regression equation — Y = a + bX Regression Table Corr. (r).93784N of cases25Missing 0 R-Square.87954S.E. Est.8.76849Sig. R0.0000 Intercept (A)88.90818S.E. of A3.64090Sig. A0.0000 Slope (B)-0.96698S.E. of B0.07462Sig. B0.0000
11-34 Regression Coefficients S.E. Estimate — StandardY based on the value of X S.E. of the estimate based on the regression equation S.S. Regression — Sum of squared deviations of each data point from the regression line S.S. Residual — The difference between S.S. total (around the mean of Y) and S.S. Regression Regression Table Corr. (r).93784N of cases25Missing 0 R-Square.87954S.E. Est.8.76849Sig. R0.0000 Intercept (A)88.90818S.E. of A3.64090Sig. A0.0000 Slope (B)-0.96698S.E. of B0.07462Sig. B0.0000
11-35 Ways to Describe the Statistical Significance of Regression What is the probability this much variance in the values of the dependent variable would would be “explained” by the values of the independent variable, only because of sampling error, if the two variables were unrelated in the population? If these two variables were actually independent of one another in the population, what are the odds that this size sample would show this much of a relationship? What is the probability that the values of X would explain this much variance in Y, purely by sampling error, if X and Y were unrelated to one another in the entire population?