Presentation on theme: "Extension The General Linear Model with Categorical Predictors."— Presentation transcript:
Extension The General Linear Model with Categorical Predictors
Extension Regression can actually handle different types of predictors, and in the social sciences we are often interested in differences between groups For now we will concern ourselves with the two independent groups case E.g. gender, republican vs. democrat etc.
Dummy coding There are different ways to code categorical data for regression, and in general, to represent a categorical variable you need k-1 coded variables 1 k = number of categories/groups Dummy coding involves using zeros and ones to identify group membership, and since we only have two groups, one group will be zero (the reference group) and the other 1
Dummy coding Example The thing to note at this point is that we have a simple bivariate correlation/simple regression setting The correlation between group and the DV is.76 This is sometimes referred to as the point biserial correlation (r pb ) because of the categorical variable However, don’t be fooled, it is calculated exactly the same way as the Pearson before i.e. you treat that 0,1 grouping variable like any other in calculating the correlation coefficient However, the sign is arbitrary since either group could have been a one or zero, and so that needs to be noted Group Outcome
Example Graphical display The R-square is.76 2 =.577 The regression equation is
Example Look closely at the descriptive output compared to the coefficients. What do you see?
The constant Note again our regression equation Recall the definition for the slope and constant First the constant, what does “when X = O” mean here in this setting? It means when we are in the O group What is that predicted value? Y pred = (0) = 4 That is the group’s mean The constant here is thus the reference group’s mean
The coefficient Now think about the slope What does a ‘1 unit change in X’ mean in this setting? It means we go from one group to the other Based on that coefficient, what does the slope represent in this case (i.e. can you derive that coefficient from the descriptive stats in some way?) The coefficient is the difference between means
The regression line The regression line covers the values represented i.e. 0, 1, for the two groups It passes through each of their means Using least squares regression the regression line always passes through the mean of X and Y, though the mean of X here is nonsensical The constant (if we are using dummy coding) is the mean for the zero (reference) group The coefficient is the difference between means
Furthermore, the previous gives the same results we would have gotten via a t-test, to which we are about to turn, However, you now can see it is not a distinct procedure from regression with a linear model of some outcome predicted by a grouping variable. Two Sample t-test data: Outcome by Group t = , df = 8, p-value = percent confidence interval:
Understanding the basics regarding the general linear model can go a long way toward one’s ability to understand any analysis It not only specifically holds here but is utilized in more complex univariate and multivariate analyses, and even in some nonlinear situations (e.g. logistic regression), we use ‘generalized’ linear models Y = Xb + e For properly specified models, linear models provide reasonable fits and an intuitive understanding relative to more complex approaches.