Presentation on theme: "Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances."— Presentation transcript:
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances R commands for log-linear and logistic models
Generalised linear model Linear models are useful when the distribution of the observations are or can be approximated with normal distribution. Even if it is not case for large number of observation normal distribution is safe assumption. However there are many cases when different model should be used. Generalised linear model is a way of generalising linear models to a wide range of distributions. If distribution of the observations is the from the family of generalised exponential family and mean value of this distribution is linear on the input parameters then generalised linear model can be used. Recall generalised exponential family: Following distributions belong to the generalised exponential family (note that parameters we are considering are the mean values. Other members of this family include: gamma, exponential and many others. If some function of the mean ( , , for above cases) is a linear function of the observations then it can be handled using generalised linear model. Usually this function is taken A( ).
ANOVA revisited Let us recall purpose of ANOVA: We want to know difference between effects of different parameters. There might be several set of parameters. If number of parameters is two then t-test is suitable for testing difference between means. If there are more than two parameters then we design experiment according to one of the schemes (n-way crossed, n-fold nested or mixture of them). When we have the result of experiments then we fit various linear models under different hypotheses. Then we calculate likelihood ratio (LR) tests. LR test turns out to be related with ratio of sum of the squares under different hypotheses. And this ratio is related with F-distribution (if observations are distributed normally). If F-value is large enough then we say that differences between means are significant. If it is small we say that differences are not significant and we can remove some parameters from our model. One of the assumptions in ANOVA model is that observations are distributed normally. Another hidden assumption is that parameters are continuous. If number of observations is large enough then these assumptions work very well. There are cases when ANOVA is not adequate using linear model. Examples are: Outcomes are success or failure. In this case binomial distribution is more adequate. Outcome is the number of occurrences. In this case Poisson distribution is more adequate. One more feature of Binomial and Poisson distribution is that they can be applied to categorical variables (since these distributions are discrete).
R commands for ANOVA First decide what type ANOVA it is. Then decide what is the result of experiment and what are factors. Then define factors. There are several command to define factors f1 <- gl(k,n,total number) or it can be done directly: g1 <- c(numbers) f1 <- factor(some variable) then use linear model to fit data: result <- lm(data~formula) data is result of the experiment, formula is what do we want to fit. if we have two factors f1 and f2 then if we want to fit effects of f1 and f2 we can use as formula f1 + f2. If we want effect of f1 and f2 and their interaction then we can use f1*f2. If we want only interactions then we can use f1:f2. If we want effect of f1 and interaction between f1 and f2 then we can use f1*f2-f1. It is equivalent to f1 + f1:f2. Once we have result of linear model we can use following commands: anova(result), plot(result), summary(result).
Poisson distribution: log-linear model If the distribution of observations is Poisson then log-linear model should be used. Recall that Poisson distribution is from exponential family and the function A of the mean value is logarithm. It can be handled using generalised linear model. When log-linear model is appropriate: When outcomes are frequencies (expressed as integers) and parameters are categorical then log-linear model is appropriate. When we fit log-linear model then we can find estimated mean using exponential function: Example: Relation between gray hair and age Age gray hair under 40 over 40 yes no It is similar to two-fold nested ANOVA model. We could analyse this type of data using the log-linear model.
Binomial distribution: logistic model If the distribution of the result of experiment is binomial, i.e. outcome is 0 or 1 (success of failrure) then logistic model can be used. Recall that function of mean value A has the form: This function has a special name – logit. It has several advantages: If logit( ) has been estimated then we can find and it is between 0 and 1. If probability of success is larger than failure then this function is positive, otherwise it is negative. Changing places of success and failure changes only the sign of this function. This model can be used when outcomes are binary (0 and 1). If logit( ) is linear then we can find : For logistic model either grouped variables (fraction of successes) or individual items (every individual have success (1) or failure (0) can be used. Ratio of the probability of success to the probability of failure is also called odds.
Deviances In linear model we maximise the likelihood with full model and under the hypothesis. Then ratio of the values of the likelihoods under two hypotheses (null and alternative) is related with F-distribution. Interpretation is that how much variance would increase if we would remove part of the model (null hypothesis). In logisitc and log-linear model analysis again likelihood function is maximised under the null- and alternative hypotheses. Then logarithm of ratio of the values of the likelihood under these two hypotheses is related asymptotically with chi-squared distribution: That is the reason why in log-linear and logistic regressions it is usual to talk about deviances and chi-squared statistics instead of variances and F-statistics. Analysis based on log- linear and logistic models (in general for generalised linear models) is usually called analyisis of deviances. Reason for this is that chi-squared is related with deviation of the fitted model and observations. Another test is based on Pearson’s chi-squared test. These two tests behave similarly as the number of observations increases. Pearson chi-squared is calculated using:
R commands for log-linear model log-linear model can be analysed using generalised linear model. Once the factors, the data and the formula have been decided then we can use: result <- glm(data~formula,family=poisson) It will give us fitted model. Then we can use anova.glm(result,test=‘Chisq’) anova(result,test=‘Chisq’) plot(result) summary(result) Interpretation of the results is similar to linear model ANOVA table. Degrees of freedom is defined similarly. Only difference is that instead of sum of squares deviances are used.
R commands for logistic regression Similar to log-linear model: Decide what are the data, the factors and what formula should be used. Then use generalised linear model to fit. result <- glm(data~formula,family=binomial) then analyse using anova(result,test=“Chisq”) summary(result) plot(result)