Malcolm Cameron Xiaoxu Tan

Slides:



Advertisements
Similar presentations
Linear regression models in R (session 1) Tom Price 3 March 2009.
Advertisements

Chapter 4: Basic Estimation Techniques
7. Models for Count Data, Inflation Models. Models for Count Data.
Simple Linear Regression Analysis
Multiple Regression and Model Building
Econometric Modeling Through EViews and EXCEL
Brief introduction on Logistic Regression
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
EPI 809/Spring Probability Distribution of Random Error.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Linear and generalised linear models
Linear and generalised linear models
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Logistic Regression and Generalized Linear Models:
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
Forecasting Choices. Types of Variable Variable Quantitative Qualitative Continuous Discrete (counting) Ordinal Nominal.
1 GLM I: Introduction to Generalized Linear Models By Curtis Gary Dean Distinguished Professor of Actuarial Science Ball State University By Curtis Gary.
Negative Binomial Regression NASCAR Lead Changes
Lecture 12: Cox Proportional Hazards Model
Environmental Modeling Basic Testing Methods - Statistics III.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
Correlation & Regression Analysis
Linear Models Alan Lee Sample presentation for STATS 760.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Logistic Regression and Odds Ratios Psych DeShon.
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Unit 32: The Generalized Linear Model
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
Chapter 12 Simple Linear Regression and Correlation
Negative Binomial Regression
CHAPTER 7 Linear Correlation & Regression Methods
Generalized Linear Models
12 Inferential Analysis.
Checking Regression Model Assumptions
Correlation and Simple Linear Regression
Checking Regression Model Assumptions
Chapter 12 Simple Linear Regression and Correlation
SAME THING?.
Hypothesis testing and Estimation
Statistical Assumptions for SLR
Correlation and Simple Linear Regression
12 Inferential Analysis.
Simple Linear Regression and Correlation
Product moment correlation
Logistic Regression with “Grouped” Data
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Malcolm Cameron Xiaoxu Tan GENERALIZED LINEAR MODELS- NEGATIVE BINOMIAL GLM and its application in R Malcolm Cameron Xiaoxu Tan

Table of Contents Poisson GLM Problems with Poisson Solutions to this Problem Negative Binomial GLM Data Analysis Summary

2. Poisson Let Yi be the random variable for claim count in the ith class, i = 1,2 ..... N, The mean and the variance are both equal to λ

Poisson Continued Mean and variance are both equal Memoryless Property Commonly used because of its convenience and appropriateness. Many Statistical Packages available Poisson Distribution can be used for Queuing theory Insurance claims Any count data

3. Problems with Poisson Over dispersion and heterogeneity of the distribution of residuals. MLE procedure used to derive estimates and provide the standard errors of those estimates make strong assumptions that every subject within a covariate group has the same underlying rate of outcome (homogeneity). The model assumes that the variability of counts within a covariate group is equal to the mean So if the variance is greater than the mean, this will lead to underestimated standard errors and overestimated significance of regression parameters

4. Solutions to This Problem But don’t worry! (1) Fit a Poisson quasi likelihood. (2) Fit Negative Binomial GLM. We will be focusing on the second method

5. Negative Binomial The pmf is k ∈ { 0, 1, 2, 3, … } — number of successes

Negative Binomial Continued Under the Poisson the mean, λi, is assumed to be constant within classes. But, if we define a specific distribution for λi, heterogeneity within classes can be used. One method is to assume λi to be Gamma with E(λi)= µi and Var(λi)=µi 2 / vi And Yi | λi to be the Poisson distribution with conditional mean E(Yi | λi )= λi

Negative Binomial Continued It follows that the marginal distribution of Yi follows a Negative Binomial distribution with PDF Where the mean is E(Yi)= µi and Var(Yi)= µi + µi2 vi-1

MLE for NB GLM Different parameterization can result in different types of negative binomial distributions. For Example, by letting vi = a-1 , Y follows a NB with E(Yi)= µi , and Var(Yi) = µi(1+a µi) where a denotes the dispersion parameter. Note: If a=0, there would be no over dispersion. The log likelihood for this example would be

MLE for NB GLM Continued And the Maximum likelihood estimates are obtained by solving the following equations and

Negative Binomial Continued The MLE may be solved simultaneously, and the procedure involves sequential iterations. In (1), by using the initial value a, a(0), l(β,a) is maximized w.r.t β, producing β(1). The First equation is equivalent to the weighted least squares, so with slight adjustments, the MLE can be found using Iterated Weighted Least Squares (IWLS) regression, similar to the Poisson. In (2), we treat β as a constant to solve for a(1) . This can be done using the Newton-Raphson algorithm By cycling between these two processes of updating our variables, the MLE for β and a will be obtained.

6 .Data Analysis Find data set with over dispersion. Do analysis using Poisson and NegBinomial Compare the models

Example: Students Attendance School administrators study the attendance behavior of high school juniors at two schools. Predictors: The type of program in which the student is enrolled The students grades in a standardized math test

The Data nb_data---The file of attendance data on 314 high school juniors from two urban high schools Daysabs---The response variable of interest is days absent Math ----The variable gives the standardized math score for each student. Prog --- is a three-level nominal variable indicating the type of instructional program in which the student is enrolled.

Plot the data!

6.1 fit the Poisson model Poisson1 <- glm(daysabs ~ math + prog, family = "poisson", data = students) summary(Poisson1 ) Call: glm(formula = daysabs ~ math + prog, family = "poisson", data = students) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.651974 0.060736 43.664 < 2e-16 *** math -0.006808 0.000931 -7.313 2.62e-13 *** progAcademic -0.439897 0.056672 -7.762 8.35e-15 *** progVocational -1.281364 0.077886 -16.452 < 2e-16 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2217.7 on 313 degrees of freedom Residual deviance: 1774.0 on 310 degrees of freedom AIC: 2665.3 Number of Fisher Scoring iterations: 5

because the mean value of daysabs appears to vary by progress. with(students, tapply (daysabs, prog, function(x) {  sprintf("Mean (Var) = %1.2f (%1.2f)", mean(x), var(x)) General Academic Vocational "M (SD) = 10.65 (8.20)" "M (SD) = 6.93 (7.45)" "M (SD) = 2.67 (3.73)" Poisson regression has a very strong assumption, that is the conditional variance equals conditional mean. But The variance is much greater than the mean, So...

Plot the data

So we need to find a new model… Negative binomial regression can be used ,when the conditional variance exceeds the conditional mean.

6.2 fit the negative-binomial model > NB1=glm.nb(daysabs ~ math + prog, data = students) > summary(NB1) Call: glm.nb(formula = daysabs ~ math + prog, data = students, init.theta = 1.032713156, link = log) Deviance Residuals: Min 1Q Median 3Q Max -2.1547 -1.0192 -0.3694 0.2285 2.5273 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.615265 0.197460 13.245 < 2e-16 *** math -0.005993 0.002505 -2.392 0.0167 * progAcademic -0.440760 0.182610 -2.414 0.0158 * progVocational -1.278651 0.200720 -6.370 1.89e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(1.0327) family taken to be 1) Null deviance: 427.54 on 313 degrees of freedom Residual deviance: 358.52 on 310 degrees of freedom AIC: 1741.3 Number of Fisher Scoring iterations: 1

Plot the data !

7. Check model assumptions We use the likelihood ratio test Code: Poisson1 <- glm (daysabs ~ math + prog, family = "poisson", data = students) > pchisq(2 * ( logLik(poisson1) - logLik(NB1)), df = 1, lower.tail = FALSE) [1] 2.157298e-203 This strongly suggests the negative binomial model is more appropriate than the Poisson model !

8. Goodness of fit *for poisson *for negative binomial resids1<-residuals(poisson1, type="pearson") sum(resids1^2) [1] 2045.656 1-pchisq(2045.656,310) [1] 0 *for negative binomial resids2<-residuals(NB1, type="pearson") sum(resids2^2) [1] 339.8771 > 1-pchisq(339.8771,310) [1] 0.1170337

9. AIC –which model is better? > AIC(Poisson1) [1] 2665.285 > AIC(NB1) [1] 1741.258 For negative binomial, it has Much smaller AIC!

Thank you !