# NEGATIVE BINOMIAL MODELS

## Presentation on theme: "NEGATIVE BINOMIAL MODELS"— Presentation transcript:

NEGATIVE BINOMIAL MODELS
THE POISSON & NEGATIVE BINOMIAL MODELS By: ALVARD AYRAPETYAN

OUTLINE OF PRESENTATION
Poisson Regression Model Assumptions, Assessment, and Interpretations Applications in SAS and R Quick Programming in SPSS and MINITAB Negative Binomial Model Assumptions, Assessment, and Interpretations Quick Programming in SPSS

ASSUMPTIONS FOR POISSON MODEL
Number of events must occur at a fixed period of time Number of events must occur at a constant rate Events must be independent Dependent variable’s conditional mean and variance must be equal Dependent variable must be an integer

THE POISSON MODEL Random Component: Poisson Distribution for the # of lead changes Systematic Component: Mass Function: E(Y) = µ & V(Y)= µ Link Function: g(µ) = log(µ)

EXAMPLES OF POISSON DISTRIBUTION
Number of earthquakes in a region Number of accidents on a highway in a certain area in a specified time Number of telephone calls received in one hour Number of customers that enter a bank in one hour Number of times an elderly person will fall in a month

INTEPRETING COEFFICIENTS
CONTINUOUS PREDICTOR CATEGORICAL PREDICTOR Keeping all constant, when is increased by one unit, Y increases/decreases (+/-) by Keeping all constant, when is increased by one unit, the expected number of Y will go up/down (+/-) by Keeping all constant, when , Y increases/decreases (+/-) by Keeping all constant, when the expected number of Y will go up/down (+/-) by

POTENTIAL PROBLEM WITH POISSON
OVERDISPERSION-the variance is much larger than the mean Negative Binomial is the solution!

THE DATA Trying to predict the number of field goal attempts in NBA
Extracted the top 100 highest scoring players in the NBA for the season The following were used as predictors: Number of games played (GP) Number of defensive rebounds(DREB) Number of assists (AST) Number of steals (STL) Number of blocks (BLK) Number of turnovers (TOV) Number of free throws made (FTM)

SAMPLE OF THE DATA Rank Player GP FGA DREB AST STL FTM TOV 1
Kevin Love (MIN) 15 268 146 68 13 95 41 2 Kevin Durant (OKC) 12 209 72 62 17 131 45 3 Monta Ellis (DAL) 14 235 42 76 22 85 55 4 Blake Griffin (LAC) 242 129 47 19 59 40 5 LeBron James (MIA) 201 67 88 71 49 6 Evan Turner (PHI) 272 53 56 7 Kevin Martin (MIN) 248 48 33 18 20 8 Paul George (IND) 231 23 70 9 LaMarcus Aldridge (POR) 285 105 35 54 34 10 Carmelo Anthony (NYK) 264 79 36 11 Kyrie Irving (CLE) 89 Klay Thompson (GSW) 212 30 Dirk Nowitzki (DAL) 206 82 16 74 25 James Harden (HOU) 195 65 21 91 52 Chris Paul (LAC) 208 188 81 44 Arron Afflalo (ORL) 197 61 Damian Lillard (POR) 225 64 31 DeMarcus Cousins (SAC) 230 103

POISSON-EXAMPLE WITH SAS
proc genmod data = nba; model FGA= GP DREB AST STL TOV FTM /dist=poisson; run; /*check goodness of fit for model*/ data pvalue; df = 93; chisq = ; pvalue = 1 - probchi(chisq, df); proc print data = pvalue noobs; run; /*pvalue is NOT significant, model isnt good*; dispersion parameter >> 1, major overdipsersion/

EXAMPLE RESULTS-GOODNESS OF FIT
The GENMOD Procedure Model Information Data Set WORK.NBA Distribution Poisson Link Function Log Dependent Variable FGA Number of Observations Read Number of Observations Used Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Log Likelihood Full Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better)

RESULTS: Analysis of Maximum Likelihood Parameter Estimates
DF ESTIMATE STANDARD ERROR WALD 95% CONFIDENCE LIMITS WALD CHI-SQUARE PR>CHISQ Intercept 1 4.1864 0.0749 (4.0396,43332) <.0001 GP 0.0422 0.0057 (0.0310,0.0534) 54.93 DREB 0.0004 0.0003 ( ,0.0010) 1.55 0.2131 AST ( ,0.0005) 0.28 0.5995 STL 0.0028 0.0012 (0.0004,0.0052) 5.17 0.0230 TOV 0.0010 (0.0038,0.0077) 33.53 FTM 0.0040 (0.0032,0.0048) 98.23 Scale 1.000 (1.0, 1.0)

ASSESSMENT OF RESULTS Ratio of Deviance/Df= >>>1==major overdispersion Deviance= , not well fit because pvalue=1-prob(chisq,df) is NOT significant Every term significant except for AST and DREB False results possible if model is inaccurate Must perform a NEGATIVE BINOMIAL

POISSON-EXAMPLE WITH R
nba <- read.csv("F:/STATS544/nba.cs v",header=TRUE) poiss<-glm(FGA ~GP+DREB+AST+STL+TOV+FT M, family = "poisson", data = nba) summary(poiss)

R-GOODNESS OF FITS Deviance Residuals: Min 1Q Median 3Q Max (Dispersion parameter for poisson family taken to be 1) Null deviance: on 99 degrees of freedom Residual deviance: on 93 degrees of freedom AIC:

R-ANALYSIS OF PARAMETER ESTIMATES
Call: glm(formula = FGA ~ GP + DREB + AST + STL + TOV + FTM, family = "poisson", data = nba) Coefficients: Estimate Std. Error z value Pr(>|z|) --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ESTIMATE STD.ERROR Z VALUE PR(>|z|) (Intercept) 55.902 < 2e-16 *** GP 7.411 1.25e-13 *** DREB 1.245 0.213 AST -0.525 0.600 STL 2.273 0.023 * TOV 5.790 7.02e-09 *** FTM 9.911

POISSON WITH SPSS & MINITAB
genlin FGA with GP DREB AST STL TOV FTM /model GP DREB AST STL TOV FTM INTERCEP=YES distribution = poisson link = log /print FIT SUMMARY SOLUTION. Stat > Regression  > Poisson Regression > Fit Poisson Model.

Detecting over-dispersion with SAS
Poisson regression gives a ratio between DEVIANCE and DF >1. proc genmod data = nba; model FGA= GP DREB AST STL TOV FTM /dist=poisson; run; PROC MEANS--- the variance of FGA(Y) is much higher than its mean proc means data = nba n mean var min max; var FGA

Detecting over-dispersion with R
Poisson regression gives a ratio between RESIDUAL DEVIANCE and DF >1 poiss<-glm(FGA ~GP+DREB+AST+STL+TOV+FTM, family = "poisson", data = nba) summary(poiss) mean(nba\$FGA) [1] var(nba\$FGA) [1]

NEGATIVE BINOMIAL REGRESSION
Generalization of Poisson regression Used for over-dispersed count data PMF: E(Y)= m, V(Y) = m+ k*(m2) K=dispersion parameter As k0, the V(Y)m, NB approaches Poisson and V(Y)=E(Y)= m Link Function same as Poisson: g(m) = log(m.) Equation: Log(λ(X))= β0 + β1Χ1 + β2Χ2+……..+ βp-1Xp-1 Goodness Of fit Test-same as Poisson

NEGATIVE BINOMAL-EXAMPLE WITH SAS
proc genmod data = nba; model FGA= GP DREB AST STL TOV FTM /dist=negbin; (ONLY DIFFERENCE FROM POISSON) run; /*check goodness of fit for model*/ data pvalue; df = 93; chisq = ; pvalue = 1 - probchi(chisq, df); proc print data = pvalue noobs;

EXAMPLE RESULTS-GOODNESS OF FIT
Data Set WORK.NBA Distribution Negative Binomial Link Function Log Dependent Variable FGA Number of Observations Read 100 Number of Observations Used 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X Log Likelihood Full Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better)

RESULTS: Analysis of Maximum Likelihood Parameter Estimates
DF ESTIMATE STANDARD ERROR WALD 95% CONFIDENCE LIMITS WALD CHI-SQUARE PR>CHI-SQ INTERCEPT 1 4.1742 0.1641 (3.8525,4.4958) 647.01 <.0001 GP 0.0426 0.0125 (0.0181,0.0671) 11.62 0.0007 DREB 0.0003 ( ,0.0016) 0.15 0.7028 AST 0.0008 ( ,0.0014) 0.03 0.8619 STL 0.0024 0.0027 ( ,0.0077) 0.78 0.3756 TOV 0.0060 0.0023 (0.0015,0.0105) 6.95 0.0084 FTM 0.0042 0.0010 (0.0023,0.0061) 19.32 DISPERSION 0.0230 0.0040 (0.0163,0.0325)

Assessment of Results Ratio of Deviance/Df= ≈1 (over-dispersion fixed!) Deviance= , now is well fit because pvalue=1- prob(chisq,df) IS significant Extra parameter in the “Analysis of Maximum Likelihood Parameter Estimates” called “Dispersion” (aka ALPHA) Accounts for the over-dispersion factor we came across in the Poisson regression This estimate has a value of with a Wald Confidence Interval of (.0163, 0325). Based on the 95% Confidence Limits for our dispersion parameter, we can say that dispersion is significantly different from 0, justifying the negative binomial model is more appropriate GP, TOV, & FTM only significant predictors Notice the drastic change in number of significant predictors

NEGATIVE BINOMIAL-EXAMPLE WITH R
nba <- read.csv("F:/STATS544/nba.csv",header= TRUE) install.packages('MASS') library(MASS) nb<-glm.nb(FGA ~GP+DREB+AST+STL+TOV+FTM, data = nba) summary(nb)

EXAMPLE RESULTS-GOODNESS OF FIT
(Dispersion parameter for Negative Binomial( ) family taken to be 1) Null deviance: on 99 degrees of freedom Residual deviance: on 93 degrees of freedom AIC: Number of Fisher Scoring iterations: 1 Deviance Residuals: Min 1Q Median 3Q Max Theta: Std. Err.: x log-likelihood:

RESULTS: Analysis of Maximum Likelihood Parameter Estimates
Call: glm.nb(formula = FGA ~ GP + DREB + AST + STL + TOV + FTM, data = nba, init.theta = , link = log) Coefficients: Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ESTIMATE STD.ERROR Z-VALUE PR(>|Z|) (Intercept) 25.663 < 2e-16 *** GP 3.438 *** DREB 0.383 AST -0.176 STL 0.886 TOV 2.652 ** FTM 4.467 7.95e-06 ***

INTERPETATION OF SIGNIFICANT COEFFICIENTS
GP: Holding all other variables constant, for every one unit addition of games played, the expected log number of field goal attempts will go up by Or similarly, for every additional game played, the number of field goal attempts will increase by 4.35% TOV: Holding all other variables constant, for every one extra TOV, the expected log number of field goal attempts will increase by Or similarly, for every additional turnover made, the number of field goal attempts will increase by 0.60%. FTM: Holding all other variables constant, for every one unit addition of free throws made, the expected log number of field goal attempts will go up by Or similarly, for every additional free throw made, the number of field goal attempts will increase by 0.42%.

NEGATIVE BINOMIAL WITH SPSS & MINITAB
genlin FGA with GP DREB AST STL TOV FTM /model GP DREB AST STL TOV FTM INTERCEP=YES Distribution=negbin(mle) link = log /print FIT SUMMARY SOLUTION. NA

SUMMARY Use Poisson regression when dealing with COUNT data
If there’s Overdispersion, switch to Negative binomial Assumptions for both Poisson and NB are the same Both model coefficients are interpreted same manner Can perform both regressions in SAS, R, & SPSS Minitab only able to perform Poisson