Presentation on theme: "Modelling Count Data: Outline"— Presentation transcript:
1Modelling Count Data: Outline Characteristics of count data and the Poisson distributionApplying the Poisson: Flying bomb strikes in South LondonDeaths by horse-kick: as a single-level model Poisson model fitted in MLwiNOverdispersion: types and consequences, the unconstrained Poisson, the Negative BinomialTaking stock: 4 distributions for modeling countsNumber of extramarital affairs: the incidence rate ratio (IRR); handling categorical & continuous predictors; comparing model with DICTitanic survivor data; taking account of exposure, the offsetMultilevel Poisson and NBD models; estimation and VPCApplications: HIV in India and Teenage employment in GlasgowSpatial models: Lip cancer in Scotland; respiratory cancer in Ohio
2Some characteristics of count data Very common in the social sciencesNumber of children; Number of marriagesNumber of arrests; Number of traffic accidentsNumber of flows; Number of deathsCounts have particular characteristicsIntegers & cannot be negativeOften positively skewed; a ‘floor’ of zeroIn practice often rare events which peak at 1,2 or 3 and rare at higher valuesModelled byLogit regression models the log odds of an underlying propensity of an outcome;Poisson regression models the log of the underlying rate of occurrence of a count.
3Theoretical Poisson distribution I The Poisson distribution results if the underlying number of random events per unit time or space have a constant mean () rate of occurrence, and each event is independent Simeon-Denis Poisson (1838) Research on the Probability of Judgments in Criminal and Civil MattersApplying the Poisson: Flying bomb strikes in South LondonKey research question: falling at random or under a guidance systemIf random independent events should be distributed spatially as a Poisson distributionDivide south London into 576 equally sized small areas (0.24km2)Count the number of bombs in each area and compare to a PoissonMean rate = = [229(0) + 211(1) + 93(2) + 35(3) + 7(4) +1(5)]/576= hits per unit square12345+Observed22921193357Poisson227.5211.398.1530.397.061.54Very close fit; concluded random
4Theoretical Poisson distribution II Probability mass function(PMF) for 3 different mean occurrencesWhen mean =1; very positively skewedAs mean occurrence increases (more common event), distribution approaches Gaussian;So use Poisson for ‘rareish’ events; mean below 10Fundamental property of the Poisson: mean = varianceSimulated 10,000 observations according to PoissonMean Variance SkewnessVariance is not a freely estimated parameter as in Gaussian
5Death by Horse-kick I: the data Bortkewicz L von.(1898) The Law of Small Numbers, LeipzigNo of soldiers killed annually by horse-kicks in Prussian cavalry; 10 corps over 20 years (occurrences per unit time)The full data 200 corps years of observations……As a frequency distribution (grouped data)DeathsFrequencyMean Variance Number of obsInterpretation: mean rate of deaths per cohort year (ie rare)Mean equals variance, therefore a Poisson distribution
6Death by Horse-kick II: as a Poisson Again :The Poisson results if underlying number of random events per unit time or space have a constant mean () rate of occurrence, and each event is independentWith a mean (and therefore a variance) of 0.61DeathsFrequencyTheoryFormula for Poisson PMFe is base of the natural logarithm (2.7182) is the mean (shape parameter): the average number of events in a given time intervalx! is the factorial of xEG: mean rate of 0.6 accidents per corps year; what is probability of getting 3 accidents in a corps in a year?
7Horse-kick III: as a single level Poisson model General form of the single-level modelObserved count is distributed as an underlying Poisson with a mean rate of occurrence of ;That is as an underlying mean and level-1 random term of z0 (the Poisson ‘weight’)Mean rate is related to predictors non-linearly as an exponential relationshipModel loge to get a linear model (log link)The Poisson weight is the square root of estimated underlying count, re-estimated at each iterationVariance of level-1 residuals constrained to 1,Modeling on the loge scale, cannot make prediction of a negative count on the raw scaleLevel-1 variance is constrained to be an exact Poisson, (variance =mean)
8Horse-kick IV: Null single-level Poisson model in MLwiN The raw ungrouped counts are modeled with a log link and a variance constrained to be equal to the meanis the mean rate of occurrence on the log scaleExponentiate to get the mean rate of 0.610.61 is interpreted as RATE; the number of events per unit time (or space), ie 0.61 horse-kick deaths per corps-year
9Overdispersion I: Types and consequences So far; equi-dispersion, variances equal to the meanOverdispersion: variance > mean; long tail, eg LOS (common)Un-dispersion: variance < mean; data more alike than pure Poisson process; in multilevel possibility of missing levelConsequences of overdispersionFixed part SE’s "point estimates are accurate but they are not as precise as you think they are"In multilevel, mis-estimate higher-level random partApparent and true overdispersion: thought experiment: number of extra-marital affairs: men women with different meansapparent: mis-specified fixed part, not separated out distributions with different meanstrue: genuine stochastic property of more inherent variabilityin practice model fixed part as well as possible, and allow for overdisperionWho?MeanVarCommentMen&Women0.550.76OverdispMen1.00Poisson0.10
10Overdispersion II: the unconstrained Poisson Deaths by horse-kickestimate an over- dispersed Poissonallow the level-1 variance to be estimatedNot significantly different from 1;No evidence that this is not a Poisson distribution
11Overdispersion III: the Negative binomial Instead of fitting an overdispersed Poisson, could fit a NBD modelHandles long-tailed distributions;An explicit model in which variance is greater than the meanCan even have an over-dispersed NBDSame log-link but NBD has 2 parameters for the level-1 variance; that is quadratic level-1 variance, v is the overdispersion parameter
12Overdispersion IV: the Negative binomial Horse=kick analysis:Null single-level NBD model; essentially no change, v is estimated to be 0.00 (see with Stored model; Compared stored model)Overdispersed negative binomial;No evidence of overdispersion; deaths are independent
13Linking the Binomial and the Poisson: I First Bernoulli and BinomialBernoulli is a distribution for binary discrete eventsy is observed outcome; ie 1 or 0E(y) = ; underlying propensity/probability for occurrenceVar(y) = /1- Mean: Variance: /1- 0.010.01/0.99 =0.50.5*0.5 =0.990.99/0.01 =Binomial is a distribution for discrete events out of a number of trialsy is observed outcome; n is the number of trials,E(y) = ; underlying propensity/ of occurrenceVar(y) = [/(1- )]/nLeast variation when denominator is large (more reliable), and as underlying probability approaches 0 or 1
14Linking the Binomial and the Poisson: II Poisson is limit of a binomial process in which prob → 0, n→∞Poisson describes the probability that a random event will occur in a time or space interval when the probability is very small, the number of trials is very large, and each event is independentEG : The probability that any automobile is involved in an accident is very small, but there are very many cars on a road in a day so that the distribution (if each crash is independent) follows a Poisson countIf non-independence of crashes (a ‘pile-up’), then over-dispersed Poisson/NBD, latter used for ‘contagious’ processesIn practice, Poisson and NBD used for rare occurrences, less than 10 cases per interval,& hundreds or even thousands for denominator/ trials [Clayton & Hills (1993)Statistical Models in Epidemiology OUP]
15Taking stock: 4 distributions for counts If common rate of occurrence (mean >10) then use raw counts and Gaussian distribution (assess Normality assumption of the residuals)If rare rate of occurrence, then use over-dispersed Poisson or NBD; the level-1 unconstrained variance estimate will allow assessment of departure from equi-dispersion; improved SE’s, but biased estimates if apparent overdispersion due to model mis-specificationUse the Binomial distribution if count is out of some total and the event is not rare; that is numerator and denominator of the same orderMean variance relations for 4 different distributions that could be used for counts
16Modeling number of extra-marital affairs Single- level Poisson with Single categorical Predictor Extract of raw data (601 individuals from Fair 1978)AffairsConsChildrenReligiousYearsMarriedAge1NoKidsSlightly10-15 years37.0Somewhat< 4 years27.03WithKidsAnti6-10 years32.0Very57.0Fair, R C(1978) A theory of extramarital affairs, Journal of Political Economy, 86(1), 45-61Single categorical predictor Children with NoKids, as the baseUnderstanding customized predictions…………….
17Single- level Poisson with Single categorical Predictor Understanding customized predictionsLog scale:NoKids: WithKids = 0.514First use equation to get underyling log-number of events then exponeniate to get estimated count (since married)As mean/median countsNoKids: expo(-0.092)=Withkids: expo( ) =Those with children have a higher average rate of affairs (but have they been married longer?)
18Modeling number of extra marital affairs: the incidence rate ratio (IRR)So far: mean countsNoKids: expo(-0.092)=Withkids: expo( ) =But also as IRR; comparing the ratio of those with and without kidsIRR = / =That is Withkids have a 83% higher rateBUT can get this directly from the model by exponentiating the estimate for the contrasted categoryexpo(0.606) =Rules:exponeniating the estimates for the (constant plus the contrasted category) gives the mean rate for the contrasted categoryb) ) exponeniating the contrasted category gives IRR in comparison to base category
19Why is the exponentiated coefficient a IRR? As always, the estimated coefficient is the change in response corresponding to a one unit change in the predictorResponse is underlying logged countWhen Xi is 0 (Nokids); log count| Xi is 0 = β0But when Xi is 1(Withkids); log count| Xi is 1 = β0 + β1 X1iSubtracting the first equation from the second gives(log count| Xi is 1)-(log count| Xi is 0) = β1Exponentiating both sides gives (note the division sign)(count| Xi is 1)/(count| Xi is 0) = exp(β1)Thus, exp(β1) is a rate ratio corresponding to the ratio of the mean number of affairs for a with-child person to the mean number of affairs without-child personIncidence: number of new casesRate: because it number of events per time or spaceRatio: because its is ratio of two rates
20Modeling number of extra-marital affairs: changing the base category Previously contrasted category: Withkids:Now contrasted category: Nokids :Changing base simply produces a change of sign on the loge scaleExponentiating the contrasted category:Before: expo(+0.606) =Now: expo(-0.606) =Doubling the rate on loge scale is 0.693; Halving the rate on loge scale isIRR of = IRR of 9-fold increase, difficult to appreciateAdvice: choose base category to be have the lowest mean rate;get positive contrasted estimates;always then comparing a larger value to a base of 1
21Affairs: modeling a set of categorical predictors A model with ‘years married’ included with < 4 years as baseCustomised predictions: mean rate, IRR, & graph with 95% CI’sYrsMarriedA: Log estimateMean rate(expo A)B: DifferentialLog EstimateIRR(Expo B)< 4 years-0.2970.7430.0001.0004-5 years1.6160.7732.1666-10 years1.8880.9322.54010-15 years2.1031.0412.832
22Affairs: modeling a continuous predictor Age Age as a 2nd order polynomial centred around 17 years (the youngest person in survey; also lowest rateTo get mean rate as it changes with ageExpo ( *(Age-17) *(Age-17)2)To get IRR in comparison for a person aged 33 compared to 17Expo( 0.149*16) –(0.003 *162) (drop the constant!)Easiest interpreted as graphs!
23Affairs: a SET of predictors & models TermPoissonSEExtraNBDFixedCons-1.330.200.53-1.560.51Years Married4-5 years0.890.130.341.200.366-10 years1.000.140.381.010.4110-15 years1.310.150.401.420.42ReligiousSomewhat-0.000.390.060.37Slightly0.951.28Non0.871.16Anti1.360.161.520.47ChildrenWithKids-0.060.110.280.250.29Age(age-17)^10.050.020.01(age-17)^20.00Random PartVar6.78NBD var (v)5.15Notice substantial overdispersionPoisson & extra-Poisson no change in estimates; some change to NBDNotice larger SE’s when allow for overdispersion; NBD most conservativeIn full model, WithKids not significant
24NBD model for Marital Affairs IRR of 1 for Under 4 years married, Very religious, No children, aged 17Previous Age effect is really length of marriageUsed comparable vertical axes, range of 4
25NBD model for number of Extra-Marital Affairs With 95% confidence intervalsNB that they are asymmetric on the unlogged scale
26Affairs: Evaluating a sequence of models using DIC Likelihood and hence the Deviance are not available for Poisson and NBD models fitted by quasi-likelihoodDIC criterion available though MCMC; typically needs larger number of iterations than Normal & Binomial (suggested default is 50k not 5k)TermsNull+ Relig+ YrsMar+Kids+Age2Cons0.375-0.139-1.131-1.135-1.352Slightly0.8181.0311.0400.963Somewhat-0.0170.0500.0560.005Anti1.0861.4061.4121.367Not0.6410.9170.9250.8814-5 years0.9180.9430.8826-10 years1.0771.1000.99410-15 years1.2651.2901.303WithKids-0.029-0.051(age-17)^10.048(age-17)^2-0.001DIC:3421.53296.83065.83068.03052.5∆DIC+2.176Pd158911Currently MCMC not available in MLwiN for over-dispersed Poisson nor NBD models; so have to use Wald tests in Intervals and tests window
27Titanic survivor data: Taking account of exposure So far, response is observed count, now we want to model a count given exposure: EG only 1 high-class female child survived but only 1 exposed!Here 2 possible measures of exposurea) the number of potential cases; could use a binomialb) the expected number if everyone had the same exposure (i indexes cell)Death rate = Total Deaths/ Total exposed =817/1316 =Expi = Casesi * Survival rateLatter often used to treat the exposure as a nuisance parameter & allows calculation of Standardised Rate’sSRi = Obsi/Expi * 100SurviveCaseAgeGenderClassExpSR14168AdultMenMid63.722.075462Low175.242.81348Child18.271.457175High66.485.931Women11.8119.17616562.6121.5809335.3226.914014454.6256.44.93263.710.38114.1751.89Previous examplesHorsekick: Exposure removed by design: 200 cohort yearsAffairs: included length of marriage; theoretically interesting
28Modeling SR’s: the use of the OFFSET Model: SRi = (Obsi/Expi) = F(Agei, Genderi, Classi)Where i is a cell, groups with same characteristicsAim: are observed survivors greater or less than expected, and how these ‘differences’ are related to a set of predictor variables?As a non-linear model: E(SRi)= E(Obsi/Expi) =As a linear model (division of raw data is subtraction of a log)Loge (Obsi) - Loge(Expi)) =As a model with an offset, moving Loge(Expi) to the right-hand side, and constraining coefficient to be 1; ie Exp becomes predictor variableLoge (Obsi) = 1.0* Loge(Expi) +NB MLwiN automatically loge transforms the observed response; you have to create the loge of the expected and declare it as an offsetSir John Nelder
29Surviving on the Titanic as a log-linear model AgesGenderClassLogEstimateModeled SRSR=Obs/ExpChildWomenLow0.171.19Mid0.972.64HighMen-0.340.71Adult0.191.210.822.270.942.56-0.850.43-1.520.22-0.150.86Include the offsetAs a saturated model; ie Age *Gender*Class, (2*2*3), 12 terms for 12 cellsMake predictions on the loge scale (must include constant); exponentiate all terms to get departures from the expected rate, that is modeled SR’s
30Titanic survival: parsimonious model Remove insignificant terms starting with 3-way interactions for High*women*childrenCustomized predictions: Very low rates of survival for Low and Middle class adult men; large gender gap for adults, but not for children
31Titanic survival: parsimonious model Modeled SR’s and descriptive SR’sOrdered by worse survivalEstimated SR’s only shown if 95% CI’s do not include 1.0AgeGenderClassEst SRSRAdultMenMid0.22Low0.420.43AdultsWomen1.21ChildHigh1.712.642.262.272.562.572.622.68*1.190.710.86
32Two-level multilevel Poisson One new term, the level 2 differential, on the loge scale, is assumed to come from Normal distribution with a variance ofCan also fit Poisson multilevel with offset and NBD multilevel in MLwiN
33Estimation of multilevel Poisson and NBD in MLwiN I Same options as for binary and binomialQuasi-likelihood and therefore MQL or PQL fitted using IGLS/RIGLS; fast, but no deviance (have to use Wald tests); may be troubled by small number of higher-level units; simulations have shown that MQL tends to overestimate the higher-level variance parametersMCMC estimates; good quality and can use DIC to compare Poisson models; but currently MCMC is not possible for extra-Poisson nor for NBDMCMC in MLwiN often produces highly correlated chains (in part due to the fact that the parameters of the model are highly correlated; variance =mean) Therefore requires substantial number of simulations; typically much larger than for Normal or for Binomial
34Estimation of multilevel Poisson and NBD in MLwiN II Possibility to output to WinBUGS and use the univariate AR sampler and Gamerman (1997) method which tends to have less correlated chains, but WinBUGS is considerably slower generally [Gamerman, D. (1997) Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing 7, 57-68]Advice: start with IGLS PQL; switch to MCMC, be prepared to make 500,000 simulations (suggest use 1 in 10 thinning to store the chains); use Effective sample size to assess required length of change, eg need ESS of at least 500 for key parameters of interest; compare results and contemplate using PQL and over-dispersed PoissonFreely available software MIXPREG for multilevel Poisson counts including offsets; uses full information maximum likelihood estimated using quadrature
35VPC for Poisson models Can either use Simulation method to derive VPC (modify the binomial procedure)Use exact method: (Henrik Stryhn)VPC for two level random intercepts model (available for other models)Clearly VPC depends on and
36Modeling Counts in MLwiN: HIV in India Aim: investigate the State geography of HIV in terms of risk Data: nationally representative sample of 100k individuals in Response: HIV sero-status from blood samples Structure: 1720 cells within 28 States; cells are a group of people who share common characteristics [Age-Groups(4), Education(4), Sex(2), Urbanity(2) and State (28)] Rarity: only 467 sero-positives were found Model: Log count of number of seropositives in a cell related to an offset of Log expected count if national rates applied Predictors of Age, Sex and Education and Urbanity Two-level multilevel Poisson, extra-Poisson & NBD
37HIV in India: Standardized Morbidity Rates Higher educated females have the lowest risk, across the age-groups
38HIV in India: some results Risks for different States relative to living in urban and rural areas nationally.
39Modeling proportions as a binomial in MLwiN exactly the same procedure as for binary modelsexcept that observed y is a proportion (not just 1 and 0, the denominator (n) is variable (not just 1) and extra-dispersion at level 1 is allowed (not just exact binomial)Reading:Subramanian S V, Duncan C, Jones K (2001) Multilevel perspectives on modeling census data Environment and Planning A 33(3) 399 – 417
40Data: teenage employment In Glasgow districts “Ungrouped” data that is individual dataModel binary outcome of employed or not and two individual predictorsNamePersonDistrictEmployedQualifSexCraig1YesLowMaleSubra2HighNina3FemMin4NoMyles5Sarah1250Kat13Colin14Andy15
41Same data as a multilevel structure: a set of tables for each district GENDERQUALIF MALE FEMALE Postcode UnErateLOW 5 out of 6 3 out of G1A %HIGH 2 out of 7 7 out of 9LOW 5 out of 9 7 out of 11 G1B %HIGH 8 out of 8 7 out of 9LOW 3 out of G99Z %HIGH 2 out of 3 out of 5Level 1 cell in tableLevel 2: Postcode sectorMargins: define the two categorical predictorsInternal cells: the response of 5 out of 6 are employed
42Teenage unemployment: some results from a binomial, two-level logit model
43Spatial Models as a combination of strict hierarchy and multiple membership: counts are commonly usedMultiple membership defined by common boundary; weights as function of inverse distance between centroids of areasMLKJIHGFEDCBAPerson in AAffected by A(SH) and B,C,D (MM)Person in HAffected by H(SH) and E,I,K,G (MM)
44Scottish Lip Cancer Spatial multiple-membership model Response: observed counts of male lip cancer for the 56 regions of Scotland ( )Predictor: % of workforce working in outdoor occupations (Agric;For; Fish) Expected count based on population sizeStructure areas and their neighbours defined as having a common border (up to 11); equal weights for each neighbouring region that sum to 1Rate of lip cancer in each region is affected by both the region itself and its nearest neighbours after taking account of outdoor activityModel Log of the response related to fixed predictor, with an offset, Poisson distribution for counts;NB Two sets of random effects1 area random effects; (ie unstructured; non-spatial variation);2 multiple membership set of random effects for the neighbours of each regionWhat is not thereWhy do we need techniques? ?Why do we need to condition on variables ?
46Scottish Lip Cancer: CAR model CAR: CAR one set of random effects, which have an expected value of the average of the surrounding random effects; weights divided by the number of neighbourswhere ni is the number of neighbours for area i and the weights are typically all 1What is not thereWhy do we need techniques? ?Why do we need to condition on variables ?MLwiN: limited capabilities for CAR model; ie at one level only (unlike Bugs)
47MCMC estimation: CAR model, 50,000 draws Poisson modelFixed effects:Offset andWell-supported + relationWell-supported Residual neighbourhood effect
49Ohio cancer: repeated measures (space and time!) Response: counts of respiratory cancer deaths in Ohio countiesAim: Are there hotspot counties with distinctive trends? (small numbers so ‘borrow strength’ from neighbours)Structure: annual repeated measures ( ) for countiesClassification 3: n’hoods as MM (3-8 n’hoods)Classification 2: counties (88)Classification 1: occasion (88*10)Predictor: Expected deaths; TimeModel Log of the response related to fixed predictor, with an offset, Poisson distribution for counts (C1);Two sets of random effects1 area random effects allowed to vary over time; trend for each county from the Ohio distribution (c2)2 multiple membership set of random effects for the neighbours of each region (C3)What is not thereWhy do we need techniques? ?Why do we need to condition on variables ?
50MCMC estimation: repeated measures model, 50,000 draws General trendN’hood varianceVariance function for between county time trendDefault priors
51Respiratory cancer trends in Ohio: raw and modelled Red: County 41 in 1988; SMR = 77/49 = 1.57Blue: County 80 in 1988: SMR= 6/19 = 0.31
52General References on Modeling Counts Agresti, A. (2001) Categorical Data Analysis (2nd ed). New York: Wiley.Cameron, A.C. and P.K. Trivedi (1998). Regression analysis of count data, Cambridge University PressHilbe, J.M. (2007). Negative Binomial Regression, Cambridge University Press.McCullagh, P and Nelder, J (1989). Generalized Linear Models, Second Edition. Chapman & Hall/CRC. On spatial modelsBrowne, W J (2003) MCMC Estimation in MLwiN; Chapter 16 Spatial modelsLawson, A.B., Browne W.J., and Vidal Rodeiro, C.L. (2003) Disease Mapping using WinBUGS and MLwiN Wiley. London (Chapter 8: GWR)What is not thereWhy do we need techniques? ?Why do we need to condition on variables ?