Download presentation

Presentation is loading. Please wait.

Published byGabriel Corcoran Modified over 2 years ago

1
Modelling Count Data: Outline Characteristics of count data and the Poisson distribution Applying the Poisson: Flying bomb strikes in South London Deaths by horse-kick: as a single-level model Poisson model fitted in MLwiN Overdispersion: types and consequences, the unconstrained Poisson, the Negative Binomial Taking stock: 4 distributions for modeling counts Number of extramarital affairs: the incidence rate ratio (IRR); handling categorical & continuous predictors; comparing model with DIC Titanic survivor data; taking account of exposure, the offset Multilevel Poisson and NBD models; estimation and VPC Applications: HIV in India and Teenage employment in Glasgow Spatial models: Lip cancer in Scotland; respiratory cancer in Ohio

2
2 Some characteristics of count data Very common in the social sciences Number of children;Number of marriages Number of arrests;Number of traffic accidents Number of flows;Number of deaths Counts have particular characteristics Integers & cannot be negative Often positively skewed; a floor of zero In practice often rare events which peak at 1,2 or 3 and rare at higher values Modelled by Logit regression models the log odds of an underlying propensity of an outcome; Poisson regression models the log of the underlying rate of occurrence of a count.

3
3 The Poisson distribution results if the underlying number of random events per unit time or space have a constant mean ( ) rate of occurrence, and each event is independent Simeon-Denis Poisson (1838) Research on the Probability of Judgments in Criminal and Civil Matters Applying the Poisson: Flying bomb strikes in South London Key research question:falling at random or under a guidance system If random independent events should be distributed spatially as a Poisson distribution Divide south London into 576 equally sized small areas (0.24km 2 ) Count the number of bombs in each area and compare to a Poisson Mean rate = = [229(0) + 211(1) + 93(2) + 35(3) + 7(4) +1(5)]/576 = hits per unit square Observed Poisson Very close fit; concluded random Theoretical Poisson distribution I

4
4 Theoretical Poisson distribution II Probability mass function(PMF) for 3 different mean occurrences When mean =1; very positively skewed As mean occurrence increases (more common event), distribution approaches Gaussian; So use Poisson for rareish events; mean below 10 Fundamental property of the Poisson: mean = variance Simulated 10,000 observations according to Poisson MeanVarianceSkewness Variance is not a freely estimated parameter as in Gaussian

5
5 Bortkewicz L von.(1898) The Law of Small Numbers, Leipzig No of soldiers killed annually by horse-kicks in Prussian cavalry; 10 corps over 20 years (occurrences per unit time) The full data 200 corps years of observations As a frequency distribution (grouped data) Deaths Frequency MeanVariance Number of obs Interpretation: mean rate of 0.61 deaths per cohort year (ie rare) Mean equals variance, therefore a Poisson distribution Death by Horse-kick I: the data ……

6
6 Death by Horse-kick II: as a Poisson With a mean (and therefore a variance) of 0.61 Deaths Frequency Theory Again :The Poisson results if underlying number of random events per unit time or space have a constant mean ( ) rate of occurrence, and each event is independent Formula for Poisson PMF e is base of the natural logarithm (2.7182) is the mean (shape parameter): the average number of events in a given time interval x! is the factorial of x EG: mean rate of 0.6 accidents per corps year; what is probability of getting 3 accidents in a corps in a year?

7
7 Horse-kick III: as a single level Poisson model Modeling on the log e scale, cannot make prediction of a negative count on the raw scale Level-1 variance is constrained to be an exact Poisson, (variance =mean) General form of the single-level model Observed count is distributed as an underlying Poisson with a mean rate of occurrence of ; That is as an underlying mean and level-1 random term of z0 (the Poisson weight) Mean rate is related to predictors non-linearly as an exponential relationship Model log e to get a linear model (log link) The Poisson weight is the square root of estimated underlying count, re-estimated at each iteration Variance of level-1 residuals constrained to 1,

8
8 Horse-kick IV: Null single-level Poisson model in MLwiN The raw ungrouped counts are modeled with a log link and a variance constrained to be equal to the mean is the mean rate of occurrence on the log scale Exponentiate to get the mean rate of is interpreted as RATE; the number of events per unit time (or space), ie 0.61 horse-kick deaths per corps-year

9
9 So far; equi-dispersion, variances equal to the mean Overdispersion: variance > mean; long tail, eg LOS (common) Un-dispersion: variance < mean; data more alike than pure Poisson process; in multilevel possibility of missing level Consequences of overdispersion Fixed part SEs "point estimates are accurate but they are not as precise as you think they are" In multilevel, mis-estimate higher-level random part Apparent and true overdispersion: thought experiment : number of extra-marital affairs: men women with different means Who? MeanVarComment Men& Women Overdisp Men1.00 Poisson Women0.10 Poisson apparent: mis-specified fixed part, not separated out distributions with different means true: genuine stochastic property of more inherent variability in practice model fixed part as well as possible, and allow for overdisperion Overdispersion I: Types and consequences

10
10 Overdispersion II: the unconstrained Poisson Not significantly different from 1; No evidence that this is not a Poisson distribution Deaths by horse-kick estimate an over- dispersed Poisson allow the level-1 variance to be estimated

11
11 Instead of fitting an overdispersed Poisson, could fit a NBD model Handles long-tailed distributions; An explicit model in which variance is greater than the mean Can even have an over-dispersed NBD Same log-link but NBD has 2 parameters for the level-1 variance; that is quadratic level-1 variance, v is the overdispersion parameter Overdispersion III: the Negative binomial

12
12 Overdispersion IV: the Negative binomial Horse=kick analysis: Null single-level NBD model; essentially no change, v is estimated to be 0.00 (see with Stored model; Compared stored model) Overdispersed negative binomial; No evidence of overdispersion; deaths are independent

13
13 First Bernoulli and Binomial Bernoulli is a distribution for binary discrete events y is observed outcome; ie 1 or 0 E(y) = ; underlying propensity/probability for occurrence Var(y) = /1- Mean: Variance: / /0.99 = *0.5 = /0.01 = Binomial is a distribution for discrete events out of a number of trials y is observed outcome; n is the number of trials, E(y) = ; underlying propensity/ of occurrence Var(y) = [ /(1- )]/n Linking the Binomial and the Poisson: I Least variation when denominator is large (more reliable), and as underlying probability approaches 0 or 1

14
14 Poisson is limit of a binomial process in which prob 0, n Poisson describes the probability that a random event will occur in a time or space interval when the probability is very small, the number of trials is very large, and each event is independent EG : The probability that any automobile is involved in an accident is very small, but there are very many cars on a road in a day so that the distribution (if each crash is independent) follows a Poisson count If non-independence of crashes (a pile-up), then over-dispersed Poisson/NBD, latter used for contagious processes In practice, Poisson and NBD used for rare occurrences, less than 10 cases per interval,& hundreds or even thousands for denominator/ trials [Clayton & Hills (1993)Statistical Models in Epidemiology OUP] Linking the Binomial and the Poisson: II

15
15 If common rate of occurrence (mean >10) then use raw counts and Gaussian distribution (assess Normality assumption of the residuals) If rare rate of occurrence, then use over-dispersed Poisson or NBD; the level-1 unconstrained variance estimate will allow assessment of departure from equi-dispersion; improved SEs, but biased estimates if apparent overdispersion due to model mis-specification Use the Binomial distribution if count is out of some total and the event is not rare; that is numerator and denominator of the same order Taking stock: 4 distributions for counts Mean variance relations for 4 different distributions that could be used for counts

16
16 AffairsConsChildrenReligiousYearsMarriedAge 01NoKidsSlightly10-15 years NoKidsSomewhat< 4 years NoKidsSlightly< 4 years WithKidsAnti6-10 years WithKidsSlightly< 4 years WithKidsVery6-10 years57.0 Single categorical predictor Children with NoKids, as the base Modeling number of extra-marital affairs Single- level Poisson with Single categorical Predictor Extract of raw data (601 individuals from Fair 1978) Fair, R C(1978) A theory of extramarital affairs, Journal of Political Economy, 86(1), Understanding customized predictions…………….

17
17 Log scale: NoKids: WithKids = First use equation to get underyling log-number of events then exponeniate to get estimated count (since married) As mean/median counts NoKids: expo(-0.092)= Withkids: expo( ) = Those with children have a higher average rate of affairs (but have they been married longer?) Single- level Poisson with Single categorical Predictor Understanding customized predictions

18
18 Modeling number of extra marital affairs: the incidence rate ratio (IRR) BUT can get this directly from the model by exponentiating the estimate for the contrasted category expo(0.606) = So far: mean counts NoKids: expo(-0.092)= Withkids: expo( ) = But also as IRR; comparing the ratio of those with and without kids IRR = / = That is Withkids have a 83% higher rate Rules: a)exponeniating the estimates for the (constant plus the contrasted category) gives the mean rate for the contrasted category b) ) exponeniating the contrasted category gives IRR in comparison to base category

19
19 As always, the estimated coefficient is the change in response corresponding to a one unit change in the predictor Response is underlying logged count When Xi is 0 (Nokids); log count| Xi is 0 = β0 But when Xi is 1(Withkids); log count| Xi is 1 = β0 + β1 X1i Subtracting the first equation from the second gives (log count| Xi is 1)-(log count| Xi is 0) = β1 Exponentiating both sides gives (note the division sign) (count| Xi is 1)/(count| Xi is 0) = exp(β1) Thus, exp(β1) is a rate ratio corresponding to the ratio of the mean number of affairs for a with-child person to the mean number of affairs without-child person Incidence: number of new cases Rate: because it number of events per time or space Ratio: because its is ratio of two rates Why is the exponentiated coefficient a IRR?

20
20 Modeling number of extra-marital affairs: changing the base category Previously contrasted category: Withkids: Now contrasted category: Nokids : Changing base simply produces a change of sign on the log e scale Exponentiating the contrasted category: Before: expo(+0.606) = Now: expo(-0.606) = Doubling the rate on log e scale is 0.693; Halving the rate on log e scale is IRR of = IRR of 9-fold increase, difficult to appreciate Advice: choose base category to be have the lowest mean rate; get positive contrasted estimates; always then comparing a larger value to a base of 1

21
21 Affairs: modeling a set of categorical predictors Customised predictions: mean rate, IRR, & graph with 95% CIs YrsMarriedA: Log estimate Mean rate (expo A) B: Differential Log Estimate IRR (Expo B) < 4 years years years years A model with years married included with < 4 years as base

22
22 Affairs: modeling a continuous predictor Age To get mean rate as it changes with age Expo ( *(Age-17) *(Age-17) 2 ) To get IRR in comparison for a person aged 33 compared to 17 Expo( 0.149*16) –(0.003 *16 2 ) (drop the constant!) Easiest interpreted as graphs! Age as a 2nd order polynomial centred around 17 years (the youngest person in survey; also lowest rate

23
23 Affairs: a SET of predictors & models TermPoissonSEExtraSENBDSE Fixed Cons Years Married 4-5 years years years Religious Somewhat Slightly Non Anti Children WithKids Age (age-17)^ (age-17)^ Random Part Var NBD var (v) Notice substantial overdispersion Poisson & extra- Poisson no change in estimates; some change to NBD Notice larger SEs when allow for overdispersion; NBD most conservative In full model, WithKids not significant

24
24 NBD model for Marital Affairs IRR of 1 for Under 4 years married, Very religious, No children, aged 17 Previous Age effect is really length of marriage Used comparable vertical axes, range of 4

25
25 With 95% confidence intervals NB that they are asymmetric on the unlogged scale NBD model for number of Extra-Marital Affairs

26
26 Likelihood and hence the Deviance are not available for Poisson and NBD models fitted by quasi-likelihood DIC criterion available though MCMC; typically needs larger number of iterations than Normal & Binomial (suggested default is 50k not 5k) TermsNull+ Relig+ YrsMar+Kids+Age 2 Cons Slightly Somewhat Anti Not years years years WithKids (age-17)^ (age-17)^ DIC: DIC Pd Affairs: Evaluating a sequence of models using DIC Currently MCMC not available in MLwiN for over- dispersed Poisson nor NBD models; so have to use Wald tests in Intervals and tests window

27
27 So far, response is observed count, now we want to model a count given exposure: EG only 1 high-class female child survived but only 1 exposed! SurviveCaseAgeGenderClassExpSR 14168AdultMenMid AdultMenLow ChildMenLow AdultMenHigh ChildWomenLow AdultWomenLow AdultWomenMid AdultWomenHigh ChildWomenMid ChildWomenHigh ChildMenMid ChildMenHigh Latter often used to treat the exposure as a nuisance parameter & allows calculation of Standardised Rates SR i = Obs i /Exp i * 100 Titanic survivor data: Taking account of exposure Here 2 possible measures of exposure a) the number of potential cases; could use a binomial b) the expected number if everyone had the same exposure (i indexes cell) Death rate = Total Deaths/ Total exposed =817/1316 = Exp i = Cases i * Survival rate Previous examples Horsekick: Exposure removed by design: 200 cohort years Affairs: included length of marriage; theoretically interesting

28
28 Model: SR i = (Obs i /Exp i ) = F(Age i, Gender i, Class i ) Where i is a cell, groups with same characteristics Aim: are observed survivors greater or less than expected, and how these differences are related to a set of predictor variables? As a non-linear model: E(SR i )= E(Obs i /Exp i ) = As a linear model (division of raw data is subtraction of a log) Log e (Obs i ) - Log e (Exp i )) = As a model with an offset, moving Log e (Exp i ) to the right-hand side, and constraining coefficient to be 1; ie Exp becomes predictor variable Log e (Obs i ) = 1.0* Log e (Exp i ) + NB MLwiN automatically log e transforms the observed response; you have to create the log e of the expected and declare it as an offset Sir John Nelder Modeling SRs: the use of the OFFSET

29
29 Include the offset As a saturated model; ie Age *Gender*Class, (2*2*3), 12 terms for 12 cells Make predictions on the log e scale (must include constant); exponentiate all terms to get departures from the expected rate, that is modeled SRs AgesGenderClassLog Estimate Modeled SR SR= Obs/Exp ChildWomenLow ChildWomenMid ChildWomenHigh ChildMenLow ChildMenMid ChildMenHigh AdultWomenLow AdultWomenMid AdultWomenHigh AdultMenLow AdultMenMid AdultMenHigh Surviving on the Titanic as a log-linear model

30
30 Remove insignificant terms starting with 3-way interactions for High*women*children Customized predictions: Very low rates of survival for Low and Middle class adult men; large gender gap for adults, but not for children Titanic survival: parsimonious model

31
31 Modeled SRs and descriptive SRs Ordered by worse survival Estimated SRs only shown if 95% CIs do not include 1.0 AgeGenderClassEst SRSR AdultMenMid0.22 AdultMenLow AdultsWomenLow1.21 ChildMenHigh AdultsWomenMid ChildWomenHigh AdultWomenHigh ChildWomenMid ChildMenMid ChildWomenLow*1.19 ChildMenLow*0.71 AdultsMenHigh*0.86 Titanic survival: parsimonious model

32
32 Two-level multilevel Poisson One new term, the level 2 differential, on the log e scale, is assumed to come from Normal distribution with a variance of Can also fit Poisson multilevel with offset and NBD multilevel in MLwiN

33
33 Same options as for binary and binomial Quasi-likelihood and therefore MQL or PQL fitted using IGLS/RIGLS; fast, but no deviance (have to use Wald tests); may be troubled by small number of higher-level units; simulations have shown that MQL tends to overestimate the higher-level variance parameters MCMC estimates; good quality and can use DIC to compare Poisson models; but currently MCMC is not possible for extra- Poisson nor for NBD MCMC in MLwiN often produces highly correlated chains (in part due to the fact that the parameters of the model are highly correlated; variance =mean) Therefore requires substantial number of simulations; typically much larger than for Normal or for Binomial Estimation of multilevel Poisson and NBD in MLwiN I

34
34 Possibility to output to WinBUGS and use the univariate AR sampler and Gamerman (1997) method which tends to have less correlated chains, but WinBUGS is considerably slower generally [ Gamerman, D. (1997) Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing 7, ] Advice: start with IGLS PQL; switch to MCMC, be prepared to make 500,000 simulations (suggest use 1 in 10 thinning to store the chains); use Effective sample size to assess required length of change, eg need ESS of at least 500 for key parameters of interest; compare results and contemplate using PQL and over-dispersed Poisson Freely available software MIXPREG for multilevel Poisson counts including offsets; uses full information maximum likelihood estimated using quadrature Estimation of multilevel Poisson and NBD in MLwiN II

35
35 Can either use Simulation method to derive VPC (modify the binomial procedure) Use exact method: (Henrik Stryhn)http://people.upei.ca/hstryhn/iccpoisson.ppt VPC for two level random intercepts model (available for other models) Clearly VPC depends on and VPC for Poisson models

36
Aim: investigate the State geography of HIV in terms of risk Data: nationally representative sample of 100k individuals in Response: HIV sero-status from blood samples Structure: 1720 cells within 28 States; cells are a group of people who share common characteristics [Age-Groups(4), Education(4), Sex(2), Urbanity(2) and State (28)] Rarity: only 467 sero-positives were found Model: Log count of number of seropositives in a cell related to an offset of Log expected count if national rates applied Predictors of Age, Sex and Education and Urbanity Two-level multilevel Poisson, extra-Poisson & NBD Modeling Counts in MLwiN: HIV in India

37
37 HIV in India: Standardized Morbidity Rates Higher educated females have the lowest risk, across the age-groups

38
38 HIV in India: some results Risks for different States relative to living in urban and rural areas nationally.

39
39 Modeling proportions as a binomial in MLwiN exactly the same procedure as for binary models except that observed y is a proportion (not just 1 and 0, the denominator (n) is variable (not just 1) and extra-dispersion at level 1 is allowed (not just exact binomial) Reading: Subramanian S V, Duncan C, Jones K (2001) Multilevel perspectives on modeling census data Environment and Planning A 33(3) 399 – 417

40
40 Data: teenage employment In Glasgow districts Ungrouped data that is individual data Model binary outcome of employed or not and two individual predictors NamePersonDistrictEmployedQualifSex Craig11YesLowMale Subra21YesHighMale Nina31YesLowFem Min41NoLowFem Myles51NoHighMale Sarah1250YesHighFem Kat1350NoLowFem Colin1450YesLowMale Andy1550NoHighMale

41
41 Same data as a multilevel structure: a set of tables for each district GENDER QUALIFMALEFEMALE Postcode UnErate LOW5 out of 63 out of 12 G1A 15% HIGH2 out of 7 7 out of 9 LOW5 out of 9 7 out of 11G1B 12% HIGH8 out of 8 7 out of 9 LOW3 out of 3 -G99Z 3% HIGH2 out of 3 out of 5 Level 1 cell in table Level 2: Postcode sector Margins: define the two categorical predictors Internal cells: the response of 5 out of 6 are employed

42
42 Teenage unemployment: some results from a binomial, two-level logit model

43
43 Spatial Models as a combination of strict hierarchy and multiple membership: counts are commonly used Multiple membership defined by common boundary; weights as function of inverse distance between centroids of areas MLK J IHG FED C BA Person in A Affected by A(SH) and B,C,D (MM) Person in H Affected by H(SH) and E,I,K,G (MM)

44
44 Response: observed counts of male lip cancer for the 56 regions of Scotland ( ) Predictor: % of workforce working in outdoor occupations ( Agric;For; Fish ) Expected count based on population size Structureareas and their neighbours defined as having a common border (up to 11); equal weights for each neighbouring region that sum to 1 Rate of lip cancer in each region is affected by both the region itself and its nearest neighbours after taking account of outdoor activity Model Log of the response related to fixed predictor, with an offset, Poisson distribution for counts; NBTwo sets of random effects 1area random effects; (ie unstructured; non-spatial variation); 2multiple membership set of random effects for the neighbours of each region Scottish Lip Cancer Spatial multiple-membership model

45
45 MCMC estimation: 50,000 draws Fixed effects: Offset and Well-supported + relation Well-supported Residual neighbourhood effect Poisson model NB: Poisson highly correlated chains

46
46 Scottish Lip Cancer: CAR model CAR: CAR one set of random effects, which have an expected value of the average of the surrounding random effects; weights divided by the number of neighbours where n i is the number of neighbours for area i and the weights are typically all 1 MLwiN:limited capabilities for CAR model; ie at one level only (unlike Bugs)

47
47 MCMC estimation: CAR model, 50,000 draws Fixed effects: Offset and Well-supported + relation Well-supported Residual neighbourhood effect Poisson model

48
48 NB Scales: shrinkage

49
49 Ohio cancer: repeated measures (space and time!) Response: counts of respiratory cancer deaths in Ohio counties Aim: Are there hotspot counties with distinctive trends? (small numbers so borrow strength from neighbours) Structure:annual repeated measures ( ) for counties Classification 3: nhoods as MM (3-8 nhoods) Classification 2: counties (88) Classification 1: occasion (88*10) Predictor: Expected deaths; Time Model Log of the response related to fixed predictor, with an offset, Poisson distribution for counts (C1); Two sets of random effects 1area random effects allowed to vary over time; trend for each county from the Ohio distribution (c2) 2multiple membership set of random effects for the neighbours of each region (C3)

50
50 MCMC estimation: repeated measures model, 50,000 draws General trend Variance function for between county time trend Nhood variance Default priors

51
51 Respiratory cancer trends in Ohio: raw and modelled Red: County 41 in 1988; SMR = 77/49 = 1.57 Blue: County 80 in 1988: SMR= 6/19 = 0.31

52
52 General References on Modeling Counts Agresti, A. (2001) Categorical Data Analysis (2nd ed). New York: Wiley. Cameron, A.C. and P.K. Trivedi (1998). Regression analysis of count data, Cambridge University Press Hilbe, J.M. (2007). Negative Binomial Regression, Cambridge University Press. McCullagh, P and Nelder, J (1989). Generalized Linear Models, Second Edition. Chapman & Hall/CRC. On spatial models Browne, W J (2003) MCMC Estimation in MLwiN; Chapter 16 Spatial models Lawson, A.B., Browne W.J., and Vidal Rodeiro, C.L. (2003) Disease Mapping using WinBUGS and MLwiN Wiley. London (Chapter 8: GWR)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google