Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Brief introduction on Logistic Regression
Correlation and regression
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Logistic Regression Example: Horseshoe Crab Data
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
9. SIMPLE LINEAR REGESSION AND CORRELATION
SIMPLE LINEAR REGRESSION
Log-linear and logistic models
Chapter 11 Multiple Regression.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Linear and generalised linear models
SIMPLE LINEAR REGRESSION
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Calibration & Curve Fitting
Correlation and Linear Regression
Review of Lecture Two Linear Regression Normal Equation
SIMPLE LINEAR REGRESSION
A Primer on the Exponential Family of Distributions David Clark & Charles Thayer American Re-Insurance GLM Call Paper
Inference for regression - Simple linear regression
Correlation and Linear Regression
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Introduction to Linear Regression
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
1 GLM I: Introduction to Generalized Linear Models By Curtis Gary Dean Distinguished Professor of Actuarial Science Ball State University By Curtis Gary.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Chapter 13 Multiple Regression
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Stats Methods at IC Lecture 3: Regression.
Transforming the data Modified from:
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Statistical Data Analysis - Lecture /04/03
Correlation and Simple Linear Regression
Essentials of Modern Business Statistics (7e)
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
John Loucks St. Edward’s University . SLIDES . BY.
Business Statistics Multiple Regression This lecture flows well with
Correlation and Simple Linear Regression
Regression Analysis Week 4.
CHAPTER 29: Multiple Regression*
Prepared by Lee Revere and John Large
Hypothesis testing and Estimation
Correlation and Simple Linear Regression
Multiple Regression Chapter 14.
SIMPLE LINEAR REGRESSION
Simple Linear Regression and Correlation
SIMPLE LINEAR REGRESSION
Generalized Additive Model
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004

Objectives u Gentle introduction to Linear Models and Generalized Linear Models u Illustrate some simple applications u Show examples in commonly available software u Which model(s) to use? u Practical issues

A Brief Introduction to Regression u One of most common statistical methods fits a line to data u Model: Y = a+bx + error u Error assumed to be Normal

A Brief Introduction to Regression u Fits line that minimizes squared deviation between actual and fitted values u

Simple Formula for Fitting Line

Excel Does Regression u Install Data Analysis Tool Pak (Add In) that comes with Excel u Click Tools, Data Analysis, Regression

Goodness of Fit Statistics u R 2 : (SS Regression/SS Total) u percentage of variance explained u F statistic: (MS Regression/MS Resid) u significance of regression u T statistics: Uses SE of coefficient to determine if it is significant u significance of coefficients u It is customary to drop variable if coefficient not significant u Note SS = Sum squared of errors

Output of Excel Regression Procedure

Assumptions of Regression u Errors independent of value of X u Errors independent of value of Y u Errors independent of prior errors u Errors are from normal distribution u We can test these assumptions

Other Diagnostics: Residual Plot u Points should scatter randomly around zero u If not, a straight line probably is not be appropriate

Other Diagnostics: Normal Plot u Plot should be a straight line u Otherwise residuals not from normal distribution

Test for autocorrelated errors u Autocorrelation often present in time series data u Durban – Watson statistic: u If residuals uncorrelated, this is near 2

Durban Watson Statistic u Indicates autocorrelation present

Non-Linear Relationships u The model fit was of the form: u Severity = a + b*Year u A more common trend model is: u Severity Year =Severity Year0 *(1+t) (Year-Year0) u T is the trend rate u This is an exponential trend model u Cannot fit it with a line

Transformation of Variables u Severity Year =Severity Year0 *(1+t) (Year-Year0) 1. Log both sides 2. ln(Sev Year )=ln(Sev Year0 )+(Year-Year0)*ln(1+t) 3. Y = a + x * b 4. A line can be fit to transformed variables where dependent variable is log(Y)

Exponential Trend – Cont. u R 2 declines and Residuals indicate poor fit

A More Complex Model u Use more than one variable in model (Econometric Model) u In this case we use a medical cost index and the consumer price index to predict workers compensation severity

Multivariable Regression

Regression Output

Regression Output cont. u Standardized residuals more evenly spread around the zero line – but pattern still present u R 2 is.84 vs.52 of simple trend regression u We might want other variables in model (i.e, unemployment rate), but at some point overfitting becomes a problem

Multicollinearity u Predictor variables are assumed uncorrelated u Assess with correlation Matrix

Remedies for Multicollinearity u Drop one of the highly correlated variables u Use Factor analysis or Principle components to produce a new variable which is a weighted average of the correlated variables

Exponential Smoothing u A weighted average with more weight given to more recent values u Linear Exponential Smoothing: model level and trend

Exponential Smoothing Fit

Tail Development Factors: Another Regression Application u Typically involve non-linear functions: u Inverse Power Curve: u Hoerel Curve: u Probability distribution such as Gamma, Lognormal

Example: Inverse Power Curve Can use transformation of variables to fit simplified model: LDF=1+k/t a ln(LDF-1) =a+b*ln(1/t) Use nonlinear regression to solve for a and c Uses numerical algorithms, such as gradient descent to solve for parameters. Most statistics packages let you do this

Nonlinear Regression: Grid Search Method Try out a number of different values for parameters and pick the ones which minimize a goodness of fit statistic You can use the Data Table capability of Excel to do this Use regression functions linest and intercept to get k and a Try out different values for c until you find the best one

Fitting non-linear function

Using Data Tables in Excel

Use Model to Compute the Tail

Fitting Non-linear functions u Another approach is to use a numerical method u Newton-Raphson (one dimension) u x n+1 = x n – f’(x n )/f’’(x n ) u f(x n ) is typically a function being maximized or minimized, such as squared errors u x’s are parameters being estimated u A multivariate version of Newton_Raphson or other algorithm is available to solve non-linear problems in most statistical software u In Excel the Solver add-in is used to do this

Claim Count Triangle Model Chain ladder is common approach

Claim Count Development u Another approach: additive model u This model is the same as a one factor ANOVA

ANOVA Model for Development

Regression With Dummy Variables u Let Devage24=1 if development age = 24 months, 0 otherwise u Let Devage36=1 if development age = 36 months, 0 otherwise u Need one less dummy variable than number of ages

Regression with Dummy Variables: Design Matrix

Equivalent Model to ANOVA

Apply Logarithmic Transformation u It is reasonable to believe that variance is proportional to expected value u Claims can only have positive values u If we log the claim values, can’t get a negative u Regress log(Claims+.001) on dummy variables or do ANOVA on logged data

Log Regression

Poisson Regression u Log Regression assumption: errors on log scale are from normal distribution. u But these are claims – Poisson assumption might be reasonable u Poisson and Normal from more general class of distributions: exponential family of distributions

“Natural” Form of the Exponential Family

Specific Members of the Exponential Family u Normal (Gaussian) u Poisson u Negative Binomial u Gamma u Inverse Gaussian

Some Other Members of the Exponential Family u Natural Form u Binomial u Logarithmic u Compound Poisson/Gamma (Tweedie) u General Form [use ln( y ) instead of y ] u Lognormal u Single Parameter Pareto

Poisson Distribution u Natural Form: u “Over-dispersed” Poisson allows   1. u Variance/Mean ratio =  u Poisson distribution:

Linear Model vs GLM u Regression: u GLM:

The Link Function u Like transformation of variables in linear regression u Y=AX B is transformed into a linear model u log(Y) = log(A) + B*log(X) u This is similar to having a log link function: u h(Y) = log(Y) u denote h(Y) as n u n = a+bx

Other Link Functions u Identity u h(Y)=Y u Inverse u h(Y) = 1/Y u Logistic u h(Y)=log(y/(1-y)) u Probit u h(Y) =

The Other Parameters: Poisson Example Link function

LogLikhood for Poisson

Estimating Parameters u As with nonlinear regression, there usually is not a closed form solution for GLMs u A numerical method used to solve u For some models this could be programmed in Excel – but statistical software is the usual choice u If you can’t spend money on the software, download R for free

GLM fit for Poisson Regression u >devage<-as.facto((AGE) u >claims.glm<-glm(Claims~devage, family=poisson) u >summary(claims.glm) u Call: u glm(formula = Claims ~ devage, family = poisson) u Deviance Residuals: u Min 1Q Median 3Q Max u u Coefficients: u Estimate Std. Error z value Pr(>|z|) u (Intercept) < 2e-16 *** u devage < 2e-16 *** u devage < 2e-16 *** u devage e-12 *** u --- u Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 u (Dispersion parameter for poisson family taken to be 1) u Null deviance: on 36 degrees of freedom u Residual deviance: on 33 degrees of freedom u AIC:

Deviance: Testing Fit u The maximum liklihood achievable is a full model with the actual data, y i, substituted for E(y) u The liklihood for a given model uses the predicted value for the model in place of E(y) in the liklihood u Twice the difference between these two quantities is known as the deviance u For the Normal, this is just the sum of squared errors u It is used to assess the goodness of fit of GLM models – thus it functions like residuals for Normal models

A More General Model for Claim Development

Design Matrix: Dev Age and Accident Year Model

More General GLM development Model u Deviance Residuals: u Min 1Q Median 3Q Max u u Coefficients: u Estimate Std. Error z value Pr(>|z|) u (Intercept) < 2e-16 *** u devage < 2e-16 *** u devage < 2e-16 *** u devage e-11 *** u AY u AY u AY * u AY e-05 *** u AY * u AY u AY u AY u AY u AY *** u --- u Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 u (Dispersion parameter for poisson family taken to be 1) u Null deviance: on 36 degrees of freedom u Residual deviance: on 23 degrees of freedom u AIC: 782.3

Plot Deviance Residuals to Assess Fit

QQ Plots of Residuals

An Overdispersed Poisson? u Variance of poisson should be equal to its mean u If it is greater than that, then overdispersed poisson u This uses the parameter u It is estimated by evaluating how much the actual variance exceeds the mean

Weighted Regression u There an additional consideration in the analysis: should the observations be weighted? u The variability of a particular record will be proportional to exposures u Thus, a natural weight is exposures

Weighted Regression u Least squares for simple regression u Minimize SUM((Y i – a – bX i ) 2 ) u Least squares for weighted regression u Minimize SUM((w i (Y i – a –bx i ) 2 ) u Formula

Weighted Regression u Example: u Severities more credible if weighted by number of claims they are based on u Frequencies more credible if weighted by exposures u Weight inversely proportional to variance u Like a regression with # observations equal to number of claims (policyholders) in each cell u A way to approximate weighted regression u Multiply Y by weight u Multiply predictor variables by weight u Run regression u With GLM, specify appropriate weight variable

Weighted GLM of Claim Frequency Development u Weighted by exposures u Adjusted for overdispersion

Introductory Modeling Library Recommendations u Berry, W., Understanding Regression Assumptions, Sage University Press u Iversen, R. and Norpoth, H., Analysis of Variance, Sage University Press u Fox, J., Regression Diagnostics, Sage University Press u Chatfield, C., The Analysis of Time Series, Chapman and Hall u Fox, J., An R and S-PLUS Companion to Applied Regression, Sage Publications u 2004 Casualty Actuarial Discussion Paper Program on Generalized Linear Models,