THE SCIENCE OF RISK SM 1 Interaction Detection in GLM – a Case Study Chun Li, PhD ISO Innovative Analytics March 2012.

Slides:



Advertisements
Similar presentations
Generalized Additive Models Keith D. Holler September 19, 2005 Keith D. Holler September 19, 2005.
Advertisements

This PowerPoint presentation shows you how to use the NRM 1.0.xls Excel Workbook to fit several popular regression models to experimental data. The models.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Chapter 13 Multiple Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Considerations in P&C Pricing Segmentation February 25, 2015 Bob Weishaar, Ph.D., FCAS, MAAA.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Linear Regression and Correlation Topic 18. Linear Regression  Is the link between two factors i.e. one value depends on the other.  E.g. Drivers age.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Decision Tree Models in Data Mining
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Two-Way Analysis of Variance STAT E-150 Statistical Methods.
Multiple Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
CAS Predictive Modeling Seminar Evaluating Predictive Models Glenn Meyers ISO Innovative Analytics October 5, 2006.
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
Application of SAS®! Enterprise Miner™ in Credit Risk Analytics
Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.
Categorical Data Prof. Andy Field.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Mining Insurance Data to Promote Traffic Safety and Better Match Rates to Risk 2002 CAS Seminar on Ratemaking Greg Hayward.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data.
Travelers Analytics: U of M Stats 8053 Insurance Modeling Problem
The next step in performance monitoring – Stochastic monitoring (and reserving!) NZ Actuarial Conference November 2010.
Network Screening 1 Module 3 Safety Analysis in a Data-limited, Local Agency Environment: July 22, Boise, Idaho.
TQM. What does TQM mean? Total Quality Management means that the organization's culture is defined by and supports the constant attainment of customer.
CHAPTER 14 MULTIPLE REGRESSION
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Practical GLM Modeling of Deductibles
@ Hanover Insurance Group: Catherine Eska 1 FROM CLASS TO INDIVIDUAL RATING CAS Predictive Modeling Seminar October 4 th, 5 th 2006 Data Challenges and.
2007 CAS Predictive Modeling Seminar Estimating Loss Costs at the Address Level Glenn Meyers ISO Innovative Analytics.
Catastrophe Pricing: The Finer Points Sean Devlin CARe Meeting June 6-7, 2005.
9-1 Using SafetyAnalyst Module 4 Countermeasure Evaluation.
Obesity, Medication Use and Expenditures among Nonelderly Adults with Asthma Eric M. Sarpong AHRQ Conference September 10, 2012.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
IMPROVING ACTUARIAL RESERVE ANALYSIS THROUGH CLAIM-LEVEL PREDICTIVE ANALYTICS 1 Presenter: Chris Gross.
© 2012 Towers Watson. All rights reserved. GLM II Basic Modeling Strategy 2012 CAS Ratemaking and Product Management Seminar by Len Llaguno March 20, 2012.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
INDE 6335 ENGINEERING ADMINISTRATION SURVEY DESIGN Dr. Christopher A. Chung Dept. of Industrial Engineering.
America CAS Seminar on Ratemaking March 2005 Presented by: Serhat Guven An Introduction to GLM Theory Refinements.
Privileged & Confidential Frequency and Severity vs. Loss Cost Modeling CAS 2012 Ratemaking and Product Management Seminar March 2012 Philadelphia, PA.
2008 CAS SPRING MEETING PROJECT MANAGEMENT FOR PREDICTIVE MODELS JOHN BALDAN, ISO.
Glenn Meyers ISO Innovative Analytics 2007 CAS Annual Meeting Estimating Loss Cost at the Address Level.
Chapter 6: Analyzing and Interpreting Quantitative Data
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Linear Models Alan Lee Sample presentation for STATS 760.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Advanced Statistical Methods: Continuous Variables REVIEW Dr. Irina Tomescu-Dubrow.
Practical GLM Analysis of Homeowners David Cummings State Farm Insurance Companies.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Special Challenges With Large Data Mining Projects CAS PREDICTIVE MODELING SEMINAR Beth Fitzgerald ISO October 2006.
1 Model Validation: The Modeler’s Perspective Kevin Mahoney, FCAS CAS RPM Seminar March 2012.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Traffic accident analysis using machine learning paradigms Miao Chong, Ajith Abraham, Mercin Paprzycki Informatica 29, P89, 2005 Report: Hsin-Chan Tsai.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
 Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 13 Simple Linear Regression
Chapter 15 Multiple Regression Model Building
Generalized Linear Models
SIMPLE LINEAR REGRESSION MODEL
Forecasting Methods ISAT /10/2018.
Dr. Morgan C. Wang Department of Statistics
Generalized Linear Models
Presentation transcript:

THE SCIENCE OF RISK SM 1 Interaction Detection in GLM – a Case Study Chun Li, PhD ISO Innovative Analytics March 2012

THE SCIENCE OF RISK SM 2 Agenda Case study Approaches – Proc Genmod, GAM in R, Proc Arbor Details Summary

THE SCIENCE OF RISK SM 3 Case Study Personal Auto loss prediction –Pure premium prediction (GLM – Tweedie) –Inputs: Environment components Vehicle components Driver components Household components –Objective is to detect interactions among the components to further improve model performance

THE SCIENCE OF RISK SM 4 Components Environment (frequency and severity for each) Traffic density Traffic composition Traffic generators Weather Experience and trend Vehicle ISO Symbol relativity Price new relativity Model year relativity Body style and dimension Performance and safety Theft Weather Animal Glass All other perils Driver Driver characteristics (age, gender, marital, good student etc) Violation history Claim history Household Usage/mileage Household composition

THE SCIENCE OF RISK SM 5 Challenges There are many different approaches that can be used to detect interactions The approach we selected was based on our requirements that: –interaction detection be completed in a timely manner despite the large number of observations (>1 million) and large number of interaction pairs (>300) –all variables in the final model (including interactions) be interpretable – the final model (including interactions) be built in the form of a SAS GLM model

THE SCIENCE OF RISK SM 6 Approach Step 0 Build main effect model Aim to model the residual using interaction terms Step I Automated pair-wise selection Based on standalone contribution Step II Manual selection from Step I results Based on marginal contribution in GLM Step III Validation/Refinement/Finalization * We’ll be focusing on Step I

THE SCIENCE OF RISK SM 7 Step I - Details The purpose of Step I is to separate significant interaction pairs from insignificant ones, so that we can focus on those that have higher potential. The principle is to add each pair to the model to predict the residual, measure their contribution, and rank the pairs based on contribution.

THE SCIENCE OF RISK SM 8 Step I - Details Three methods are used –Proc Genmod in SAS –GAM in R –Proc Arbor (Regression Tree) in SAS

THE SCIENCE OF RISK SM 9 Proc Genmod in SAS Use main effect model as offset Add a component pair to the model Use ‘Increase in Gini’ as the performance metric Created SAS macro to loop through all component pairs and output these pairs ranked according to the performance metric

THE SCIENCE OF RISK SM 10 Gini Definition A measure of model predictiveness Focuses more on rank ordering than spot value Definition: Sort the population by model score Calculate the % of accumulative exposure and loss from low to high score (least risky to most risky) Chart the points using % exposure as x-axis and % loss as y-axis Gini is defined as 2 times the area between the diagonal line and the curve

THE SCIENCE OF RISK SM 11 Proc Genmod in SAS Interaction terms –Both linear –Both binned –One linear and one binned The linear assumption is based on the fact that the components (or sometimes, the log transformation of the components) are developed in the way that they have linear relationship with the target.

THE SCIENCE OF RISK SM 12 GAM in R GAM = Generalized Additive Model – In R package: mgcv – Able to do Tweedie distribution with Log link – Fits splines – Multi-dimentional smoothing for interactions Smoothing classes: s(a, b) Tensor product smoothing: te(a, b)

THE SCIENCE OF RISK SM 13 Illustration of interaction surface te(X1, X2)

THE SCIENCE OF RISK SM 14 GAM in R Use main effect model as offset Add a component pair to the model Use ‘Decrease in AIC’ as the performance metric Create R process to loop through all possible component pairs and output these pairs ranked according to the performance metric

THE SCIENCE OF RISK SM 15 Proc Arbor in SAS – The same algorithm behind EMiner’s Decision Tree Node – Can be part of a programmable process Loop through component pairs Build model Evaluate model performance

THE SCIENCE OF RISK SM 16 Proc Arbor in SAS – Use residual of main effect mode as target – Build regression tree using a pair of components – Performance metric sqrt(MSE*Leaf_Count) – Created SAS macro to loop through all possible component pairs and output these pairs ranked according to the performance metric

THE SCIENCE OF RISK SM 17 Example – Collision Coverage Drivers in the low household relativity segment should have the driver relativity adjusted higher, and high lower.

THE SCIENCE OF RISK SM 18 Example – Collision Coverage In the location where the loss experience is low, the weather relativity needs to be adjusted lower, and high higher

THE SCIENCE OF RISK SM 19 Summary Most of the significant pairs are captured by proc Genmod method –Closest to the final model format Both GAM in R and proc Arbor detect additional significant interaction pairs –Need to convert to the format that Proc Genmod can handle

THE SCIENCE OF RISK SM 20 Take away The methodologies described can be applied generally to variable selection processes –May need to do variable de-correlation process beforehand (eg. variable clustering) Significantly reduces the time/effort needed for variable selection

THE SCIENCE OF RISK SM 21 Q & A Questions? Contact: