Download presentation
Presentation is loading. Please wait.
1
Predictive Analytics: Statistical Learning
Phillip Floyd, MS ASA MAAA
2
Checklist Rehearse!!! Make sure to bring clicker
Rehearse with clicker!!! Bring business cards
3
About Phil From New Orleans Graduated from LSU and UNO
Taken over 90 hours of graduate-level statistical course work 50+ credit hours towards a Ph.D. in Biostatistics Has worked as statistical and actuarial consulting Currently works at Segal Consulting as an analyst Associate of Society of Actuaries Poster presentation at 2017 Conference on Statistical Practices Lecturer at the 2018 Conference on Statistical Practices
4
Course Guide Predictive Analytics Intro
Statistical Learning Techniques Enhancing Model Comparison Restructuring the Model Actuarial Science Applications Redefining and Examining Predictors
5
Download Material Search LinkedIn for “Phillip Floyd, MS ASA”
Find the PROJECTS section and click “See project” Lecture title = Statistical Learning R code is also available for download
6
Predictive Analytics Intro
7
History of Statistics Actuarial science and Statistics are intertwined
Statistics is a new/growing discipline (relatively) Father of Statistics – Ronald Fisher died in 1960s Most of what we learn has not been around for long Discoveries are constantly expanding techniques and changing the way statistics is taught
8
Statistical Learning History
First book published in 2001 Has become a staple of statistics curriculum Huge are of research for graduate students in stat Book in picture Most common statistical learning textbook Used to teach at Stanford among other schools Required reading for SOA’s new SRM and PA Exam
9
What is Statistical Learning?
Umbrella of new age techniques Set of tools used in modeling and understanding of predictive analytics Rely heavily on CPU processing power (computer learning) Enhancement of traditional techniques
10
What is Predictive Analytics?
the use of historical data to identify the likelihood of future outcomes Base Model 𝑓(𝑌)=𝑔(𝑋𝛽)+ℎ(𝜀) Error Matrix Notation Fitted Coefficient Dependent / Response variable Independent / Predictor variable
11
Areas Of Predictive Analytics: Linear Regression
a linear approach to modelling the relationship between a response (or dependent variable) and one or more explanatory variables (or independent variables) 𝑌=𝑋𝛽+𝜀
12
Advantages / Disadvantages
Simple predictive model Effective when response is quantitative Ineffective when response is bounded Ineffective when variance is not constant
13
Boston Data: Linear Regression
Predicting Boston Median House Value (MEDV) N = 506 neighborhoods 13 predictors Crime Rate Average # Rooms Property-Tax Rate % Residential Zone % Built Before Pupil-Teacher Ratio % Non-Retail Bus Downtown Proximity Race % Lower Class Highway Proximity Charles River Indicator Nitrogen Oxide Concentration
14
Univariate Approach Univariate example
MEDV - vs – LSTAT (% Lower Socioeconomic Class) Fit using least squares Inverse relationship medv = 𝛽 0 + 𝛽 1 ∗𝑙𝑠𝑡𝑎𝑡 medv =34.55−0.95∗𝑙𝑠𝑡𝑎𝑡
15
Determining Significance
P-values measure coefficient strength Lower the better. Less than .05 = significance Intercept p-value < LSTAT p-value < Lower status has significant relationship with median house value As the % of lower status decreases median house value increases medv =34.55−0.95∗𝑙𝑠𝑡𝑎𝑡
16
Multiple Regression medv = 𝛽 0 + 𝛽 1 ∗ 𝑋 1 + 𝛽 2 ∗ 𝑋 2 + 𝛽 3 ∗ 𝑋 3 …
Consider all 13 predictors Find most efficient set of predictors All predictors Forwards Selection Backwards Selection Subset Selection medv = 𝛽 0 + 𝛽 1 ∗ 𝑋 1 + 𝛽 2 ∗ 𝑋 2 + 𝛽 3 ∗ 𝑋 3 …
17
+.009∗black+.05∗zn−.1∗crim+.3∗rad−.01∗tax+.0007∗age+.02∗indus
All Predictors Model contains all 13 predictors. No regard for significance determination medv =36.5−.5∗𝑙𝑠𝑡𝑎𝑡+3.8∗𝑟𝑚−1∗ptratio−1.5∗dis−17.8∗nox+2.7∗chas +.009∗black+.05∗zn−.1∗crim+.3∗rad−.01∗tax+.0007∗age+.02∗indus
18
−16.7∗nox+3.1∗chas+.009∗black+.04∗zn
Forward Selection Starts with fitting all univariate (one predictor) models Lowest p-value moves forward Repeat process with two predictors keeping winner Process repeats until no added predictor provides significance 8 predictors found to be significant medv =30.3−.5∗𝑙𝑠𝑡𝑎𝑡+4.1∗𝑟𝑚−.9∗ptratio−1.4∗dis −16.7∗nox+3.1∗chas+.009∗black+.04∗zn
19
+.009∗black+.05∗zn−.1∗crim+.3∗rad−.01∗tax
Backward Selection Starts with fitting one model with all predictors Highest non-significant p-value is removed Repeat process fitting leftover variables Process repeats until all predictors are significant 11 predictors kept in model medv =36.3−.5∗𝑙𝑠𝑡𝑎𝑡+3.8∗𝑟𝑚−.9∗ptratio−1.5∗dis−17.4∗nox+2.7∗chas +.009∗black+.05∗zn−.1∗crim+.3∗rad−.01∗tax
20
Subset Selection medv =37.5−.6∗𝑙𝑠𝑡𝑎𝑡+4.2∗𝑟𝑚−1.0∗ptratio−1.2∗dis−18∗nox
Fits every possible model of the selected subset Ex: subset 3 predictors, chose from 13*12*11 = 286 models Model with best fit is chosen Below is best subset model chosen out of all 5 predictor models medv =37.5−.6∗𝑙𝑠𝑡𝑎𝑡+4.2∗𝑟𝑚−1.0∗ptratio−1.2∗dis−18∗nox
21
Selection Advantages/Disadvantages
Forward and Backward selection are easy to implement Used interchangeably in practice Does not consider collinearity between predictors Subset selection tests all possible models Computationally intensive
22
Model Comparison Test Statistics Univariate All Forward Backward
Subset 𝑅 2 .5441 .7406 .7266 .7081 Adjusted − 𝑅 2 .5432 .7338 .7222 .7348 .7052 MSE 38.63 22.52 23.49 22.43 24.94
23
Model Comparison 𝑅 2 = 𝑆𝑆 𝑚𝑜𝑑𝑒𝑙 𝑆𝑆 𝑡𝑜𝑡𝑎𝑙 𝑅 2 ∈[0,1] Test Statistics
Univariate All Forward Backward Subset 𝑅 2 .5441 .7406 .7266 .7081 Adjusted − 𝑅 2 .5432 .7338 .7222 .7348 .7052 MSE 38.63 22.52 23.49 22.43 24.94 𝑅 2 = 𝑆𝑆 𝑚𝑜𝑑𝑒𝑙 𝑆𝑆 𝑡𝑜𝑡𝑎𝑙 𝑅 2 ∈[0,1] Higher is better
24
Model Comparison 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑− 𝑅 2 = 𝑅 2 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑓𝑜𝑟 # 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠
Test Statistics Univariate All Forward Backward Subset 𝑅 2 .5441 .7406 .7266 .7081 Adjusted − 𝑅 2 .5432 .7338 .7222 .7348 .7052 MSE 38.63 22.52 23.49 22.43 24.94 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑− 𝑅 2 = 𝑅 2 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑓𝑜𝑟 # 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑠 Higher is better
25
Model Comparison 𝑀𝑆𝐸= 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 2 𝑁 𝑦 𝑖 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
Test Statistics Univariate All Forward Backward Subset 𝑅 2 .5441 .7406 .7266 .7081 Adjusted − 𝑅 2 .5432 .7338 .7222 .7348 .7052 MSE 38.63 22.52 23.49 22.43 24.94 𝑀𝑆𝐸= 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 𝑁 𝑦 𝑖 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 Lower is better
26
Enhancing Model Comparison
Statistical Learning Enhancing Model Comparison Using cross validation to improve on conventional model comparison techniques
27
Cross-Validation Divide the data set into two part, test and training data With training data, Fit model to training data Use selection procedures to choose model With test data, Use training model to get estimates for each test data point Estimate standard error of test data Use that to compare models and determine fitness Dataset Training Test
28
Leave-One-Out Cross-Validation
Training data set is all but one observation Standard error of single test observation is obtained Method is repeated for all observations Mean standard error (MSE) is used to assess model fitness Unbiased MSE estimator Processing power required increases with observations and model complexity Dataset Training Test
29
K-Fold Cross-Validation
Divide data set into K groups (folds) K-1 groups are the training dataset 1 group testing Fit training model and obtain testing data MSE Repeat to get the MSE for each group Unbiased MSE estimator Less taxing than Leave-One-Out Leave-One-Out = K-Fold when K=N K=5,10 common in practice
30
More is not always better…
Leave-One-Out and K-Fold give unbiased estimates for each data point But K-Fold cross-validation often gives more accurate estimates of test error rate But why… Bias-Variance tradeoff Leave-One-Out models have a higher correlation = higher variance
31
Bias-Variance tradeoff
Bias = bias from erroneous assumptions Variance = error from model sensitivity Too few observations -> innacurate assumptions Too many observations -> high correlation between models -> overfitting 𝑇𝑜𝑡𝑎𝑙 𝑀𝑆𝐸= 𝐵𝑖𝑎𝑠 2 +𝑉𝑎𝑟 1 N Number of Folds
32
Reassess Model Comparison
Test Statistics Univariate All Forward Backward Subset 𝑅 2 .5441 .7406 .7266 .7081 Adjusted − 𝑅 2 .5432 .7338 .7222 .7348 .7052 MSE 38.63 22.52 23.49 22.43 24.94 Leave-One-Out MSE 38.89 23.73 24.49 23.51 25.64 10-Fold MSE 38.74 23.35 24.75 23.26 25.72 Lower is better for both CV techniques
33
Restructuring the model
Statistical Learning Restructuring the model Using iterative techniques to create complexity
34
Decision Trees Splits response into R distinct regions based on predictors Each node represents optimal split minimizing the MSE Same prediction for every observation within region Region prediction = average Nodes are formed until a cutoff point
35
Decision Trees 𝑌=𝑋𝛽 Can be written in terms of
Where X would be replaced with indicator functions ( 𝐼 𝑚<𝑐 ). 𝑌=𝑋𝛽
36
Simple Decision Tree vs Regression
Tree Advantages Easier to interpret Can be displayed graphically Mirrors human decision-making Non-parametric Tree Disadvantages Less predictive accuracy if function type relationship exists Non-robust (small changes in data = large impact)
37
Overcoming Decision Tree Disadvantages
Statistical Learning / Ensemble Methods Discussed Bagging / Boostrapping Random Forests Boosting
38
Bagging / Boostrapping
Split dataset into training and test With training data Create resampled datasets of N observations Sample with replacement Fit decision tree to each dataset sample Combine / Average trees to obtain prediction model Apply model to test data
39
Resampling dataset Training Data Sample 1 Sample 2 Sample 3 Y X 32 5
27 6 11 1 12 2 15 3
40
Resampled Trees
41
Rules for Resampling Datasets
Keep X and Y values together (Observation keeps same qualities) Sample with replacement (Can have duplicate values) Sample dataset size = N (Same number of observations as dataset) # of samples should be sufficient to curb MSE (K=500 resamples works for Boston dataset) MSE Number of Trees
42
Prediction Model 𝑦 𝑖 = 𝑥 𝑖 𝐵 = 𝑘=1 500 𝑇 𝑘 𝑥 𝑖 500 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 2 𝑁
Use trees to predict response MSE 𝑦 𝑖 = 𝑥 𝑖 𝐵 = 𝑘= 𝑇 𝑘 𝑥 𝑖 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 𝑁
43
Reassess Model Comparison
Keep backward model Test Statistics Univariate All Forward Backward Subset 𝑅 2 .5441 .7406 .7266 .7081 Adjusted − 𝑅 2 .5432 .7338 .7222 .7348 .7052 MSE 38.63 22.52 23.49 22.43 24.94 Leave-One-Out 38.89 23.73 24.49 23.51 25.64 10-Fold 38.74 23.35 24.75 23.26 25.72
44
Reassess Model Comparison
Bagging model outperforms the Backward model Test Statistics Backward Bagging 𝑅 2 .7406 .8757 Adjusted − 𝑅 2 .7348 MSE 22.43 10.49 Leave-One-Out 23.51 9.94 10-Fold 23.26 10.48
45
Random Forest Improves accuracy of Bagging Adds a modification
Same process of random sampling / boostrapping Model Fitting: a random sample of p predictors is chosen from the full set of m predictors to be fit at each tree split Typical to choose p ≈ 𝑚
46
Why does Random Forest work?
Decorrelates trees Strong predictor can overshadow other relationships Makes bagged trees look similar Bagged trees highly correlated Limits variance reduction Greatest improvement is seen when # of predictors is large
47
Reassess Model Comparison
Random Forest mod enhanced bagging accuracy Test Statistics Backward Bagging Random Forest 𝑅 2 .7406 .8758 .8851 Adjusted − 𝑅 2 .7348 MSE 22.43 10.48 9.70 Leave-One-Out 23.51 9.91 10-Fold 23.26 10.44
48
Boosting Unlike Bagging / Random Forest No random sampling
Model learns slowly
49
Boosting Algorithm Define 𝑌 0 and 𝑋 0 as matrix of response and predictors Define b as small number (b=.01 or b=.001) Fit decision tree to dataset: 𝑌 0 and 𝑋 0 Predicted response matrix 𝑅 1 = 𝑇 𝑋 0 dataset: 𝑌 1 = 𝑌 0 −𝑏∗ 𝑅 1 Repeat process: fit decision tree to dataset: 𝑌 1 and 𝑋 0 Final model: 𝑌 = 𝑏∗ 𝑅 𝑘
50
Boosting Slowly chips away at response 3 key parameters
Shrinkage parameter = b Number of splits in each tree = d Number of trees to create = K d usually kept small 1-2 splits in textbook Large K can overfit model K should be optimized using cross validation
51
Boosting: Boston data Set d =1, b = .01
52
Boosting: Boston data Set d = 2, b = .01
53
Boosting: Interaction Depth Comparison
Model Comparison d=1 d=2 LOOCV MSE 12.46 9.78 10-fold CV MSE 12.76 9.71
54
Reassess Model Comparison
Random Forest mod enhanced bagging accuracy Test Statistics Backward Bagging Random Forest Boosting d=2 𝑅 2 .7406 .8758 .8851 Adjusted − 𝑅 2 .7348 MSE 22.43 10.48 9.70 > 9.70 Leave-One-Out 23.51 9.91 9.78 10-Fold 23.26 10.44 9.71
55
Bagging vs Boosting Variance – Bias Tradeoff Bagging reduces variance
Boosting reduces bias Boosting model can be overfit Boosting better for simple models Bagging better for more complex models Regression can be used with either technique in place of trees
56
Modeling Overview: Stat Learning vs Regression
Model is simple, easy to explain Requires less resources May require assumptions about response distribution Unstable if multicollinearity exists Stat Learning Modeling Techniques Increased accuracy Reduction of variance / bias Difficult to explain predictor influence Requires more resources / processing power
57
Limits of Data Used Small dataset was used to teach
Some models took several minutes to run In practice, datasets contain 1000s of datapoints Multiple computers/processors may be needed (grid system)
58
Actuarial Science Applications
Statistical Learning Actuarial Science Applications
59
Moving Beyond Linear Regression
Categorical Data Survival Data Time Series Data 𝑓(𝑌)=𝑔(𝑋𝛽)+ℎ(𝜀)
60
Categorical Data Analysis
Encompasses modeling techniques where the response variable (Y) is a set of classes or categories Predicting credit default (Yes, No) Policy Lapse (Yes, No) Plan Migration (Plan A, Plan B, Plan C) Legitimate vs spam Marketing surveys (Brand A vs Brand B vs Brand C)
61
Categorical Data Analysis
Y = p = probability of occurrence of event Problems pluggin Y = p Find a function 𝑓(𝑝) such that 𝑓(𝑌)=𝑋 𝐵 +𝜀 𝑝 →[0,1] 𝑓(𝑝) → −∞ , ∞ 𝑓 𝑝 =𝑙𝑜𝑔𝑖𝑡 𝑝 = ln 𝑝 1−𝑝 commonly used in practice
62
Categorical Data Analysis using Decision Trees
Classification Trees Each node represents a category Splits are found through optimization as with trees with linear classifiers Yes No No Yes Yes Yes No Yes No
63
Categorical Data Analysis using Decision Trees
Probability Estimation Trees Each node is a probability estimate taken as the mean of the response data in the node Splits are found through optimization as with trees with linear classifiers .6 .3 .2 .1 .7 .1 .7 .4 .8
64
Survival Analysis Set of methods for analyzing the time until occurrence of an event of interest Mortality Morbidity Failure of element or system
65
Survival Analysis Estimating probability of survival
Always between [0,1] Non-Increasing function (slope <= 0) Probability of survival at time 𝑡 𝑘 ≥ 𝑡 𝑘+𝑗 Kaplan-Meier Estimate of survival function = non-parametric Cox-Proportional Hazard = parametric
66
Survival Analysis Find graph
Estimate survival using force of mortality (aka hazard function or failure rate) Force of mortality, ℎ(𝑡), chance of immediate survival ℎ 𝑡 = 𝑓 𝑡 1−𝐹 𝑡
67
Survival Analysis Using Decision Trees
Use significance tests to determine splits where the hazard functions are different Each hazard fuction, ℎ(𝑡) 𝑘 , would represent a different survival distribution These distributions can be either parametric or non-parametric ℎ(𝑡) 6 ℎ(𝑡) 7 ℎ(𝑡) 9 ℎ(𝑡) 8 ℎ(𝑡) 4 ℎ(𝑡) 5 ℎ(𝑡) 1 ℎ(𝑡) 2 ℎ(𝑡) 3
68
Time Series Analysis Time series analysis accounts for the fact that data points taken over time may have an internal structure that should be accounted for (such as dependency on past values or seasonal variation) Stock prices/trends Interest rates Sales trends Price of commodities, real estate Y Time
69
𝑌=Α 𝑌 𝑝 +Κ Ε 𝑞 +𝑋 𝐵 Time Series Analysis
ARMA is a commonly used modeling technique Below is a ARMA(p,q) model with a transfer function of k variables Matrix notation 𝑦 𝑡 = 𝛼 𝑡 + 𝛼 𝑡−1 𝑦 𝑡−1 +…+ 𝛼 𝑡−𝑝 𝑦 𝑡−𝑝 + 𝜎 𝑡 + 𝜅 𝑡−1 𝜎 𝑡−1 +…+ 𝜅 𝑡−𝑝 𝜎 𝑡−𝑞 + 𝛽 1 𝑥 1 +…+ 𝛽 𝑘 𝑥 𝑘 𝑌=Α 𝑌 𝑝 +Κ Ε 𝑞 +𝑋 𝐵
70
Time Series Analysis Using Decision Trees
Time Series Trees with ARMA nodes Start with fitting ARMA model to test data to get defined p and q Separate branches are transfer function splits At each split, ARMA degrees can be reassessed adding in the transfer function split into the parameter estimation 𝑌 6 𝑌 7 𝑌 8 𝑌 9 𝑌 1 𝑌 4 𝑌 5 𝑌 2 𝑌 3
71
Redefining and Examining Predictors
Statistical Learning Redefining and Examining Predictors Unsupervised Learning Techniques: Principal Component Analysis
72
Principle Component Analysis (PCA)
Consider dataset with p variables/predictors PCA creates new variables to reduce dimensions compress information Illuminate relationships between variables
73
PCA: 2-D Example Consider two predictors ad spending and population
1st component (green) minimizes SS 2nd component (dash) minimizes SS keeping orthogonal to 1st component Components can be used as predictors instead of the original data
74
Principle Component Analysis (PCA)
A principle component, 𝑍 𝑚 , is a linear function of predictors Where 𝑖=1 𝑝 𝜙 2 1𝑖 =1 The variance between the projected observations is maximized Sum of squares between each point and the component line is minimized The 𝑖 𝑡ℎ principal component, 𝑍 𝑖 , is orthogonal to all previous components 𝑍 𝑘 = 𝜙 𝑘1 𝑋 1 + 𝜙 𝑘2 𝑋 2 +…+ 𝜙 𝑘𝑝 𝑋 𝑝 Same thing
75
Appendix https://en.wikipedia.org/wiki/Linear_regression
76
END!!!!!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.