Variable Reduction for Predictive Modeling with Clustering

Variable Reduction for Predictive Modeling with Clustering
Presentation Title Variable Reduction for Predictive Modeling with Clustering Casualty Actuarial Society Seminar on Ratemaking Bob Sanche March 14, 2006 Sunday, July 01, 2018

Data Storage and Amount of Predictive Variables
Contents Presentation Title Data Storage and Amount of Predictive Variables Predictive Modeling and Model Generalization Dimension Reduction Goal of Variable Clustering What Is Clustering? Variable Clustering When Does Variable Clustering Occur During the Predictive Modeling Process? Example Sunday, July 01, 2018

Data Storage and Predictive Variables
Presentation Title Data storage economics “In 1956, IBM sold its first magnetic disk system, RAMAC (Random Access Method of Accounting and Control). It used inch metal disks, with 100 tracks per side. It could store 5 megabytes of data and cost $10,000 per megabyte. (As of 2005, disk storage costs less than $1 per gigabyte).” 1 gigabyte = 130 numeric characteristics for 1 million policies for $1 Sunday, July 01, 2018

Data Storage and Predictive Variables
Presentation Title New data sources Data warehousing External sources (demographics, meteorological, etc.) Policyholder, household or company information Agency Other Data storage economics and availability of data Increase the number of predictive variables Data mining paradigm Additional inputs add lift to the model Sunday, July 01, 2018

Predictive Modeling and Model Generalization
Presentation Title A predictive model is created from a number of predictors that are likely to influence future results Y = α1X1 + … + αnXn + β n is universe of all available predictors Goal of predictive modeling Obtain coefficients for α’s and β Predictive of future results Model generalizes well over time Model complexity → Overfitting Sunday, July 01, 2018

Need to reduce model complexity Dimension reduction
Presentation Title Need to reduce model complexity Dimension reduction Clustering (K-Means) Rows Variable clustering Columns Alternatives to variable clustering PCA and factor analysis Difficult to interpret and deploy Sunday, July 01, 2018

Goal of Variable Clustering
Presentation Title Reduce the number of variables More difficult to identify irrelevant variables than redundant variables Y = α1X1 + … + αmXm + β where m<n Why do we want to reduce the number of variables? Improve efficiency of predictive modeling process Time to develop the model Interpretation of the results Reduce variance of the model estimates Demographics example Average household size, median household size, proportion of families, median vehicles per household could be replaced by only one variable Sunday, July 01, 2018

Divide set of data (variable) into groups of similar characteristics
What Is Clustering? Presentation Title “Cluster Analysis is a set of methods for constructing a sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual” B.S. Everitt , The Cambridge Dictionary of Statistics, 1998 Divide set of data (variable) into groups of similar characteristics Unsupervised learning technique Useful only when there is redundancy in the data Sunday, July 01, 2018

Similarity measured by distance or correlation metrics
What Is Clustering? Presentation Title Similarity measured by distance or correlation metrics Types of clustering Hierarchical clustering Agglomerative Divisive Partitive (optimization) clustering Sunday, July 01, 2018

Presentation Title Variable Clustering Variable clustering divides a set of numeric* variables into clusters. A cluster representing a large set of variables can be replaced by a single member (cluster representative). * Hamming distance for categorical variables Selection of the cluster representative Sunday, July 01, 2018 Intuitively, we want the cluster representative to be as closely correlated to its own cluster (R2own1) and as uncorrelated to the nearest cluster (R2nearest0). Therefore, the optimal representative of a cluster is a variable where 1-R2 ratio tends to zero

SEMMA process for data mining Sample Explore Modify Model Assess
When Does Variable Clustering Occur During the Predictive Modeling Process? Presentation Title SEMMA process for data mining Sample Explore Modify Model Assess Sunday, July 01, 2018

Example 3 CLUSTERS R-SQUARED WITH 1-R2 Ratio Cluster Variable
Presentation Title 3 CLUSTERS R-SQUARED WITH 1-R2 Ratio Cluster Variable Own Cluster Next Closest Cluster 1 Rain Days 0.5995 0.0426 0.4183 Snow Days 0.8976 0.0317 0.1095 Annual Snow 0.8940 0.0314 Cluster 2 Population Density 0.9804 0.0228 0.0201 Car Density 0.0113 0.0199 Cluster 3 Population Growth 0.6459 0.0911 0.3896 Legal Expenditures 0.0013 0.3546 Sunday, July 01, 2018

Example Name of Variable or Cluster Population Density
Presentation Title Name of Variable or Cluster Population Density Legal Expenditures Snow Days Annual Snow Accumulation Rain Days Sunday, July 01, 2018 Population Growth Car Density 1.00 0.95 0.90 0.85 0.80 0.75 0.70 Proportion of Variance Explained

Need for dimension reduction for model generalization
Conclusion Presentation Title Need for dimension reduction for model generalization Variable clustering reduces the amount of variables available for predictive modeling (GLM, etc.) The predictive modeling process using variable clustering Avoid overfitting Increases interpretability Reduces time for modeling Sunday, July 01, 2018

Variable Reduction for Predictive Modeling with Clustering

Similar presentations

Presentation on theme: "Variable Reduction for Predictive Modeling with Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Variable Reduction for Predictive Modeling with Clustering

Similar presentations

Presentation on theme: "Variable Reduction for Predictive Modeling with Clustering"— Presentation transcript:

Similar presentations

About project

Feedback