Data Mining – Best Practices CAS 2008 Spring Meeting Quebec City, Canada Louise Francis, FCAS, MAAA

Slides:



Advertisements
Similar presentations
1 Chapter 34 Data Mining Transparencies © Pearson Education Limited 1995, 2005.
Advertisements

Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Assignment Four Underwriting. Definitions Underwriting – The process of selecting policyholders by recognizing and evaluation hazards, establishing prices.
1. Abstract 2 Introduction Related Work Conclusion References.
Mining the Data Ira M. Schoenberger, FACHCA Senior Administrator 2011 AHCA/NCAL Quality Symposium Friday February 18, 2011.
Distinguishing the Forest from the Trees University of Texas November 11, 2009 Richard Derrig, PhD, Opal Consulting Louise Francis,
Data Mining.
Data Mining By Archana Ketkar.
EFFECTIVE PREDICTIVE MODELING- DATA,ANALYTICS AND PRACTICE MANAGEMENT Richard A. Derrig Ph.D. OPAL Consulting LLC Karthik Balakrishnan Ph.D. ISO Innovative.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Chapter 5 Data mining : A Closer Look.
Chapter 35 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
Data Mining & Data Warehousing PresentedBy: Group 4 Kirk Bishop Joe Draskovich Amber Hottenroth Brandon Lee Stephen Pesavento.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Copyright © 2014 Pearson Education, Inc. 1 It's what you learn after you know it all that counts. John Wooden Key Terms and Review (Chapter 6) Enhancing.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and.
2006 CAS RATEMAKING SEMINAR CONSIDERATIONS FOR SMALL BUSINESSOWNERS POLICIES (COM-3) Beth Fitzgerald, FCAS, MAAA.
Application of SAS®! Enterprise Miner™ in Credit Risk Analytics
Dr. Awad Khalil Computer Science Department AUC
Data Mining Techniques
More on Data Mining KDnuggets Datanami ACM SIGKDD
1 Data Mining DT211 4 Refer to Connolly and Begg 4ed.
Data Mining Techniques As Tools for Analysis of Customer Behavior
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Data Mining Chun-Hung Chou
Proprietary & Confidential 1 Product Development Workshop Part 7: Product Monitoring/Risk Management 2012 CAS Ratemaking and Product Management Seminar.
The CRISP-DM Process Model
Business Intelligence Solutions for the Insurance Industry DAT – 13 Data Warehousing Rasool Ahmed.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Free and Cheap Sources of External Data CAS 2007 Predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Mining – Best Practices Part #2 Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008.
Predictive Modeling CAS Reinsurance Seminar May 7, 2007 Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining,
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Predictive Modeling for Small Commercial Risks CAS PREDICTIVE MODELING SEMINAR Beth Fitzgerald ISO October 2006.
IMPROVING ACTUARIAL RESERVE ANALYSIS THROUGH CLAIM-LEVEL PREDICTIVE ANALYTICS 1 Presenter: Chris Gross.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
2008 CAS SPRING MEETING PROJECT MANAGEMENT FOR PREDICTIVE MODELS JOHN BALDAN, ISO.
Predictive Modeling Spring 2005 CAMAR meeting Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc
Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Special Challenges With Large Data Mining Projects CAS PREDICTIVE MODELING SEMINAR Beth Fitzgerald ISO October 2006.
Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial.
Data Mining Copyright KEYSOFT Solutions.
1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Francis Analytics and.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Clustering Algorithms Minimize distance But to Centers of Groups.
DATA MINING It is a process of extracting interesting(non trivial, implicit, previously, unknown and useful ) information from any data repository. The.
Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar Richard Derrig, PhD, Opal Consulting Louise Francis, FCAS,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.
Analytics: Its More than Just Modeling
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
BEC 30325: MANAGERIAL ECONOMICS
Presentation transcript:

Data Mining – Best Practices CAS 2008 Spring Meeting Quebec City, Canada Louise Francis, FCAS, MAAA

Topics in Data Mining Best Practices Introduction: Data Mining for Management Data Quality Data augmentation Data adjustments Method/Software issues Post deployment monitoring References & Resources

Introduction Research by IBM indicates on 1% of data collected by organizations is used for analysis Predictive Modeling and Data Mining widely embraced by leading businesses –2002 Strategic Decision Making survey by Hackett Best Practices determined that world class companies adopted predictive modeling technologies at twice the rate of other companies –Important commercial application is Customer retention: 5% increase in retention  95% increase in profit –It costs 5 to 10 times more to acquire new business Another study of 24 leading companies found that they needed to go beyond simple data collection

Successful Implementation of Data Mining Data Mining: Process of discovering previously unknown patterns in databases Needs insights from many different areas Multidisciplinary effort –Quantitative experts –IT –Business experts –Managers –Upper management

Manage the Human Side of Analytics Data Collection Communicating business benefits Belief, model understanding, model complexity ‘Tribal’ Knowledge as model attributes Behavioral change and Transparency Disruption in ‘standard’ processes Threat of obsolescence (automation) Pay greater attention to the interaction of models and humans Don’t over rely on the technology and recognize the disruptive role you play Becoming a better practitioner

CRISP-DM Cross Industry Standard Process for Data Mining Standardized approach to data mining

Phases of CRISP-DM

Data Quality Scope of problem How it is addressed New educational resources for actuaries

Survey of Actuaries Data quality issues have a significant impact on the work of general insurance (P&C) actuaries –About a quarter of their time is spent on such issues –About a third of projects are adversely affected –See “Dirty Data on Both Sides of the Pond” – 2008 CAS Winter Forum –Data quality issues consume significantly more time on large predictive modelling Projects

Statistical Data Editing Process of Checking data for errors and correcting them Uses subject matter experts Uses statistical analysis of data May include using methods to “fill in” missing values Final result of SDE is clean data as well as summary of underlying causes of errors See article in Encyclopedia of Data Warehousing and Data Mining

EDA: Overview Typically first step in analyzing data Purpose: –Explore structure of the data –Find outliers and errors Uses simple statistics and graphical techniques Examples include histograms, descriptive statistics and frequency tables Step 2 Transformations Aggregations Step 3 Analysis Step 4 Presentation of Results Step 1 Data Collection Step 0 Data Requirements Final Step Decisions

EDA: Histograms Step 2 Transformations Aggregations Step 3 Analysis Step 4 Presentation of Results Step 1 Data Collection Step 0 Data Requirements Final Step Decisions

Data Educational Materials Working Party Formation The closest thing to data quality on the CAS syllabus are introductions to statistical plans The CAS Data Management and Information Committee realized that SOX and Predictive Modeling have increased the need for quality data So they formed the CAS Data Management Educational Materials working party to find and gather materials to educate actuaries

CAS Data Management Educational Materials Working Party Publications Book reviews of data management and data quality texts in the CAS Actuarial Review starting with the August 2006 edition These reviews are combined and compared in “Survey of Data Management and Data Quality Texts,” CAS Forum, Winter 2007, This presentation references our recently published paper: “Actuarial IQ (Information Quality)” published in the Winter 2008 edition of the CAS Forum:

Data Flow Step 2 Transformations Aggregations Step 3 Analysis Step 4 Presentation of Results Step 1 Data Collection Step 0 Data Requirements Final Step Decisions Information Quality involves all steps: Data Requirements Data Collection Transformations & Aggregations Actuarial Analysis Presentation of Results To improve Final Step: Making Decisions

Data Augmentation Add information from Internal data Add information from external data For overview of inexpensive sources of data see: “Free and Cheap Sources of Data”, 2007 Predictive modeling seminar and “External Data Sources” at 2008 Ratemaking Seminar

Data Augmentation – Internal Data Create aggregated statistics from internal data sources –Number of lawyers per zip –Claim frequency rate per zip –Frequency of back claims per state Use unstructured data –Text Mining

Data Adjustments Trend –Adjust all records to common cost level –Use model to estimate trend Development –Adjust all losses to ultimate –Adjust all losses to a common age –Use model to estimate future development

KDnuggets Poll on Data

Methods: What are data miners using? How well does it work?

Major Kinds of Data Mining Supervised learning –Most common situation –A dependent variable Frequency Loss ratio Fraud/no fraud –Some methods Regression Trees/Machine Learning Some neural networks Unsupervised learning –No dependent variable –Group like records together A group of claims with similar characteristics might be more likely to be fraudulent Ex: Territory assignment, Text Mining –Some methods Association rules K-means clustering Kohonen neural networks

KDnuggets Poll on Methods

KDnuggets Poll on Open Source Software

The Supervised Methods and Software Evaluated Research by Derrig and Francis 1) TREENET7) Iminer Ensemble 2) Iminer Tree8) MARS 3) SPLUS Tree 9) Random Forest 4) CART10) Exhaustive Chaid 5) S-PLUS Neural11) Naïve Bayes (Baseline) 6) Iminer Neural 12) Logistic reg ( (Baseline)

TREENET ROC Curve – IME Explain AUROC AUROC = 0.701

Plot of AUROC for SIU vs. IME Decision

Monitoring Models Monitor use of model Monitor data going into model Monitor performance –This requires more mature data

Novelty Detection Problem Statements: At the time of underwriting a risk, how different is the subject risk from the data used to build the model? How are the differences, if any, logically grouped for business meaning An example of model interaction with people to improve business outcomes

Clustering Methods Make Models 1.Select features that you are interested in clustering, e.g. Demographics, Risk, Auto, Employment 2.Run cluster algorithms within the grouped features to find homogenous groups (let the data tell you the groupings). Each member has a distance to the ‘center’ of the cluster. 3.Explore each cluster and statistically describe them compared to the entire ‘world’ from the training data; create thresholds for distance to the center that you care about; may add additional description and learning 4.Assign business meaning (names) to cluster members; homogenous group; Deploy; score new data as it becomes available 5.Look at novelty within each cluster on the new sample; distance, single variable differences 6.Use the Threshold to determine differences from the cluster membership. 7.Investigate for business impact or unexpected changes

Novelty Score Uses Dimensional Novelty –Market Cycles –Policy Limits –Exposure –Geography –Demographics Operationalize –Book drift –Evaluation of pricing and marketing activities –Model refresh cycle –Regulatory Support Novelty Score: to detect ‘drift’ of aspects of clusters in predictor data over time

Example – Automobile Insurance Data

Six clusters with the following statistical profile and distribution in the sample set; look a the data and assign names to the groups (in this case 3 variables) WORLD Demographic Features and Clusters The view of the current book

Display the distribution of named clusters within the grouping of features (Demographic Cluster) in the test set View of the clusters in the current book business within Demographics

Monitor the changes in distribution of the clusters in the data over time Initial Customer Base After 6 months Two clusters now show up in different percentages

Humility Models incorporate significant uncertainties about parameters When deployed, models will likely not be as good as they were on historic data Need to appreciate the limitations of the models

Additional References Encyclopedia of Data Warehousing and Data Mining, John Wang For GLMs: 2004 CAS Discussion Paper Program 2008 Discussion Paper program on multivariate methods “Distinguishing the Forest From the Trees” – 2006 Winter Forum, updated at –See other papers by Francis on CAS web site Data Preparation for Data Mining using SAS, Mamdouh Refaat Data Mining for Business Intelligence: Concepts, Techniques and Applications in Microsoft Excel with XL Miner, Shmuel, Patel and Bruce

Questions?