Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial.

Similar presentations


Presentation on theme: "Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial."— Presentation transcript:

1 Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com

2 Objectives Discuss topic of data quality Present methods for screening data for problems Errors Missing data Present methods of fixing problems

3 Data Quality: A Problem Actuary doing reconciliation on February 15

4 It’s Not Just Us “In just about any organization, the state of information quality is at the same low level” Olson, Data Quality

5 AAA Standards of Practice AAA SOP: Review data for completeness, accuracy and relevance IDMA and CAS White paper. Evaluate data for Validity accuracy reasonableness completeness

6 Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses

7 Screening Data Many of the Methods have been in use for a while Pioneered in field of exploratory data analysis More recently – missing data methods for dealing with some quality problems

8 Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Statistical Descriptive statistics Data spheres

9 Histograms Can do them in Microsoft Excel

10 Histograms Frequencies for Age Variable

11 Histograms of Age Variable Varying Window Size

12 Formula for Window Width

13 Example of Suspicious Value

14 Discrete-Numeric Data

15 Filtered Data Filter out Unwanted Records

16 Box and Whisker Plot

17 Box and Whisker Example

18 Plot of Heavy Tailed Data Paid Losses

19 Heavy Tailed Data – Log Scale

20 Descriptive Statistics

21 Mahalanobis Distance

22 Data Spheres Example: Longitude and Latitude

23 Sample from Highest Percentile

24 Categorical Data: Data Cubes

25 Example – Marital Status

26 Screening for Missing Data

27 Blanks as Missing

28 Types of Missing Values Missing completely at random Missing at random Informative missing

29 Methods for Missing Values Drop record if any variable used in model is missing Drop variable Data Imputation Other CART, MARS use surrogate variables Expectation Maximization

30 Imputation A method to “fill in” missing value Use other variables (which have values) to predict value on missing variable Involves building a model for variable with missing value Y = f(x 1,x 2,…x n )

31 Example: Age Variable About 14% of records missing values Imputation will be illustrated with simple regression model Age = a+b 1 X 1 +b 2 X 2 …b n X n

32 Model for Age

33 Censorship Problem Property and casualty insurance data is typically censored We do not know final settlement value for data Adjustments must be made to avoid erroneous models Use ultimates Mix adjust

34 Example From Ignoring Censorship

35 Metadata Data about data Detailed description of the variables in the file, their meaning and permissible values

36 Conclusions Data quality is significant problem in insurance and in other industries Statistical methods can be used to detect and remediate data quality problems How do we get better data?

37 Conclusions “In the end, the best defense is relentless monitoring of data and metadata” Dasu and Johnson, Exploratory Data Mining and Data Cleaning


Download ppt "Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial."

Similar presentations


Ads by Google