Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and.

Similar presentations


Presentation on theme: "1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and."— Presentation transcript:

1 1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise_francis@msn.com

2 2 Objectives Introduce data preparation and where it fits in in modeling process Discuss Data Quality Focus on a key part of data preparation Exploratory data analysis Identify data glitches and errors Understanding the data Identify possible transformations What to do about missing data Provide resources on data preparation

3 3 CRISP-DM Guidelines for data mining projects Gives overview of life cycle of data mining project Defines different phases and activities that take place in phase

4 4 Modelling Process

5 5 Data Preprocessing

6 6 Data Quality Problem

7 7 Data Quality: A Problem Actuary reviewing a database

8 8 It’s Not Just Us “In just about any organization, the state of information quality is at the same low level” Olson, Data Quality

9 9 Some Consequences of poor data quality Affects quality (precision) of result Can’t do modeling project because of data problems If errors not found – modeling blunder

10 10 Data Exploration in Predictive Modeling

11 11 Exploratory Data Analysis Typically the first step in analyzing data Makes heavy use of graphical techniques Also makes use of simple descriptive statistics Purpose Find outliers (and errors) Explore structure of the data

12 12 Definition of EDA Exploratory data analysis (EDA) is that part of statistical practice concerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system.. Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.data analysisstatistical practicedatacause systemdata mining - www.wikipedia.org

13 13 Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses Legal representaion Suspicion score (of fraud)

14 14 Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Stem and Leaf Plots Statistical Descriptive statistics Data spheres

15 15 Histograms Can do them in Microsoft Excel

16 16 Histograms Frequencies for Age Variable

17 17 Histograms of Age Variable Varying Window Size

18 18 Formula for Window Width

19 19 Example of Suspicious Value

20 20 Discrete-Numeric Data

21 21 Filtered Data Filter out Unwanted Records

22 22 Box Plot Basics: Five – Point Summary Minimum 1 st quartile Median 2 nd quartile Maximum

23 23 Functions for five point summary =min(data range) =quartile(data range1) =median(data range) =quartile(data range,3) =max(data range)

24 24 Box and Whisker Plot

25 25 Plot of Heavy Tailed Data Paid Losses

26 26 Heavy Tailed Data – Log Scale

27 27 Box and Whisker Example

28 28 Descriptive Statistics Analysis ToolPak

29 29 Descriptive Statistics Claimant age has minimum and maximums that are impossible

30 30 Multivariate EDA Often want to review relationships between multiple variables at one time What structures exist? What correlations exist? Identify outliers

31 31 Scatterplot Matrices

32 32 Panel Histogram

33 33 Data Spheres: The Mahalanobis Distance Statistic

34 34 Screening Many Variables at Once Plot of Longitude and Latitude of zip codes in data Examination of outliers indicated drivers in Ca and PR even though policies only in one mid-Atlantic state

35 35 Records With Unusual Values Flagged

36 36 Categorical Data: Data Cubes

37 37 Categorical Data Data Cubes Usually frequency tables Search for missing values coded as blanks

38 38 Categorical Data Table highlights inconsistent coding of marital status

39 39 Population Pyramid

40 40 Missing Data

41 41 Screening for Missing Data

42 42 Blanks as Missing

43 43 Types of Missing Values Missing completely at random Missing at random Informative missing

44 44 Methods for Missing Values Drop record if any variable used in model is missing Drop variable Data Imputation Other CART, MARS use surrogate variables Expectation Maximization

45 45 Imputation A method to “fill in” missing value Use other variables (which have values) to predict value on missing variable Involves building a model for variable with missing value Y = f(x 1,x 2,…x n )

46 46 Example: Age Variable About 14% of records missing values Imputation will be illustrated with simple regression model Age = a+b 1 X 1 +b 2 X 2 …b n X n

47 47 Model for Age

48 48 Missing Values A problem for many traditional statistical models Elimination of records missing on anything from analysis Many data mining procedures have techniques built in for handling missing values If too many records missing on a given variable, probably need to discard variable

49 49 Metadata

50 50 Metadata Data about data A reference that can be used in future modeling projects Detailed description of the variables in the file, their meaning and permissible values

51 51 Many other Facets to Data Preparation Variable transformation Normalization Sparse data Data reduction Derived variables

52 52 Library for Getting Started Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003 Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Cleaning Data”, CAS Winter Forum, March 2005, www.casact.org Find a comprehensive book for doing analysis in Excel such as: Jospeh Schmuller, Statistical Analysis With Excel for Dummies Pyle, Dorian, Data Preparation for Data Mining, Morgan Kaufmann


Download ppt "1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and."

Similar presentations


Ads by Google