Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Process Source: CRISP-DM (SPSS.com website)

Similar presentations


Presentation on theme: "Data Mining Process Source: CRISP-DM (SPSS.com website)"— Presentation transcript:

1 Data Mining Process Source: CRISP-DM (SPSS.com website)

2 Data Preparation Data Aggregation New Variable Creation Data Cleaning
Turning raw data into variables at the right level of aggregation for analysis. New Variable Creation Typically creating ratio variables from existing ones. Eg. Number of Crimes/ Population, Number of Gold Medals / Population Data Cleaning includes preliminary analysis to find Missing Data Outliers Misclassification, Incorrect coding of values Variable Transformation Creating dummies Creating quadratic, log, or other transforms Creating interaction terms

3 Data Cleaning MIS Issues Analyst Issues
(Source: Article by Ralph Kimball) Analyst Issues

4 MIS Issues Elementizing (Parsing) Standardizing Verifying Matching,
Householding Documenting

5 Elementising Ralph B and Julianne Kimball Trustees for Kimball Fred C Ste Hiway 9 Box 1234 Boulder Crk Colo 95006

6 Addressee First Name(1): Ralph Addressee Middle Initial(1): B Addressee Last Name(1): Kimball Addressee First Name(2): Julianne Addressee Last Name(2): Kimball Addressee Relationship: Trustees for Relationship Person First Name: Fred Relationship Person Middle Name: C Relationship Person Last Name: Kimball Street Address Number: Street Name: Hiway 9 Suite Number: 116 Post Office Box Number: 1234 City: Boulder Crk State: Colo Five Digit Zip: 95006

7 Standardizing Ste = suite Hiway 9 = Highway 9 Other example -
Grade “D” = Distinction in Australia

8 Verification Zip code is CA, not Colorado

9 Matching/Householding
Match record with other customer records containing Ralph and Julianne Kimball Establish that they are part of the same household

10 Analyst Issues Physical data problems Data Dictionaries
Validation (Frequencies) Missing Data The “zero” value problem Inappropriate (Future) data for modeling Unavailable data

11 Physical Cannot access data ASCII vs EBCDIC
On a medium that you can’t use (certain type of tape, for instance)

12 Data Dictionaries What are the fields? Where are they located?
What format are they stored in?

13 Missing Data Ignore Find the right values if you can
Use Average for that variable Replace with number that matches its characteristics (What do the missing people look like in terms of the dependent? Who else looks like that?

14 The zero problem What does 0 mean?
If “Number of Revolving Bankcard Trades Currently Past Due” = 0, what does that mean?

15 # of Bank Rev. Trds Currently Past Due
Cumulative Cumulative BRPSTD Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

16 # of Trds Cumulative Cumulative
TRADES Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

17 # of Bank Rev. Trds Cumulative Cumulative
BRTRDS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ INQS. & PR ONLY PR ONLY INQS. ONLY NO RECORD

18 # of Bank Rev. Trds Currently Past Due
Cumulative Cumulative BRPSTD Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ NO TRADES OF THIS TYPE INQS. & PR ONLY PR ONLY INQS. ONLY NO RECORD MISSING

19 Inappropriate Data Used
Future data used to build great looking model. Used payments till month end instead of payments until cycle date.

20 Unavailable Data Data on Rejected Applicants
Would they have been Good or Bad had they been accepted? Use “Reject Inferencing” techniques.


Download ppt "Data Mining Process Source: CRISP-DM (SPSS.com website)"

Similar presentations


Ads by Google