Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Mining and Classification

Similar presentations


Presentation on theme: "Introduction to Data Mining and Classification"— Presentation transcript:

1 Introduction to Data Mining and Classification
F. Michael Speed, Ph.D. Analytical Consultant SAS Global Academic Program

2 Objectives State one of the major principles underlying data mining
Give a high level overview of three classification procedures

3 A Basic principle of Data Mining
Splitting the data: Training Data Set – this is a must do Validation Data Set – this is a must do Testing Data Set – This is optional

4 Training Set For a given procedure (logistic or neural net or decision tree) we use the training set to generate a sequence of models. For example: If we use logistic regression, we get: Model 1 Model 2 Model q Training Data Logistic Reg

5 How Do We decide Which of the q Models is Best?
We want the model with the fewest terms (most parsimonious). We want the model with largest (smallest) value of our criteria index (adjusted r-square, misclassification rate, AIC, BIC, SBC etc.) We use the validation set to compute the criteria (Fit Index) for each model and then choose the “best.”

6 Compute the Fit Index for Each Model
Then find the “best” using a fixed Fit Index Model 1 Model 2 Model q Validation Set Fit Index 1 Fit Index 2 Fit Index q

7 Fit Indices (Statistics)
Default — The default selection uses different statistics based on the type of target variable and whether a profit/loss matrix has been defined. If a profit/loss matrix is defined for a categorical target, the average profit or average loss is used. If no profit/loss matrix is defined for a categorical target, the misclassification rate is used. If the target variable is interval, the average squared error is used. Akaike's Information Criterion — chooses the model with the smallest Akaike's Information Criterion value. Average Squared Error — chooses the model with the smallest average squared error value. Mean Squared Error — chooses the model with the smallest mean squared error value. ROC — chooses the model with the greatest area under the ROC curve. Captured Response — chooses the model with the greatest captured response values using the decile range that is specified in the Selection Depth property.

8 Continued Gain — chooses the model with the greatest gain using the decile range that is specified in the Selection Depth property. Gini Coefficient — chooses the model with the highest Gini coefficient value. Kolmogorov-Smirnov Statistic —  chooses the model with the highest Kolmogorov - Smirnov statistic value. Lift — chooses the model with the greatest lift using the decile range that is specified in the Selection Depth property. Misclassification Rate — chooses the model with the lowest misclassification rate. Average Profit/Loss — chooses the model with the greatest average profit/loss. Percent Response — chooses the model with the greatest % response. Cumulative Captured Response — chooses the model with the greatest cumulative % captured response. Cumulative Lift — chooses the model with the greatest cumulative lift. Cumulative Percent Response — chooses the model with the greatest cumulative % response.

9 Misclassification Rate(MR)
Prediction = 0 Prediction =1 Actual = 0 True Negative False Positive Actual = 1 False Negative True Positive MR = (FN +FP)/(TN+FP+FN+TP)

10 Equity Data Set. The variable BAD = 1 if the borrower is a bad credit risk and = 0 if not. We want to build a model to predict if a person is a bad credit risk Other Variables: Job, YOJ, Loan, DebtInc Mortdue - How much they need to pay on their mortgage Value - Assessed valuation Derog - Number of Derogatory Reports Deliniq - Number of Delinquent Trade Lines Clage - Age of Oldest Trade Line Ninq - Number of recent credit inquiries. Clno - Number of trade lines

11 Three Procedures Decision Tree Regression (Logistic) Neural Network

12 Decision Tree Very Simple to Understand Easy to use
Can explain to the boss/supervisor

13 Example

14 Maximal Tree – Ignoring Validation Data

15 Optimal Tree

16 Continued

17 Fit Statistics Prediction = 0 Prediction =1 Actual = 0 2266 146
225 370 MC=( )/2981=

18 Logistic Regression Since we observe a 0 or a 1, ordinary least squares is not an option. We need a different approach The probability of getting a 1 depends upon X. We write that as p(X). Log odds = log(p(X)/(1-p(X))= a + bX

19 Logistic Graph –Solve for p(X)

20 Fit Statistics

21 MCR Prediction = 0 Prediction =1 Actual = 0 2306 80 Actual = 1 332 263

22 Neural Net Very Complex Mathematical Equations Interpretations of the meaning of the input variables are not possible with final model Often a good prediction of the response.

23 Neural Net Diagram

24 Fit Statistics

25 MCR Prediction = 0 Prediction =1 Actual = 0 2291 95 Actual = 1 288 307

26 Comparison

27 Enterprise Miner Interface

28 Enterprise Guide Interface

29 RPM

30 Continued

31 Continued

32 Fit Statistics

33 Summary Divide your data into training and validation
We looked at trees, logistic regression and neural nets We also looked at RPM

34 Q&A


Download ppt "Introduction to Data Mining and Classification"

Similar presentations


Ads by Google