Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning Interpretability

Similar presentations


Presentation on theme: "Machine Learning Interpretability"— Presentation transcript:

1 Machine Learning Interpretability
Thuy Nguyen Mihir Jain Edward Adcock Toby Alfred-Jones © COPYRIGHT | Delta Capita | CONFIDENTIAL

2 Contents Project objective Use cases Data overview
Neural Network model description Machine Learning Interpretability Results  Visualization Next steps 01 02 03 04 05 06 07 08

3 01. Project objective

4 Project objective Delta Capita has developed a Neural Network model to determine the likelihood of mortgage default given a set of information that represents an individual loan The aim of the project is to develop knowledge extraction techniques which interpret the mortgage defaulting model

5 02. Use cases

6 Use cases The ability to accurately determine loan default is especially useful in two domains: Mortgage-backed security investing Risk Management

7 03. Data overview Overview of data 8 Data features 9
Data features (provided) 10 Data features (created) 11 Data features (added) Model Preparation

8 Overview of data Main Dataset Freddy Mac Single-Family Loans
Time frame: 1999 – 2016 Size: 15.3m unique loans with 326m performance updates (monthly) Types of Loans: Default : 85k Fully Paid : 15.2m Ratio of Default vs Fully Paid loans: 0.6 : 99.4 Additional Dataset Average National Mortgage Interest Rate Monthly national interest rate for standard mortgages from January 1999 to July 2016 Housing Price Index Per State Monthly House Price Index in each U.S. state from January 1999 to July 2016 Unemployment Rate Per State Seasonally adjusted unemployment rate by each U.S. state from January 1999 to July 2016

9 Data features We split the following section into 3 parts:
Data Features provided in the main dataset Data Features created from the main dataset Data Features added from external sources We will later evaluate the models performance on data from Feature Sets 1, 2 & 3 (combined)

10 Data features (provided)
Monthly Performance Update Features Evaluation of on-going loans on a monthly basis Origination Features Assessment at the time of the loan application Origination Features Credit Score Original Unpaid Principal Balance First Payment Date Loan-To-Value Ratio First Time Home Buyer Flag Interest Rate Maturity Date Channel of origination of a loan Metropolitan Statistical Area Prepayment Penalty Mortgage Flag Mortgage Insurance Percentage Product Type Number of Units in a Property Property Type Occupancy Status Property State Combined Loan-To-Value Ratio Loan Purpose Debt-To-Income (DTI) Ratio Original Loan Term Number of Borrowers Performance Features Monthly Reporting Period Current Actual Unpaid Principal Balance Loan Age Remaining Months to Legal Maturity Current Interest Rate

11 Data features (created)
Based on history of current loan: Occurrence of: Loan Status (30-dd, 60-dd, 90-dd, foreclosed, etc ...) Occurrences of Loan Status in the last 12 months Percentage change between Last Balance and Current Balance Based on history of all loans: Number of Loans (active) per State/Zip-code Number of Loans (taken out) per State/Zip-code Number of Loans (taken out) per State/Zip-code in the last 12 months Default Rate per State/Zip-code Default Rate per State/Zip-code in the last 12 months Occurrences of ‘Paid Off’ & ‘Default’ per State/Zip-code Occurrences of ‘Paid Off’ & ‘Default’ per State/Zip-code in the last 12 months

12 Data features (added) Economic Features:
Monthly Unemployment Rate per State Monthly Housing Price Index Per State Monthly National Interest Rate Extra features created from added datasets: Difference between Current Interest Rate and National Interest Rate Number of Months that Mortgage Interest Rate is less than National Interest Rate

13 Model preparation Class imbalance Categorical data: One hot encoding
Using under-sampling technique on the training set New ratio of Default vs Fully Paid loans: : 85 Categorical data: One hot encoding For example, if the property is in New York, the value is 1, otherwise 0 Randomly shuffle data

14 04. Neural Network Model Description
Model Architecture                                     15 Performance Evaluation Metric                           17 Model performance                                                      19

15 Model Architecture We use Neural Network to create the Mortgage Classification model Model classes: Default or Fully Paid Model output: Any value between 0 and 1, which represents the probability of Default Threshold value of 0.5 (Value above 0.5 predicts a Default Loan) Model architecture: Layers Number of layers Number of Neurons Input layer (Number of loan features) 1 133* Hidden layer 2 100 : 100 Output layer (Number of classes) * means out of 133 input features, there are only 64 unique loan features

16 Performance Evaluation Metrics
We use 4 performance metrics: Accuracy - Overall classification accuracy True Positive Rate - Classification accuracy of ‘Default’ loans  True Negative Rate - Classification accuracy of ‘Fully Paid’ loans AUC - ( True Positive Rate + True Negative Rate ) / 2

17 Model performance Using the data from Feature Sets 1, 2 & 3 (combined): Performance Metric % Accuracy 98.3 % Correct Default (True Positive Rate) 98.5 % Correct ‘Fully Paid’ (True Negative Rate) 98.2 AUC 98.4

18 05. Machine Learning Interpretability
Knowledge Extraction                                   22 TREPAN                                                                                       23 Distilling Soft Decision Tree                                      24 LIME                                                                25

19 Knowledge extraction Problems: Methods:
Neural Networks: high performance, but black-box Decision Tree: high representation, but low performance Combine Neural Networks & Decision Tree to create rules that are human-comprehensible Methods: Global: TREPAN Distilling Soft Decision Tree Local: LIME

20 TREPAN Key features: Neural Networks serve as an oracle that returns class labels Construct models of the underlying distribution of data Tree expansion: best-first expansion to increase fidelity Splitting tests: m-of-n Stopping criteria: Global criteria: size of the tree, highest fidelity tree Local criterion: stopping the tree Key metrics: Accuracy Fidelity Comprehensibility

21 Distilling soft decision tree
Key features: Mimic the input– output function from the Neural Networks Soft targets: true label, predictions of Neural Networks Trained with mini-batch-gradient descent Uses learned filters to make hierarchical decisions Selects a particular static probability distribution over classes as output Key metrics: Accuracy Comprehensibility: complexity of the tree

22 LIME Key features: Create a local linear model around the prediction
Assign weights to different features in the dataset Compute the class probability Predict the class having the highest probability Key metrics: Accuracy

23 06. Results TREPAN                                                                27 Distilling Soft Decision Tree          28 LIME                                                 29

24 TREPAN Use 400 data points Conditions: Model performance:
Maximum of nodes: 10 Minimum sample: 100 Model performance: Accuracy: 80% Fidelity: 88%

25 Distilling Soft Decision Tree
Use 400 data points Condition: Maximum of tree depth: 10 Accuracy: 95%

26 LIME Use 400 data points Example: Prediction of a loan for the 5th customer

27 07. Visualisation

28 Visualization - Dashboard

29 Visualization - Dashboard

30 08. Next Steps

31 Next steps Interpretability Model
Use the entire dataset to validate all interpretability models  Try to interpret different Machine Learning models such as Random Forest, SVM Commercial products Develop a front-end app which is easier for people with no data science background to use Provide the tool to work irrespective of dataset or Python libraries Suggest recommendation from results of interpretability models

32


Download ppt "Machine Learning Interpretability"

Similar presentations


Ads by Google