Who would be a good loanee? Zheyun Feng 7/17/2015.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Florida International University COP 4770 Introduction of Weka.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
D ON ’ T G ET K ICKED – M ACHINE L EARNING P REDICTIONS FOR C AR B UYING Albert Ho, Robert Romano, Xin Alice Wu – Department of Mechanical Engineering,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Alternative Techniques
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Towards Separating Trigram- generated and Real sentences with SVM Jerry Zhu CALD KDD Lab 2001/4/20.
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Introduction to Data Mining with XLMiner
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Adaboost and its application
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Distance and Similarity Measures
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
This week: overview on pattern recognition (related to machine learning)
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Multi-Task Learning for HIV Therapy Screening Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, Tobias Scheffer.
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
Benk Erika Kelemen Zsolt
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Chapter 4: Introduction to Predictive Modeling: Regressions
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Predicting Voice Elicited Emotions
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Jivko Sinapov, Kaijen Hsiao and Radu Bogdan Rusu Proprioceptive Perception for Object Weight Classification.
PREDICTING SONG HOTNESS
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
10. Decision Trees and Markov Chains for Gene Finding.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CH 5: Multivariate Methods
Tabulations and Statistics
Dr. Morgan C. Wang Department of Statistics
Model generalization Brief summary of methods
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Group 9 – Data Mining: Data
Multivariate Methods Berlin Chen, 2005 References:
CS412 – Machine Learning Sentiment Analysis - Turkish Tweets
Predicting Loan Defaults
Data Pre-processing Lecture Notes for Chapter 2
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Who would be a good loanee? Zheyun Feng 7/17/2015

Introduction  Objective  Given the application data of a customer, determine if he/she should be given the loan or not  What the data looks like  Tools  Python  Scikit-learn

TABLE OF CONTENTS  Exploring and understanding the input data Types of data Matching features and labels  Presenting the data to learning algorithms Problematic (missing or ambiguous) data Represent data feature as a matrix  Choosing models and learning algorithms Algorithms  Evaluating the performance  Conclusion

Understanding the labels  Totally 1285 records  1269 with -01  16 with -02  Loan ID repeats  Duplication or Meaningful? 1269 with with 02  Most data: labels are the same  3 data: labels conflicts  Processed labels:  2 Good: 2  1 Good: 1  1 Bad: -1  No label/Conflicting label: 0

Understanding the data features  Nonsense feature  Status (all approved)  Payment_ach ( except 1)  Nominal  Loan id – matching label  P: address_zip  Q:  R: bank routing  Binary/Multiple choices  Rent or own  How use money  Contact way  Payment frequency  Ordinal  /back/address duration  Numeric  FICO score  Money amount, eg. payment amount, income

Understanding the data features  Loan ID – Matching the labels  No duplicates  16 no label (0) : label missing(13)/label conflicting (3)  281 good (1:268, 2:13)  350 bad (-1)  /Zipcode/Bank Routing  No duplicates -> no sense; with duplicates -> copy labels  Duplicates of domain o yahoo (N/(N+P)) o aol o bing o hotmail o gmail  Convert binary to numeric: prior indicating negative ratio

Understanding the data features  Zipcode  Many repetition  Convert binary to numeric value: prior indicating negative ratio  Repetition counts >10 => negative ratio; else => 0.55

Understanding the data features  Bank Routing  Many repetition  Convert binary to numeric value: prior indicating negative ratio  Repetition counts >10 => negative ratio; else => 0.55

Presenting data to the learning algorithms  Multiple choice data ( eg. Contacts, how use money ):  encode to a sequence of binary value  Ordinal:  assign as 1, 2, 3, …  Missing values ( eg. Payment approved )  regression. Train a regression model on the non-missing data and predict the values for the missing samples  add a binary feature indicating if value is missing or not  Missing values ( eg. Other contacts)  ignore the missing values.  consider the non-missing values together with “contacts”  Concatenate all features together to form a matrix

Data Statistics Data size: samples without label Feature dimension: 34 Positive samples: 281, negative samples: 350 After normalization: each feature item is in [0,1] Training set: 80%, testing set: 20%

Impacts of certain features

Learning Models SVM with poly kernel Logistic regression Linear discriminant analysis Quadratic discriminant analysis AdaboostBaggingRandom ForestExtra Tressa K-nearest neighbors

Learning Models

Conclusion and future direction  Data matters  Choose data with better quality  Explore more features: household income, occupation, payment records  Pre-processing of missing/problematic data is important  Data normalization is important  Ensemble classifier outperforms single classifiers  Majority voting/ weighted combination / boosting  Overfitting risk  Randomness  Parameter tuning  If data is large enough  Neuronetwork /deep learning  Kernel methods