Presentation on theme: "Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 18: Credit Scoring Padhraic Smyth Department of."— Presentation transcript:
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 18: Credit Scoring Padhraic Smyth Department of Information and Computer Science University of California, Irvine
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Presentations for Next Week Names for each day will be emailed out by tomorrow Instructions: –Email me your presentations by 12 noon the day of your presentation (no later please) –I will load them on my laptop (so no need to bring a machine) –Each presentation will be 6 minutes long + 2 minutes questions So probably about 4 to 8 (max) slides per presentation
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine References on Credit Scoring Statistical Classification Methods in Consumer Credit Scoring: a Review D. J. Hand and W. E. Henley Journal of the Royal Statistical Society: Series A Volume 160: Issue 3, November 1997 Available online at class Web page under lecture notes Also: Credit Scoring and its Applications: L. C. Thomas, D. B. Edelman, J. N. Crook, SIAM, 2002 Credit Risk Modeling, E. Mays (editor), American Management Association, 1998.
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Outline Credit Scoring –Problem definition, standard notation Data Sources Models –Logistic regression, trees, linear regression, etc Model building issues –Problem of reject inference Practical issues –Cutoff selection, updating models
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine The Problem of Credit Scoring Applicants apply for a bank loan –Population 1 is rejected –Population 2 is accepted Population 2a repays their loan -> labeled good Population 2b goes into some form of default -> labeled bad Model building –Build a model that can discriminate population 2a from population 2b –Usually treated as a classification problem –Typically want to estimate p(good | features) and rank individuals this way Widely used by banks and credit card companies –Similar problems occur in direct marketing and other scoring applications
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Many different applications for Customer Scoring Other financial applications: –Delinquent loans: who is most likely to pay up Uses historical data on who paid in the past Often used to create portfolios of delinquent debt –Customer revenue How much will each customer generate in revenue over the next K years Predicting marketing response –Cost of a mailer to a customer is order of $1 dollar –Targeted marketing Rank customers in terms of likelihood to respond Churn prediction –Predicting which customers are most likely to switch to another brand –E.g., wireless phone service –Scores used to rank customers and then target most likely with incentives Many more….
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Some background History –General ideas started in the 1950s e.g., Bill Fair and Eric Isaac -> FairIsaac -> FICO scores –Initially a bit contraversial Worries about it being unfair to some segments of society –US Equal Opportunity Credit Acts, 1975/76 Skepticism that machine generated rules from data could outperform human generated guidelines –First adopted in credit-card approvals (1960s) –Later broadly adopted in home-loans, etc –Now widely accepted and used by almost all banks, credit-granting agencies, etc
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Data Sources Data from the loan application –Age, address, income, profession, SS#, number of credit cards, savings, etc –Easy to obtain Internal Performance data –How the individual has performed on other loans with the same bank –May only be available for a subset of customers External Performance data: –Credit Reports How the individual has performed historically on all loans and credit cards Relatively expensive to obtain (e.g., $1 per individual) –Court Judgements –Real Estate records Macro-level external data –Demographic characteristics for applicants zip code or census tract
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Loan Application Data Issues –Data entry errors (e.g., birthday = date of loan application) –Deliberate falsifications (e.g., over-reporting of income) –Legal issues US Equal Credit Opportunity Acts, 1975/76 Illegal to use race, color, religion, national origin, sex, marital status, or age in the decision to grant credit But what if other variables are highly predictive of some of these variables?
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Variable NameDescriptionCodings dobYear of birthIf unknown the year will be 99 nkidNumber of childrennumber depNumber of other dependentsnumber phonIs there a home phone1=yes, 0 = no sincSpouse's income aesApplicant's employment status V = Government W = housewife M = military P = private sector B = public sector R = retired E = self employed T = student U = unemployed N = others Z = no response
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Variable NameDescriptionCodings daincApplicant's income resResidential statusO = Owner F = tenant furnished U = Tenant Unfurnished P = With parents N = Other Z = No response dhvalValue of Home 0 = no response or not owner 000001 = zero value blank = no response dmortMortgage balance outstanding 0 = no response or not owner 000001 = zero balance blank = no response doutmOutgoings on mortgage or rent doutlOutgoings on Loans douthpOutgoings on Hire Purchase doutccOutgoings on credit cards BadGood/bad indicator1 = Bad 0 = Good
Credit Report Data Available from 3 major bureaus in the US: –Experian, Trans-Union, and Equifax Data in the form of a list of transactions/events –Typically needs to be converted into feature-value form E.g., number of credit cards opened in past 12 months –Can result in a huge number of features Cost varies as a function of type and time-window of data requested –Interesting problem: cost-optimal downloading of selected credit report features adapted to each individual as a function of cheaper features
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Defining Good and Bad Good versus Bad –Not necessarily clear how to define 2 classes –E.g., bad = ever 3 or more payments in arrears? Bad = 2 or more payments in arrears more than once? –A spectrum of behavior Never any problems in payments Occasional problems Persistent problems –Typical to discard the intermediate cases and also those with insufficient experience to reliably classify them Not ideal theoretically, but convenient
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Selecting a Data Set for Model Building Sample selection –Typical sample sizes ~ 10k to 100k per class –Should be representative of customers who will apply in the future –Need to be able to get the relevant variables for this set of customers Internal performance data External performance data Etc External data sources (e.g., credit reports) can result in a very large number of possible variables –E.g., in the 1000s –E.g., number of accounts opened in past 12/24/36/… months –Typically some form of variable selection done before building a model Often based on univariate criteria such as information gain
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Models used in Credit Scoring Regression: –Ignore the fact that we are estimating a probability –Typically linear regression is used Classification (more common approach) –Logistic regression (most widely used) –Decision trees (becoming more popular) –Neural networks (experimented with, but not used in practice so much) –Nearest neighbors –Model combining - some work in this area –SVMs - too new, relatively unproven General comments –Many trade-secrets, companies like FairIsaac do not publish details –Generally the industry is conservative: prefer well-established methods –Classification accuracy is only one part of the overall solution….
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine g -1 () w 0 + w 1 x 1 +…+ w p x p = p Logistic Regression Models Training Data log(odds) ( ) p 1 - p log logit( p ) 0.0 1.0 p 0.5 logit( p ) 0 Note that near 0, logit(p) is almost linear, so linear and logistic regression will be similar in this region w 0 + w 1 x 1
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Modeling Example ModelBad Risk Rate (%) k nearest neighbor with special metric 43.09 k nearest neighbor (standard)43.25 logistic regression43.30 linear regression43.36 decision tree43.77 (from Hand and Henley paper)
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Evaluation Methods Decile/Centile reporting: –Rank customers by predicted scores –Report lift rate in each decile (and cumulatively) compared to accepting everyone Receiver Operation Characteristics –Vary classification threshold –Plot proportion of good risks accepted vs. bad risks accepted Bad Risk rate = bad risk among those accepted –Let p = proportion of good risks –Let a = proportion accepted e.g., can show that, with a > p, the bad risk rate among those accepted is lower bounded by 1 – p/a e.g., p = 0.45, a =0.70 => bad risk rate must be between 0.35 and 0.78
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Economics of Credit Scoring Classification accuracy is not the appropriate metric Benefit = Increase in revenue from using model - cost of developing and installing model Model development: anywhere from $5k to $100k depending on the complexity of modeling project Model installation: can be expensive (software, testing, legal requirements) –model maintenance and updating should also probably be included Revenue increase based on estimate performance plus assumptions about cost of bad risks versus good risks Small improvements in accuracy (e.g., 1 to 5%) could lead to significant gains if the model is used on large numbers of customers
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Problem of Reject Inference Typically the population available for training consists only of past applicants who were accepted –Application data is available for rejects, but no performance data Question: –Is there a way to use the data from rejected applicants? Answer: no widely accepted approach. Methods include –Define all rejects as bad (not reliable!) –Build a statistical model (treat labels as missing, but biased) Cam be quite complex, see Section 5 in Hand and Henley paper –Grant credit to some fraction of rejects and track their performance so that the full population is sampled Rarely used for loans, but ideally is the best method
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine Other issues Threshold selection –Above what threshold should loans be granted –Depends on goals of the project E.g., focusing on a small set of high-scoring customers versus widening the net to include a larger number (but still minimizing risk) Time-dependent classification –What really matters is what the customer will do at time t+T –Can we model the state of a customer (rather than statically)? Still somewhat of a research topic Overrides –Loans are still manually signed-off. The bank may sometimes override the systems recommendation
Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine The model works… now what? Implementation –Depends on whether the model is replacing an existing automated model –… or is the first time modeling is being applied to the problem –Many software issues in terms of databases, security, etc Monitoring and tracking –Important to see how the scorecard works in practice –Generating monthly/quarterly reports on scorecard performance (naturally there will be some delay in this) –Analyzing in detail at performance on segments, by attribute, etc Time for a new model? –E.g., population has changed significantly –E.g., new (cheap and useful) data available –E.g., new modeling technology available