Presentation is loading. Please wait.

Presentation is loading. Please wait.

Premiere Products Team Project SAS Enterprise Miner (Part I)

Similar presentations


Presentation on theme: "Premiere Products Team Project SAS Enterprise Miner (Part I)"— Presentation transcript:

1 Premiere Products Team Project SAS Enterprise Miner (Part I)
BCIS 4660 Spring 2006

2 BCIS 4660 Project Overview Assignment #7 Assignment #8 Assignment #9
Each team will present their work (10-12 slides, 10 minutes) All team members will submit PEER EVALUATIONS to the instructor at the conclusion of the project. These will be part of your grade

3 BCIS 4660 Project Team Assignments
Team assignment will be centered around the team leaders The traits of a good leader (according to Nick Evangelopoulos): Knowledge on the class project topic Track record on producing results/ meeting deadlines Ability to push people when they are lazy, without being insensitive to special needs/ temporary problems Leadership by example: Do what you preach Being proactive in resolving communication/ management problems at a very early stage, before they escalate into conflicts

4 BCIS 4660 Project Team Assignments
What to do if you have already teamed up: If it’s less than 4 in your team, select a leader and let us know how many more people you need in your team If it’s already 4 of you in the team, select a team leader and have him/her come up and let us know your team is full

5 Tasks for Team Leaders Come on up here and introduce yourself by FIRST NAME. Bring pen and paper. Use ONE SENTENCE to elaborate on your leadership style Use ONE SENTENCE to instill confidence and convince the team members this project is something you can handle Invite team members to come talk to you and then take their full names and contact info (phone#, ). Send everybody a confirmation/”thank you” ASAP the instructor your Team Number and the names of all team members (including your own)

6 Some slide material taken from: Groth, Han and Kamber, SAS Institute
Data Mining Some slide material taken from: Groth, Han and Kamber, SAS Institute

7 The UNT/SAS® joint Data Mining Certificate: New in 2006
Just approved! Free of charge! Requires: DSCI 2710 DSCI 3710 BCIS 4660 DSCI 4520 SAMPLE

8 Overview of this Presentation
Introduction to Data Mining The SEMMA Methodology Regression/Logistic Regression Decision Trees Neural Networks SAS EM Demo: The Home Equity Loan Case Important DM techniques Not Covered today: Market Basket Analysis Memory-Based Reasoning Web Link Analysis

9 Introduction to DM “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” (Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia")

10 What Is Data Mining? Data mining (knowledge discovery in databases):
A process of identifying hidden patterns and relationships within data (Groth) Data mining: Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

11 Data hospital patient registries electronic point-of-sale data
remote sensing images tax returns stock trades OLTP telephone calls airline reservations credit card charges catalog orders bank transactions

12 DM and Business Decision Support
Database Marketing Target marketing Customer relationship management Credit Risk Management Credit scoring Fraud Detection Healthcare Informatics Clinical decision support

13 Multidisciplinary Statistics Pattern Neurocomputing Recognition
Machine Learning AI Data Mining Databases KDD

14 On the News: Data-mining software digs for business leads
SAN FRANSCISCO, March 8, Last year, Congress shut down Pentagon’s Total Information Awareness programs. Now entrepreneurs are selling new data- mining tools that do similar things on a smaller scale. Spoke and Visible Path sell their software primarily to corporations. The idea is to provide tools for finding helpful business partners and making blind introductions -- allowing, say, a lawyer for Silly Computers Inc. to electronically ask a former classmate from Harvard who once did legal work for Microsoft to help him pitch a business deal to Bill Gates. How does it work? Both Spoke and Visible Path send so-called crawlers around a corporation's internal computer network -- sniffing telltale clues, say, from employee Outlook files about who they and how often, who replies to particular messages and who doesn't, which names show up in electronic calendars and phone logs. Then it cross-references those snippets with information from other company databases, including sales records from PeopleSoft and Salesforce.com.

15 Data Mining: A KDD Process
Knowledge Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

16 Data Mining and Business Intelligence
Increasing potential to support business decisions End User (Manager) Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

17 Architecture of a Typical Data Mining System
Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

18 Introducing SAS Enterprise Miner (EM)

19 The SEMMA Methodology Introduced By SAS Institute
Implemented in SAS Enterprise Miner (EM) Organizes a DM effort into 5 activity groups: Sample Explore Modify Model Assess

20 Sample Input Data Source Sampling Data Partition

21 Explore Distribution Association Explorer Multiplot Variable Selection
Insight Association Variable Selection Link Analysis

22 Modify Data Set Attributes Transform Variables Clustering Filter
Outliers Replacement Clustering Self-Organized Maps Kohonen Networks Time Series

23 Model Regression Tree Neural Network Princomp/ Dmneural User Defined
Ensemble Memory Based Reasoning Two-Stage Model

24 Assess Assessment Reporter

25 Other Types of Nodes – Scoring Nodes, Utility Nodes
Group Processing Data Mining Database SAS Code Control Point Subdiagram Score C*Score

26 DATA MINING AT WORK: Detecting Credit Card Fraud
Credit card companies want to find a way to monitor new transactions and detect those made on stolen credit cards. Their goal is to detect the fraud while it is taking place. In a few weeks after each transaction they will know which of the transactions were fraudulent and which were not, and they can then use this data to validate their fraud detection and prediction scheme.

27 DATA MINING AT WORK: Telstra Mobile Combats Churn with SAS®
As Australia's largest mobile service provider, Telstra Mobile is reliant on highly effective churn management. In most industries the cost of retaining a customer, subscriber or client is substantially less than the initial cost of obtaining that customer. Protecting this investment is the essence of churn management. It really boils down to understanding customers -- what they want now and what they're likely to want in the future, according to SAS. "With SAS Enterprise Miner we can examine customer behaviour on historical and predictive levels, which can then show us what 'group' of customers are likely to churn and the causes," says Trish Berendsen, Telstra Mobile's head of Customer Relationship Management (CRM).

28 DATA MINING AT WORK: Reducing armed robberies in South Africa
SAS helped Absa, a Major South African Bank reduce armed robberies by 41 percent over two years ( ), netting a 38 percent reduction in cash loss and an 11 percent increase in customer satisfaction ratings. Absa, one of South Africa's largest banks, uses SAS' data mining capabilities to leverage their data for better customer relationships and more targeted marketing campaigns. With SAS analytics, the bank can also track which branches are more likely to fall victim to a robbery and take effective preventive measures. "Absa used to be one of the banks that was targeted more than other banks; now we're at the bottom of the list," says Dave Donkin, Absa group executive of e-business and information management.

29 DATA MINING AT WORK: Strategic Pricing Solutions at MCI
MCI now has a solution for making strategic pricing decisions, driving effective network analysis, enhancing segment reporting and creating data for sales leader compensation. Before implementing SAS, the process of inventorying MCI's thousands of network platforms and IT systems – determining what each one does, who runs them, how they help business and which products they support – was completely manual. The model created with SAS has helped MCI to catalog all that information and map the details to products, customer segments and business processes. "That's something everyone is excited about," says Leslie Mote, director of MCI corporate business analysis. "Looking at the cost of a system and what it relates to helps you see the revenue you're generating from particular products or customers. I can see what I'm doing better."

30 Our own example: The Home Equity Loan Case
HMEQ Overview Determine who should be approved for a home equity loan. The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.

31 HMEQ case overview The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections). The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.

32 The HMEQ Loan process An applicant comes forward with a specific property and a reason for the loan (Home-Improvement, Debt-Consolidation) Background info related to job and credit history is collected The loan gets approved or rejected Upon approval, the Applicant becomes a Customer Information related to how the loan is serviced is maintained, including the Status of the loan (Current, Delinquent, Defaulted, Paid-Off)

33 The HMEQ Loan Transactional Database
Entity Relationship Diagram (ERD), Logical Design: APPLICANT CUSTOMER PROPERTY becomes Applies for HMEQ Loan on… using… Reason Loan Approval Date OFFICER has HISTORY Balance Status MonthlyPayment ACCOUNT

34 HMEQ Transactional database: the relations
Entity Relationship Diagram (ERD), Physical Design: Applicant APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ Property PROPERTYID ADDRESS VALUE MORTDUE HMEQLoanApplication OFFICERID LOAN REASON DATE APPROVAL Customer CUSTOMERID Account ACCOUNTID BALANCE MONTHLYPAYMENT STATUS Officer OFFICERNAME PHONE FAX History HISTORYID PAYMENT

35 The HMEQ Loan Data Warehouse Design
We have some slowly changing attributes: HMEQLoanApplication: Loan, Reason, Date Applicant: Job and Credit Score related attributes Property: Value, Mortgage, Balance An applicant may reapply for a loan, then some of these attributes may have changed. Need to introduce “Key” attributes and make them primary keys

36 The HMEQ Loan Data Warehouse Design
STAR 1 – Loan Application facts Fact Table: HMEQApplicationFact Dimensions: Applicant, Property, Officer, Time STAR 2 – Loan Payment facts Fact Table: HMEQPaymentFact Dimensions: Customer, Property, Account, Time

37 Two Star Schemas for HMEQ Loans
Property PROPERTYKEY PROPERTYID ADDRESS VALUE MORTDUE Customer CUSTOMERKEY CUSTOMERID APPLICANTID NAME ADDRESS Applicant APPLICANTKEY APPLICANTID NAME JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ HMEQApplicationFact APPLICANTKEY PROPERTYKEY OFFICERKEY TIMEKEY LOAN REASON APPROVAL HMEQPaymentFact CUSTOMERKEY PROPERTYKEY ACCOUNTKEY TIMEKEY BALANCE PAYMENT STATUS Officer OFFICERKEY OFFICERID OFFICERNAME PHONE FAX Time TIMEKEY DATE MONTH YEAR Account ACCOUNTKEY LOAN MATURITYDATE MONTHLYPAYMENT

38 The HMEQ Loan DW: Questions asked by management
How many applications were filed each month during the last year? What percentage of them were approved each month? How has the monthly average loan amount been fluctuating during the last year? Is there a trend? Which customers were delinquent in their loan payment during the month of September? How many loans have defaulted each month during the last year? Is there an increasing or decreasing trend? How many defaulting loans were approved last year by each loan officer? Who are the officers with the largest number of defaulting loans?

39 The HMEQ Loan DW: Some more involved questions
Are there any patterns suggesting which applicants are more likely to default on their loan after it is approved? Can we relate loan defaults to applicant job and credit history? Can we estimate probabilities to default based on applicant attributes at the time of application? Are there applicant segments with higher probability? Can we look at relevant data and build a predictive model that will estimate such probability to default on the HMEQ loan? If we make such a model part of our business policy, can we decrease the percentage of loans that eventually default by applying more stringent loan approval criteria?

40 Selecting Task-relevant attributes
Customer CUSTOMERKEY CUSTOMERID APPLICANTID NAME ADDRESS Time TIMEKEY DATE MONTH YEAR Account ACCOUNTKEY LOAN MATURITYDATE MONTHLYPAYMENT Applicant APPLICANTKEY JOB DEBTINC YOJ DEROG CLNO DELINQ CLAGE NINQ Officer OFFICERKEY OFFICERID OFFICERNAME PHONE FAX Property PROPERTYKEY PROPERTYID VALUE MORTDUE HMEQApplicationFact REASON APPROVAL HMEQPaymentFact BALANCE PAYMENT STATUS

41 HMEQ final task-relevant data file
Name Model Role Measurement Level Description BAD Target Binary 1=defaulted on loan, 0=paid back loan REASON Input HomeImp=home improvement, DebtCon=debt consolidation JOB Nominal Six occupational categories LOAN Interval Amount of loan request MORTDUE Amount due on existing mortgage VALUE Value of current property DEBTINC Debt-to-income ratio YOJ Years at present job DEROG Number of major derogatory reports CLNO Number of trade lines DELINQ Number of delinquent trade lines CLAGE Age of oldest trade line in months NINQ Number of recent credit inquiries

42 HMEQ: Modeling Goal The credit scoring model should compute the probability of a given loan applicant to default on loan repayment. A threshold is to be selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection. Using the HMEQ task-relevant data file, three competing models will be built: A logistic Regression model, a Decision Tree, and a Neural Network Model assessment will allow us to select the best of the three alternative models

43 Predictive Modeling Inputs Target ... ... ... ... ... ... Cases ...

44 Modeling Tools Logistic Regression

45 Modeling Techniques: Separate Sampling
Benefits: Helps detect rare target levels Speeds processing Risks: Biases predictions (correctable) Increases prediction variability

46 Logistic Regression Models
log(odds) logit(p) 0.0 1.0 p 0.5 logit(p ) ( ) p 1 - p log g-1( ) p = w0 + w1x1 +…+ wpxp Training Data

47 Changing the Odds ( ) ( ) ( ) ´ p 1 - p log = w0 + w1x1 +…+ wpxp = p
( ) p 1 - p log = w0 + w1x1 +…+ wpxp = ( ) p 1 - p log w0 + w1(x1+1)+…+ wpxp ( ) p 1 - p log exp(w1) w1 + w0 + w1x1 +…+ wpxp odds ratio Training Data

48 Modeling Tools Decision Trees

49 Divide and Conquer the HMEQ data
10% BAD n = 3,350 n = 1,650 Debt-to-Income Ratio < 45 yes no 21% BAD 5% BAD The tree is fitted to the data by recursive partitioning. Partitioning refers to segmenting the data into subgroups that are as homogeneous as possible with respect to the target. In this case, the binary split (Debt-to-Income Ratio < 45) was chosen. The 5,000 cases were split into two groups, one with a 5% BAD rate and the other with a 21% BAD rate. The method is recursive because each subgroup results from splitting a subgroup from a previous split. Thus, the 3,350 cases in the left child node and the 1,650 cases in the right child node are split again in similar fashion.

50 The Cultivation of Trees
Split Search Which splits are to be considered? Splitting Criterion Which split is best? Stopping Rule When should the splitting stop? Pruning Rule Should some branches be lopped off?

51 Possible Splits to Consider: an enormous number
500,000 Nominal Input 400,000 Ordinal Input 300,000 200,000 100,000 1 2 4 6 8 10 12 14 16 18 20 Input Levels

52 Splitting Criteria How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless! In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chi-squared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.

53 Benefits of Trees Interpretability Mixed Measurement Scales
tree-structured presentation Mixed Measurement Scales nominal, ordinal, interval Robustness (tolerance to noise) Handling of Missing Values Regression trees, Consolidation trees

54 Modeling Tools Neural Networks

55 Neural network models (multi-layer perceptrons)
Often regarded as a mysterious and powerful predictive modeling technique. The most typical form of the model is, in fact, a natural extension of a regression model: A generalized linear model on a set of derived inputs These derived inputs are themselves a generalized linear model on the original inputs The usual link for the derived input’s model is inverse hyperbolic tangent, a shift and rescaling of the logit function Ability to approximate virtually any continuous association between the inputs and the target You simply need to specify the correct number of derived inputs

56 Neural Network Model ( ) x2 tanh(x) x x1 Training Data p 1 - p log
( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = x2 tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 tanh(x) x -1 1 x1 Training Data

57 Input layer, hidden layer, output layer
Multi-layer perceptron models were originally inspired by neurophysiology and the interconnections between neurons. The basic model form arranges neurons in layers. The input layer connects to a layer of neurons called a hidden layer, which, in turn, connects to a final layer called the target, or output, layer. The structure of a multi-layer perceptron lends itself to a graphical representation called a network diagram.

58 Neural Network Diagram
( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 ( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 p Target ( ) 1 - p log w00 + w01H1 + w02H2 + w03H3 = tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 H1 H2 H3 Hidden layer ( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 ( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 ( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = x2 tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 x2 x1 Inputs tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 x1 Training Data

59 Objective Function Where:
Predictions are compared to the actual values of the target via an objective function. An easy-to-understand example of an objective function is the mean squared error (MSE) given by: Where: N is the number of training cases. yi is the target value of the ith case. is the predicted target value. is the current estimate of the model parameters.

60 Overgeneralization A small value for the objective function, when calculated on training data, need not imply a small value for the function on validation data. Typically, improvement on the objective function is observed on both the training and the validation data over the first few iterations of the training process. At convergence, however, the model is likely to be highly overgeneralized and the values of the objective function computed on training and validation data may be quite different.

61 Training Overgeneralization
( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = x2 tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 Objective function (w) x1 10 20 30 40 50 60 70 Training Data

62 Final Model To compensate for overgeneralization, the overall average profit, computed on validation data, is examined. The final parameter estimates for the model are taken from the training iteration with the maximum validation profit.

63 Neural Network Final Model
( ) p 1 - p log w00 + w01H1 + w02H2 + w03H3 = x2 tanh-1( H1 ) = w10 + w11x1 + w12x2 tanh-1( H2 ) = w20 + w21x1 + w22x2 tanh-1( H3 ) = w30 + w31x1 + w32x2 Profit x1 10 20 30 40 50 60 70 Training Data

64 Next Lecture: SAS EM Demo: HMEQ Case


Download ppt "Premiere Products Team Project SAS Enterprise Miner (Part I)"

Similar presentations


Ads by Google