Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 2: Basics of Business Analytics

Similar presentations


Presentation on theme: "Chapter 2: Basics of Business Analytics"— Presentation transcript:

1 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

2 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

3 Objectives Name two major types of data mining analyses.
List techniques for supervised and unsupervised analyses.

4 Analytical Methodology
A methodology clarifies the purpose and implementation of analytics. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and Repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

5 Business Analytics and Data Mining
Data mining is a key part of effective business analytics. Components of data mining: data management customer segmentation predictive modeling forecasting standard and nonstandard statistical modeling practices

6 What Is Data Mining? Information Technology
complicated database queries Machine Learning inductive learning from examples Statistics what we were taught not to do

7 Translation for This Course
Segmentation Unsupervised classification cluster analysis association rules other techniques Predictive Modeling Supervised classification linear regression logistic regression decision trees other techniques

8 Customer Segmentation
Segmentation is a vague term with many meanings. Segments can be based on the following: a priori judgment alike based on business rules, not based on data analysis unsupervised classification alike with respect to several attributes supervised classification alike with respect to a target, defined by a set of inputs

9 Segmentation: Unsupervised Classification
Training Data Training Data case 1: inputs, ? case 2: inputs, ? case 3: inputs, ? case 4: inputs, ? case 5: inputs, ? case 1: inputs, cluster 1 case 2: inputs, cluster 3 case 3: inputs, cluster 2 case 4: inputs, cluster 1 case 5: inputs, cluster 2 new case new case

10 Segmentation: A Selection of Methods
Barbie  Candy Beer  Diapers Peanut butter  Meat k-means clustering Association rules (Market basket analysis)

11 Predictive Modeling: Supervised Classification
Training Data case 1: inputs  prob  class case 2: inputs  prob  class case 3: inputs  prob  class case 4: inputs  prob  class case 5: inputs  prob  class new case new case

12 Predictive Modeling: Supervised Classification
Inputs Target ... ... ... ... ... ... Cases ... ... ... . . . . . . . . . . . ...

13 2.01 Poll The primary difference between supervised and unsupervised classification is whether a dependent, or target, variable is known.  Yes  No Yes

14 2.01 Poll – Correct Answer The primary difference between supervised and unsupervised classification is whether a dependent, or target, variable is known.  Yes  No Yes

15 Types of Targets Logistic Regression event/no event (binary target)
class label (multiclass problem) Regression continuous outcome Survival Analysis time-to-event (possibly censored)

16 Discrete Targets Health Care target = favorable/unfavorable outcome
Credit Scoring target = defaulted/did not default on a loan Marketing target = purchased product A, B, C, or none

17 Continuous Targets Health Care Outcomes
target = hospital length of stay, hospital cost Liquidity Management target = amount of money at an ATM machine or in a branch vault Merchandise Returns target = time between purchase and return (censored)

18 Application: Target Marketing
Cases = customers, prospects, suspects, households Inputs = geographics, demographics, psychometrics, RFM variables Target = response to a past or test solicitation Action = target high-responding segments of customers in future campaigns

19 Application: Attrition Prediction/Defection Detection
Cases = existing customers Inputs = payment history, product/service usage, demographics Target = churn, brand switching, cancellation, defection Action = customer loyalty promotion

20 Application: Fraud Detection
Cases = past transaction or claims Inputs = particulars and circumstances Target = fraud, abuse, deception Action = impede or investigate suspicious cases

21 Application: Credit Scoring
Cases = past applicants Inputs = application information, credit bureau reports Target = default, charge-off, serious delinquency, repossession, foreclosure Action = accept or reject future applicants for credit

22 The Fallacy of Univariate Thinking
What is the most important cause of churn? Prob(churn) Daytime Usage International Usage

23 A Selection of Modeling Methods
Linear Regression, Logistic Regression Decision Trees

24 Hard Target Search Transactions ...

25 Hard Target Search Transactions Fraud

26 Undercoverage Accepted Bad Accepted Good Rejected No Follow-up ...

27 Undercoverage Next Generation Accepted Bad Accepted Good Rejected
No Follow-up

28 2.02 Poll Impediments to high-quality business data can lie in the very nature of business decision-making: the worst prospects are not marketed to. Therefore, information about the sort of customer that they would be (profitable or unprofitable) is usually unknown, making supervised classification more difficult.  Yes  No Yes

29 2.02 Poll – Correct Answer Impediments to high-quality business data can lie in the very nature of business decision-making: the worst prospects are not marketed to. Therefore, information about the sort of customer that they would be (profitable or unprofitable) is usually unknown, making supervised classification more difficult.  Yes  No Yes

30 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

31 Objectives Explain the concept of data integration.
Describe SAS Enterprise Guide and how it fits in with data integration and management for business analytics.

32 Data Management and Business Analytics
Data management brings together data components that can exist on multiple machines, from different software vendors, throughout the organization. Data management is the foundation for business analytics. Without correctly consolidated data, those working in the analytics, reporting, and solutions areas might not be working with the most current, accurate data. Advanced Analytics Basic Analytics Reporting

33 Managing Data for Business Analytics
Business analytics requires data management activities such as data access, movement, transformation, aggregation, and augmentation. These tasks can involve many different types of data (for example, simple flat files, files with comma-separated values, Microsoft Excel files, SAS tables, and Oracle tables). The data likely combines individual transactions, customer summaries, product summaries, or other levels of data granularity – or some combination of those things.

34 Planning from the Top Down
What mission-critical questions must be answered? ? What data will help you answer these questions? What data do you have that will help you build the needed data?

35 Implementing from the Bottom Up
? Create Reports Define Target Data Identify Source Data

36 Collaboration Is Key to Business Analytics
Business Expert IT Expert Analytical Expert

37 Data Marts: Tying Questions to Data
Stated simplistically, data marts are implemented at organizations because there are questions that must be answered. Data is typically collected in daily operations but might not be organized in a way that answers the questions. An IT professional can use the questions and the data collected from daily operations to construct the tables for a data warehouse or data mart.

38 Building a Data Mart Foundation of a Data Mart Identify source tables.
Identify target tables. Create target tables. Building the foundation of the data mart consists of the three basic steps listed above.

39 Analytic Objective Example
Business: Large financial institution Objective: From a population of existing clients with sufficient tenure and other qualifications, identify a subset most likely to have interest in an insurance investment product (INS).

40 Financial Institution’s Data
The financial institution has highly detailed data that is challenging to transform into a structure suitable for predictive modeling. As is the case with most organizations, the financial institution has a large amount of data about its customers, products, and employees. Much of this information is stored in transactional systems in various formats. Using SAS Enterprise Guide, this transactional information is extracted, transformed, and loaded into a data mart for the Marketing Department. You continue to work with this data set for some basic exploratory analysis and reporting.

41 A Target Star Schema One goal of creating a data mart is to produce, from the source data, a dimensional data model that is a star schema. Customer Dimension Organization Dimension Fact Table Product Dimension  A dimensional data model consists of one or more dimension tables and fact tables. One type of dimensional data model is known as a star schema where there is typically one central fact table and some number of dimensional tables. Dimension tables store records related to that particular dimension and no facts (measures) are stored in these tables. A fact table typically has two types of columns: those that contain facts (measures) and those that are foreign keys to dimension tables. Time Dimension

42 Financial Institution Target Star Schema
The analyst can produce, from the financial institution’s source data, a dimensional data model that is a star schema. Customer Dimension Credit Bureau Dimension Checking Fact Table Insurance Dimension

43 Checking_transactions Table
The checking_transactions table contains the following attributes, one per a record fact. This fact contains some measured or observed variables. The fact table contains the data, and the dimensions identify each tuple in the data. CHECKING_ID CHKING_TRANS_DT CHKING_TRANS_AMT CHKING_TRANS_CHANNEL_CD CHKING_TRANS_METHOD_CD CHKING_TRANS_TYPE_CD

44 Client Table The client table contains client information. In practice, this data set could also contain address and other information. For this demonstration, only CLIENT_ID, FST_NM, LST_NM, ORIG_DT, BIRTH_DT, and ZIP_5 are used. CLIENT_ID FST_NM LST_NM ORIG_DT BIRTH_DT ZIP_5

45 Client_ins_account Table
The client_ins_account table matches client IDs to INS account IDs. CLIENT_ID CLIENT_INS_ID

46 Ins_account Table The ins_account table contains the insurance account information. In practice, this data set would contain other fields such as rates, maturity dates, and initial deposit amount. For this demonstration, only INS_ACT_ID and INS_ACT_OPEN_DT are used. INS_ACT_ID INS_ACT_OPEN_DT

47 Credit_bureau Table The credit_bureau table contains credit bureau information. In practice, this data set could contain credit scores from more than one credit bureau and also a history of credit scores. CLIENT_ID TL_CNT FST_TL_TR FICO_CR_SCR CREDIT_YQ

48 Advantages of Data Marts
There is one version of the truth. Downstream tables are updated as source data is updated, so analyses are always based on the latest information. The problem of a proliferation of spreadsheets is avoided. Information is clearly identified by standardized variable names and data types. Multiple users can access the same data.

49 SAS Enterprise Guide Overview
SAS Enterprise Guide can be used for data management, as well as a wide variety of other tasks: data exploration querying and reporting graphical analysis statistical analysis scoring

50 Example: Financial Institution Data Management
The head of Marketing wants to know which customers have the highest propensity for buying insurance products from the institution. This could present a cross-selling opportunity. Create part of an analytical data mart by combining information from many tables: checking account data, customer records, insurance data, and credit bureau information.

51 Input Files client_ins_account.sas7bdat credit_bureau.sas7bdat ins_account.sas7dbat client.sas7bdat

52 Final Data

53 A Data Management Process Using SAS Enterprise Guide
Financial Institution Case Study Task: Join several SAS tables and use separate sampling to obtain a training data set.

54 Exploring the Data and Creating a Report
Investigate the distribution of credit scores. Create a report of credit scores by customers without insurance and customers with insurance. Does age have an influence on credit scores? Which customers have the highest credit scores, young customers or older customers? Create a graph of credit scores by age.

55 Exploratory Analysis

56 Exploring the Data and Creating a Basic Report
Financial Institution Case Study Task: Investigate the distribution of credit scores by creating a report of credit scores by customers without insurance and customers with insurance.

57 Graphical Exploration
Financial Institution Case Study Task: Create a graph of credit scores by age.

58 Idea Exchange What conclusions would you draw from this basic data exploration? Are there additional plots or reports that you would like to explore from the orders data to help you better understand your customers and their propensity to buy insurance? What additional data would you need to help you make a case to the head of the Marketing Department that marketing dollars should be spent in a particular way?

59 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

60 Objectives Identify several of the challenges of data mining and present ways to address these challenges.

61 Initial Challenges in Data Mining
What do I want to predict? What level of granularity is needed to obtain data about the customer? ...

62 Initial Challenges in Data Mining
What do I want to predict? a transaction an individual a household a store a sales team What level of granularity is needed to obtain data about the customer? ...

63 Initial Challenges in Data Mining
What do I want to predict? a transaction an individual a household a store a sales team What level of granularity is needed to obtain data about the customer? transactional regional daily monthly other

64 2.03 Multiple Answer Poll Which of the following might constitute a case in a predictive model? a household loan amount an individual the number of products purchased a company a ZIP code salary A, C, E, F

65 2.03 Multiple Answer Poll – Correct Answers
Which of the following might constitute a case in a predictive model? a household loan amount an individual the number of products purchased a company a ZIP code salary A, C, E, F

66 Typical Data Mining Time Line
Allotted Time Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation Data Analysis

67 Data Challenges What identifies a unit?

68 Cracking the Code What identifies a unit?
ID1 ID2 DATE JOB SEX FIN PRO3 CR_T ERA DEC ETS PBB RVC ETT OFS RFN PBB SLP STL DES DLF

69 Data Challenges What should the data look like to perform an analysis?

70 Data Arrangement What should the data look like to perform an analysis? Long-Narrow Acct type 2133 MTG 2133 SVG CK CK 2653 SVG 3544 MTG CK 3544 MMF CD 3544 LOC Short-Wide Acct CK SVG MMF CD LOC MTG

71 Data Challenges What variables do I need?

72 Derived Inputs What variables do I need? Claim Accident Date Time
11nov /12:38 22dec /01:42 26apr /03:05 02jul /06:25 08mar /18:33 15dec /18:12 09nov /22:14 Delay Season Dark 19 fall 0 333 winter 1 3 spring 1 0 summer 0 69 winter 0 186 summer 0 4 fall 1

73 Data Challenges How do I convert my data to the proper level of granularity?

74 Roll-Up How do I convert my data to the proper level of granularity?
HH Acct Sales HH Acct Sales ? ? ? ?

75 Rolling Up Longitudinal Data
How do I convert my data to the proper level of granularity? Frequent Flying VIP Flier Month Mileage Member Jan No Feb No Mar No Apr No Jan No Feb No Mar Yes Apr Yes

76 Data Challenges What sorts of raw data quality problems can I expect?

77 Errors, Outliers, and Missings
What sorts of raw data quality problems can I expect? cking #cking ADB NSF dirdep SVG bal Y Y Y Y Y Y y Y Y ­ Y Y Y Y Y Y Y

78 Missing Value Imputation
What sorts of raw data quality problems can I expect? Inputs ? ? ? ? ? Cases ? ? ? ?

79 Data Challenges Can I (more importantly, should I) analyze all the data that I have? All the observations? All the variables?

80 Massive Data Can I (more importantly, should I) analyze all the data that I have? Bytes 210 220 230 240 250 Paper ½ sheet 1 ream 167 feet 32 miles 32,000 miles Kilobyte Megabyte Gigabyte Terabyte Petabyte Reference: A mid-to large-scale wind turbine is around 167 feet high. (GB) The Stratosphere is from 6-31 miles above Earth. Mesosphere is miles up. (TB) Satellites are usually ~ miles above Earth. Moon is around 238,000 miles away from Earth.

81 Sampling Can I (more importantly, should I) analyze all the data that I have?

82 Oversampling Can I (more importantly, should I) analyze all the data that I have? OK Fraud

83 The Curse of Dimensionality
Can I (more importantly, should I) analyze all the data that I have? 1-D 2-D 3-D

84 Dimension Reduction Can I (more importantly, should I) analyze all the data that I have? Redundancy Irrelevancy E(Target) Input3 Input1 Input2 Input1

85 2.04 Multiple Answer Poll Which of the following statements are true?
The more data you can get, the better. Too many variables can make it difficult to detect patterns in data. Too few variables can make it difficult to learn interesting facts about the data. Cases with missing values should generally be deleted from modeling. B, C

86 2.04 Multiple Answer Poll – Correct Answers
Which of the following statements are true? The more data you can get, the better. Too many variables can make it difficult to detect patterns in data. Too few variables can make it difficult to learn interesting facts about the data. Cases with missing values should generally be deleted from modeling. B, C

87 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

88 Objectives Describe the basic navigation of SAS Enterprise Miner.

89 SAS Enterprise Miner

90 SAS Enterprise Miner – Interface Tour
Menu bar and shortcut buttons

91 SAS Enterprise Miner – Interface Tour
Project panel

92 SAS Enterprise Miner – Interface Tour
Properties panel

93 SAS Enterprise Miner – Interface Tour
Help panel

94 SAS Enterprise Miner – Interface Tour
Diagram workspace

95 SAS Enterprise Miner – Interface Tour
Process flow

96 SAS Enterprise Miner – Interface Tour
Node

97 SAS Enterprise Miner – Interface Tour
SEMMA tools palette

98 Catalog Case Study Analysis Goal: A mail-order catalog retailer wants to save money on mailing and increase revenue by targeting mailed catalogs to customers who are most likely to purchase in the future. Data set: CATALOG2010 Number of rows: 48,356 Number of columns: 98 Contents: sales figures summarized across departments and quarterly totals for 5.5 years of sales Targets: RESPOND (binary) ORDERSIZE (continuous)

99 Catalog Case Study: Basics
Throughout this chapter, you work with data in SAS Enterprise Miner to perform exploratory analysis. Import the CATALOG2010 data. Identify the target variables. Define and transform the variables for use in RFM analysis. Perform graphical RFM analysis in SAS Enterprise Miner. Later, you use the CATALOG2010 data for predictive modeling and scoring.

100 Accessing and Importing Data for Modeling
First, get familiar with the data! The data file is a SAS data set. Create a project in SAS Enterprise Miner. Create a diagram. Locate and import the CATALOG2010 data. Define characteristics of the data set, such as the variable roles and measurement levels. Perform a basic exploratory analysis of the data.

101 Defining a Data Source Metadata Definition CATALOG data ABA1 SAS
Foundation Server Libraries

102 Metadata Definition Select a table. Set the metadata information.
Three purposes for metadata: Define variable roles (such as input, target, or ID). Define measurement levels (such as binary, interval, or nominal). Define table role (such as raw data, transactional data, or scoring data).

103 Creating Projects and Diagrams in SAS Enterprise Miner
Catalog Case Study Task: Create a project and a diagram in SAS Enterprise Miner.

104 Defining a Data Source Catalog Case Study Task: Define the CATALOG data source in SAS Enterprise Miner.

105 Defining Column Metadata
Catalog Case Study Task: Define column metadata.

106 Changing the Sampling Defaults in the Explore Window and Exploring a Data Source
Catalog Case Study Tasks: Change preference settings in the Explore window and explore variable associations.

107 Idea Exchange Consider an academic retention example. Freshmen enter a university in the fall term, and some of them drop out before the second term begins. Your job is to try to predict whether a student is likely to drop out after the first term. continued...

108 Idea Exchange What types of variables would you consider using to assess this question? How does time factor into your data collection? Do inferences about students five years ago apply to students today? How do changes in technology, university policies, and teaching trends affect your conclusions? continued...

109 Idea Exchange As an administrator, do you have this information? Could you obtain it? What types of data quality issues do you anticipate? Are there any ethical considerations in accessing the information in your study?

110 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

111 Objectives Explain the characteristics of a good predictive model.
Describe data splitting. Discuss the advantages of using honest assessment to evaluate a model and obtain the model with the best prediction.

112 Predictive Modeling Implementation
Model Selection and Comparison Which model gives the best prediction? Decision/Allocation Rule What actions should be taken on new cases? Deployment How can the predictions be applied to new cases? ...

113 Predictive Modeling Implementation
Model Selection and Comparison Which model gives the best prediction? Decision/Allocation Rule What actions should be taken on new cases? Deployment How can the predictions be applied to new cases?

114 Getting the “Best” Prediction: Fool’s Gold
My model fits the training data perfectly... I’ve struck it rich!

115 2.05 Poll The best model is a model that does a good job of predicting your modeling data.  Yes  No No

116 2.05 Poll – Correct Answer The best model is a model that does a good job of predicting your modeling data.  Yes  No No

117 Model Complexity ...

118 Model Complexity Too flexible ...

119 Model Complexity Too flexible Not flexible enough ...

120 Model Complexity Just right Too flexible Not flexible enough

121 Data Splitting and Honest Assessment
Validation Training Test

122 Overfitting Training Set Test Set

123 Better Fitting Training Set Validation Set

124 Predictive Modeling Implementation
Model Selection and Comparison Which model gives the best prediction? Decision/Allocation Rule What actions should be taken on new cases? Deployment How can the predictions be applied to new cases?

125 Decisions, Decisions Predicted Accuracy Sensitivity Cutoff 1 Lift 360
1 Lift 360 540 .08 44% 80% 1.3 20 80 1 540 360 Actual .10 60% 60% 1.4 40 60 1 720 180 .12 76% 40% 1.8 60 40 1

126 Misclassification Costs
Predicted Class Action 1 Accept Deny True Neg False Pos 1 OK Actual False Neg True Pos 9 Fraud 1

127 Predictive Modeling Implementation
Model Selection and Comparison Which model gives the best prediction? Decision/Allocation Rule What actions should be taken on new cases? Deployment How can the predictions be applied to new cases?

128 Scoring Model Deployment Model Development

129 Scoring Recipe The model results in a formula or rules.
The data requires modifications. Derived inputs Transformations Missing value imputation The scoring code is deployed. To score, you do not rerun the algorithm; apply score code (equations) obtained from the final model to the scoring data.

130 Scorability Training Data Classifier Scoring Code X1 1 Tree
If x1<.47 and x2<.18 or x1>.47 and x2>.29, then red. .8 .6 .4 New Case .2 .2 .4 .6 .8 1 X2

131 Scoring Pitfalls: Population Drift
Data generated Data cleaned Model deployed Time Data analyzed Data acquired

132 The Secret to Better Predictions
Fraud OK Time of Day Transaction Amt. ...

133 The Secret to Better Predictions
Fraud Time of Day Transaction Amt. OK Time of Day Transaction Amt. ...

134 The Secret to Better Predictions
Cheatin’ Heart Fraud Time of Day Transaction Amt. OK Time of Day Transaction Amt.

135 Idea Exchange Think of everything that you have done in the past week. What transactions or actions created data? For example, point-of-sale transactions, Internet activity, surveillance, and questionnaires are all data collection avenues that many people encounter daily. How do you think that the data about you will be used? How could models be deployed that use data about you?

136 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

137 Objectives Describe a methodology for implementing business analytics through data mining. Discuss each of the steps, with examples, in the methodology. Create a project and diagram in SAS Enterprise Miner.

138 Define or refine business objective Prepare and repair data
Methodology Data mining is not a linear process. It is a cycle, where later results can lead back to previous steps. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

139 Why Have a Methodology? To avoid learning things that are not true
To avoid learning things that are not useful results that arise from past marketing decisions results that you already know results that you already should know results that you are not allowed to use To create stable models To avoid making the mistakes that you made in the past To develop useful tips from what you learned The methodology presented here grew out of our own experience. Early in our careers, we learned that a focus on techniques and algorithms does not lead to success. The methodology is our attempt to bring together the tricks of the trade that we have picked up over the years. Data mining should provide new information. Many of the strongest patterns in data represent things that are already known. People over retirement age tend not to respond to offers for retirement savings plans. People who live where there is no home delivery do not become newspaper subscribers. Even though they might respond to subscription offers, service never starts. For the same reason, people who live where there are no cell towers tend not to purchase cell phones. Often, the strongest patterns reflect business rules. Not only are these patterns uninteresting, their strength might obscure less obvious patterns. Learning things that are already known does serve one useful purpose. It demonstrates that, on a technical level, the data mining effort is working and the data is reasonably accurate. This can be quite comforting.

140 Define or refine business objective Prepare and repair data
Methodology 1. Define the business objective and state it as a data mining task. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

141 1) Define the Business Objective
Improve the response rate for a direct marketing campaign. Increase the average order size. Determine what drives customer acquisition. Forecast the size of the customer base in the future. Choose the right message for the right groups of customers. Target a marketing campaign to maximize incremental value. Recommend the next, best product for existing customers. Segment customers by behavior. A data mining project should seek to solve a well-defined business problem. Wherever possible, broad, general goals such as “understand customers better” should be broken down into more specific ones like the ones on the slide. A lot of good statistical analysis is directed at solving the wrong business problem.

142 Define the Business Goal
Example: Who is the yogurt lover? What is a yogurt lover? One answer prints coupons at the cash register. Another answer mails coupons to people’s homes. Another results in advertising. Through a misunderstanding of the business problem, we once created a model that gave supermarket loyalty card holders a “yogurt lover” score based on their likelihood to be in the top tercile for both absolute dollars spent on yogurt and yogurt as a proportion of their total shopping. The model got good lift, and we were pleased with it. The client, however, was disappointed. “But, who is the yogurt lover?” they wanted to know. “Someone who gets a high score from the yogurt lover model” was not considered a good answer. The client was looking for something like “The yogurt lover is a woman between the ages of X and Y living in a zip code where the median home price is between M and N.” A description like that could be used for deciding where to buy advertising and how to shape the creative content of ads. Ours, based as it was on shopping behavior rather than demographics, could not.

143 Big Challenge: Defining a Yogurt Lover
“Yogurt lover” is not in the data. You can impute it, using business rules: Yogurt lovers spend a lot of money on yogurt. Yogurt lovers spend a relatively large amount of their shopping dollars on yogurt. $$ Spent on Yogurt LOW MEDIUM HIGH LOW MEDIUM HIGH Yogurt as % of All Purchases

144 Next Challenge: Profile the Yogurt Lover
You have identified a segment of customers that you believe are yogurt lovers. But who are they? How would I know them in the store? Identify them by demographic data. Identify them by other things that they purchase (for example, yogurt lovers are people who buy nutrition bars and sports drinks). What action can I take? Set up “yogurt-lover-attracting” displays.

145 Idea Exchange If a customer is identified as a yogurt lover, what action should you take? Should you give yogurt coupons, even though these individuals buy yogurt anyway? Is there a cross-sell opportunity? Is there an opportunity to identify potential yogurt lovers? What would you do?

146 Profiling in the Extreme: Best Buy
Using analytical methodology, electronics retailer Best Buy discovered that a small percentage of customers accounted for a large percentage of revenue. Over the past several years, the company has adopted a customer-centric approach to store design and flow, staffing, and even corporate acquisitions such as the Geek Squad support team. The company’s largest competitor has gone bankrupt while Best Buy has seen growth in market share. See Gulati (2010)

147 Define the Business Objective
What Is the business objective? Example: Telco Churn Initial problem: Assign a churn score to all customers. Recent customers with little call history Telephones? Individuals? Families? Voluntary churn versus involuntary churn How will the results be used? Better objective: By September 24, provide a list of the 10,000 elite customers who are most likely to churn in October. An instruction to calculate churn scores for all customers led to all sorts of questions. Only after understanding that the business goal was actually to choose candidates for a retention program aimed at members of the elite customer club was it possible to refine the goal into something more actionable. The new objective is actionable.

148 Define the Business Objective
Example: Credit Churn How do you define the target? When did a customer leave? When she has not made a new charge in six months? When she had a zero balance for three months? When the balance does not support the cost of carrying the customer? When she cancels her card? When the contract ends? Tenure (months) 0.0% 0.2% 0.4% 0.6% 0.8% 1.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3.0% Before building a credit card attrition model, you have to come to agreement on the definition of attrition. There is no right answer. Your choice will depend on the nature of the actions that the model is designed to support. Typically, it is not up to the data miner to define attrition. That is a business decision, not an analytic one. In the wireless industry, many customers start out on one-year subscriptions. When the year is up, many of them leave—a phenomenon known as anniversary churn. For one of our clients, anniversary churn seemed very high, with many people leaving on the very day they became eligible. But is the cancellation date really the right variable to look at? By looking at call center data, it became clear that many subscribers first placed a call to cancel months earlier only to be told that they were stuck with the service until the year was up.

149 Translate Business Objectives into Data Mining Tasks
Do you already know the answer? In supervised data mining, the data has examples of what you are looking for, such as the following: customers who responded in the past customers who stopped transactions identified as fraud In unsupervised data mining, you are looking for new patterns, associations, and ideas. Data Mining often comes in two flavors, directed and undirected. Directed data mining is about learning from known past examples and repeated this learning in the future. This is predictive modeling, and most data mining falls into this category. Undirected data mining is about spotting unexpected patterns in the data. To be useful, these unexpected patterns need to be assessed by humans for relevance.

150 Data Mining Tasks Lead to Specific Techniques
Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value To translate a business problem into a data mining problem, it should be formulated as one of the seven tasks mentioned on the slide. The data mining techniques described provide the tools for addressing these tasks. No single data mining tool or technique is equally applicable to all tasks. ...

151 Data Mining Tasks Lead to Specific Techniques
Objectives Tasks Exploratory Data Analysis Binary Response Modeling Multiple Response Modeling Estimation Forecasting Detecting Outliers Pattern Detection Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value To translate a business problem into a data mining problem, it should be formulated as one of the seven tasks mentioned on the slide. The data mining techniques described provide the tools for addressing these tasks. No single data mining tool or technique is equally applicable to all tasks. ...

152 Data Mining Tasks Lead to Specific Techniques
Objectives Tasks Techniques Exploratory Data Analysis Binary Response Modeling Multiple Response Modeling Estimation Forecasting Detecting Outliers Pattern Detection Decision Trees Regression Neural Networks Survival Analysis Clustering Association Rules Link Analysis Hypothesis Testing Visualization Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value To translate a business problem into a data mining problem, it should be formulated as one of the seven tasks mentioned on the slide. The data mining techniques described provide the tools for addressing these tasks. No single data mining tool or technique is equally applicable to all tasks.

153 Data Analysis Is Pattern Detection
Patterns might not represent any underlying rule. Some patterns reflect some underlying reality. The party that holds the White House tends to lose seats in Congress during off-year elections. Others do not. When the American League wins the World Series in Major League Baseball, Republicans take the White House. Stars cluster in constellations. Sometimes, it is difficult to tell without analysis. In U.S. presidential contests, the taller candidate usually wins. We humans are very good at seeing patterns. Presumably, the reason that humans have evolved such an affinity for patterns is that patterns often do reflect some underlying truth about the way the world works. The phases of the moon, the progression of the seasons, the constant alternation of night and day, even the regular appearance of a favorite TV show at the same time on the same day of the week are useful because they are stable and therefore predictive. We can use these patterns to decide when it is safe to plant tomatoes and how to program the VCR. Other patterns clearly do not have any predictive power. If a fair coin comes up heads 5 times in a row, there is still a chance that it will come up tails on the sixth toss. The challenge for data miners is to figure out which patterns are predictive and which are not. According to the article “Presidential Candidates Who Measure Up,” by Paul Sommers in Chance magazine ( Vol. 9, No. 3, Summer 1996, pp , the following is a list of presidential heights for the last century: Year Winner Height (inches) Runner-Up Height (inches) 1900 McKinley 67 Bryan 72 1904 T. Roosevelt 70 Parker 72 1908 Taft 72 Bryan 72 1912 Wilson 71 T. Roosevelt 70 1916 Wilson 71 Hughes 71 1920 Harding 72 Cox 70 1924 Coolidge 70 Davis 72 1928 Hoover 71.5 Smith 71 1932 F. Roosevelt 74 Hoover 71 1936 F. Roosevelt 74 Landon 68 1940 F. Roosevelt 74 Wilkie 73 1944 F. Roosevelt 74 Dewey 68 1948 Truman 69 Dewey 68 1952 Eisenhower 70.5 Stevenson 70 1956 Eisenhower 70.5 Stevenson 70 1960 Kennedy 72 Nixon 71.5 1964 Johnson 75 Goldwater 72 1968 Nixon 71.5 Humphrey 71 1972 Nixon 71.5 McGovern 73 1976 Carter 69.5 Ford 72 1980 Reagan 73 Carter 69.5 1984 Reagan 73 Mondale 70 1988 Bush 74 Dukakis 68 1992 Clinton 74 Bush 74 1996 Clinton 74 Dole 74 2000 Bush 71 Gore 73 2004 Bush 71 Kerry 76 2008 Obama 74 McCain 69 For the last century, the ratio is 16 taller versus 10 shorter (61% to the taller). If we throw in cases where the candidates have the same height and consider the popular vote as opposed to the electoral outcome (Gore had a higher popular vote in 2000), then the tally is 21/26 (81%).

154 Example: Maximizing Donations
Example from the KDD Cup, a data mining competition associated with the KDD Conference ( Purpose: Maximizing profit for a charity fundraising campaign Tested on actual results from mailing (using data withheld from competitors) Competitors took multiple approaches to the modeling: Modeling who will respond Modeling how much people will give Perhaps more esoteric approaches However, the top three winners all took the same approach (although they used different techniques, methods, and software).

155 The Winning Approach: Expected Revenue
Potential Donors Task: Estimate responseperson, the probability that a person responds to the mailing (all customers). Actual Donors Task: Estimate the value of response, dollarsperson (only customers who respond). This example is taken from a data mining contest that was held in conjunction with the 1998 KDD Conference (knowledge discovery and data mining). What separated the winners from the losers was not the algorithms they used or the software they employed, but how they translated the business problem into a data mining problem. The business problem was to maximize donations to a charity. The data was a historical database of contributions. Exploring the data revealed the first insight: the more often someone contributed, the less money they contributed each time. Usually, we think that the best customers are those who are the most frequent. In this case, though, it seems that people plan their charitable giving on a yearly basis. They might give it all at one time, or space it over time. More checks does not mean more money. This insight led to the winning approach. People seem to decide how much money to give separate from whether they will respond to the campaign. This suggests two models: first, built on a training set that contains both contributors and non-contributors to determine who will respond to the campaign; and the second, built only on contributors, to estimate how much they will give. The three winning entries all took this approach of combining models. Choose prospects with the highest expected value, responseperson * dollarsperson.

156 An Unexpected Pattern An unexpected pattern suggests an approach.
When people give money frequently, they tend to donate less money each time. In most business applications, as people take an action more often, they spend more money. Donors to a charity are different. This suggests that potential donors go through a two-step process: Shall I respond to this mailing? How much money should I give this time? Modeling can follow the same logic.

157 Define or refine business objective Prepare and repair data
Methodology 2. Select or collect the appropriate data to address the problem. Identify the customer signature. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

158 2) Select Appropriate Data
What is available? What is the right level of granularity? How much data is needed? How much history is required? How many variables should be used? What must the data contain? Assemble results into customer signatures. Of course, data mining depends on the data that is available. It is worth asking questions about the data to determine what kinds of questions it can answer.

159 Representativeness of the Training Sample
The model set might not reflect the relevant population. Customers differ from prospects. Survey responders differ from non-responders. People who read differ from people who do not read . Customers who started three years ago might differ from customers who started three months ago. People with land lines differ from those without. The model set is the collection of historical data that is used to develop data mining models. For inferences drawn from the model set to be valid, the model set must reflect the population that the model is meant to describe, classify, or score. A sample that does not properly reflect its parent population is biased. Using a biased sample as a model set is a recipe for learning things that are not true. It is also hard to avoid. Getting a truly representative sample requires effort. One explanation for the “Dewey Defeats Truman” headline is that telephone polls were biased because in 1948, more Republicans than Democrats had telephones. Although the polling industry has learned how to better estimate results for the population at large based on a sample, biases regularly occur when taking a poll over the Internet or by radio, and today there is once again a sizeable number of households without landline telephones.

160 Availability of Relevant Data
Elevated printing defect rates might be due to humidity, but that information is not in press run records. Poor coverage might be the number one reason for wireless subscribers canceling their subscriptions, but data about dropped calls is not in billing data. Customers might already have potential cross-sell products from other companies, but that information is not available internally. In some cases, the right data might not be available. There is sometimes the option of purchasing external data; however, this rarely fills in all the gaps. A company’s data about its own customers is a key competitive advantage that other companies do not have.

161 Types of Attributes in Data
Readily Supported Binary Categorical (nominal) Numeric (interval) Date and time Require More Work Text Image Video Links Categorical columns contain discrete information, such as product type, county, credit class, and so on. SAS calls these nominal, after the Latin word for name. Notice that class columns might be represented as numbers (ZIP codes, product IDs, and so on). However, computation on these numbers makes no sense. Binary columns, a special type of categorical columns, take on exactly two values. These are particularly important as targets. Notice that sometimes the two values are “something” and missing (particularly when data is appended from outside vendors). Numeric columns contain numbers. SAS calls these interval because subtraction (the interval between two values) is defined. Date/Time columns represent dates, times, and spans of time between them. SAS calls these intervals as well.

162 Idea Exchange Suppose that you were in charge of a charity similar to the KDD example above. What type of data are you likely to have available before beginning the project? Is there additional data that you would need? Do you have to purchase the data, or is it publicly available for free? How could you make the best use of a limited budget to acquire high quality data about individual donation patterns?

163 The Customer Signature
The primary key uniquely identifies each row, often corresponding to customer ID. The target columns are what you are looking for. Sometimes, the information is in multiple columns, such as a churn flag and churn date. A foreign key gives access to data in another table, such as ZIP code demographics. Some columns are ignored because the values are not predictive or they contain future information, or for other reasons. The customer signature is the basic unit of modeling. Each row describes a single customer (or whatever we are interested in). The columns contain features about the customer. Each row generally corresponds to a customer.

164 Data Assembly Operations
Derivation of new variables Copying Table lookup Pivoting Summarization of values from data Chapter 17 of Data Mining Techniques is devoted to constructing customer signatures and preparing data for mining. Some of the operations typically used to assemble a customer signature are listed on the slide. These operations are needed to take data that is stored at many levels of granularity and bring it all to the customer level. Some data (credit score and acquisition source, for example) is already stored at the customer level and can be copied in. Some data (billing data, for example) is stored by month. In order to create a time series in the customer signature, such data must be pivoted. Some data is stored as individual transactions. Such data must be aggregated to the customer level. Often, the aggregation is performed to the monthly level and the summarized data is then pivoted. Some data finds its way into the customer signature through lookup tables. For example, the customer’s zip code might be used to look up such things as the median home price, median rent, and population density. Still other values do not exist anywhere in the source data and are summarized from the available data, such as the historical sales penetration within a zip code or the churn rate by handset type. Finally, many elements of the customer signature are derived from other elements. Examples include trends, ratios, and durations. Aggregation

165 Define or refine business objective Prepare and repair data
Methodology 3. Explore the data. Look for anomalies. Consider time-dependent variables. Identify key relationships among variables. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

166 3) Explore the Data Examine distributions. Study histograms.
Think about extreme values. Notice the prevalence of missing values. Compare values with descriptions. Validate assumptions. Ask many questions. Good data miners seem to rely heavily on intuition, somehow being able to guess what a good derived variable to try might be, for instance. The only way to develop intuition for what is going on in an unfamiliar data set is to immerse yourself in it. Along the way, you are likely to discover many data quality problems and be inspired to ask many questions that would not otherwise come up. A good first step is to examine a histogram of each variable in the data set and think about what it is telling you. Make note of anything that seems surprising. If there is a state code variable, is California the tallest bar? If not, why not? Are some states missing? If so, does it seem plausible that this company does not do business in those states? If there is a gender variable, are there similar numbers of men and women? If not, is that unexpected? Pay attention to the range of each variable. Do variables that should be counts take on negative values? Do the highest and lowest values sound like reasonable values for that variable to take on? Is the mean much different from the median? How many missing values are there? Have the variable counts been consistent over time?

167 Ask Many Questions Why were some customers active for 31 days in February, but none were active for more than 28 days in January? How do some retail card holders spend more than $100,000 in a week in a grocery store? Why were so many customers born in 1911? Are they really that old? Why do Safari users never make second purchases? What does it mean when the contract begin date is after the contract end date? Why are there negative numbers in the sale price field? How can active customers have a non-null value in the cancellation reason code field? These are all real questions we have had occasion to ask about real data. Sometimes the answers taught us things we had not known about the client’s industry. At the time, New Jersey and Massachusetts did not allow automobile insurers much flexibility in setting rates, so a company that sees its main competitive advantage as smarter pricing does not want to operate in those markets. Other times we learned about idiosyncrasies of the operational systems, such as the data entry screen that insisted on a birth date even when none was known, which lead to a lot of people being assigned the birthday November 11, 1911 because 11/11/11 is the date you get by holding down the “1” key and letting it auto-repeat until the field is full. Sometimes we discovered serious problems with the data such as the data for February being misidentified as January. And in the last example, we learned that the process extracting the data had bugs.

168 Be Wary of Changes over Time
Does the same code have the same meaning in historical data? Did different data elements start being loaded at different points in time? Did something happen at a particular point in time? Price-related cancelations Price increase The fact that a variable keeps the same name over time does not mean that its meaning stays the same. Credit classes might always be A, B, C, and D, but the threshold credit scores might change over time. Similarly, the definition of sales regions may change over time. Policy changes may cause changes in customer behavior and so on. Looking at how data values have changed over time can reveal some of these things. This graph suggests that there might have been a price increase around the beginning of February.

169 Define or refine business objective Prepare and repair data
Methodology 4. Prepare and repair the data. Define metadata correctly. Partition the data and create balanced samples, if necessary. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

170 4) Prepare and Repair the Data
Set up a proper temporal relationship between the target variable and inputs. Create a balanced sample, if possible. Include multiple time frames if necessary. Split the data into training, validation, and (optionally) test data sets. The model set contains all the data that is used in the modeling process. Some of the data in the model set is used to find patterns. Some of the data in the model set is used to verify that the model is stable. Some is used to assess the model’s performance. Creating a model set requires assembling data from multiple sources to form customer signatures and then preparing the data for analysis. Modeling is done on flat data sets that have one row per item to be modeled. We generally call these rows customer signatures because in analytic CRM, most analysis is at the customer level. Sometimes, however, another level is appropriate. If customers are anonymous, analysis might have to be done at the order level. If the relationship is with entire families, the household level might be most appropriate. This decision must be made before signatures can be constructed.

171 Temporal Relationship: Prediction or Profiling?
The same techniques work for both. In a predictive model, values of explanatory variables are from an earlier time frame than the target variable. In a profiling model, the explanatory variables and the target variable might all be from the same time frame. Earlier Later Same Time Frame When building a predictive model, data from the distant past is used to explain a target variable in the recent past. This separation of timeframes is very important; it mimics the gap that exists when the model is deployed and used. When a predictive model is used, data from the recent past is used to predict future values of the target variable.

172 Balancing the Input Data Set
A very accurate model simply predicts that no one wants a brokerage account: 98.8% accurate 1.2% error rate This is useless for differentiating among customers. Stratified sampling and weights (called frequencies in SAS Enterprise Miner) are two ways of creating a balanced model set. With stratified sampling, the model set ends up with fewer records, which is often desirable. With weights or frequencies, every record counts, but the common ones count less than the rare ones. Using stratified sampling or weights to balance the outcomes in the model forces the model to distinguish characteristics of the rare class rather than simply ignoring it. It is easy to get back to actual distribution of classes and probabilities after the fact.

173 Two Ways to Create Balanced Data
Stratified sampling and weights (called frequencies in SAS Enterprise Miner) are two ways of creating a balanced model set. With stratified sampling, the model set ends up with fewer records, which is often desirable. With weights or frequencies, every record counts, but the common ones count less than the rare ones. Using stratified sampling or weights to balance the outcomes in the model forces the model to distinguish characteristics of the rare class rather than simply ignoring it. It is easy to get back to actual distribution of classes and probabilities after the fact.

174 Data Splitting and Validation
Improving the model causes the error rate to decline on the data used to build it. At the same time, the model becomes more complex. Error Rate Data mining algorithms find patterns in preclassified data. As long as you keep looking at the same training data, the more complex the model becomes (more leaves in a decision tree; more iterations of training for a neural network; higher degree polynomials to fit a curve), the better it will appear to fit the training data. Unfortunately, it will be a fitting noise as well as a signal. This can be seen by applying models to a second preclassified data set. In this graph, the error rate is shown going up on the validation data as the model becomes overly complex, even though it is still decreasing on the training data. The tendency of models to memorize or overfit the training data at the expense of stability and generality is a topic that is discussed at greater length in the chapters on decision trees and neural networks. Models Getting More Complex

175 Validation Data Prevents Overfitting
Sweet spot Error Rate Validation Data Signal Noise Training Data Models Getting More Complex

176 Partitioning the Input Data Set
Training Validation Test Use the training set to find patterns and create an initial set of candidate models. Use the validation set to select the best model from the candidate set of models. Use the test set to measure performance of the selected model on unseen data. The test set can be an out-of-time sample of the data, if necessary. Partitioning data is an allowable luxury because data mining assumes a large amount of data. Test sets do not help select the final model; they only provide an estimate of the model’s effectiveness in the population. Test sets are not always used. When building models, the model set is often partitioned into several sets of data. The first part, the training set, is used to build the initial model. The second part, the validation set, is used by the modeling algorithm to adjust the initial model to make it more general and less tied to the idiosyncrasies of the training set; some data mining algorithms use a test set, some do not. The third part, the test set, is used to gauge the likely effectiveness of the model when applied to unseen data. Three sets are necessary because once data has been used for one step in the process, it can no longer be used for the next step because the information it contains has already become part of the model; therefore, it cannot be used to correct or judge. For predictive models that are going to be scored in the future, the test set should also come from a different time period than training and validation sets. The proof of a model’s stability is in its ability to perform well month after month. A test set from a different time period, often called an out of time test set, is a good way to verify model stability, although such a test set is not always available.

177 Fix Problems with the Data
Data imperfectly describes the features of the real world. Data might be missing or empty. Samples might not be representative. Categorical variables might have too many values. Numeric variables might have unusual distributions and outliers. Meanings can change over time. Data might be coded inconsistently. All data is dirty. All data has problems. What is or is not a problem varies with the data mining technique. For some, such as decision trees, missing values and outliers do not cause too much trouble. For others, such as neural networks, they cause all sorts of trouble. For that reason, some of what we have to say about fixing problems with data can be found in the chapters on the techniques where they cause the most difficulty.

178 No Easy Fix for Missing Values
Throw out the records with missing values? No. This creates a bias for the sample. Replace missing values with a “special” value (-99)? No. This resembles any other value to a data mining algorithm. Replace with some “typical” value? Maybe. Replacement with the mean, median, or mode changes the distribution, but predictions might be fine. Impute a value? (Imputed values should be flagged.) Maybe. Use distribution of values to randomly choose a value. Maybe. Model the imputed value using some technique. Use data mining techniques that can handle missing values? Yes. One of these, decision trees, is discussed. Partition records and build multiple models? Yes. This action is possible when data is missing for a canonical reason, such as insufficient history. Some data mining algorithms are capable of treating “missing” as a value and incorporating it into rules. Others cannot handle missing values, unfortunately. None of the obvious solutions preserve the true distribution of the variable. Throwing out all records with missing values introduces bias because it is unlikely that such records are distributed randomly. The IMPUTE node in SAS Enterprise Miner simplifies the task of replacing missing values, but with what? Replacing the missing value with some likely value such as the mean or the most common value adds spurious information. Replacing the missing value with an unlikely value is even worse because the data mining algorithms will not recognize that. For example, 999 is an unlikely value for age. The algorithms will go ahead and use it. Another possibility is to replace the value with a common value, such as median or mean, or to pull a value from the distribution of known values. It is even possible to predict the missing value using a decision tree or neural network; under some circumstances, this can be the only thing to do. It is sometimes useful to distinguish between two different situations that can both lead to a null value in a database or in a SAS data set. A value is said to be missing if there is a value but it is unknown. An empty value occurs when there simply is no value for that attribute. In the decision trees chapter, we will discuss two approaches to building models in the presence of missing values: using surrogate splits and treating missing as a legitimate value.

179 Define or refine business objective Prepare and repair data
Methodology 5. Transform data. Standardize, bin, combine, replace, impute, log, and so on. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

180 5) Transform Data Standardize values into z-scores.
Change counts into percentages. Remove outliers. Capture trends with ratios, differences, or beta values. Combine variables to bring information to the surface. Replace categorical variables with some numeric function of the categorical values. Impute missing values. Transform using mathematical functions, such as logs. Translate dates to durations. Once the model set has been assembled and major data problems fixed, the data must still be prepared for analysis. This involves adding derived fields to bring information to the surface. It may also involve removing outliers, binning numeric variables, grouping classes for categorical variables, applying transformations such as logarithms, turning counts into proportions, and the like. Data preparation is such an important topic that our colleague Dorian Pyle has written a book about it, Data Preparation for Data Mining. In the decision trees chapter, we will discuss two approaches to building models in the presence of missing values: using surrogate splits and treating missing as a legitimate value. Example: Body Mass Index (kg/m2) is a better predictor of diabetes than either variable separately.

181 A Selection of Transformations
Standardize numeric values. All numeric values are replaced by the notion of “how far is this value from the average?” Conceptually, all numeric values are in the same range. (The actual range differs, but the meaning is the same.) Although it sometimes has no effect on the results (such as for decision trees and regression), it never produces worse results. Standardization is so useful that it is often built into SAS Enterprise Miner modeling nodes.

182 A Selection of Transformations
“Stretching” and “squishing” transformations Log, reciprocal, and square root are examples. Replace categorical values with appropriate numeric values. Many techniques work better with numeric values than with categorical values. Historical projections (such as handset churn rate or penetration by ZIP code) are particularly useful.

183 Define or refine business objective Prepare and repair data
Methodology 6. Apply analysis. Fit many candidate models, try different solutions, try different sets of input variables, select the best model. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

184 6) Apply Analysis Regression Decision trees Cluster detection
Association rules Neural networks Memory-based reasoning Survival analysis Link analysis Genetic algorithms Most of the rest of the class is devoted to understanding these techniques and how they are used to build models.

185 Train Models MODEL 3 MODEL 2 MODEL 1
Build candidate models by applying a data mining technique (or techniques) to the training data. INPUT INPUT OUTPUT MODEL 3 INPUT OUTPUT MODEL 2 OUTPUT MODEL 1

186 Assess Models MODEL 3 MODEL 2 MODEL 1
Assess models by applying the models to the validation data set. INPUT INPUT OUTPUT MODEL 3 INPUT OUTPUT MODEL 2 OUTPUT MODEL 1

187 Assess Models Score the validation data using the candidate models and then compare the results. Select the model with the best performance on the validation data set. Communicate model assessments through quantitative measures graphs. There are a number of different ways to assess models. All of these work by comparing predicted results with actual results in preclassified data that was not used as part of the model building process.

188 Look for Warnings in Models
Trailing Indicators: Learning Things That Are Not True What happens in month 8? More than one novice data miner has discovered that usage declines in the month before a customer stops using a service. This chart shows monthly usage for a subscriber who appears to fit the pattern. But appearances are deceiving. Looking at minutes of use by day instead of by month would show that the customer continued to use the service at a constant rate until the middle of the month and then stopped completely, presumably because on that day, he or she began using a competing service. The putative period of declining usage does not actually exist and so certainly does not provide a window of opportunity for retaining the customer. What appears to be a leading indicator is actually a trailing one. Does declining usage in month 8 predict attrition in month 9?

189 Look for Warnings in Models
Perfect Models: Things that are too good to be true. 100% of customers who spoke to a customer support representative canceled a contract. Eureka! It’s all I need to know! If a customer cancels, that customer is automatically flagged to get a call from customer support. The information is useless in predicting cancellation. More than one novice data miner has discovered that usage declines in the month before a customer stops using a service. This chart shows monthly usage for a subscriber who appears to fit the pattern. But appearances are deceiving. Looking at minutes of use by day instead of by month would show that the customer continued to use the service at a constant rate until the middle of the month and then stopped completely, presumably because on that day, he or she began using a competing service. The putative period of declining usage does not actually exist and so certainly does not provide a window of opportunity for retaining the customer. What appears to be a leading indicator is actually a trailing one. Models that seem too good usually are.

190 Idea Exchange What are some other warning signs that you can think of in modeling? Have you experienced any pitfalls that were memorable or that changed how you approach the data analysis objectives?

191 Define or refine business objective Prepare and repair data
Methodology 7. Deploy models. Score new observations, make model-based decisions. Gather results of model deployment. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

192 7) Deploy Models and Score New Data
Models are meant to be used. Depending on the environment, that may mean scoring customers every month, every day, or every time they have a transaction. Or, as in this picture, it may mean creating rules that get triggered when a customer looks at a particular item on a retail Web page. Deploying a model means moving it from the data mining environment to the scoring environment. This process may be easy or hard. In the worst case (and we have seen this at more than one company), the model is developed in a special modeling environment using software that runs nowhere else. To deploy the model, a programmer takes a printed description of the model and recodes it in another programming language so that it can be run on the scoring platform. A more common problem is that the model uses input variables that are not in the original data. This should not be a problem because the model inputs are at least derived from the fields that were originally extracted to from the model set. Unfortunately, data miners are not always good about keeping a clean, reusable record of the transformations they applied to the data.

193 Define or refine business objective Prepare and repair data
Methodology 8. Assess the usefulness of the model. If the model has gone stale, revise it. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

194 8) Assess Results Compare actual results against expectations.
Compare the challenger’s results against the champion’s. Did the model find the right people? Did the action affect their behavior? What are the characteristics of the customers most affected by the intervention? The real test of data mining comes when you can measure the value of the actions you have taken as a result of the mining. Measuring lift on a test set helps choose the right model. Profitability models based on lift will help decide how to apply the results of the model. But, it is very important to measure these things in the field as well. In a database marketing application, this requires always setting aside control groups and carefully tracking customer response according to various model scores. When thinking about designing an experiment to assess results, the following are some things to keep in mind: The right size for the test Being sure that any test groups are chosen randomly and get the same treatment (same message, same offer, same timing, similar customers) Being sure that operational systems can handle the process.

195 Good Test Design Measures the Impact of Both the Message and the Model
Impact of model on group getting message Control Group Chosen at random; receives message. Response measures message without model. Target Group Chosen by model; receives message. Response measures message with model. Holdout Group Chosen at random; receives no message. Response measures background response. Modeled Holdout Chosen by model; receives no message. Response measures model without message. YES Message Impact of message on group with good model scores NO NO YES Picked by Model

196 Test Mailing Results E-mail campaign test results lift 3.5
This graph is taken from the assessment of an campaign by a bank. The was sent to customers who had given permission to be contacted in this manner. The suggested they sign up for a particular product. Ten thousand people were picked at random to receive the . This is the control group, which had a 0.2% response rate. Another 10,000 people were sent the because they were scored by the model as likely to be interested in the product. This group responded at the rate of 0.7%, or three and a half times the rate of the control. So clearly the model did something. It found a group of people more likely to sign up for the product than the average customer. But was it the that caused them to sign up, or did the model simply identify people who were more likely to sign up whether or not they got the ? To test that, the bank also tracked the take up rate of people who did not get the . It was virtually nil, so both the model and the message had an effect.

197 Define or refine business objective Prepare and repair data
Methodology 9. As you learn from earlier model results, refine the business goals to gain more from the data. Select data Define or refine business objective Assess results Deploy models Explore input data Prepare and repair data Transform input data Apply analysis This twelve-step methodology is intended to cover the basic steps needed for a successful data mining project. The steps occur at very different levels, with some concerned with the mechanics of modeling while others are concerned with business needs. As shown in the slide, the data mining process is best thought of as a set of nested loops rather than a straight line. The steps do have a natural order, but it is not necessary or even desirable to completely finish with one before moving on to the next. Things learned in later steps will cause earlier ones to be revisited. The rest of this chapter looks at these steps one by one.

198 9) Begin Again Revisit business objectives. Define new objectives.
Gather and evaluate new data. model scores cluster assignments responses Example: A model discovers that geography is a good predictor of churn. What do the high-churn geographies have in common? Is the pattern your model discovered stable over time? Every data mining project raises more questions than it answers. This is a good thing. It means that new relationships are now visible that were not visible before. The newly discovered relationships suggest new hypotheses to test, and the data mining process begins all over again.

199 Lessons Learned Data miners must be careful to avoid pitfalls, particularly with regard to spurious patterns in the data: learning things that are not true or not useful confusing signal and noise creating unstable models A methodology is a way of being careful. Data mining comes in two forms. Directed data mining is searching through historical records to find patterns that explain a particular outcome. Directed data mining includes the tasks of classification, estimation, prediction, and profiling. Undirected data mining is searching through the same records for interesting patterns. It includes the tasks of cluster, finding association rules, and description. The primary lesson of this chapter is that data mining is full of traps for the unwary, and following a data mining methodology based on experience can help avoid them. The first hurdle is translating the business problem into one of the six tasks that can be solved by data mining: classification, estimation, prediction, affinity grouping, clustering, and profiling. The next challenge is to locate appropriate data that can be transformed into actionable information. Once the data has been located, it should be explored thoroughly. The exploration process is likely to reveal problems with the data. It will also help build up the data miner’s intuitive understanding of the data. The next step is to create a model set and partition it into training, validation, and test sets. Data transformations are necessary for two purposes: to fix problems with the data such as missing values and categorical variables that take on too many values, and to bring information to the surface by creating new variables to represent trends and other ratios and combinations. Once the data has been prepared, building models is a relatively easy process. Each type of model has its own metrics by which it can be assessed, but there are also assessment tools that are independent of the type of model. Some of the most important of these are the lift chart, which shows how the model has increased the concentration of the desired value of the target variable, and the confusion matrix, which shows the misclassification error rate for each of the target classes.

200 Idea Exchange Outline a business objective of your own in terms of the methodology described here. What is your business objective? Can you frame it in terms of a data mining problem? How will you select the data? What are the inputs? What do you want to look at to get familiar with the data? continued...

201 Idea Exchange Anticipate any data quality problems that you might encounter and how you could go about fixing them. Do any variables require transformation? Proceed through the remaining steps of the methodology as you consider your example.

202 Basic Data Modeling A common approach to modeling customer value is RFM analysis, so named because it uses three key variables: Recency – how long it has been since the customer’s last purchase Frequency – how many times the customer has purchased something Monetary value – how much money the customer has spent RFM variables tend to predict responses to marketing campaigns effectively.  RFM is a special case of OLAP.

203 RFM Cell Approach Frequency Monetary value Recency
I want to have numbers 1-5 on axes & fix hairy lines over the top edge of the cube. Recency

204 RFM Cell Approach A typical approach to RFM analysis is to bin customers into (approximately) equal-sized groups on each of the rank-ordered R,F, and M variables. For example: Bin five groups on R (highest bin = most recent) Bin five groups on F (highest bin = most frequent) Bin five groups on M (highest bin = highest value) The combination of the bins gives an RFM “score” that can be compared to some target or outcome variable. Customer score 555 = most recent quintile, most frequent quintile, highest spending quintile.

205 Computing Profitability in RFM
Break-even response rate = current cost of promotion per dollar of net profit. Cost of promotion to an individual Average net profit per sale Example: It costs $2.00 to print and mail each catalog. Average net profit per transaction is $ /30.00 = Profitable RFM cells are those with a response rate greater than 6.7%. CITE BERRY AND LINOFF!!

206 RFM Analysis of the Catalog Data
Recode recency so that the highest values are the most recent. Bin the R, F, and M variables into five groups each, numbered 1-5, so that 1 is the least valuable and 5 is the most valuable bin. Concatenate the RFM variables to obtain a single RFM “score.” Graphically investigate the response rates for the different groups.

207 Performing RFM Analysis of the Catalog Data
Catalog Case Study Task: Perform RFM analysis on the catalog data.

208 Performing Graphical RFM Analysis
Catalog Case Study Task: Perform graphical RFM analysis.

209 Limitations of RFM Only uses three variables
Modern data collection processes offer rich information about preferences, behaviors, attitudes, and demographics. Scores are entirely categorical 515 and 551 and 155 are equally good, if RFM variables are of equal importance. Sorting by the RFM values is not informative and overemphasizes recency. So many categories The simple example above results in 125 groups. Not very useful for finding prospective customers Statistics are descriptive.

210 Idea Exchange Would RFM analysis apply to a business objective that you are considering? If so, what would be your R, F, and M variables? What other basic analytical techniques could you use to explore your data and get preliminary answers to your questions?

211 Exercise Scenario Practice with a charity direct mail example. Analysis Goal: A veteran’s organization seeks continued contributions from lapsing donors. Use lapsing donor response from an earlier campaign to predict future lapsing donor response. ...

212 Exercise Scenario Practice with a charity direct mail example.
Analysis Goal: A veteran’s organization seeks continued contributions from lapsing donors. Use lapsing donor response from an earlier campaign to predict future lapsing donor response. Exercise Data (PVA97NK): The data is extracted from previous year’s campaign. The sample is balanced with regard to response/non-response rate. The actual response rate is approximately 5%.

213 R, F, M Variables in the Charity Data Set
In the data set PVA97NK, the following variables should be used for RFM analysis: GiftTimeLast Time since last gift (Recency) GiftCntAll Gift count over all months (Frequency) Monetary value must be computed as follows: GiftAvgAll*GiftCntAll Average gift amount over lifetime * total gift count Use SAS Enterprise Miner to create the RFM variables and bins, and then perform graphical RFM analysis.

214 Exercise This exercise reinforces the concepts discussed previously.

215 Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading

216 Recommended Reading Davenport, Thomas H., Jeanne G. Harris, and Robert Morison Analytics at Work: Smarter Decisions, Better Results. Boston: Harvard Business Press. Chapters 2 through 6, the DELTA method These chapters present a complementary perspective to this chapter on how to integrate analytics at various levels of the organization.


Download ppt "Chapter 2: Basics of Business Analytics"

Similar presentations


Ads by Google