Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple.

Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from SAS@ Course Notes Lecture Notes 8 Continuous and Multiple Target Prediction

Structure of the Chapter  Section 2.1 raises the problem that the normal decision tree methods did not turn out good results  Section 2.2 analyzes the problem  Section 2.3 develops basic two-stage models to improve the results  Section 2.4 further improves the two-stage models 2

Section 2.1 Introduction

Motivation  The results of the 1998 KDD-Cup produced a surprise. Almost half of the entrees yielded a total profit on the validation data that was less than that obtained by soliciting everyone.  Part of the problem lies in the method used to select cases for solicitation. This chapter extends the notion of profit introduced in Chapter 1 to allow for better selection of cases for solicitation. 4

5 1998 KDD-Cup Results 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. $14,712 14,662 13,954 13,825 13,794 13,598 13,040 12,298 11,423 11,276 Total Profit Rank $0.153 0.152 0.145 0.143 0.141 0.135 0.128 0.119 0.117 Overall Avg. Profit 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. $ 10,720 10,706 10,112 10,049 9,741 9,464 5,683 5,484 1,925 1,706 Total Profit Rank $ 0.111 0.111 0.105 0.104 0.101 0.098 0.059 0.057 0.020 0.018 Overall Avg. Profit $10,560 $ 0.110 Total profit Avg. profit for “solicit everyone” model

Section 2.2 Generalized Profit Matrices

7 Random Profit Consequences Profit 00 0 0 Primary DecisionSecondary Decision Negative profit

Outcome Conditioned Random Profits  In a more general context, the profit associated with a decision for an individual case can be thought of as a random variable. The goal of predictive modeling is to estimate the distribution of this profit random variable conditioned on case input measurements.  Because the decisions are usually associated with discrete outcomes, the random profits are conditioned on each of these outcomes. For a binary outcome and two decisions, the random profits form the elements of a 2x2 random matrix. 8

9 Outcome Conditioned Random Profits Primary DecisionSecondary Decision Profit 00 0 0 Primary Outcome Secondary Outcome 0 Negative profit

10 Expected Profit Matrix Profit 00 0 0 E( ) Primary DecisionSecondary Decision Primary Outcome Secondary Outcome Negative profit

Expected/Reduced Profit Matrix  Because it is easier to work with concrete numbers than random variables, statistical summaries of the random profit matrices are used to quantify the consequence of a decision.  One way to do this is to calculate the expected value of the profit random variable for each outcome and decision combination. Arrayed as a matrix, this is called the expected profit-consequence matrix, or the expected profit matrix for a case.  Often, generalized profit matrices have zeros in the secondary decision column. Without loss of generality (assuming the profit- consequence is measured by expected value), it is always possible to write the generalized profit matrix with a column of zero profits 11

12 Reduced Profit Matrix Profit 00 0 0 E( ) Primary DecisionSecondary Decision Primary Outcome Secondary Outcome Negative profit The difference

13 Reduced Profit Matrix Profit0 0 E( ) Primary Decision Primary Outcome Secondary Outcome Profit0 0 E( ) Secondary Decision Negative profit The difference

14 Expected Profit-Consequence 0 0 Primary Decision Primary Outcome Secondary Outcome Expected Profit-Consequence EPF p p +∙∙ p p ∙+∙ EPC = Negative profit

15 Expected Profit-Consequence 0 0 Primary Decision Primary Outcome Secondary Outcome Expected Profit-Consequence EPF p p EPC EPF p p ∙+∙ p p ∙+∙ = = Negative profit

16 Expected Profit-Consequence 0 0 Primary Decision Primary Outcome Secondary Outcome Expected Profit-Consequence EPF p p EPC EPF p p ∙+∙ p p ∙+∙ = = Negative profit

17 Expected Profit-Consequence EPC 0 0 Primary Decision Primary Outcome Secondary Outcome Negative profit

18 Sort Expected Profit-Consequence EPC Sort cases by decreasing EPC.

19 EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Total Expected Profit EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥ Sum EPCs in excess of threshold. EPC  ≥

20 EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Total Expected Profit EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Sum EPCs in excess of threshold.

21 EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

22 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

23 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  Negative profit

29 EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Observed Profit EPC OP Record observed profits.

30 OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC Observed Total Profit EPC OP Sum OPs for cases with EPCs in excess of threshold. OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

31 OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC Generalized Profit Assessment Data EPC OP Sum OPs for cases with EPCs in excess of threshold. OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥

32 OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC Total Profit Plot EPC OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Depth

33 Observed and Expected Profit Plot OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Depth

34 Profit Confusion Matrix Primary Decision Primary Outcome Secondary Outcome OP   true positive profit false positive profit total primary profit total secondary profit Secondary Decision OP   false negative profit true negative profit OP   total primary decision profit OP  total secondary decision profit OP 

35 True Positive Profit Fraction Primary Decision Primary Outcome Secondary Outcome OP   true positive profit false positive profit total primary profit total secondary profit Secondary Decision OP   false negative profit true negative profit OP   total primary decision profit OP  total secondary decision profit OP 

36 False Positive Profit Fraction Primary Decision Primary Outcome OP  true positive profit total primary profit Secondary Decision OP   false negative profit true negative profit OP  total primary decision profit OP  total secondary decision profit OP  Secondary Outcome OP  false positive profit total secondary profit OP 

Section 2.3 Basic Two-Stage Models

38 Defining Two-Stage Model Components E( B | X ) E( D | X ) 15.30 X Specified values Separate predictive models Joint predictive models E( B,D | X )

Two-Stage Modeling Methods  A better estimate of the primary decision profit can be obtained by modeling both outcome probability and expected profit, using two-stage modeling methods.  The two ways to estimate the components used in two-stage models.  The first is to simply specify values for certain components. This is simple to do, but it often produces poor results.  In a more sophisticated approach, you can use the value in an input or a look-up table as a surrogate for expected donation amount.  The most common approach is to estimate values for components with individual models.  At the extreme end of the sophistication scale, you can use a single model to predict both components simultaneously, for example, the NEURAL procedure in SAS Enterprise Miner. 39

Basic Two-Stage Models 40 Two-stage model collapses two models: - One to estimate the donation propensity; - Another one to estimate the donation amount.

Two-Stage Model Tool  The Two Stage Model tool builds two models, one to predict TARGET_B and one to predict TARGET_D. Theoretically, you can use this node to combine predictions for the two target variables and get a prediction of expected donation amount.  The tool has two minor limitations:  It does not recognize the prior vector. Thus, because responders are overrepresented in the training data, the probabilities in the TARGET_B model are biased.  The node has no built-in diagnostic to assess overall average profit. Profit information passed to the Assessment node is incorrect.  Both of these limitations are easily overcome by the Generalized Profit Assessment tool. 41

The Model We Are Using 42 Basic model Different from the book

Target Variables 43

Some Two-Stage Model Options  Model fitting approach: sequential, or concurrent  Sequential: couples model by making the binary outcome model’s prediction an input for the expected profit model  Concurrent: fits a neural network model with two target  FILTER: removes cases from the training data when building the value model  MULTIPLY: multiplies the class and value model predictions 44

Results of the Two-Stage Node 45

Results of the GPA Node Oddities in the assessment report. 1.The reported overall average profits from training data are extremely low. 2.The depth supposedly corresponding to optimum profit threshold is reported to be 100% (select all cases). 3.The total profit reported in the validation data is almost 40% higher than in the training data. 46

Stratification with BIN_TARGET_D 47

Improved Results of the GPA Node The third problem has been solved. But the performance of the model is still lower than that from “no model”. 48

Correct bias in GPA by setting the following parameter in the code: %let EM_PROPERTY_adjustprobs = Y; The model is no longer selecting all the data (it is now around 60%), but the overall average profit values remain low. The average profit = 0.1105. It is slightly more than that without using a model. 49

Results from an Improved Two-Stage Model Parameters: Class Model: Regression Selection Model: Stepwise Selection Criteria: Validation Error The Average Profit: 0.155 This result is good enough to win the KDD Cup! 50

Summary – Improving the Performance Section 2.3  Use two-stage models  Stratification using the binned value target  Correct bias in GPA: %let EM_PROPERTY_adjustprobs = Y; Section 2.4  Use regression settings in the Two Stage node  Reduce MSE: Interval target value transformation  Construct the component models separately from the Two Stage node.  Use regression trees in a two-stage model (%let EM_PROPERTY_adjustprobs = N;)  Use neural networks in a two-stage model (%let EM_PROPERTY_adjustprobs = N;) 51

Section 2.4 Constructing Component Models

53 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w)

Two-Stage Modeling Challenges  Constructing two-stage (or more generally, any multiple component model) requires attention to several challenges not previously encountered.  Earlier modeling assessment efforts evaluated models based on profitability measures, assuming a fixed profit structure. Because the profit structure itself is being modeled in a two-stage model, you need a different mechanism to assess model performance.  Correct specification requires appropriately chosen inputs, link functions, and target error distribution.  By incorporating the predictions of the binary model into the interval mode, it can be possible to make a more parsimonious specification of the interval model. 54

55 Estimating Mean Squared Error X D Training Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = ^ D MSE E[( D - D ) 2 ] ^

56 ^ D MSE Decomposition: Variance X D Training Data Variance  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = MSE E[( D - D ) 2 ] = E[( D -E D ) 2 ] + [E( D -E D )] 2 ^ ^ In theory, the MSE can be decomposed into two components, each involving a deviance from the true expected value of the target variable.

57 ^ D MSE Decomposition: Squared Bias X D Training Data Bias 2  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 ^ Variance - independent of any fitted model. Bias 2 - the difference between the predicted and actual expected value

58 ^ D Honest MSE Estimation X D Validation Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 Bias 2 ^ Unbiased estimates can be obtained by correctly accounting for model degrees of freedom in the MSE estimate or simply estimating MSE from an independent validation data set.

59 ^ D Honest MSE Estimation X D Validation Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 Bias 2 ^

60 ^ D Honest MSE Estimation X D Validation Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 Bias 2 ^

61 Inseparability ^ B MSE and Binary Target Models X B Validation Data  ( B i - B i ) 2 ^ i = 1 N 1 N Estimated MSE = Inaccuracy E[( B - B ) 2 ] ^ = E[( B -E B ) 2 ] + [E( B -E B )] 2 Imprecision VarianceMSEBias 2 ^

The Binary Target  The estimated MSE of the binary target can be thought of as measuring the overall inaccuracy of model prediction.  This inaccuracy estimate can be decomposed into a term related to the inseparability of the two-target levels (corresponding to the variance component) plus a term related to the imprecision of the model estimate (corresponding to the bias-squared component).  In this way, the model with the smallest estimated MSE will also be the least imprecise. 62

63 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w) Use Validation MSE To assess both the binary and the interval component models, it is reasonable to compare their validation data mean squared error. Models with the smallest MSE will have the smallest bias or imprecision.

64 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w) Use Validation MSE A standard regression model may be ill suited for accurately modeling the relationship between the inputs and TARGET_D. Matching the structure of the model to the specific modeling requirements is vital to obtaining good predictions.

65 Interval Model Requirements Correct Error Distribution Good Inputs x 1 x 3 x 10 E( D ) > 0 Positive Predictions Adequate Flexibility

66 Making Positive Predictions log(E( Y |X )) E( log( Y ) | X ) Transform target. Define appropriate link. Hints: The interval component of a two-stage model is often used to predict a monetary response. Random variables that represent monetary amounts usually assume a skewed distribution with positive range and a variance related to expected value. When the target variable represents a monetary amount, this limited range and skewness in the model specification must be considered. Proper specification of the target range and error distribution increases the chances of selecting good inputs for the interval target model. With good inputs, the correct degree of flexibility can be incorporated into the model and predictions can be optimized.

67 Error Distribution Requirements Possess correct skewness. Have conforming support. Account for heteroscedasticity. Y

68 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y))2  Lognormal  (E(Y))2 Distribution Variance The normal distribution has a range from negative to positive infinity, whereas the target variable may have a more restricted range.

69 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y)) 2  Lognormal  (E(Y)) 2 Distribution Variance One disadvantage of the Poisson distribution relates to its skewness properties. Poisson error distributions are limited to the Neural Network node.

70 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y)) 2  Lognormal  (E(Y)) 2 Distribution Variance The gamma distribution is limited to the neural network node. The lognormal distribution can be used with any modeling tool.

71 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y)) 2  Lognormal  (E(Y)) 2 Distribution Variance 100x A few extreme outliers may indicate a lognormal distribution, whereas the absence of such may imply a gamma or less extreme distribution.

72 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w) log(Target) / Specify Link and Error Use Validation MSE

Interval Target Model 73

The Parameters and Results 74

Compare the Distributions of Residuals 75 Use Log-transformed Target_DUsing original Target_D

Using Regression Trees 76

Using Neural Network Models 77

Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple.

Similar presentations

Presentation on theme: "Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple.

Similar presentations

Presentation on theme: "Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple."— Presentation transcript:

Similar presentations

About project

Feedback