Zhangxi Lin ISQS 7342-001 Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple.

Slides:

Advertisements

Similar presentations

Statistical Techniques I EXST7005 Start here Measures of Dispersion.

Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.

Brief introduction on Logistic Regression

© 2011 Pearson Education, Inc

Probability & Statistical Inference Lecture 9

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Simple Linear Regression. Start by exploring the data Construct a scatterplot  Does a linear relationship between variables exist?  Is the relationship.

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.

Correlation and regression

4.3 Confidence Intervals -Using our CLM assumptions, we can construct CONFIDENCE INTERVALS or CONFIDENCE INTERVAL ESTIMATES of the form: -Given a significance.

Chapter 6: Model Assessment

The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.

Chapter 10 Simple Regression.

Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

Additional Topics in Regression Analysis

The Simple Regression Model

Evaluating Hypotheses

Chapter 11 Multiple Regression.

Linear and generalised linear models

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.

Inferences About Process Quality

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Linear and generalised linear models

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Lecture II-2: Probability Review

Relationships Among Variables

Separate multivariate observations

Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science 5 th edition Cliff T. Ragsdale.

Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.

Introduction to Linear Regression and Correlation Analysis

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人：黃子齊

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

by B. Zadrozny and C. Elkan

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.

Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.

Traffic Modeling.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

5.2 Input Selection 5.3 Stopped Training

1 Statistical Distribution Fitting Dr. Jason Merrick.

Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.

VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.

This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.

Chapter 4: Introduction to Predictive Modeling: Regressions

1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.

Machine Learning 5. Parametric Methods.

Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.

Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.

Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.

Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

Lecture Notes 9 Prediction Limits

Chapter 7. Classification and Prediction

Deep Feedforward Networks

Linear Regression.

Statistics in MSmcDESPOT

Advanced Analytics Using Enterprise Miner

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Simple Linear Regression

Presentation transcript:

Zhangxi Lin ISQS Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple Target Prediction

Structure of the Chapter  Section 2.1 raises the problem that the normal decision tree methods did not turn out good results  Section 2.2 analyzes the problem  Section 2.3 develops basic two-stage models to improve the results  Section 2.4 further improves the two-stage models 2

Section 2.1 Introduction

Motivation  The results of the 1998 KDD-Cup produced a surprise. Almost half of the entrees yielded a total profit on the validation data that was less than that obtained by soliciting everyone.  Part of the problem lies in the method used to select cases for solicitation. This chapter extends the notion of profit introduced in Chapter 1 to allow for better selection of cases for solicitation. 4

KDD-Cup Results $14,712 14,662 13,954 13,825 13,794 13,598 13,040 12,298 11,423 11,276 Total Profit Rank $ Overall Avg. Profit $ 10,720 10,706 10,112 10,049 9,741 9,464 5,683 5,484 1,925 1,706 Total Profit Rank $ Overall Avg. Profit $10,560 $ Total profit Avg. profit for “solicit everyone” model

Section 2.2 Generalized Profit Matrices

7 Random Profit Consequences Profit Primary DecisionSecondary Decision Negative profit

Outcome Conditioned Random Profits  In a more general context, the profit associated with a decision for an individual case can be thought of as a random variable. The goal of predictive modeling is to estimate the distribution of this profit random variable conditioned on case input measurements.  Because the decisions are usually associated with discrete outcomes, the random profits are conditioned on each of these outcomes. For a binary outcome and two decisions, the random profits form the elements of a 2x2 random matrix. 8

9 Outcome Conditioned Random Profits Primary DecisionSecondary Decision Profit Primary Outcome Secondary Outcome 0 Negative profit

10 Expected Profit Matrix Profit E( ) Primary DecisionSecondary Decision Primary Outcome Secondary Outcome Negative profit

Expected/Reduced Profit Matrix  Because it is easier to work with concrete numbers than random variables, statistical summaries of the random profit matrices are used to quantify the consequence of a decision.  One way to do this is to calculate the expected value of the profit random variable for each outcome and decision combination. Arrayed as a matrix, this is called the expected profit-consequence matrix, or the expected profit matrix for a case.  Often, generalized profit matrices have zeros in the secondary decision column. Without loss of generality (assuming the profit- consequence is measured by expected value), it is always possible to write the generalized profit matrix with a column of zero profits 11

12 Reduced Profit Matrix Profit E( ) Primary DecisionSecondary Decision Primary Outcome Secondary Outcome Negative profit The difference

13 Reduced Profit Matrix Profit0 0 E( ) Primary Decision Primary Outcome Secondary Outcome Profit0 0 E( ) Secondary Decision Negative profit The difference

14 Expected Profit-Consequence 0 0 Primary Decision Primary Outcome Secondary Outcome Expected Profit-Consequence EPF p p +∙∙ p p ∙+∙ EPC = Negative profit

15 Expected Profit-Consequence 0 0 Primary Decision Primary Outcome Secondary Outcome Expected Profit-Consequence EPF p p EPC EPF p p ∙+∙ p p ∙+∙ = = Negative profit

16 Expected Profit-Consequence 0 0 Primary Decision Primary Outcome Secondary Outcome Expected Profit-Consequence EPF p p EPC EPF p p ∙+∙ p p ∙+∙ = = Negative profit

17 Expected Profit-Consequence EPC 0 0 Primary Decision Primary Outcome Secondary Outcome Negative profit

18 Sort Expected Profit-Consequence EPC Sort cases by decreasing EPC.

19 EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Total Expected Profit EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥ Sum EPCs in excess of threshold. EPC  ≥

20 EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Total Expected Profit EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Sum EPCs in excess of threshold.

21 EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

22 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

23 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  Negative profit

24 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

25 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  Negative profit

26 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  Negative profit

27 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

28 OP EPC Observed Profit EPC Profit0 0 Primary Decision Primary Outcome Observed Profit Secondary Outcome Observed Profit OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  Negative profit

29 EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Observed Profit EPC OP Record observed profits.

30 OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC Observed Total Profit EPC OP Sum OPs for cases with EPCs in excess of threshold. OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ 

31 OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC Generalized Profit Assessment Data EPC OP Sum OPs for cases with EPCs in excess of threshold. OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥

32 OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC Total Profit Plot EPC OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Depth

33 Observed and Expected Profit Plot OP EPC  ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥  OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC OP   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ EPC   ≥  ≥  ≥  ≥  ≥  ≥  ≥  ≥ Depth

34 Profit Confusion Matrix Primary Decision Primary Outcome Secondary Outcome OP   true positive profit false positive profit total primary profit total secondary profit Secondary Decision OP   false negative profit true negative profit OP   total primary decision profit OP  total secondary decision profit OP 

35 True Positive Profit Fraction Primary Decision Primary Outcome Secondary Outcome OP   true positive profit false positive profit total primary profit total secondary profit Secondary Decision OP   false negative profit true negative profit OP   total primary decision profit OP  total secondary decision profit OP 

36 False Positive Profit Fraction Primary Decision Primary Outcome OP  true positive profit total primary profit Secondary Decision OP   false negative profit true negative profit OP  total primary decision profit OP  total secondary decision profit OP  Secondary Outcome OP  false positive profit total secondary profit OP 

Section 2.3 Basic Two-Stage Models

38 Defining Two-Stage Model Components E( B | X ) E( D | X ) X Specified values Separate predictive models Joint predictive models E( B,D | X )

Two-Stage Modeling Methods  A better estimate of the primary decision profit can be obtained by modeling both outcome probability and expected profit, using two-stage modeling methods.  The two ways to estimate the components used in two-stage models.  The first is to simply specify values for certain components. This is simple to do, but it often produces poor results.  In a more sophisticated approach, you can use the value in an input or a look-up table as a surrogate for expected donation amount.  The most common approach is to estimate values for components with individual models.  At the extreme end of the sophistication scale, you can use a single model to predict both components simultaneously, for example, the NEURAL procedure in SAS Enterprise Miner. 39

Basic Two-Stage Models 40 Two-stage model collapses two models: - One to estimate the donation propensity; - Another one to estimate the donation amount.

Two-Stage Model Tool  The Two Stage Model tool builds two models, one to predict TARGET_B and one to predict TARGET_D. Theoretically, you can use this node to combine predictions for the two target variables and get a prediction of expected donation amount.  The tool has two minor limitations:  It does not recognize the prior vector. Thus, because responders are overrepresented in the training data, the probabilities in the TARGET_B model are biased.  The node has no built-in diagnostic to assess overall average profit. Profit information passed to the Assessment node is incorrect.  Both of these limitations are easily overcome by the Generalized Profit Assessment tool. 41

The Model We Are Using 42 Basic model Different from the book

Target Variables 43

Some Two-Stage Model Options  Model fitting approach: sequential, or concurrent  Sequential: couples model by making the binary outcome model’s prediction an input for the expected profit model  Concurrent: fits a neural network model with two target  FILTER: removes cases from the training data when building the value model  MULTIPLY: multiplies the class and value model predictions 44

Results of the Two-Stage Node 45

Results of the GPA Node Oddities in the assessment report. 1.The reported overall average profits from training data are extremely low. 2.The depth supposedly corresponding to optimum profit threshold is reported to be 100% (select all cases). 3.The total profit reported in the validation data is almost 40% higher than in the training data. 46

Stratification with BIN_TARGET_D 47

Improved Results of the GPA Node The third problem has been solved. But the performance of the model is still lower than that from “no model”. 48

Correct bias in GPA by setting the following parameter in the code: %let EM_PROPERTY_adjustprobs = Y; The model is no longer selecting all the data (it is now around 60%), but the overall average profit values remain low. The average profit = It is slightly more than that without using a model. 49

Results from an Improved Two-Stage Model Parameters: Class Model: Regression Selection Model: Stepwise Selection Criteria: Validation Error The Average Profit: This result is good enough to win the KDD Cup! 50

Summary – Improving the Performance Section 2.3  Use two-stage models  Stratification using the binned value target  Correct bias in GPA: %let EM_PROPERTY_adjustprobs = Y; Section 2.4  Use regression settings in the Two Stage node  Reduce MSE: Interval target value transformation  Construct the component models separately from the Two Stage node.  Use regression trees in a two-stage model (%let EM_PROPERTY_adjustprobs = N;)  Use neural networks in a two-stage model (%let EM_PROPERTY_adjustprobs = N;) 51

Section 2.4 Constructing Component Models

53 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w)

Two-Stage Modeling Challenges  Constructing two-stage (or more generally, any multiple component model) requires attention to several challenges not previously encountered.  Earlier modeling assessment efforts evaluated models based on profitability measures, assuming a fixed profit structure. Because the profit structure itself is being modeled in a two-stage model, you need a different mechanism to assess model performance.  Correct specification requires appropriately chosen inputs, link functions, and target error distribution.  By incorporating the predictions of the binary model into the interval mode, it can be possible to make a more parsimonious specification of the interval model. 54

55 Estimating Mean Squared Error X D Training Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = ^ D MSE E[( D - D ) 2 ] ^

56 ^ D MSE Decomposition: Variance X D Training Data Variance  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = MSE E[( D - D ) 2 ] = E[( D -E D ) 2 ] + [E( D -E D )] 2 ^ ^ In theory, the MSE can be decomposed into two components, each involving a deviance from the true expected value of the target variable.

57 ^ D MSE Decomposition: Squared Bias X D Training Data Bias 2  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 ^ Variance - independent of any fitted model. Bias 2 - the difference between the predicted and actual expected value

58 ^ D Honest MSE Estimation X D Validation Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 Bias 2 ^ Unbiased estimates can be obtained by correctly accounting for model degrees of freedom in the MSE estimate or simply estimating MSE from an independent validation data set.

59 ^ D Honest MSE Estimation X D Validation Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 Bias 2 ^

60 ^ D Honest MSE Estimation X D Validation Data  ( D i - D i ) 2 ^ i = 1 N 1 N Estimated MSE = VarianceMSE E[( D - D ) 2 ] ^ = E[( D -E D ) 2 ] + [E( D -E D )] 2 Bias 2 ^

61 Inseparability ^ B MSE and Binary Target Models X B Validation Data  ( B i - B i ) 2 ^ i = 1 N 1 N Estimated MSE = Inaccuracy E[( B - B ) 2 ] ^ = E[( B -E B ) 2 ] + [E( B -E B )] 2 Imprecision VarianceMSEBias 2 ^

The Binary Target  The estimated MSE of the binary target can be thought of as measuring the overall inaccuracy of model prediction.  This inaccuracy estimate can be decomposed into a term related to the inseparability of the two-target levels (corresponding to the variance component) plus a term related to the imprecision of the model estimate (corresponding to the bias-squared component).  In this way, the model with the smallest estimated MSE will also be the least imprecise. 62

63 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w) Use Validation MSE To assess both the binary and the interval component models, it is reasonable to compare their validation data mean squared error. Models with the smallest MSE will have the smallest bias or imprecision.

64 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w) Use Validation MSE A standard regression model may be ill suited for accurately modeling the relationship between the inputs and TARGET_D. Matching the structure of the model to the specific modeling requirements is vital to obtaining good predictions.

65 Interval Model Requirements Correct Error Distribution Good Inputs x 1 x 3 x 10 E( D ) > 0 Positive Predictions Adequate Flexibility

66 Making Positive Predictions log(E( Y |X )) E( log( Y ) | X ) Transform target. Define appropriate link. Hints: The interval component of a two-stage model is often used to predict a monetary response. Random variables that represent monetary amounts usually assume a skewed distribution with positive range and a variance related to expected value. When the target variable represents a monetary amount, this limited range and skewness in the model specification must be considered. Proper specification of the target range and error distribution increases the chances of selecting good inputs for the interval target model. With good inputs, the correct degree of flexibility can be incorporated into the model and predictions can be optimized.

67 Error Distribution Requirements Possess correct skewness. Have conforming support. Account for heteroscedasticity. Y

68 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y))2  Lognormal  (E(Y))2 Distribution Variance The normal distribution has a range from negative to positive infinity, whereas the target variable may have a more restricted range.

69 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y)) 2  Lognormal  (E(Y)) 2 Distribution Variance One disadvantage of the Poisson distribution relates to its skewness properties. Poisson error distributions are limited to the Neural Network node.

70 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y)) 2  Lognormal  (E(Y)) 2 Distribution Variance The gamma distribution is limited to the neural network node. The lognormal distribution can be used with any modeling tool.

71 Specifying the Correct Error Distribution  Normal (truncated)constant*  Poisson  E(Y)  Gamma  (E(Y)) 2  Lognormal  (E(Y)) 2 Distribution Variance 100x A few extreme outliers may indicate a lognormal distribution, whereas the absence of such may imply a gamma or less extreme distribution.

72 Two-Stage Modeling Challenges Model Assessment Interval Model Specification E( D ) = g(x;w) log(Target) / Specify Link and Error Use Validation MSE

Interval Target Model 73

The Parameters and Results 74

Compare the Distributions of Residuals 75 Use Log-transformed Target_DUsing original Target_D

Using Regression Trees 76

Using Neural Network Models 77