Session 7. Applied Regression -- Prof. Juran2 Outline Chi-square Goodness-of-Fit Tests Fit to a Normal Simulation Modeling Autocorrelation, serial correlation.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Session 8b Decision Models -- Prof. Juran.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Pengujian Parameter Regresi Pertemuan 26 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Simple Linear Regression. Start by exploring the data Construct a scatterplot  Does a linear relationship between variables exist?  Is the relationship.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Objectives (BPS chapter 24)
Session 2. Applied Regression -- Prof. Juran2 Outline for Session 2 More Simple Regression –Bottom Part of the Output Hypothesis Testing –Significance.
Chapter 10 Simple Regression.
1 1 Slide 統計學 Spring 2004 授課教師:統計系余清祥 日期: 2004 年 5 月 4 日 第十二週:複迴歸.
Part 18: Regression Modeling 18-1/44 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.
SIMPLE LINEAR REGRESSION
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Part 7: Multiple Regression Analysis 7-1/54 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Part 24: Multiple Regression – Part /45 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
SIMPLE LINEAR REGRESSION
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Regression Analysis (2)
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 15 Inference for Counts:
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CHAPTER 14 MULTIPLE REGRESSION
Introduction to Linear Regression
Economics 173 Business Statistics Lecture 6 Fall, 2001 Professor J. Petry
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
14- 1 Chapter Fourteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
Chapter 13 Inference for Counts: Chi-Square Tests © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Business Research Methods
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Model selection and model building. Model selection Selection of predictor variables.
Chapter 15 Inference for Regression. How is this similar to what we have done in the past few chapters?  We have been using statistics to estimate parameters.
Stats Methods at IC Lecture 3: Regression.
Inference for Least Squares Lines
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Simple Linear Regression
Prepared by Lee Revere and John Large
Correlation and Simple Linear Regression
Multiple Regression Chapter 14.
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Session 7

Applied Regression -- Prof. Juran2 Outline Chi-square Goodness-of-Fit Tests Fit to a Normal Simulation Modeling Autocorrelation, serial correlation Runs test Durbin-Watson Model Building Variable Selection Methods Minitab

Applied Regression -- Prof. Juran3 Goodness-of-Fit Tests Determine whether a set of sample data have been drawn from a hypothetical population Same four basic steps as other hypothesis tests we have learned An important tool for simulation modeling; used in defining random variable inputs

Applied Regression -- Prof. Juran4 Example: Barkevious Mingo Financial analyst Barkevious Mingo wants to run a simulation model that includes the assumption that the daily volume of a specific type of futures contract traded at U.S. commodities exchanges (represented by the random variable X ) is normally distributed with a mean of 152 million contracts and a standard deviation of 32 million contracts. (This assumption is based on the conclusion of a study conducted in 2013.) Barkevious wants to determine whether this assumption is still valid.

Applied Regression -- Prof. Juran5 He studies the trading volume of these contracts for 50 days, and observes the following results (in millions of contracts traded):

Applied Regression -- Prof. Juran6

7 Here is a histogram showing the theoretical distribution of 50 observations drawn from a normal distribution with μ = 152 and σ = 32, together with a histogram of Mingo’s sample data:

Applied Regression -- Prof. Juran8 The Chi-Square Statistic

Applied Regression -- Prof. Juran9 Essentially, this statistic allows us to compare the distribution of a sample with some expected distribution, in standardized terms. It is a measure of how much a sample differs from some proposed distribution. A large value of chi-square suggests that the two distributions are not very similar; a small value suggests that they “fit” each other quite well.

Applied Regression -- Prof. Juran10 Like Student’s t, the distribution of chi- square depends on degrees of freedom. In the case of chi-square, the number of degrees of freedom is equal to the number of classes (a.k.a. “bins” into which the data have been grouped) minus one, minus the number of estimated parameters.

Applied Regression -- Prof. Juran11

Applied Regression -- Prof. Juran12

Applied Regression -- Prof. Juran13 Note: It is necessary to have a sufficiently large sample so that each class has an expected frequency of at least 5. We need to make sure that the expected frequency in each bin is at least 5, so we “collapse” some of the bins, as shown here.

Applied Regression -- Prof. Juran14 The number of degrees of freedom is equal to the number of bins minus one, minus the number of estimated parameters. We have not estimated any parameters, so we have d.f. = 4 – 1 – 0 = 3. The critical chi-square value can be found either by using a chi-square table or by using the Excel function: =CHIINV(alpha, d.f.) = CHIINV(0.05, 3) = We will reject the null hypothesis if the test statistic is greater than

Applied Regression -- Prof. Juran15 Our test statistic is not greater than the critical value; we cannot reject the null hypothesis at the 0.05 level of significance. It would appear that Barkevious is justified in using the normal distribution with μ = 152 and σ = 32 to model futures contract trading volume in his simulation.

Applied Regression -- Prof. Juran16 The p -value of this test has the same interpretation as in any other hypothesis test, namely that it is the smallest level of alpha at which H 0 could be rejected. In this case, we calculate the p -value using the Excel function: = CHIDIST(test stat, d.f.) = CHIDIST(7.439,3) =

Applied Regression -- Prof. Juran17 Example: Catalog Company If we want to simulate the queueing system at this company, what distributions should we use for the arrival and service processes?

Applied Regression -- Prof. Juran18 Arrivals

Applied Regression -- Prof. Juran19

Applied Regression -- Prof. Juran20

Applied Regression -- Prof. Juran21

Applied Regression -- Prof. Juran22

Applied Regression -- Prof. Juran23 Services

Applied Regression -- Prof. Juran24

Decision Models -- Prof. Juran 26

Decision Models -- Prof. Juran 27

Decision Models -- Prof. Juran 28

Decision Models -- Prof. Juran 29

Decision Models -- Prof. Juran30

Decision Models -- Prof. Juran31

Decision Models -- Prof. Juran32

Decision Models -- Prof. Juran33

Decision Models -- Prof. Juran34

Decision Models -- Prof. Juran35

Decision Models -- Prof. Juran36

Decision Models -- Prof. Juran37

Applied Regression -- Prof. Juran38 Other uses for the Chi-Square statistic Tests of the independence of two qualitative population variables. Tests of the equality or inequality of more than two population proportions. Inferences about a population variance, including the estimation of a confidence interval for a population variance from sample data. The chi-square technique can often be employed for purposes of estimation or hypothesis testing when the z or t statistics are not appropriate. In addition to the goodness-of-fit application described above, there are at least three other important uses for chi-square:

Applied Regression -- Prof. Juran39 (A.k.a. Autocorrelation) Are the residuals independent of each other? What if there’s evidence that sequential residuals have a positive correlation? Serial Correlation

Applied Regression -- Prof. Juran40

Applied Regression -- Prof. Juran41

Applied Regression -- Prof. Juran42 There seems to be a relationship between each observation and the ones around it. In other words, there is some positive correlation between the observations and their successors. If true, this suggests that a lot of the variability in observation Y i can be explained by observation Y i – 1. In turn, this might suggest that the importance of Money Stock is being overstated by our original model.

Applied Regression -- Prof. Juran43

Applied Regression -- Prof. Juran44

Applied Regression -- Prof. Juran45

Applied Regression -- Prof. Juran46

Applied Regression -- Prof. Juran47

Applied Regression -- Prof. Juran48

Applied Regression -- Prof. Juran49

Applied Regression -- Prof. Juran50

Applied Regression -- Prof. Juran51

Applied Regression -- Prof. Juran52 A “run” is when the residual is positive (or negative) consecutively. Runs Test has 2 runs has 5 runs has 7 runs, and so forth.

Applied Regression -- Prof. Juran53 Let n 1 be the observed number of positive runs and n 2 be the observed number of negative runs. The total number of runs in a set of n uncorrelated residuals can be shown to have a mean of And a variance of

Applied Regression -- Prof. Juran54 In our Money Stock case, the expected value is 8.1 and the standard deviation ought to be about 1.97.

Applied Regression -- Prof. Juran55 Our Model 1 has 5 runs which is 1.57 standard deviations below the expected value — an unusually small number of runs. This suggests that the residuals are not independent. (This is an approximation based on the central limit theorem; it doesn’t work well with small samples.) Our Model 2 has 7 runs; only 0.56 standard deviations below the expected value.

Applied Regression -- Prof. Juran56 Durbin-Watson Another popular hypothesis-testing procedure: H 0 : Correlation = 0 H A : Correlation > 0 The test statistic is:

Applied Regression -- Prof. Juran57 In general, Values of d close to zero indicate strong positive correlation, and values of d close to 2 suggest weak correlation. Precise definitions of “close to zero” and “close to 2” depend on the sample size and the number of independent variables; see p. 346 in RABE for a Durbin-Watson table.

Applied Regression -- Prof. Juran58 The Durbin-Watson procedure will result in one of three possible decisions: From the Durbin-Watson table, we see that our Model 1 has upper and lower limits of 1.15 and 0.95, respectively. Model 2 has limits of 1.26 and 0.83.

Applied Regression -- Prof. Juran59

Applied Regression -- Prof. Juran60

Applied Regression -- Prof. Juran61 In Model 1, we reject the null hypothesis and conclude there is significant positive correlation between sequential residuals. In Model 2, we do not reject the null hypothesis; the serial correlation is not significantly greater than zero.

Applied Regression -- Prof. Juran62 Residual Analysis from the Tool-Wear Model

Applied Regression -- Prof. Juran63

Applied Regression -- Prof. Juran64 Normal score calculations:

Applied Regression -- Prof. Juran65

Applied Regression -- Prof. Juran66

Applied Regression -- Prof. Juran67

Applied Regression -- Prof. Juran68

Applied Regression -- Prof. Juran69 Model Building Ideally, we build a model under clean, scientific conditions: Understand the phenomenon well Have an a priori theoretical model Have valid, reliable measures of the variables Have data in adequate quantities over an appropriate range Regression validates and calibrates the model, not discovers it

Applied Regression -- Prof. Juran70 Little understanding of the phenomenon No a priori theory or model Have data that may or may not cover all reasonable variables Have measures of some variables, but little sense of their validity or reliability Have data in small quantities over a restricted range We hope that regression uncovers some magical unexpected relationships This process has been referred to as Creative Regression Analytical Prospecting, or CRAP. “This room is filled with horseshit; there must be a pony in here somewhere.” Unfortunately, we too often find ourselves Data Mining:

Applied Regression -- Prof. Juran71 The Model Building Problem Suppose we have data available for n variables. How do we pick the best sub-model from: yielding, perhaps There is no solution to this problem that is entirely satisfactory, but there are some reasonable heuristics.

Applied Regression -- Prof. Juran72 Scientific Ideology: In chemistry, physics, and biology, most good models are simple. The principle of parsimony carries over into social sciences, such as business analysis. Statistical Advantages: Even eliminating “significant” variables that don’t contribute much to the model can have advantages, especially for predicting the future. These advantages include less expensive data collection, smaller standard errors, and tighter confidence intervals. Why Reduce the Number of Variables?

Applied Regression -- Prof. Juran73 Statistical Criteria for Comparing Models

Applied Regression -- Prof. Juran74 Taking into account the possible bias that comes from having an under-specified model, this measure estimates the MSE including both bias and variance: If the model is complete (we have the p terms that matter) the expected value of C p = p. So we look for models with C p close to p. Mallows C p

Applied Regression -- Prof. Juran75 All-Subsets Forward Backward Stepwise Best Subsets Variable Selection Algorithms

Applied Regression -- Prof. Juran76 If there are p candidate independent variables, then there are 2 p possible models. Why not look at them all? This is not really a major computational problem, but can pose difficulties in looking at all of the output. However, some reasonable schemes exist for looking at a relatively small subset of all the possible models. All-Subsets Regression

Applied Regression -- Prof. Juran77 Start with one independent variable (the one with the strongest bivariate correlation with the dependent variable), and add additional variables until the next variable in line to enter fails to achieve a certain threshold value. This can be based on a minimum F value in the full-model/reduced-model test, called F IN, or it can be based on the last-in p -value for each candidate variable. Forward selection is basically the same thing as Stepwise, except variables are never removed once they enter the model. Set “ F to remove” to zero. The procedure ends when no variable not already in the model has an F -stat greater than F IN. Forward Regression

Applied Regression -- Prof. Juran78 Start with all of the independent variables, and eliminate them one by one (on the basis of having the weakest t -stat) until the next variable fails to meet a minimum threshold. This can be an F criterion called F OUT, or a p -value criterion. Backwards elimination starts with all of the independent variables, then removes them one at a time based on the stepwise procedure, except that no variable can re-enter once it has been removed. Set F IN at a very large number such as 100,000 and list all predictors in the Enter box. The procedure ends when no variable in the model has an F -stat less than F OUT. Backward Regression

Applied Regression -- Prof. Juran79 An intelligent mixture of forward and backward ideas. Variables can be entered or removed using F IN and F OUT criteria or p -value criteria. Stepwise Regression

Applied Regression -- Prof. Juran80 The basic (default) method of stepwise regression calculates an F - statistic for each variable in the model. Suppose the model contains X 1,..., X p. Then the F -statistic for X i is with 1 and n - p - 1 degrees of freedom. If the F -statistic for any variable is less than F to remove, the variable with the smallest F is removed from the model. The regression equation is calculated for this smaller model, the results are printed, and the procedure proceeds to a new step. The F Criterion

Applied Regression -- Prof. Juran81 If no variable can be removed, the procedure attempts to add a variable. An F -statistic is calculated for each variable not yet in the model. Suppose the model, at this stage, contains X 1,..., X p. Then the F -statistic for a new variable, X p+1 is The variable with the largest F -statistic is then added, provided its F -statistic is larger than F to enter. Adding this variable is equivalent to choosing the variable with the largest partial correlation or to choosing the variable that most effectively reduces SSE. The regression equation is then calculated, results are displayed, and the procedure goes to a new step. If no variable can enter, the stepwise procedure ends. The p -value criterion is very similar, but uses a threshold alpha value.

Applied Regression -- Prof. Juran82 A handy procedure that reports, for each number of independent variables p, the model with the highest R -square. Best Subsets is an efficient way to select a group of "best subsets" for further analysis by selecting the smallest subset that fulfills certain statistical criteria. The subset model may actually estimate the regression coefficients and predict future responses with smaller variance than the full model using all predictors. Best Subsets

Applied Regression -- Prof. Juran83 Excel’s regression utility is not well suited to iterative procedures like this. More stats-focused packages like Minitab offer a more user- friendly method. Minitab treats Forward and Backward as subsets of Stepwise. (This makes sense; they really are special cases where entered variables can’t leave, or removed variables can’t re-enter. Minitab uses the p -value criterion by default. Using Minitab

Applied Regression -- Prof. Juran84 Example: Rick Beck

Applied Regression -- Prof. Juran85

Applied Regression -- Prof. Juran86

Applied Regression -- Prof. Juran87 Need to select “regression” several times

Applied Regression -- Prof. Juran88

Applied Regression -- Prof. Juran89

Applied Regression -- Prof. Juran90

Applied Regression -- Prof. Juran91

Applied Regression -- Prof. Juran92

Applied Regression -- Prof. Juran93 Forward Selection of Terms α to enter = 0.25 Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression Single Divorced Credit D Credit E Children Debt Error Total Model Summary S R-sq R-sq(adj) R-sq(pred) % 30.37% 29.39%

Applied Regression -- Prof. Juran94 Coefficients Term Coef SE Coef T-Value P-Value VIF Constant Single Divorced Credit D Credit E Children Debt Regression Equation Default = Single Divorced Credit D Credit E Children Debt

Applied Regression -- Prof. Juran95 Regression – Regression – Best Subsets

Applied Regression -- Prof. Juran96

Applied Regression -- Prof. Juran97

Applied Regression -- Prof. Juran98

Applied Regression -- Prof. Juran99 Best Subsets Regression: Default versus Married, Divorced,... Response is Default D C C C C C M i W r r r r h a v i e e e e i I r o d d d d d l n r r o i i i i d c i c w t t t t r A o R-Sq R-Sq Mallows e e e e g m Vars R-Sq (adj) (pred) Cp S d d d A B C D n e e X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Applied Regression -- Prof. Juran100 Summary Chi-square Goodness-of-Fit Tests Fit to a Normal Simulation Modeling Autocorrelation, serial correlation Runs test Durbin-Watson Model Building Variable Selection Methods Minitab