Download presentation
Presentation is loading. Please wait.
1
BUS-221 Quantitative Methods
LECTURE 16
2
Learning Outcome Knowledge - Be familiar with basic mathematical techniques including: calculations including regression. Mentation - Analyse business case studies and make decisions based on quantitative data. Practical - Can formulate business problems in mathematical terms, and can perform calculations to solve such problems including the use of appropriate software. Is able to interpret results to determine their impact on the problem at hand.
3
Topics Forecasting Application: use of Excel to do simulation
4
Regression: Association between two variables
Regression is useful when we want to look for significant relationships between two variables predict a value of one variable for a given value of the other It involves estimating the line of best fit through the data which minimises the sum of the squared residuals What are the residuals? Regression produces an equation for the line of best fit. The equation can be used to predict values for the dependent variable given the independent variable. More commonly students use regression to look for significant relationships. The computer classifies the line of best fit which minimises the sum of the squared residuals but what are the residuals?
5
Residuals Residuals are the differences between the observed and predicted weights Residuals Baby heavier than predicted Baby lighter than expected Regression line Residuals are the differences between the observed dependent variables and the predicted value from the regression equation. These residuals are squared and added together. Baby the same as predicted X Predictor / explanatory variable (independent variable)
6
Regression Simple linear regression looks at the relationship between two Scale variables by producing an equation for a straight line of the form Which uses the independent variable to predict the dependent variable Independent variable Dependent variable Intercept Slope Regression produces a line of this form where alpha is the intercept and beta is the slope. Students may struggle with this if they have not studied maths so it’s important to give numerical examples. They may not be interested in this either as they are just using regression to find significant relationships.
7
Assumptions for regression
Plot to check The relationship between the independent and dependent variables is linear. Original scatter plot of the independent and dependent variables Homoscedasticity: The variance of the residuals about predicted responses should be the same for all predicted responses. Scatterplot of standardised predicted values and residuals The residuals are independently normally distributed Plot the residuals in a histogram Look for patterns. There are several assumptions for regression. The first is that the relationship between continuous dependent and independent variables is linear which is confirmed using a scatterplot. The second two are available on request from SPSS through the ‘Plots’ section of the regression. The first assesses whether the spread of residuals varies depending on the predicted values and the second whether the residuals are normally distributed.
8
Checking normality Histogram of the residuals looks approximately normally distributed When writing up, just say ‘normality checks were carried out on the residuals and the assumption of normality was met’ Here’s the histogram of residuals for the birthweight and gestational age model. The residuals only have to be approximately normally distributed so only histograms showing serious skew are a problem. Students don’t need to include the plots for checking normality in their main report. They just need to say that the checks were carried out. Outliers are outside
9
Predicted values against residuals
Are there any patterns as the predicted values increases? This plot shows predicted values on the x axis and residuals on the y axis. What we want is random scatter with no patterns. One problem to look for is an increase in the width of the scatter as predicted values increase. There is a problem with Homoscedasticity if the scatter is not random. A “funnelling” shape such as this suggests problems.
10
What if assumptions are not met?
If the residuals are heavily skewed or the residuals show different variances as predicted values increase, the data needs to be transformed Try taking the natural log (ln) of the dependent variable. Then repeat the analysis and check the assumptions Here are some examples of problem data. The histogram on the left shows residuals that are very positively skewed and the plot on the right shows an increase in variability of the residuals as predicted values increase. The dependent variable needs to be transformed and the most common transformation is to take the log of the dependent variable. The analysis is re-run using the transformed variable and the assumptions checked again. The downside to this is the interpretation of the output changes and some students will struggle with the interpretation.
11
Exercise Investigate whether mothers pre-pregnancy weight and birth weight are associated using a scatterplot, correlation and simple regression. Student exercise with the next 3 slides for them to fill in.
12
Exercise - scatterplot
Describe the relationship using the scatterplot and correlation coefficient r = 0.39
13
Regression question R2 = 0.152
Pre-pregnancy weight p-value: Regression equation: Interpretation: R2 = 0.152 Does the model result in reliable predictions?
14
Check the assumptions
15
Correlation Pearson’s correlation = 0.39
Describe the relationship using the scatterplot and correlation coefficient There is a moderate positive linear relationship between mothers’ pre- pregnancy weight and birth weight (r = 0.39). Generally, birth weight increases as mothers weight increases Solution
16
Regression Pre-pregnancy weight p-value: p = 0.011
Regression equation: y = x Interpretation: There is a significant relationship between a mothers’ pre-pregnancy weight and the weight of her baby (p = 0.011). Pre-pregnancy weight has a positive affect on a baby’s weight with an increase of 0.03 lbs for each extra pound a mother weighs. Does the model result in reliable predictions? Not really. Only 15.2% of the variation in birth weight is accounted for using this model. Solution
17
Checking assumptions Linear relationship
Histogram roughly peaks in the middle No patterns in residuals Solution
18
Multiple regression Multiple regression has several binary or Scale independent variables Categorical variables need to be recoded as binary dummy variables Effect of other variables is removed (controlled for) when assessing relationships Multiple regression is an extension to simple linear regression with multiple independent variables. Although regression only takes scale and binary variables, categorical variables with 3+ categories can be recoded a binary dummy variables. For example, marital status could be recoded as Are you married yes/ no, are you single yes/ no etc One of the good things about multiple regression is that when assessing the significance of a particular independent variable, the effect due to the other variables is removed/ controlled for first. For example, in medical trials looking for a difference between treatments, the effects of factors such as age and weight are controlled for when assessing the difference. This ensures that any significant difference is due to the treatment rather than differences in patients within each group.
19
Multiple regression What affects the number of Nobel prize winners?
Dependent: Number of Nobel prize winners Possible independents: Chocolate consumption, GDP and mean temperature Chocolate consumption is significantly related to Nobel prize winners in simple linear regression Once the effect of a country’s GDP and temperature were taken into account, there was no relationship Here’s an example. When looking into what affects the number of Nobel prize winners, simple regression showed significant evidence of a relationship between chocolate consumption and Nobel prize winners. However, when the GDP and average temperature of a country were taken into account, chocolate was found not to be significant.
20
Multiple regression In addition to the standard linear regression checks, relationships BETWEEN independent variables should be assessed Multicollinearity is a problem where continuous independent variables are too correlated (r > 0.8) Relationships can be assessed using scatterplots and correlation for scale variables SPSS can also report collinearity statistics on request. The VIF should be close to 1 but under 5 is fine whereas 10 + needs checking Multiple regression has the same assumptions as simple regression but has one extra. If two independent variables are strongly correlated, they are less likely to be significant. This effect is called multicollinearity. Before running multiple regression, check the correlations between the dependent and all scale dependents. Look for independents that are strongly correlated with the dependent to include in the model but also for variables related to each other. Common errors are including weight and BMI for example. If r between independents is above 0.8, only one should be included in the model. For further checks, SPSS will provide multicollinearity statistics in the output for each variable. The VIF column displays a number which should be under 1 but under 5 is usually fine. If the number is more than 10, remove that variable and run the regression again.
21
Exercise Which variables are most strongly related?
Here are correlations between birth weight and some possible independents. Which independents are most related to birth weight and to each other?
22
Exercise - Solution Which variables are most strongly related?
Gestation and birth weight (0.709) Mothers height and weight (0.671) Mothers height and weight are strongly related. They don’t exceed the problem correlation of 0.8 but try the model with and without height in case it’s a problem. When both were included in regression, neither were significant but alone they were Solution
23
Logistic regression Logistic regression has a binary dependent variable The model can be used to estimate probabilities Example: insurance quotes are based on the likelihood of you having an accident Dependent = Have an accident/ do not have accident Independents: Age (preferably Scale), gender, occupation, marital status, annual mileage Ordinal regression is for ordinal dependent variables There are a number of different types of regression for different types of dependent variable but most are beyond the scope of the average student. One which is commonly used is logistic regression which has a binary dependent variable. The model can be used to estimate probabilities although as with regression, students are usually just looking for significant relationships. A common use for this type of model is used by insurance companies. Using your details, the likelihood of you having a crash is estimated when your insurance premium is calculated. Ordinal regression is for dependents with more that 2 categories but for a lot of students, seeing if they can reduce the categories to 2 to use logistic regression may be easier for them
24
Estimating the Variance
Errors are assumed to have a constant variance ( 2), but we usually don’t know this. It can be estimated using the mean squared error (MSE), s2. where n = number of observations in the sample k = number of independent variables
25
Estimating the Variance
For Triple A Construction: We can estimate the standard deviation, s. This is also called the standard error of the estimate or the standard deviation of the regression.
26
Testing the Model for Significance
When the sample size is too small, you can get good values for MSE and r2 even if there is no relationship between the variables. Testing the model for significance helps determine if the values are meaningful. We do this by performing a statistical hypothesis test.
27
Testing the Model for Significance
We start with the general linear model If 1 = 0, the null hypothesis is that there is no relationship between X and Y. The alternate hypothesis is that there is a linear relationship (1 ≠ 0). If the null hypothesis can be rejected, we have proven there is a relationship. We use the F statistic for this test.
28
Testing the Model for Significance
The F statistic is based on the MSE and MSR: where k = number of independent variables in the model The F statistic is: This describes an F distribution with: degrees of freedom for the numerator = df1 = k degrees of freedom for the denominator = df2 = n – k – 1
29
Testing the Model for Significance
If there is very little error, the MSE would be small and the F-statistic would be large indicating the model is useful. If the F-statistic is large, the significance level (p-value) will be low, indicating it is unlikely this would have occurred by chance. So when the F-value is large, we can reject the null hypothesis and accept that there is a linear relationship between X and Y and the values of the MSE and r2 are meaningful.
30
Triple A Construction We can conclude there is a statistically significant relationship between X and Y. The r2 value of 0.69 means about 69% of the variability in sales (Y) is explained by local payroll (X). F = 7.71 0.05 9.09 Figure 4.5
31
Analysis of Variance (ANOVA) Table
When software is used to develop a regression model, an ANOVA table is typically created that shows the observed significance level (p-value) for the calculated F value. This can be compared to the level of significance () to make a decision. DF SS MS F SIGNIFICANCE Regression k SSR MSR = SSR/k MSR/MSE P(F > MSR/MSE) Residual n - k - 1 SSE MSE = SSE/(n - k - 1) Total n - 1 SST Table 4.4
32
ANOVA for Triple A Construction
Program 4.1C (partial) P(F > ) = Because this probability is less than 0.05, we reject the null hypothesis of no linear relationship and conclude there is a linear relationship between X and Y.
33
Multiple Regression Analysis
Multiple regression models are extensions to the simple linear model and allow the creation of models with more than one independent variable. Y = 0 + 1X1 + 2X2 + … + kXk + where Y = dependent variable (response variable) Xi = ith independent variable (predictor or explanatory variable) 0 = intercept (value of Y when all Xi = 0) i = coefficient of the ith independent variable k = number of independent variables = random error
34
Multiple Regression Analysis
To estimate these values, a sample is taken the following equation developed where = predicted value of Y b0 = sample intercept (and is an estimate of 0) bi = sample coefficient of the ith variable (and is an estimate of i)
35
Jenny Wilson Realty Jenny Wilson wants to develop a model to determine the suggested listing price for houses based on the size and age of the house. where = predicted value of dependent variable (selling price) b0 = Y intercept X1 and X2 = value of the two independent variables (square footage and age) respectively b1 and b2 = slopes for X1 and X2 respectively She selects a sample of houses that have sold recently and records the data shown in Table 4.5
36
Jenny Wilson Real Estate Data
SELLING PRICE ($) SQUARE FOOTAGE AGE CONDITION 95,000 1,926 30 Good 119,000 2,069 40 Excellent 124,800 1,720 135,000 1,396 15 142,000 1,706 32 Mint 145,000 1,847 38 159,000 1,950 27 165,000 2,323 182,000 2,285 26 183,000 3,752 35 200,000 2,300 18 211,000 2,525 17 215,000 3,800 219,000 1,740 12 Table 4.5
37
Jenny Wilson Realty Input Screen for the Jenny Wilson Realty Multiple Regression Example Program 4.2A
38
Jenny Wilson Realty Output for the Jenny Wilson Realty Multiple Regression Example Program 4.2B
39
Model Building The best model is a statistically significant model with a high r2 and few variables. As more variables are added to the model, the r2-value usually increases. For this reason, the adjusted r2 value is often used to determine the usefulness of an additional variable. The adjusted r2 takes into account the number of independent variables in the model.
40
Model Building The formula for r2 The formula for adjusted r2
As the number of variables increases, the adjusted r2 gets smaller unless the increase due to the new variable is large enough to offset the change in k.
41
Model Building In general, if a new variable increases the adjusted r2, it should probably be included in the model. In some cases, variables contain duplicate information. When two independent variables are correlated, they are said to be collinear. When more than two independent variables are correlated, multicollinearity exists. When multicollinearity is present, hypothesis tests for the individual coefficients are not valid but the model may still be useful.
42
Forecasting Techniques
Forecasting Models Forecasting Techniques Time-Series Methods Qualitative Models Causal Methods Delphi Methods Jury of Executive Opinion Sales Force Composite Consumer Market Survey Moving Average Exponential Smoothing Trend Projections Decomposition Regression Analysis Multiple Regression Figure 5.1
43
Qualitative Models Qualitative models incorporate judgmental or subjective factors. These are useful when subjective factors are thought to be important or when accurate quantitative data is difficult to obtain. Common qualitative techniques are: Delphi method. Jury of executive opinion. Sales force composite. Consumer market surveys.
44
Qualitative Models Delphi Method – This is an iterative group process where (possibly geographically dispersed) respondents provide input to decision makers. Jury of Executive Opinion – This method collects opinions of a small group of high-level managers, possibly using statistical models for analysis. Sales Force Composite – This allows individual salespersons estimate the sales in their region and the data is compiled at a district or national level. Consumer Market Survey – Input is solicited from customers or potential customers regarding their purchasing plans.
45
Time-Series Models Time-series models attempt to predict the future based on the past. Common time-series models are: Moving average. Exponential smoothing. Trend projections. Decomposition. Regression analysis is used in trend projections and one type of decomposition model.
46
Causal Models Causal models use variables or factors that might influence the quantity being forecasted. The objective is to build a model with the best statistical relationship between the variable being forecast and the independent variables. Regression analysis is the most common technique used in causal modeling.
47
Scatter Diagrams Wacker Distributors wants to forecast sales for three different products (annual sales in the table, in units): YEAR TELEVISION SETS RADIOS COMPACT DISC PLAYERS 1 250 300 110 2 310 100 3 320 120 4 330 140 5 340 170 6 350 150 7 360 160 8 370 190 9 380 200 10 390 Table 5.1
48
Scatter Diagram for TVs
330 – 250 – 200 – 150 – 100 – 50 – | | | | | | | | | | Time (Years) Annual Sales of Televisions (a) Sales appear to be constant over time Sales = 250 A good estimate of sales in year 11 is 250 televisions Figure 5.2a
49
Scatter Diagram for Radios
420 – 400 – 380 – 360 – 340 – 320 – 300 – 280 – | | | | | | | | | | Time (Years) Annual Sales of Radios (b) Sales appear to be increasing at a constant rate of 10 radios per year Sales = (Year) A reasonable estimate of sales in year 11 is 400 radios. Figure 5.2b
50
Scatter Diagram for CD Players
This trend line may not be perfectly accurate because of variation from year to year Sales appear to be increasing A forecast would probably be a larger figure each year 200 – 180 – 160 – 140 – 120 – 100 – | | | | | | | | | | Time (Years) Annual Sales of CD Players (c) Figure 5.2c
51
Measures of Forecast Accuracy
We compare forecasted values with actual values to see how well one model works or to compare models. Forecast error = Actual value – Forecast value One measure of accuracy is the mean absolute deviation (MAD):
52
Measures of Forecast Accuracy
Using a naïve forecasting model we can compute the MAD: YEAR ACTUAL SALES OF CD PLAYERS FORECAST SALES ABSOLUTE VALUE OF ERRORS (DEVIATION), (ACTUAL – FORECAST) 1 110 — 2 100 |100 – 110| = 10 3 120 |120 – 110| = 20 4 140 |140 – 120| = 20 5 170 |170 – 140| = 30 6 150 |150 – 170| = 20 7 160 |160 – 150| = 10 8 190 |190 – 160| = 30 9 200 |200 – 190| = 10 10 |190 – 200| = 10 11 Sum of |errors| = 160 MAD = 160/9 = 17.8 Table 5.2
53
Measures of Forecast Accuracy
Using a naïve forecasting model we can compute the MAD: YEAR ACTUAL SALES OF CD PLAYERS FORECAST SALES ABSOLUTE VALUE OF ERRORS (DEVIATION), (ACTUAL – FORECAST) 1 110 — 2 100 |100 – 110| = 10 3 120 |120 – 110| = 20 4 140 |140 – 120| = 20 5 170 |170 – 140| = 30 6 150 |150 – 170| = 20 7 160 |160 – 150| = 10 8 190 |190 – 160| = 30 9 200 |200 – 190| = 10 10 |190 – 200| = 10 11 Sum of |errors| = 160 MAD = 160/9 = 17.8 Table 5.2
54
Measures of Forecast Accuracy
There are other popular measures of forecast accuracy. The mean squared error: The mean absolute percent error: And bias is the average error.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.