Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Science and Analytics Module 4. Excel Regression

Similar presentations


Presentation on theme: "Data Science and Analytics Module 4. Excel Regression"— Presentation transcript:

1 Data Science and Analytics Module 4. Excel Regression
Introduction to Data Science and Analytics Stephan Sorger Module 4. Excel Regression Disclaimer: All images such as logos, photos, etc. used in this presentation are the property of their respective copyright owners and are used here for educational purposes only Some material adapted from: Sorger, “Marketing Analytics: Strategic Models and Metrics” In this module, we cover how to identify relationships between variables using regression analysis, especially using Microsoft Excel. © Stephan Sorger 2016; Data Science: Excel Regression

2 Outline/ Learning Objectives
Topic Description Background The goal of regression analysis Statistics Basic statistics governing regression performance Tests F tests, t tests, p tests Procedure Executing regression analysis in Microsoft Excel Multivariate Executing cases with two or more independent variables This module contains several learning objectives to cover different aspects of regression analysis. We cover basic statistics that help us understand regression performance, We discuss tests to check validity of the regression, We demonstrate the procedure to execute regression analysis, And we execute cases with two or more independent variables, often called multivariate regression analysis. © Stephan Sorger 2016; Data Science: Excel Regression

3 Regression Analysis Goal is to establish the relationship between
Independent variables (the “inputs”) and dependent variables (also called response variables) Y Axis Response Variables Dependent Variable Independent Variable X Axis Identifier Variables We characterize response variables as the dependent variables because they represent the behavior in which we are interested. We describe identifier variables as the independent variables because they represent variables which help explain behavior. The figure in the slide shows the concept. © Stephan Sorger 2016; Data Science: Excel Regression

4 Response (Dependent) Variable Categories
Functional Financial Response Variable Categories Performance; Reliability; Durability Cost savings; Revenue gain Service and Convenience Psychological Time savings; Convenience Trust; Esteem; Status Usage Usage scenario; Usage rate Response variables describe how individuals respond to sales efforts. We can measure sales revenue, of course, but we can also measure other things, such as functional, service, usage, financial, and psychological aspects. Functional response variables consider the function of company products and services. Specific variables in this category include performance, reliability, durability, quality, and so forth. Service and convenience response variables study how company products and services deliver service and convenience to their customers. Specific variables in this category include time savings, ease of use, ease of purchase, convenience and location of a retail store, and so forth. Financial response variables focus on the monetary performance delivered by products and services. Specific variables in this category include cost savings, potential revenue gain, sensitivity to price-related promotions, liability avoidance, and so forth. Usage response variables consider how customers will use the product or service. Specific variables in this category include usage scenario or occasion, usage rate, usage frequency, application of the product or service, usage patterns, and so on. Psychological response variables study how products and services affect psychological aspects. Specific variables in this category include trust, esteem, status, and so forth. © Stephan Sorger 2016; Data Science: Excel Regression

5 Independent (Input) Variables
Demographics Demographics Age; Income Industry; Company size Consumer Identifier Variables Business Identifier Variables Geographics Geographics Country; Region; City Company location Psychographics Situational Lifestyle; Interests Specific applications; Order size Many other independent variables possible: See next slide Typical independent variable examples include consumer demographics, such as age and income, consumer geographics, such as country and city, and consumer psychographics, such as values, interests, and attitudes. Similarly, we can use several familiar identifier variable categories such as business demographics, such as industries and company size, business geographics, such as company location, and business situational factors, such as the ability of the company to expedite urgent orders. © Stephan Sorger 2016; Data Science: Excel Regression

6 Regression Analysis Example
We demonstrate regression analysis with an example. © Stephan Sorger 2016; Data Science: Excel Regression

7 Regression Example Scenario: Moving into a New Apartment
(regular apt: not mansion; not rent control) Reponse Variable: S.F. Monthly Rent Paid Independent Variables: (want to predict how much people will pay) Demographics: Age Demographics: Income Geographics: Location of workplace Psychographics: Status required Psychographics: Entertaining requirements We demonstrate regression analysis with an example. We study the scenario of a couple moving into a new apartment in the pricey San Francisco Bay Area. For this demonstration, we study a couple moving into a new apartment. The apartment is not a mansion, and is not under rent control either. In this study, the dependent variable is the spending level, also known as the monthly amount paid for rent. We want to know which independent variables relate well with spending level. We can think of several types of independent variables. For example, for demographics, we can study the relationship of age or income as they pertain To the degree of spending one can justify, or afford, for an apartment. We can study geographics, or the relationship between the location of the workplace and the amount of rent paid. Some areas, where many jobs are located, often command higher rents. We can study psychographics, such as the feeling of status or the pleasure of entertaining others, Where renters might prefer a desirable area to achieve those feelings. © Stephan Sorger 2016; Data Science: Excel Regression

8 Regression Analysis: Process
Verify Data Linearity Launch Data Analysis Select Regression Analysis Input Regression Data Regression analysis follows a four step process. In the first step, we verify the linearity of our data. Linear regression demands that the data show linearity. We can quickly check for linearity by plotting out the data. The slide shows an example scatter plot of rent versus income. By examining the plot, we can quickly confirm that the data is roughly linear. Had the points formed a U-shape or other nonlinear pattern, we would need to apply nonlinear regression techniques. © Stephan Sorger 2016; Data Science: Excel Regression

9 Regression Analysis: Process
Verify Data Linearity Launch Data Analysis Select Regression Analysis Input Regression Data Excel Home Data Data Analysis A B C D E F G In the second step, we launch the data analysis function within Microsoft Excel by clicking on the Data tab and then selecting Data Analysis. © Stephan Sorger 2016; Data Science: Excel Regression

10 Regression Analysis: Process
Verify Data Linearity Launch Data Analysis Select Regression Analysis Input Regression Data Data Analysis Analysis Tools OK Regression Clicking on Data Analysis will reveal the Data Analysis dialog box. We scroll down the various data analysis tools, select Regression, and then click OK. © Stephan Sorger 2016; Data Science: Excel Regression

11 Regression Analysis: Process
Verify Data Linearity Launch Data Analysis Select Regression Analysis Input Regression Data Regression X Y Input Y Range OK Input X Range x Labels Constant is Zero x 95 Confidence Level: % In this step, we input the regression data. Highlight the Spending data in the spreadsheet to input data into the Input Y Range box. We highlight Spending because it represents the dependent variable. Highlight the Income and Age columns to input data into the Input X Range box. Include the labels at the top of the columns (Spending, Income, and Age) when selecting the data. © Stephan Sorger 2016; Data Science: Excel Regression

12 Excel Output R-Square Significance F P value T stat Standard Error
Coefficients We now interpret the regression results. Excel will open a new tab titled SUMMARY OUTPUT, with the results of the regression analysis. Excel displays the results in three tables: Regression Statistics, ANOVA, and Coefficients. We will discuss the R-Square value, the F and Significance F values, and the variable coefficients and their corresponding P values. © Stephan Sorger 2016; Data Science: Excel Regression

13 Regression Analysis: R-Squared
Scenario R-Squared No Relationship 0.0 Social Science Studies 0.3 Marketing Research 0.6 Scientific Applications 0.9 Perfect Relationship 1.0 R-Squared, the Coefficient of Determination Also known as “Goodness of Fit”, from 0 (no fit) to 1 (perfect fit) The first statistic is the R-Squared value. The R-Squared value is also called the coefficient of determination, and varies from 0 to 1. R-Squared represents the degree to which the independent values explain the estimated function (in this case, a line through the data). If they explain the relationship perfectly (i.e., all the data points lie directly on the line), then R-Squared is equal to 1. If no relationship exists, then R-Squared is equal to zero. In well-defined situations, such as scientific applications using precise instrumentation, R-Squared values are very high, often 0.9 or greater. In less manageable situations, such as social science studies, R-Squared values can be small, such as 0.3. In most marketing research applications, values of R-Squared fall in between the two extremes, generally near 0.6. © Stephan Sorger 2016; Data Science: Excel Regression

14 Regression Analysis Testing
We discuss how to test regression analysis results using tests such as p values and ROC curves, Using the results of the simple rent spending vs. income regression analysis we discussed in earlier sections of this module. © Stephan Sorger 2016; Data Science: Excel Regression

15 Hypothesis Testing: t-Stat and P-value
Statistic Description Standard Error Estimate of standard deviation of the coefficient t-Stat Coefficient divided by the Standard Error P-value Probability of encountering equal t value in random data P-value should be 5% or lower Hypothesis Testing: Test H0 (null hypothesis) Null hypothesis: No correlation between x and y Less than 5%  OK Microsoft Excel provides several tests to assess the performance of the regression process. It supplies the standard error, which is the standard deviation of the coefficient, t-Stat, the coefficient divided by the standard error, and P-value, the probability of encountering the same t-value in random data. Of these three statistics, the P-value is the most important. We want to check our P-values to ensure they are less than 5%. In other words, the chances that our results are wrong, and that our proposed model is just the result of random data, are less than 5%. © Stephan Sorger 2016; Data Science: Excel Regression

16 Hypothesis Testing: F value
Statistic Description F value Tests overall significance of the regression model H0 Tests null hypothesis that all regression coefficients = 0 Tests full model against a model with no variables Significance F Check model; Less than 0.05 to invalidate H0 Hypothesis Testing: Test H0 (null hypothesis) Null hypothesis: No correlation between x and y Less than 5%  OK In addition to checking the validity of individual independent variables using p-values, We can check the validity of the overall model using the significance F value. Just as in the case with p-values, we check significance F against the 5 percent limit To invalidate the null hypothesis and thus confirm it is a statistically valid relationship. © Stephan Sorger 2016; Data Science: Excel Regression

17 Regression Analysis: Coefficients
(Rent Spending) = (Y-Intercept) + (Coefficient, Income) * (Income) (-87.26) + (0.0366) * (Income) Slope: / 1 Y-intercept: In our example, we desire to find an equation to express the relationship between rent spending and customer income. Excel has given us an intercept value of and a coefficient for the income variable of We can apply those values in the equation shown in the slide. With the equation, we can predict the amount of money renters will spend on apartment monthly rent, Based on their annual personal income. © Stephan Sorger 2016; Data Science: Excel Regression

18 Regression Analysis: ROC Curves
Topic Description ROC Receiver Operating Characteristic Plot of true positive rate against false positive rate at different cutpoints History Developed during World War II by RADAR engineers Tradeoff Shows tradeoffs, such as sensitivity and specificity for experiments Cutpoint Sensitivity Specificity Good close to top edge Good close to left edge Cutpoint True Pos. False Pos. Bad  close to diagonal We can do further tests using a concept called a Receiver Operating Characteristic, or ROC, curve. Radio Detection and Ranging, or RADAR, engineers developed ROC curves back in World War II to check for incoming enemy aircraft. The curves displayed the tradeoff between two variables. For example, radios tuned to high sensitivity often had many false reports, because a bird or other phenomenon could trigger the device. On the other hand, radios tuned to low sensitivity might miss incoming enemy aircraft. © Stephan Sorger 2016; Data Science: Excel Regression

19 Regression Analysis: ROC Curves
Topic Description Tests Test predictive performance of model; how to select cutoffs At 95% level of confidence, we test H0 at 5% (alpha = 5%) True positive Correctly identified; High income people rent expensive apts. True negative Correctly rejected; Low income people rent cheap apartments False positive Null hypothesis is true; No correlation (type I error) To address type I error, reduce alpha (in our case, 5%) False negative Failing to reject null hypothesis which is false (type II error) We thought model doesn’t work, but it does Tradeoff As we decrease alpha from 5% to 1%... Type I error decreases, but Type II error increases (typical) Selecting cutoff a business decision; alpha = 5% usually good As we saw in the regression analysis example earlier, we conduct tests on the data To determine the validity of the model. At a standard 95% level of confidence, we often apply the somewhat arbitrary value Of alpha, or the null hyphothesis cutoff point, of 5%. When we apply the tests, we can identify four possibilities. The first possibility is a true positive, where we correctly identify the phenomenon we sought. In our apartment example, we would verify that high income people rent expensive apartments. Conversely, the true negative test would confirm that low income people rent cheap apartments. If our alpha is too high, we might falsely state that a relationship exists where one does not. In this case, also called a false positive, our findings are due to random noise, not a genuine relationship. We sometimes call this condition a type I error. To address false positives, we can reduce our alpha value to further eliminate the chances of selecting purely random data. On the other hand, we might fail to reject a null hypothesis which is false, also called a type II error, So we thought the model doesn’t work, but it in fact does. As we decrease alpha, our chances of type I error decreases, but the chances of type II error increases. So what we have here is a tradeoff. In general, we select 5% as a good tradeoff between type I and type II errors. © Stephan Sorger 2016; Data Science: Excel Regression

20 Multivariate Regression Analysis
In this section, we cover how to execute multivariate regression analysis, where we consider several potential explanatory variables. © Stephan Sorger 2016; Data Science: Excel Regression

21 Regression Analysis: Multivariate
Period Sales Level Market Awareness Number of Locations Q $1.0 million 80% 5 Q $1.1 million 80% 5 Q $1.3 million 85% 6 Q $1.2 million 85% 6 Q $1.3 million 85% 7 Q $1.5 million 90% 8 Q $1.5 million 90% 8 Q $1.4 million 90% 8 We will now demonstrate multivariate analysis with an example. The slide shows a typical sales data set, indicating the time period, the level of sales generated during that time period, the degree of market awareness for the company and its offerings, and the number of retail locations to sell to customers. We notice that sales gradually increase as the company increased its market awareness and number of locations. We seek to establish an equation that will allow us to predict sales if we make a change, such as increasing the market awareness yet further, or opening an additional store. To find out how market awareness and number of locations affect sales, we run a regression analysis against those two values. The regression analysis is similar to that done for time series, except now we have two independent variables instead of one. The two independent variables are market awareness and number of locations. For example, we could find out the resulting sales level if we maintained awareness at 90%, while opening two additional locations. What would happen if we opened 2 new stores, while holding awareness at 90%? © Stephan Sorger 2016; Data Science: Excel Regression

22 X Range: Awareness & Locations
Regression Analysis: Multivariate Y Range: Sales X Range: Awareness & Locations To calculate the regression analysis statistics, we invoke the Analysis ToolPak regression function for Windows-based machines, Or the equivalent function within StatPlus for Mac models. We enter the Sales amount in the Y range and the Awareness and Locations data in the X ranges. Because we included labels when we selected the data, we tick the Labels box, and then click OK. © Stephan Sorger 2016; Data Science: Excel Regression

23 Regression Analysis: Multivariate
R – squared: 0.92 Y-intercept = Coefficient for Awareness: at a P-value of 24.2% (not very good) Coefficient for Locations: at a P-value of 56.2% (poor) Excel returns the summary output, indicating an R-squared value of 0.92, As well as data for our awareness and locations variables. We can see that the p-value for awareness is well beyond 5% at 24.2%, Indicating that the relationship between awareness and sales could just be random. The situation for locations is even worse, with a staggeringly high p-value of 56.4%, Worse than a coin toss. At this point, we could question the validity of either of the two variables. We would recommend investigating other potential explanatory variables that would Better explain how the company drives sales. © Stephan Sorger 2016; Data Science: Excel Regression

24 Regression Analysis: Multivariate
Output Description Values in Our Sales Example R-Square Goodness of fit of model to data 0.93 Intercept Point where line crosses Y axis -1.44 Coefficient 1 Coefficient for Market Awareness 2.857 Coefficient 2 Coefficient for Number of Locations 0.043 Sales = (Intercept) + (Coefficient 1) * (Market Awareness) + (Coefficient 2) * (Number of Locations) = (- 1.44) + (2.857) * (Market Awareness) + (0.043) * (Number of Locations) Example: Maintain brand awareness at 90%; Open two new retail stores (10 total) = (-1.44) + (2.857) * (0.90) + (0.043) * (10) = $1.56 Million If we postpone that investigation and see where our existing variables take us, we can develop our traditional regression equation. Here, we see that Sales is a function of the constant intercept value, as well as the market awareness and location quantity independent variables. When we insert brand awareness at 90% and open two additional stores, we arrive at $1.56 million. We can now compare the additional revenue we get with the two additional stores With the costs associated with the opening of those two stores. The example demonstrates the beauty of such analyses, in that we can directly predict what will happen As we change the values of potential explanatory variables. © Stephan Sorger 2016; Data Science: Excel Regression

25 Outline/ Learning Objectives
Topic Description Background The goal of regression analysis Statistics Basic statistics governing regression performance Tests F tests, t tests, p tests Procedure Executing regression analysis in Microsoft Excel Multivariate Executing cases with two or more independent variables In this module, we covered several topics surrounding regression analysis using Microsoft Excel. As we discussed the basic goal of regression analysis is to identify the relationship Between a dependent variable and one or more independent variables. We covered basic summary statistics and tests, such as p tests, to assess model validity. We provided an overview of how to execute regression analysis in Microsoft Excel, Both with simple one-variable models and models using multiple independent variables. © Stephan Sorger 2016; Data Science: Excel Regression


Download ppt "Data Science and Analytics Module 4. Excel Regression"

Similar presentations


Ads by Google