# Statistics and Regression Analysis 9-1. 2 Understand the basic types of data Conduct basic statistical analyses in Excel Generate descriptive statistics.

## Presentation on theme: "Statistics and Regression Analysis 9-1. 2 Understand the basic types of data Conduct basic statistical analyses in Excel Generate descriptive statistics."— Presentation transcript:

Statistics and Regression Analysis 9-1

2 Understand the basic types of data Conduct basic statistical analyses in Excel Generate descriptive statistics and other analyses using the Analysis ToolPak Find relationships in data using COVARIANCES and CORREL Use regression analysis to predict future values Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

3 Statistics–collecting, analyzing, and interpreting data Descriptive statistics–deriving information from data financial analysis performed at the end of the fiscal year to determine the profitability of a company Inferential statistics–making predictions about population, based on sample completing a survey that was used to predict the success of a product or service. Inferential statistics rely on probability Probability–likelihood event will occur, based on what is known Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

4 Data–values that describe attributes Data set–collection of related data consist of observational units and variables. An observational unit is a person, object, or event about which data is collected. The variables are the types of data collected—for instance, name, address, gender, blood pressure, income, and so on. Observational unit–person, object, or event about which data is collected Population–collection of people, objects, or events on which data is collected Sample–subset of population larger the sample, the more likely that predictions based on it will be accurate. Random sample–subset of population selected through equal chance of being chosen Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

5 Probability distribution–describes all variables and likelihood that a given variable can be within a specific range Discrete variable–finite number with all possible values known when you take inventory, you have a discrete variable that represents the number of products on a shelf Continuous variable–infinite number of values within a range the amount of money you might spend at the grocery store. One day you may pick up just milk, and another day, you may purchase groceries for a big party. Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

6 Understand the Basic Types of Data Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. Ratio data differences that can be quantified and proportions that can be determined “9 out of 10 dentists agree” describes ratio data. There were 10 dentists asked their opinions, and 9 of them (or a ratio of 90%) agreed. Interval data measures the size of the difference between values: time, temperatures, and dates Ordinal data numbers to rank data in an order: first, second, third Nominal data classification purposes. 1 might be equal to Yes, 2 equal to No often used to categorize responses in a survey

7 RAND function: Generates random number between 0 and 1 Useful for creating random sample RAND() Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. This number can then be used in identifying people, objects, or events for a random sampling. For instance, you may generate random numbers to use as a basis for pulling parts out of inventory for testing purposes. It is a volatile function because it generates a new number every time it is calculated. To preserve a number that was generated randomly, you copy and paste it using the Copy and Paste Values option in Excel.

8 Conduct Basic Statistical Analyses in Excel Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

9 Central tendency–way data clusters around a value Mean–sum of all values divided by the total number of values Median–middle value between higher and lower halves Mode–value that appears most often Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

10 Conduct Basic Statistical Analyses in Excel Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. Central TendencyFunction MeanAVERAGE(number1,[number2],…) MedianMEDIAN(number1,[number2],…) Mode–value that appears most often MODE.SNGL(number1,[number2],…) Mode–vertical array of values that occur most often MODE.MULT(number1,[number2],…) MODE.SNGL returns the value that appears most often in the data set. If two values (or more) occur in the same amount, only the first one is returned. This could be a problem, so a second MODE function enables you to see a vertical array of the values that occur most often, displaying all that occur at the same amount. For instance, if your values were 100, 100, 95, 93, 95, 80, and 80, the MODE.MULTI would return 100, 95, and 80 in a vertical array, where the MODE.SNGL would only display 100.

11 Range–difference between minimum and maximum value.Subtracting the minimum value from the maximum value returns the range. Variance–how far the data set varies from the mean If the variance is high, the data set is widely dispersed, and conversely if it is small, the data centers closely to the mean. Standard deviation–average spread of a data set from the mean Outliers–data that are abnormally different in random sample Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

12 Conduct Basic Statistical Analyses in Excel Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. DispersionFunction Minimum valueMIN(number1,[number2],…) Maximum valueMAX(number1,[number2],…) Variance–sample data setVAR.S(number1,[number2],…) Variance–entire populationVAR.P(number1,[number2],…) Standard deviation–sample data set STDEV.S(number1,[number2],…) Standard deviation–entire population STDEV.P(number1,[number2],…)

13 Bins–intervals for grouping data Array function–performs multiple calculations on one or more items in an array FREQUENCY function: {=FREQUENCY(Data_array,Bins_array)} Large data sets, with thousands or even hundreds of thousands of records, are often grouped into bins. Bins are intervals into which the data are grouped. The Frequency function performs multiple calculations on one or more items in an array, to calculate how often values occur within the bins. For instance, I might use the Frequency function with the bins_array argument to determine the number of students who received As, Bs, and Cs, with the grade levels being the bins

14 Conduct Basic Statistical Analyses in Excel Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. {=FREQUENCY(B2:B30,D2:D5)}

15 Analysis ToolPak Add-in After installed, found on Data tab, as Data Analysis Descriptive statistics Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.

17Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. Calculates the average of values over time based on specified intervals. The interval used here is every 3 months, and in the first 2 months a #N/A value occurs because the interval is set to 3 As you can see, trends can be analyzed with this analysis tool.

18 Correlation coefficient–describes the strength and direction of relationship The strength is described as strong, moderate, weak, or very weak based on the correlation coefficient. Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. Correlation CoefficientStrength -1.0 to -0.5 or 0.5 to 1.0Strong -0.5 to -0.3 or 0.3 to 0.5Moderate -0.3 to -0.1 or 0.1 to 0.3Weak -0.1 to 0.1Very weak or none

19 Find Relationships in Data Using CORREL Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. =CORREL(A2:A37,B2:B37) Array1Array 2 The correlation is strong based on the results of the correlation coefficient result. Indicating that the older the person is, the more likely he is to spend more money.

Regression Analysis Regression analysis is a tool for building statistical models that characterize relationships among a dependent variable and one or more independent variables, all of which are numerical. Simple linear regression involves a single independent variable. Multiple regression involves two or more independent variables. 9-20

Purpose of Regression Analysis The purpose of regression analysis is to analyze relationships among variables. The analysis is carried out through the estimation of a relationship and the results serve the following two purposes: 1.Answer the question of how much y changes with changes in each of the x's (x1, x2,...,xk), Y is the dependent variable 2.Forecast or predict the value of y based on the values of the X's X is the independent variable

Simple Linear Regression  Finds a linear relationship between: - one independent variable X and - one dependent variable Y  First prepare a scatter plot to verify the data has a linear trend.  Use alternative approaches if the data is not linear. 9-22 Figure 9.1

Scatter Plots and Correlation A scatter plot (or scatter diagram) is used to show the relationship between two variables Correlation analysis is used to measure strength of the association (linear relationship) between two variables Only concerned with strength of the relationship No causal effect is implied

Scatter Plot Examples y x y x y y x x Strong relationshipsWeak relationships (continued)

r = +.3r = +1 Examples of Approximate r Values y x y x y x y x y x r = -1 r = -.6r = 0

Simple Linear Regression Example 9.1 Home Market Value Data 9-26 Figure 9.2 Figure 9.3 Size of a house is typically related to its market value. X = square footage Y = market value (\$) The scatter plot of the full data set (42 homes) indicates a linear trend.

Simple Linear Regression Finding the Best-Fitting Regression Line  Two possible lines are shown below.  Line A is clearly a better fit to the data.  We want to determine the best regression line. Y = b 0 + b 1 X where b 0 is the intercept b 1 is the slope 9-27 Figure 9.4 ^

Least Squares Line The most widely used criterion for measuring the goodness of fit of a line The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line. The slope of a regression line represents the rate of change in y as x changes. Because y is dependent on x, the slope describes the predicted values of y given x.

Simple Linear Regression Using Excel to Find the Best Regression Line Market value = 32673 + 35.036(square feet) 9-29 Figure 9.5 The regression model explains variation in market value due to size of the home. It provides better estimates of market value than simply using the average.

Linear Relations We know from algebra lines come in the form y = mx + b, where m is the slope and b is the y-intercept. In statistics, we use y = a + bx for the equation of a straight line. Now a is the intercept and b is the slope. The slope (b) of the line, is the amount by which y increases when x increase by 1 unit. This interpretation is very important. The intercept (a), sometimes called the vertical intercept, is the height of the line when x = 0.

Simple Linear Regression Using Excel Functions to Find Least-Squares Coefficients  Slope = 35.036 =SLOPE(C4:C45, B4:B45)  Intercept = 32,673 =INTERCEPT(C4:C45, B4:B45)  Estimate Y when X = 1800 square feet Y = 32,673 + 35.036(1800) = \$95,737.80 9-32 Figure 9.2

Simple Linear Regression Excel Regression tool Data Data Analysis Regression Input Y Range Input X Range Labels Excel outputs a table with many useful regression statistics. 9-33 Figure 9.7

Three Important Questions To examine how useful or effective the line summarizing the relationship between x and y, we consider the following three questions. 1.Is a line an appropriate way to summarize the relationship between the two variables? 2.Are there any unusual aspects of the data set that we need to consider before proceeding to use the regression line to make predictions? 3.If we decide that it is reasonable to use the regression line as a basis for prediction, how accurate can we expect predictions based on the regression line to be?

Simple Linear Regression Regression Statistics in Excel’s Output  Multiple R is the correlation between actual and predicted values of the dependent variable (r varies from -1 to +1 (r is negative if slope is negative) )  R Square the model’s accuracy in explaining the dependent variable R 2 varies from 0 (no fit) to 1 (perfect fit)  Adjusted R Square adjusts R 2 for sample size and number of X variables As the sample size increases above 20 cases per variable, adjustment is less needed (and vice versa).  Standard Error variability between observed & predicted Y variables 9-35

Simple Linear Regression Example 9.4 Interpreting Regression Statistics for Simple Linear Regression (Home Market Value) 9-36 Figure 9.8 53% of the variation in home market values can be explained by home size. The standard error of \$7287 is less than standard deviation (not shown) of \$10,553.

Multiple Linear Regression Multiple Regression has more than one independent variable. 9-37 Simple vs. Multiple Regression One dependent variable Y predicted from one independent variable X One regression coefficient r 2 : proportion of variation in dependent variable Y predictable from X One dependent variable Y predicted from a set of independent variables (X1, X2 ….Xk) One regression coefficient for each independent variable R 2 : proportion of variation in dependent variable Y predictable by set of independent variables (X’s)

38 Multiple Regression Analysis to Predict Future Values Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. =(E3*\$F\$27)+(F3*\$F\$28)+\$F\$26 Age coefficientAgeGenderGender coefficientIntercept coefficient

39 Multiple Regression Analysis to Predict Future Values (con’t) =(E3*\$F\$27)+(F3*\$F\$28)+\$F\$26 Age coefficientAgeGenderGender coefficientIntercept coefficient The R Square value is a conservative estimate of the independent variables’ ability to predict the value of the dependent variable. In this case, age and gender account for 65.7% of the sales revenue generated by a customer.

40 Multiple Regression Analysis to Predict Future Values (con’t) Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. =(E3*\$F\$27)+(F3*\$F\$28)+\$F\$26 Age coefficientAgeGenderGender coefficientIntercept coefficient The Intercept coefficient, found in cell F26, is the value at which a regression line will cross the y-axis. It is used in the formula to predict the sales amount based on age and gender found in cell G6. The Gender coefficient and the Age coefficients are also used in the regression formula, In this case, the prediction is that women in the 60-year-old range will spend \$207.97 at the spa.

Building Good Regression Models All of the independent variables in a linear regression model are not always significant. We will learn how to build good regression models that include the “best” set of variables. Banking Data includes demographic information on customers in the bank’s current market. 9-41 Figure 9.16 Y

Building Good Regression Models Predicting Average Bank Balance using Regression 9-42 Figure 9.17 Home Value and Education are not significant.

Building Good Regression Models Systematic Approach to Building Good Multiple Regression Models 1. Construct a model with all available independent variables and check for significance of each. 2. Identify the largest p-value that is greater than.05 3. Remove that variable and evaluate adjusted R 2. 4. Continue until all variables are significant.  Find the model with the highest adjusted R 2. (Do not use unadjusted R 2 since it always increases when variables are added.) 9-43

Building Good Regression Models Identifying the Best Regression Model Bank regression after removing Home Value 9-44 Figure 9.18 Adjusted R 2 improves slightly.

Download ppt "Statistics and Regression Analysis 9-1. 2 Understand the basic types of data Conduct basic statistical analyses in Excel Generate descriptive statistics."

Similar presentations