 # Lecture 23: Tues., Dec. 2 Today: Thursday:

## Presentation on theme: "Lecture 23: Tues., Dec. 2 Today: Thursday:"— Presentation transcript:

Lecture 23: Tues., Dec. 2 Today: Thursday:
Review of using nominal variables in multiple regression Extra sum of squares F-tests (Ch. 10.3) R-squared statistic (10.4.1) Course evaluations Thursday: Residual plots (11.2) Dealing with influential observations ( ) Take home points from course

Info about Final Assigment (no Final Exam)
Handed out after Thursday’s class (the final class) Due Friday, Dec. 12th by 5 p.m. Approximate length of a regular homework but will involve somewhat more challenging questions (emphasis will be on multiple regression but questions will involve material from whole course) Do not talk to each other about it. I will answer general conceptual questions about the course but not specific questions about the assignment [office hours by appointment any time this week and next week]

Separate/Parallel Regression Lines Model
Separate regression lines model: Parallel regression lines model:

Parallel Regression Lines Model
No strong evidence that echolocating bats use less energy than either nonecholocating bats (p-value = 0.70) or nonecholocating birds (p-value = 0.88) of same body size. 95% Confidence interval for difference in mean of log energy for nonecholocating bats and echolocating bats of same body size: (-0.51, 0.35). This means that 95% confidence interval for ratio of median energy for nonecholocating bats to echolocating bats of same body size is Summary of findings: Although there is no strong evidence that echolocating bats use less energy than nonecholocating bats of same body size, it is still plausible that they use quite a bit less energy (60% as much at the median). Study is inconclusive.

Two review points from example
If the difference in the means of the two populations log(Y1) and log(Y2) is , then the ratio of the median of the population of Y1 and Y2 is (Ch. 3.5, 8.4) A nonsignificant (>0.05) p-value does not mean that the null hypothesis is true. It means that there is not strong evidence to reject the null hypothesis. A confidence interval provides information about the range of plausible values for the parameter based on the study. Here study is inconclusive because CI contains both null hypothesis and practically significant alternative (situation D of Display 23.1). If possible, choose sample size to avoid situation D.

Nominal Variables in JMP
To automatically incorporate a nominal variable in JMP, make the modeling type nominal, fit the model and then click red triangle next to Response, Estimates and Expanded Estimates JMP creates variables for each level of the nominal variable that represent the difference between that level and the average of all the levels. To look directly at the difference between two levels of a categorical variable, it is easier to make your own indicator variables, leaving out one of the levels [Use this method for homework].

Prediction Intervals Predicted value of y for :
To find a 95% prediction interval for the mean log energy of a flying vertebrate of a given type and mass, Fit the multiple regression model (i.e., for parallel regression lines model, fit ) Click red triangle next to response log energy, click save columns, click predicted values and also click indiv confid interval. This saves the predicted values, lower 95% prediction interval endpoint and upper 95% prediction interval endpoint for each observation in data set. For an echolocating bat with body mass 8g, the prediction interval for log energy is (-0.393,0.499) [for energy it is Use mean confid interval for confidence intervals for mean response.

Nominal Variables Example
An analysis has shown that the time required in minutes to complete a production run increases with the number of items produced. Data were collected for the time required to process 20 randomly selected orders as supervised by three managers. How do the managers compare?

One Way Layout Analysis
One way layout analysis does not take into account run size. Manager C might be a better manager but have been supervising smaller production runs

Separate/Parallel Regression Lines Model
Separate regression lines model: Parallel regression lines model:

Coded Scatterplot Can follow procedure using graph – overlay plot from last lecture. Shortcut: Click Rows, Color or Mark by Column and then highlight Manager. Now use Fit Y by X with Y=Time for Run and X=Run Size

Model Fits Parallel Regression Lines Model
Separate Regression Lines Model How do we test whether parallel regression lines model is appropriate ( )?

Extra Sum of Squares F-tests
Suppose we want to test whether multiple coefficients are equal to zero, e.g., test t-tests, either individually or in combination cannot be used to test such a hypothesis involving more than one parameter. F-test for joint significance of several terms

Extra Sum of Squares F-test
Under , the F-statistic has an F distribution with number of betas being tested, n-(p+1) degrees of freedom. p-value can be found by using Table A.4 or creating a Formula in JMP with probability, F distribution and the putting the value of the F-statistic for F and the appropriate degrees of freedom. This gives the P(F random variable with degrees of freedom < observed F-statistic) which equals 1 – p-value

Extra Sum of Squares F-test example
Testing parallel regression lines model (H0, reduced model ) vs. separate regression lines model (full model) in manager example Full model: Reduced model: F-statistic p-value: P(F random variable with 2,53 df > 51.96)

Second Example of F-test
For echolocation study, in parallel regression model, test Full model: Reduced model: F-statistic: p-value: P(F random variable with 2,16 degrees of freedom > 0.44) = =

Manager Example Findings
The runs supervised by Manager a appear abnormally time consuming. Manager b has high initial fixed setup costs, but the time per unit is the best of the three. Manager c has the lowest fixed costs and per unit production time in between managers a and b. Adjustments to marginal analysis via regression only control for possible differences in size among production runs. Other differences might be relevant, e.g., difficulty of production runs. It could be that Manager a supervised most difficult production runs.

Special Cases of F-test
Multiple Regression Model: If we want to test if one equals zero, e.g., , F-test is equivalent to t-test. Suppose we want to test , i.e., null hypothesis is that the mean of Y does not depend on any of the explanatory variables. JMP automatically computes this test under Analysis of Variance, Prob>F. For separate regression lines model, strong evidence that mean run time does depend on at least one of run size, manager.

The R-Squared Statistic
For separate regression lines model in production time example, Similar interpretation as in simple linear regression. The R-squared statistic is the proportion of the variation in y explained by the multiple regression model Total Sum of Squares: Residual Sum of Squares:

Assumptions of Multiple Linear Regression Model
For each subpopulation , (A-1A) (A-1B) (A-1C) The distribution of is normal [Distribution of residuals should not depend on ] (A-2) The observations are independent of one another

Checking/Refining Model
Tools for checking (A-1A) and (A-1B) Residual plots versus predicted (fitted) values Residual plots versus explanatory variables If model is correct, there should be no pattern in the residual plots Tool for checking (A-1C) Normal quantile plot Tool for checking (A-2) Residual plot versus time or spatial order of observations