Lecture 22: Thurs., April 1 Outliers and influential points for simple linear regression Multiple linear regression –Basic model –Interpreting the coefficients.
Published byModified over 5 years ago
Presentation on theme: "Lecture 22: Thurs., April 1 Outliers and influential points for simple linear regression Multiple linear regression –Basic model –Interpreting the coefficients."— Presentation transcript:
Lecture 22: Thurs., April 1 Outliers and influential points for simple linear regression Multiple linear regression –Basic model –Interpreting the coefficients
Outliers and Influential Observations An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual. An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential. The least squares method is not resistant to outliers. Follow outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in direction of scatterplot.
Outliers Example Does the age at which a child begins to talk predict a later score on a test of mental ability? gesell.JMP contains data on the age at first word (x) and their Gesell Adaptive score (y), an ability test taken much later. Child 18 is an outlier in the x direction and potentially influential. Child 19 is an outlier in the direction of the scatterplot. To assess whether a point is influential, fit the least squares line with and without the point (excluding the row to fit it without the point) and see how much of a difference it makes. Child 18 is influential.
Will You Take Mercury With Your Fish? Too much mercury in one’s body results in memory loss, depression, irritability and anxiety – the “mad hatter” syndrome. Rivers and oceans contain small amounts of mercury which can accumulate in fish over their lifetimes. Concentration of mercury in fish tissue can be obtained at considerable expense by catching fish and sending samples to a lab for analysis. It is important to understand the relationship between mercury concentration and measurable characteristics of a fish such as length and weight in order to develop safety guidelines about how much fish to eat.
Data Set mercury.JMP contains data from a study of large mouth bass in the Wacamaw and Lumber rivers in North Carolina. At several stations along each river, a group of fish were caught, weighted, and measured. In addition a filet from each fish caught was sent to the lab so that the tissue concentration of mercury could be determined for each fish. We want to predict Y=mercury concentration per weight measured in parts per million based on X 1 =length (centimeters) and X 2 =weight (measured in grams)
Multiple Regression Model Multiple Regression: Seeks to estimate the mean of Y given multiple explanatory variables X 1,…,X p, denoted by Assumptions of ideal multiple linear regression model – (linearity) – (constant variance) –Distribution of Y for each subpopulation X 1,…,X p is normally distributed. –The selection of an observation from any of the subpopulations is independent of the selection of any other observation.
Multiple Regression Model: Another Representation Data: We observe Ideal Multiple Regression Model – has normal distribution with mean=0, SD= – are independent = “error” = error from predicting by its subpopulation mean
Estimation of Multiple Linear Regression Model The coefficients are estimated by choosing to make the sum of squared prediction errors as small as possible, i.e., choose to minimize Predicted value of y given x 1,…,x p : = SD(Y|X 1,…,X p ), estimated by = root mean square error
Multiple Linear Regression in JMP Analyze, Fit Model Put response variable in Y Click on explanatory variables and then click Add under Construct Model Effects Click Run Model.
Residuals and Root Mean Square Error from Multiple Regression Residual for observation i = Root Mean Square Error = As with simple linear regression, under the ideal multiple linear regression model – Approximately 68% of predictions of a future Y based on will be off by at most –Approximately 95% of predictions of a future Y based on will be off by at most
Interpreting the Coefficients = increase in mean of Y that is associated with a one unit (1cm) increase in length, holding fixed weight = increase in mean of Y that is associated with a one unit (1 gram) increase in weight, holding fixed length Interpretation of multiple regression coefficients depends on what other explanatory variables are in the model. See handout.