Presentation on theme: "We’ll consider here the problem of paired data. There are two common notations. (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) shows the data as n points in."— Presentation transcript:
We’ll consider here the problem of paired data. There are two common notations. (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) shows the data as n points in two-space XY x1x1 y1y1 x2x2 y2y2 x3x3 y3y3 … … xnxn ynyn This is the spreadsheet form. PowerPoint show prepared by Gary Simon, 11 MARCH 2008.
The separate points are assumed independent. We wish to find a relationship between variable X and variable Y. We have here a data set on eye response to different types of drops, but for now we’ll look at just a few simple items of information. DP0ODPupil diameter, start of experiment, right eye DP0OSPupil diameter, start of experiment, left eye AGESubject age There are altogether 100 subjects.
Let’s consider the relationship between pupil diameter in the eyes. An obvious first step is making a scatterplot showing all 100 people. Let’s put the right eye on the horizontal axis and the left eye on the vertical axis. This is not a critical decision. This graph shows that the points cluster near a diagonal line. This is not a surprise.
Here’s the same picture with the Y = X line superimposed: The points cling close to the line.
There are a few simple ways to summarize this situation. Perhaps the best is the correlation. Here r = Now let’s complicate this a bit. Suppose that we want to check on the relationship between DP0OS (pupil diameter, left eye) and AGE.
These two variables are not symmetric. We’ll think of the variable AGE as “logically earlier.” This means that we obtain it easily, reliably, and (probably) earlier than the pupil diameter. Also, it’s logical to think of using AGE to predict pupil diameter. We will designate AGE as the independent variable, we will identify it with the symbol X, and we will place it on the horizontal axis of the coming scatterplot.
We’ll think of the variable DP0OS as “logically later.” This information is obtained with some difficulty, with possible error measurement, and (probably) later than the age. We will designate DP0OS as the dependent variable, we will identify it with the symbol Y, and we will place it on the vertical axis of the coming scatterplot.
The scatterplot is next. Before it’s shown, we should ask ourselves whether *pupil diameter generally rises with age *pupil diameter is unrelated to age *pupil diameter generally decreases with age What do you think?
Here is the scatterplot:
Suppose that you would like to summarize the relationship between the two variables. You would like to write Pupil Diameter = Y = dependent variable = f(AGE) = f(X) = f(independent variable) for some function f. The problem is that you’ll never find a believable function to go through all the dots on the scatterplot. There is too much statistical noise.
The expression of the model will be revised to Y = f(X) + ε The symbol ε represents statistical noise. It may involve random errors in measuring Y or it may just represent variability that we just don’t know to account for. One could also have made “multiplicative noise” in the form Y = f(X) × ε. In some cases, this is useful. For now, we’ll stick with the “additive noise” with the + sign. We will have a lot to say about the ε term. For now, we’ll just assume that it is independent over the data points.
What form should we use for the function f ? How about f(X) = log X ? How about f(X) = a X 2 + b X + c ? How about f(X) = tan( a X 2 + h) ? How about f(X) = ?
We will start with the simplest function, the straight line. This is f(X) = β 0 + β 1 X. The symbols β 0 and β 1 are parameters. β 0 is the intercept, also called Y-intercept. β 1 is the slope. In nearly all cases, β 0 and β 1 are not known, and we have to estimate them from data.
The notation is not universal. You will also see f(X) = α + β XThis is OK. f(X) = a + b XUse of Roman letters is not recommended. For issues related to considering which symbols are fixed and which are random, we will prefer f(x) = β 0 + β 1 x. That is, we will prefer lower-case x. It is however impossible to enforce distinctions between x and X and also between y and Y. We can’t be too dogmatic about the notation.
The relationship between Y and X will be described through the simple linear regression model Y = β 0 + β 1 x + ε This is made more direct by putting on subscript i to label individual data points. Our preferred form for the simple linear regression model is Y i = β 0 + β 1 x i + ε i with i = 1, 2, …, n.
The simple linear regression model also includes these assumptions about the noise terms ε 1, ε 2, ε 3, …, ε n : The ε’s are independent of each other and also independent of the x’s. The ε’s are sampled from a hypothetical population in which the mean is zero and the standard deviation is σ. In some cases, we may add in the further assumption that the ε’s are sampled from a normal population.
The simple linear regression model Y i = β 0 + β 1 x i + ε i has three unknown parameters: β 0, β 1, and σ. Estimating these parameters is an important part of the regression task. Estimating β 0 and β 1 is equivalent to drawing a line on the scatterplot. The estimate of σ tells us how well the line describes the set of points on the scatterplot.
The estimate of β 0 is written b 0. The estimate of β 1 is written b 1. The estimate of σ is written s. You’ll also see s ε or s Y | x. Note this consistent pattern of usage: Model parameters are Greek letters. Data-based estimates are corresponding Latin letters.
Be aware that other schemes exist. Someone who writes the model as Y i = α + βx i + ε i will use a for the estimate of α and will use b for the estimate of β. Someone who writes the model as Y i = a + b x i + ε i will use for the estimate of a and will use for the estimate of b.
For our problem, the model is DP i = β 0 + β 1 AGE i + ε i The pupil diameter DP is in units of mm (millimeters). The variable AGE is in units of years. Therefore, β 0 and its estimate b 0 are in units of mm. Also, the ε’s and their standard deviation σ are in units of mm. The estimate of σ is also in units of mm. The slope β 1 and its estimate b 1 are in units of.
How should we estimate β 0 and β 1 ? We could guess. We could draw a nice-looking line on the scatterplot and then use that line to get the estimates. These are not necessarily bad methods, but they are not reproducible. This means that different people get different answers. Worse yet, the same person on two occasions will produce different answers.
We will instead propose that the estimates be done by minimizing a mathematical function. Many proposals have been made, but the nearly universal choice is least squares. Choose b 0 and b 1 to minimize the function Q = How should this minimization be done?
The solution is by (mindless and routine) differentiation. That is, solve the system This results in two linear equations in the two unknowns b 0 and b 1.
The solution method selected by the previous slide works, but it’s clumsy. Here is a cleaner way to do this. (1) Find the five sums,,,,. (2) Next find these quantities:,, S xx =, S yy =, S xy =
(3) Find b 1 (the estimate of the slope β 1 ) as b 1 = (4) Find b 0 (the estimate of the intercept β 0 ) as b 0 = - b 1 Note that b 1 is found before b 0.
(5) Finally, calculate S yy | x = We’ll use this later in the estimation of σ, the standard deviation of the noise.
While it’s possible to do this for our problem of pupil diameter versus age with just the use of a calculator… there are too many steps and we are likely to make errors. We’ll give this to the Minitab function Stat > Regression > Regression.
The Minitab output is extensive, but from it we find Regression Analysis: DP0OD versus AGE The regression equation is DP0OD = AGE This is called the fitted regression equation. This identifies for us b 0 = 7.27 and b 1 =
Here is a reprise of the scatterplot, now shown with the fitted regression line. This was made in Minitab with Stat > Regression > Fitted Line Plot. This has reported also s ε = , the estimate of σ.
It’s important to distinguish population quantities from sample quantities. The process of regression is not simply “numbers in” “numbers out.”
The simple linear regression model is Y i = β 0 + β 1 x i + ε i If you are asked to graph the line Y = β 0 + β 1 x... Please refuse! You cannot graph this line because β 0 and β 1 are unknown population parameters.
With data, you will get the estimates b 0 and b 1. The fitted regression line is = b 0 + b 1 x. The “hat” on is helpful, but it’s a typesetting nuisance. The fitted line is often given without the “hat.”
For the pupil diameter problem, the fitted line is = AGE The interpretation of is... that each year of age is associated with a reduction of mm in pupil diameter. The interpretation of 7.27 is... to be avoided. It’s tempting to say that it’s an assessment of pupil diameter at birth. The data set did not have anyone younger than 18, so we won’t force an interpretation.
The estimate of the noise standard deviation was calculated as s ε = This is about 0.83 mm, which is rather large for this context. What are we to make of this large value? This is saying that AGE is far from a perfect predictor of pupil diameter.
We still have to decide * Is there an objective way to decide if this whole activity was worth doing? * Is there an objective way to decide if the model Y i = β 0 + β 1 x i + ε i was a good choice?