# Machine Learning and Data Mining Linear regression

## Presentation on theme: "Machine Learning and Data Mining Linear regression"— Presentation transcript:

Machine Learning and Data Mining Linear regression
+ Machine Learning and Data Mining Linear regression (adapted from) Prof. Alexander Ihler Linear regression is a simple and useful technique for prediction, and will help us illustrate many important principles of machine learning. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Supervised learning Notation Features x Targets y Predictions ŷ
Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Training data (examples) Features Recall our supervised learning paradigm – we have some observed features “x” which we would like to use to predict a target value “y”. We create a flexible function, called the “learner”, whose input/output relationships can be adjusted through parameters theta. Then, a collection of training examples are used to fit the parameters until the learner is able to make good predictions on those examples. Feedback / Target values Score performance (“cost function”)

Linear regression Define form of function f(x) explicitly
“Predictor”: Evaluate line: return r Define form of function f(x) explicitly Find a good f(x) within that family 40 Target y 20 In linear regression, our learner consists of an explicit functional form f(x) : in other words, we define a family of functions parameterized by theta, and any particular value of theta defines a member of that family. Learning then finds a good member within that family (good value of theta) using the training dataset D 10 Feature x 20 (c) Alexander Ihler

More dimensions? x1 x2 y y x1 x2 (c) Alexander Ihler
10 20 30 40 22 24 26 10 20 30 40 22 24 26 x1 x2 y y x1 In a multidimensional feature space, our linear regression remains a linear function – so, a plane for two features, a hyperplane for more. Otherwise, everything remains essentially the same. x2 (c) Alexander Ihler

Notation Define “feature” x0 = 1 (constant) Then (c) Alexander Ihler
Some quick notation – our prediction yhat is a linear function, theta0 + theta1 times feature 1, plus theta2 times feature 2, etc. For notational compactness, let’s define a “feature 0” that is always value 1; then yhat is simply the vector product between a parameter vector theta and a feature vector x. (c) Alexander Ihler

Measuring error Error or “residual” Observation Prediction
We also need to define what it means to predict well. Define the error residual e(x) to be the (signed) error between the true or desired target y and the predicted value yhat If our prediction is good, these errors will be small on average; 20 (c) Alexander Ihler

Mean squared error How can we quantify the error?
Could choose something else, of course… Computationally convenient (more later) Measures the variance of the residuals Corresponds to likelihood under Gaussian model of “noise” We can measure their average squared magnitude, for example – the MSE. We’ll define a cost function J(theta) that measures how good of a fit any particular setting of theta is, by summing up the squared error residuals over the training data. While we could easily choose a different cost function J (and we will discuss some in another lecture), MSE is a common choice since it is computationally convenient, and corresponds to a Gaussian model on the “noise” that we are unable to predict; if this “noise” is the sum of many small influences, the central limit theorem suggests that such a Gaussian model will be a good choice. (c) Alexander Ihler

MSE cost function Rewrite using matrix form
Since we’re using Matlab and its vector and matrix operators, let’s rewrite J in a more convenient vector-and-matrix form… (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler

Visualizing the cost function
J(θ) θ 0 θ1 The cost function J(theta) is a function of the parameter values theta, and we can plot it in “parameter space”. Suppose we have the linear function theta0 + theta1 x : two parameters, so J(.) is a 3D plot. This is difficult to look at, so we can plot the value of J as a color, with blue being small values and red large. Even this is inconvenient to plot, so we’ll often just visualize the contours, or isosurfaces, of J – for a number of different values, we find the set of points where J equals that value, and plot that set with a red line. Then collapsing back to 2D, we can plot these contours, resulting in a “topographical map” of the function J. (c) Alexander Ihler

Supervised learning Notation Features x Targets y Predictions ŷ
Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Training data (examples) Features So now that we have chosen a form for the learner and a cost function, “learning” for linear regression comes down to selecting a value of theta that scores well. Feedback / Target values Score performance (“cost function”)

Finding good parameters
Want to find parameters which minimize our error… Think of a cost “surface”: error residual for that θ… Now, “learning” has been converted into a basic optimization problem – find the values of theta that minimize J(theta), and in the next segments, we can explore several methods of doing so. (c) Alexander Ihler

+ Machine Learning and Data Mining Linear regression: direct minimization (adapted from) Prof. Alexander Ihler For linear regression with the MSE loss, it turns out that we can also compute the optimal values of theta in closed form.

MSE Minimum Consider a simple problem Can solve this system directly:
One feature, two data points Two unknowns: µ0, µ1 Two equations: Can solve this system directly: Consider the following simple problem, in which we have only two data points to fit a line. Our two data points gives us two equations with only two unknowns, and we can just solve the set of linear equations. As you may recall from linear algebra, in matrix form this is simply the matrix inverse of Xtranspose. More generally of course, X is not square & thus not invertible, but the equivalent is to directly solve for the stationary point. However, most of the time, m > n There may be no linear function that hits all the data exactly Instead, solve directly for minimum of MSE function (c) Alexander Ihler

SSE Minimum Reordering, we have
Setting the gradient to zero, we have another linear algebraic equation. We can distribute in X, then move theta-Xt-X to the other side. XtX is square; assuming it is full-rank, we can solve by multlplying by its inverse. The quantity X(XtX)inv is called the moore-penrose pseudo inverse of X-transpose. If Xt is square and full rank, it is exactly equal to Xt-inverse. In most practical situations, we have more data than features, so m > n, in which case the pseudoinverse gives the minimum MSE fit. X (XT X)-1 is called the “pseudo-inverse” If XT is square and independent, this is the inverse If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler

Matlab SSE This is easy to solve in Matlab… % y = [y1 ; … ; ym]
% X = [x1_0 … x1_m ; x2_0 … x2_m ; …] % Solution 1: “manual” th = y’ * X * inv(X’ * X); Again, matlab gives us a compact and simple implementation of these linear algebra operations. Defining y and X as our target vector and feature matrix, we can either compute the pseudo inverse manually, like so Or, we can use Matlab’s built in “matrix-right-divide” operation, an operator that explicitly solves the set of linear equations defined by y=theta*Xtranspose. While this may look less obviously equivalent to our derivation, it lets Matlab use more numerically stable solution techniques. % Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’ (c) Alexander Ihler

Effects of MSE choice Sensitivity to outliers
2 4 6 8 10 12 14 16 18 20 16 2 cost for this one datum Heavy penalty for large errors -20 -15 -10 -5 5 1 2 3 4 MSE is a useful and widely used loss function. One very nice aspect is that we can solve it relatively easily in closed form. However, MSE is not always a good choice. Take the following data, say, y = house price where our feature x = house size. Suppose we add a data point that does not follow the standard relationship – perhaps we have entered its value incorrectly, or perhaps the house owner simply sold to a relative. Because it is squared, reducing an error of 16 is *much* more important than minimizing, say, 16 smaller errors of size 1. Viewed another way, the distribution of errors is highly non-Gaussian, so perhaps the quadratic loss is not a great modeling choice. (c) Alexander Ihler

L1 error L2, original data L1, original data L1, outlier data
18 L1, original data 16 L1, outlier data 14 12 10 8 A different loss function, like minimum absolute error (or L1 error) can have very different behavior. If we minimize MAE on the original data, our solution is not very different than MSE. But, it is more stable to the outlier. Adding up the absolute values of the distance, means that the outlier’s contribution is now only 16 – so it has equal weight to 16 errors of size 1. So the solution will be distorted less by such an outlier. 6 4 2 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Cost functions for regression
(MSE) (MAE) Something else entirely… (???) Here’s a plot showing the loss as a function of error magnitude… MSE increases quadratically, so that very small errors are unimportant but outliers or extreme errors have a very large impact. MAE increases only linearly, so large, outlier errors have less impact. If we really want to ignore large errors, we could design an alterative loss that accounts for this – for example, this one, which looks quadratic for small error values, but as error becomes large, it asymptotes to a constant – so that no matter how wrong our prediction on any data point, the loss is bounded. Of course, more arbitrary loss functions means that we can no longer use a closed form solution; but we could still use methods like gradient descent. “Arbitrary” functions can’t be solved in closed form… - use gradient descent (c) Alexander Ihler

Machine Learning and Data Mining Linear regression: nonlinear features
+ Machine Learning and Data Mining Linear regression: nonlinear features (adapted from) Prof. Alexander Ihler Let’s now look at the impact of features on linear regression, and how nonlinear features can extend our class of predictors.

Nonlinear functions What if our hypotheses are not lines?
Ex: higher-order polynomials So, what if we want to learn regression functions that are not lines? For example, a polynomial regression function, that fits a cubic function of x? (c) Alexander Ihler

Nonlinear functions Single feature x, predict target y:
Sometimes useful to think of “feature transform” Add features: Linear regression in new features From this viewpoing, we want to use a scalar feature x to predict a target y, using a cubic polynomial of x. But let’s just pretend instead that we observed three features, instead of one. Denote feature two by x^2, and set its value to the square of the first feature; and feature 3 to x cubed. Rewriting our predictor, we see that it is now just a linear regression in the new features – so we can solve it using the same technique! It’s often helpful to think of this feature augmentation as a “feature transform” Phi, that converts our “input features” into a new set of features that will be used by our algorithm. To avoid confusion, we will sometimes explicitly reference such a transform Phi -- then, linear regression is linear in theta and in Phi(x). (c) Alexander Ihler

Higher-order polynomials
Fit in the same way More “features” Can directly fit using the same framework, and plot the prediction theta Phi(x) For example, here Phi(x) = [1 x], Phi(x) = [1 x x^2], … In the prediction step, we just map x -> [x1 x2 x3] before predicting.

Features In general, can use any features we think are useful
Other information about the problem Sq. footage, location, age, … Polynomial functions Features [1, x, x2, x3, …] Other functions 1/x, sqrt(x), x1 * x2, … “Linear regression” = linear in the parameters Features we can make as complex as we want! More generally, we can use any features that we might like. For example, in addition to collecting more information about each data example, we can use whatever nonlin transform Phi we think would be useful… Thus, when we say “linear regression”, we usually *mean* linear in the parameters theta, even if we may have used transformations to learn a function that is not actually linear in the input features x. Useful transforms, then, will give us *features* that are linearly predictive of the target y. (c) Alexander Ihler

Higher-order polynomials
Are more features better? “Nested” hypotheses 2nd order more general than 1st, 3rd order “ “ than 2nd, … Fits the observed data better Once we are able to generate more features, such as polynomials, should we? Where should we stop? Notice that increasing the polynomial degree of our fit creates a nested sequence of models – for example, out of all lines, the constant predictor is one possibility – so the best line will always fit as well or better than the best constant. Similarly, all cubics include all lines, so a cubic fit will be as good or better than the best line. We get an improvement in fit quality with increasing features. At an extreme, given enough features, we will fit the data exactly (N eqs, N unkn)

Overfitting and complexity
More complex models will always fit the training data better But they may “overfit” the training data, learning complex relationships that are not really present X Y Simple model X Y Complex model But, this is not such a good thing – remember our example, where we could explain the data with a line plus noise, or as a complex function that hits all the data. Although more complex models will always fit the training examples better, they may not preform as well in the future. (c) Alexander Ihler

Test data After training the model
Go out and get more data from the world New observations (x,y) How well does our model perform? Training data New, “test” data Once we observe new data, we may discover that the complex model performs worse – overfitting! One way to assess this is through holding out validation data, or using cross-validation. We explicitly hide some data (the green points) from our learning algorithm, then use these points to estimate its performance on future data. (c) Alexander Ihler

Training versus test error
Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 15 20 25 30 Training data New, “test” data 0th to 1st order Error decreases Underfitting Higher order Error increases Overfitting On these data, for example, if we plot performance as a function of the polynomial degree fit, we find that In training data, the error decreases steadily toward zero with increasing order But, in hold out data, the error decreases from constant model to linear model (indicating that this more complex model does, in fact, help predict in the future), but increasing the degree beyond this does not help, and even begins to hurt, performance. Mean squared error Polynomial order (c) Alexander Ihler