Machine Learning and Data Mining Regression Problem

Machine Learning and Data Mining Regression Problem
+ Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Linear regression is a simple and useful technique for prediction, and will help us illustrate many important principles of machine learning. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Overview Regression Problem Definition and define parameters ϴ.
Prediction using ϴ as parameters Measure the error Finding good parameters ϴ (direct minimization problem) Non-Linear regression problem (c) Alexander Ihler

Example of Regression Vehicle price estimation problem
Features x: Fuel type, The number of doors, Engine size Targets y: Price of vehicle Training Data: (fit the model) Testing Data: (evaluate the model) Fuel type The number of doors Engine size (61 – 326) Price 1 4 70 11000 2 120 17000 310 41000 Recall our supervised learning paradigm – we have some observed features “x” which we would like to use to predict a target value “y”. We create a flexible function, called the “learner”, whose input/output relationships can be adjusted through parameters theta. Then, a collection of training examples are used to fit the parameters until the learner is able to make good predictions on those examples. Fuel type The number of doors Engine size (61 – 326) Price 2 4 150 21000

Example of Regression Vehicle price estimation problem
ϴ = [-300, -200 , 700 , 130] Ex #1 = *1 + 4* *130 = 11400 Ex #2 = *1 + 2* *130 = 16500 Ex #3 = *2 + 2* *130 = 41000 Mean Absolute Training Error = 1/3 *( ) = 300 Test = *2 + 4* *130 = Mean Absolute Testing Error = 1/1 *(600) = 600 Fuel type #of doors Engine size Price 1 4 70 11000 2 120 17000 310 41000 Fuel type The number of doors Engine size (61 – 326) Price 2 4 150 21000 Recall our supervised learning paradigm – we have some observed features “x” which we would like to use to predict a target value “y”. We create a flexible function, called the “learner”, whose input/output relationships can be adjusted through parameters theta. Then, a collection of training examples are used to fit the parameters until the learner is able to make good predictions on those examples.

Supervised learning Notation Features x (input variables)
Targets y (output variables) Predictions ŷ Parameters q Error = Distance between y and ŷ Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Training data (examples) Features Recall our supervised learning paradigm – we have some observed features “x” which we would like to use to predict a target value “y”. We create a flexible function, called the “learner”, whose input/output relationships can be adjusted through parameters theta. Then, a collection of training examples are used to fit the parameters until the learner is able to make good predictions on those examples. Feedback / Target values Evaluation of the model (measure error)

Overview Regression Problem Definition and parameters.

Linear regression Define form of function f(x) explicitly
“Predictor”: Evaluate line: ӯ = ϴ0 + ϴ1 * X1 return ӯ ӯ = Predicted target value (Black line) Define form of function f(x) explicitly Find a good f(x) within that family 40 New instance with X1=8 Predicted value =17 Target y Y = *X1 20 In linear regression, our learner consists of an explicit functional form f(x) : in other words, we define a family of functions parameterized by theta, and any particular value of theta defines a member of that family. Learning then finds a good member within that family (good value of theta) using the training dataset D 10 Feature X1 20 (c) Alexander Ihler

More dimensions? x1 x2 y y x1 x2 (c) Alexander Ihler
10 20 30 40 22 24 26 10 20 30 40 22 24 26 x1 x2 y y x1 In a multidimensional feature space, our linear regression remains a linear function – so, a plane for two features, a hyperplane for more. Otherwise, everything remains essentially the same. x2 (c) Alexander Ihler

Notation Define “feature” x0 = 1 (constant) Then
Ӯ is a plane in n+1 dimension space Define “feature” x0 = 1 (constant) Then Some quick notation – our prediction yhat is a linear function, theta0 + theta1 times feature 1, plus theta2 times feature 2, etc. For notational compactness, let’s define a “feature 0” that is always value 1; then yhat is simply the vector product between a parameter vector theta and a feature vector x. n = the number of features in dataset (c) Alexander Ihler

Supervised learning Notation Features x (input variables)
Targets y (output variables) Predictions ŷ Parameters q Error = Distance between y and ŷ Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Training data (examples) Features Recall our supervised learning paradigm – we have some observed features “x” which we would like to use to predict a target value “y”. We create a flexible function, called the “learner”, whose input/output relationships can be adjusted through parameters theta. Then, a collection of training examples are used to fit the parameters until the learner is able to make good predictions on those examples. Feedback / Target values Evaluation of the model (measure error)

Measuring error Error or “residual” Observation Prediction
Red points = Real target values Black line = ӯ (predicted value) ӯ = ϴ0 + ϴ1 * X Blue lines = Error (Difference between real value y and predicted value ӯ) Measuring error Error or “residual” Observation Prediction We also need to define what it means to predict well. Define the error residual e(x) to be the (signed) error between the true or desired target y and the predicted value yhat If our prediction is good, these errors will be small on average; 20 (c) Alexander Ihler

m=number of instance of data
Mean Squared Error How can we quantify the error? Y= Real target value in dataset, ӯ = Predicted target value by ϴ*X Training Error: m= the number of training instances, Testing Error: Using a partition of Training error to check predicted values. m= the number of testing instances, m=number of instance of data We can measure their average squared magnitude, for example – the MSE. We’ll define a cost function J(theta) that measures how good of a fit any particular setting of theta is, by summing up the squared error residuals over the training data. While we could easily choose a different cost function J (and we will discuss some in another lecture), MSE is a common choice since it is computationally convenient, and corresponds to a Gaussian model on the “noise” that we are unable to predict; if this “noise” is the sum of many small influences, the central limit theorem suggests that such a Gaussian model will be a good choice. (c) Alexander Ihler

MSE cost function Rewrite using matrix form
X = input variables in dataset y= output variable in dataset m=number of instance of data n = the number of features, Rewrite using matrix form Since we’re using Matlab and its vector and matrix operators, let’s rewrite J in a more convenient vector-and-matrix form… (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler

Visualizing the error function
J is error function. The plane is the value of J, not the plane fitted to output values. J(θ) Dimensions are ϴ0 and ϴ1 instead of X1 and X2 Output is J instead of y as target value θ0 θ1 The cost function J(theta) is a function of the parameter values theta, and we can plot it in “parameter space”. Suppose we have the linear function theta0 + theta1 x : two parameters, so J(.) is a 3D plot. This is difficult to look at, so we can plot the value of J as a color, with blue being small values and red large. Even this is inconvenient to plot, so we’ll often just visualize the contours, or isosurfaces, of J – for a number of different values, we find the set of points where J equals that value, and plot that set with a red line. Then collapsing back to 2D, we can plot these contours, resulting in a “topographical map” of the function J. Representation of J in 2D space. Inner red circles has less value of J Outer red circles has higher value of J θ1 (c) Alexander Ihler θ 0 θ0

Supervised learning Notation Features x Targets y Predictions ŷ
Parameters q Learning algorithm Change θ Improve performance Program (“Learner”) Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Training data (examples) Features So now that we have chosen a form for the learner and a cost function, “learning” for linear regression comes down to selecting a value of theta that scores well. Feedback / Target values Evaluation of the model (measure error)

Finding good parameters
Want to find parameters which minimize our error… Think of a cost “surface”: error residual for that θ… Now, “learning” has been converted into a basic optimization problem – find the values of theta that minimize J(theta), and in the next segments, we can explore several methods of doing so. (c) Alexander Ihler

MSE Minimum (m <= n+1)
m=number of instance of data n = the number of features, Consider a simple problem One feature, two data points Two unknowns and two equations: n +1=1+1 = 2 m=2 Can solve this system directly: Consider the following simple problem, in which we have only two data points to fit a line. Our two data points gives us two equations with only two unknowns, and we can just solve the set of linear equations. As you may recall from linear algebra, in matrix form this is simply the matrix inverse of Xtranspose. More generally of course, X is not square & thus not invertible, but the equivalent is to directly solve for the stationary point. Theta gives a line or plane that exactly fit to all target values. (c) Alexander Ihler

SSE Minimum (m > n+1) Reordering, we have
Most of the time, m > n There may be no linear function that hits all the data exactly Minimum of a function has gradient equal to zero (gradient is a horizontal line.) n +1=1+1 = 2 m=3 Reordering, we have Setting the gradient to zero, we have another linear algebraic equation. We can distribute in X, then move theta-Xt-X to the other side. XtX is square; assuming it is full-rank, we can solve by multlplying by its inverse. The quantity X(XtX)inv is called the moore-penrose pseudo inverse of X-transpose. If Xt is square and full rank, it is exactly equal to Xt-inverse. In most practical situations, we have more data than features, so m > n, in which case the pseudoinverse gives the minimum MSE fit. Just need to know how to compute parameters. (c) Alexander Ihler

Effects of Mean Square Error choice
outlier data: An outlier is an observation that lies an abnormal distance from other value. 2 4 6 8 10 12 14 16 18 20 16 2 cost for this one datum Heavy penalty for large errors Distract line from other points. MSE is a useful and widely used loss function. One very nice aspect is that we can solve it relatively easily in closed form. However, MSE is not always a good choice. Take the following data, say, y = house price where our feature x = house size. Suppose we add a data point that does not follow the standard relationship – perhaps we have entered its value incorrectly, or perhaps the house owner simply sold to a relative. Because it is squared, reducing an error of 16 is *much* more important than minimizing, say, 16 smaller errors of size 1. Viewed another way, the distribution of errors is highly non-Gaussian, so perhaps the quadratic loss is not a great modeling choice. (c) Alexander Ihler

Absolute error MSE, original data Abs error, original data
18 Abs error, original data 16 Abs error, outlier data 14 12 10 A different loss function, like minimum absolute error (or L1 error) can have very different behavior. If we minimize MAE on the original data, our solution is not very different than MSE. But, it is more stable to the outlier. Adding up the absolute values of the distance, means that the outlier’s contribution is now only 16 – so it has equal weight to 16 errors of size 1. So the solution will be distorted less by such an outlier. 8 6 4 2 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Error functions for regression
(Mean Square Error) (Mean Absolute Error) Something else entirely… (???) Here’s a plot showing the loss as a function of error magnitude… MSE increases quadratically, so that very small errors are unimportant but outliers or extreme errors have a very large impact. MAE increases only linearly, so large, outlier errors have less impact. If we really want to ignore large errors, we could design an alterative loss that accounts for this – for example, this one, which looks quadratic for small error values, but as error becomes large, it asymptotes to a constant – so that no matter how wrong our prediction on any data point, the loss is bounded. Of course, more arbitrary loss functions means that we can no longer use a closed form solution; but we could still use methods like gradient descent. “Arbitrary” Error functions can’t be solved in closed form… So as alternative way, use gradient descent (c) Alexander Ihler

Nonlinear functions Single feature x, predict target y:
Sometimes useful to think of “feature transform” Convert a non-linear function to linear function and then solve it. Add features: Linear regression in new features From this viewpoing, we want to use a scalar feature x to predict a target y, using a cubic polynomial of x. But let’s just pretend instead that we observed three features, instead of one. Denote feature two by x^2, and set its value to the square of the first feature; and feature 3 to x cubed. Rewriting our predictor, we see that it is now just a linear regression in the new features – so we can solve it using the same technique! It’s often helpful to think of this feature augmentation as a “feature transform” Phi, that converts our “input features” into a new set of features that will be used by our algorithm. To avoid confusion, we will sometimes explicitly reference such a transform Phi -- then, linear regression is linear in theta and in Phi(x). (c) Alexander Ihler

Higher-order polynomials
Are more features better? “Nested” hypotheses 2nd order more general than 1st, 3rd order “ “ than 2nd, … Fits the observed data better Y = ϴ0+ ϴ1 * X Y = ϴ0+ ϴ1 * X + ϴ2*X2 + ϴ3*X3 18nd order Once we are able to generate more features, such as polynomials, should we? Where should we stop? Notice that increasing the polynomial degree of our fit creates a nested sequence of models – for example, out of all lines, the constant predictor is one possibility – so the best line will always fit as well or better than the best constant. Similarly, all cubics include all lines, so a cubic fit will be as good or better than the best line. We get an improvement in fit quality with increasing features. At an extreme, given enough features, we will fit the data exactly (N eqs, N unkn) 1st order 3rd order

Test data After training the model
Go out and get more data from the world New observations (x,y) How well does our model perform? Training data New, “test” data Once we observe new data, we may discover that the complex model performs worse – overfitting! One way to assess this is through holding out validation data, or using cross-validation. We explicitly hide some data (the green points) from our learning algorithm, then use these points to estimate its performance on future data. (c) Alexander Ihler

Training versus test error
Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better What about new data? 1 2 3 4 5 6 7 8 9 10 15 20 25 30 Training data New, “test” data 0th to 2st order Error decreases Underfitting Higher order Error increases Overfitting Under fitting Overfitting On these data, for example, if we plot performance as a function of the polynomial degree fit, we find that In training data, the error decreases steadily toward zero with increasing order But, in hold out data, the error decreases from constant model to linear model (indicating that this more complex model does, in fact, help predict in the future), but increasing the degree beyond this does not help, and even begins to hurt, performance. Mean squared error Polynomial order (c) Alexander Ihler

Summary Regression Problem Definition Vehicle Price estimation
Prediction using ϴ: Measure the error: difference between y and ŷ e.g. Absolute error, MSE direct minimization problem Two cases m<=n+1 and m > n+1 Non-Linear regression problem Finding best nth order polynomial function for each problem (not overfitting and not under fitting)

Machine Learning and Data Mining Regression Problem

Similar presentations

Presentation on theme: "Machine Learning and Data Mining Regression Problem"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning and Data Mining Regression Problem

Similar presentations

Presentation on theme: "Machine Learning and Data Mining Regression Problem"— Presentation transcript:

Similar presentations

About project

Feedback