Regression in 1D: Fit a line to data by minimizing squared residuals.

Regression in 1D: Fit a line to data by minimizing squared residuals

Overview: Calculating estimates of slope an intercept Outliers, high leverage, and influential data points Inference in regression coefficient of determination standard error of estimation t-test for relation between x and y confidence interval on slope confidence interval on y prediction given x Verifying assumptions required for inference Transformation to linearize data Examples

Regression: estimating functional relationships between predictors and response
assume a particular relationship exist (line, for example) find parameters of assumed functional form by minimizing in-sample error (sum of squared residuals, for example) separates predictor-response relationship from noise in data How much confidence can have in my results? (inference) requires additional assumptions about the noise in the data

Example: fit y = ax + b to m data points
Find unknowns a and b that minimize sum of squared residuals What is the objective function to be minimized? What are the equations for a and b?

Example: fit ax + b to m data points
Find the values of a and b that minimize sum of squared residuals

Regression assignments: part 1
Write a code for fitting a line to data using results from the previous slide.

Fitting line to data is an example of linear least squares:
Given dataset {(tk,yk), k=1,...,n} and set of functions {fj(t), j=1,...,m}, find the linear combination of functions that best fits the data. Define matrix A where akj = fj(tk) (jth function evaluated at kth data point) Define column vector b = [y1, y2,...,yn]T of the response values Define column vector w = [w1, w2,...,wm]T of weights in linear combination of functions fit = Aw is the value of the fit at each data point r = fit-b is the deviation between fit and data at each data point Find best choice of w by minimizing the sum of squared residuals between fit and data

Let r = b – Aw and define f(w) = (||r||2)2 = rTr
Normal Equations: Let r = b – Aw and define f(w) = (||r||2)2 = rTr f(w) = (b – Aw)T(b – Aw) = bTb –2wTATb + wTATAw A necessary condition for w0 to be minimum of f(w) is f(w0) = o, where f is an m-vector with components equal to the partial derivatives of f(w) with respect to the weights w f(w) = 2ATAw – 2ATb = o  the optimal set of weights is a solution of the mxm symmetric system ATAw = ATb called the “normal” equations of the linear least squares problem.

Example: Use the normal equations to fit a line
y = w1x +w0 to n data points What are the functions {fj(t), j=1,2}, that determine the matrix elements in this case? Build the A matrix (akj = fj(tk) jth function evaluated at kth data point) Build the b vector (column vector of measure y values Construct and solve the normal equations.

Same set of equations as obtained
by objective function method

Review: Regression in 1D:
Fit a line to data by minimizing squared residuals Simple equations for slope and intercept derived by minimizing in-sample error. Two equivalent formulation: pure calculus calculus applied to matrix formulation Second method can be easily generalized to higher dimensions and higher degree of polynomial

Polynomial Regression: degree 1 with N data points
Review: 1D linear regression by linear algebra approach Polynomial Regression: degree 1 with N data points Solve VTVw = VTy for w1 and w0 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Simple example: distance hiked vs time (text p39)
Index x=time y=distance fit=6+2x error (error)2 Sum of squared error (SSE) = 12

Coefficient of determination: Is Fit a better predictor than average?
SST = sum of squares total

Sum of Squares Regression
Measures variability from mean response explained by regression Sum of Squares Error Measures variability in y from all other sources (usually noise) after linear relationship between x and y has been accounted for SST = SSR + SSE follows from the identity

Coefficient of determination: r2= SSR/SST
As SSE -> 0, SSR -> SST and r2 -> 1 means a perfect fit As SSR -> 0, r2 -> 0 fit is the same as the average r2 interpreted as fraction of response variation explained by predictor Correlation coefficient +sqrt(r2). Sign same as sign of b1 SST=228 SSE=12 SSR=216 R2 = 0.95 Add coefficient of determination to your linear fit code

The Standard Error of the Estimate
Mean Square Error (MSE) m = number predictors, n = number observations Standard Error of the Estimate s is a“typical” residual or error in estimation From hiker vs time dataset m =1, n=10, SSE= 12 -> s = km Linear regression estimate of hiking distance typically differs from actual distance by about 1.2 km Add Standard Error of the Estimate to your linear fit subroutine Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

ANOVA Table for Simple Linear Regression
Regression statistics summarized in Analysis of Variability (ANOVA) table m = total predictors, n = total observations F is a test statistic used for “inference” (discussed later) Source of Variation Sum of Squares Degrees of Freedom Mean Square F Regression SSR m Error (or Residual) SSE Total Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

High Leverage Points Leverage hi for ith observation:
As distance of an x-value from the mean of x-values increases, leverage increases 1/n <= Leverage <= 1.0 Observation with leverage > 2(m + 1)/n or 3(m + 1)/n are considered to have high leverage Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Standardized Residuals and Outliers
Standard error of ith residual, with leverage hi Standardized Residual: Generally, observations with standardized residual > 2 are flagged as outliers Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example of outlier calculation
Example: suppose 11th hiker traveled 20 km in 5 hours Including 11th observation changes regression results slightly with: b0 = 6.36, b1 = 2.00, s = 1.72 Leverage Standard error Standardized residue Conclusion? Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example from the Cereals dataset
Scatter plot of nutritional rating against sugars from Cereals dataset (text p34) Minitab reports standardized residual of All Bran Extra Fiber = 3.38 Outlier? Use your code to verify this result Outlier: All Bran Extra Fiber Sugars: 0 Rating: Predicted Rating: Residual: 34.26 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Influential Observations
Influential observation significantly alters regression parameters based on absence/presence in data set Outlier may or may not be influential High leverage point may or may not be influential Example: 11th hiker walks 39 km in 16 hours Identified as high leverage point Likely has strong influence on slope Cook’s distance: test for influential observations Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Cook’s distance (yi - ŷi) ith residual
s standard error of the estimate hi leverage of ith observation m number of predictors Combines elements representing outlier and leverage Example: Cook’s Distance for 11th (5, 20) hiker: Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

More on Cook’s distance
In general, influential observations have Cook’s Distance > 1.0 Cook’s Distance also compared against Fm,n-m distribution Measures greater than median percentile of Fm,n-m are considered influential Example: 11th hiker (5, 20) with Cook’s Distance = not influential, lies within 37th percentile of F1,10 Example: hard-core hiker (16, 39) has high leverage = hi = , and standardized residual = Cook’s Distance = shows it’s not influential Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example of influential observation
Data on 11th hiker “pulls down” regression line: slope b1 decreases from 2.00 to 1.82 11th hiker (10, 23) has leverage hi = , and standardized residual = Cook’s Distance = , Not influential by CD >1 but it lies in the 62nd percentile of F1,10, which makes it influential F1,10 A Fstat Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Assignment 3, part 1 Write a code for fitting a line to data using the method in slide 6. Include calculations of coefficient of determination using notes on slides Include calculation of standard error of estimation using notes on slide 16. Include calculation of high-leverage data points using the notes on slide 18. Include calculation of outliers using notes on slide 19. Include calculation of influential points using notes on slide 23 and a lower bound of 1 on Cook’s distance.

Assignment 3, part 1 continued
Fit a line to distance hiked vs time using data on slide 12. Report coefficient of determination and standard error of estimation. Flag of high-leverage points, outliers, and influential points. Check results against Table 2.7 text p47 Repeat with 11th data point (5,20). Report difference from results with 10 points. Repeat with 11th data point (16,39). Repeat with 11th data point (10,23). Fit a line to Rating as a function of Sugar from the Cereals dataset. Check results against Table 2.7 text p47.

Inference in Regression: Going beyond r2 and s
Assume that a linear relationship between predictor and response exist but is obscured by normally-distributed noise with zero mean and variance that is independent of predictor and response values. To trust the results of “inference” these assumptions of about error must be verified Each example in the population is a random variable b0 and b1 are estimates of b0 and b1 obtained by minimizing in-sample error expressed as the sum of squared residuals Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

More details about model assumptions
Assumptions about error in the data (1) Zero Mean Assumption Error term ε random variable with mean, E(ε) = 0 (2) Constant Variance Assumption Variance of ε constant, regardless of x-value (3) Independence Assumption Values of ε independent (4) Normality Assumption Error term, ε, is a normally distributed random variable Summary: εi are independent normal random variables, with mean = 0 and constant variance Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Implications of assumptions about error term
Implied Behavior of Response Variable y: (1) Based on: Zero Mean Assumption For each x, mean of y’s (different amounts of error) lie on regression line (2) Based on: Constant Variance Assumption Regardless of x-value, variance of y’s constant (3) Based on: Independence Assumption For any x, values of y independent (4) Based on: Normality Assumption y normally distributed random variable Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Distributions of y at different values of x
Observed y-values corresponding to predictor values x = 5, 10, and 15 are shown as samples from normal distributions with means β0 + β1x Normal curves have exactly the same shape Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression models without inference
Regression analysis can be applied in a descriptive manner Include r2, s, high leverage, outliers, influential data These outputs are not based on assumptions about error terms Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

When do we need Inference in Regression?
Suppose minimizing squared residuals leads to r2 = 0.3% r2 value this small indicates a linear relationship between predictor and response is not useful Are we sure? Can a valid relationships between x and y exist when r2 is small? Inference offers systematic framework to assess significance of linear association between x and y Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Inference in Regression: methods
Four inferential methods: (1) Regression equation asserts that a linear relationship exist between x and y. Student’s t-test for this assertion. (2) Confidence interval for slope, β1 (3) Confidence interval for mean of response, given an x-value (4) Prediction interval for random response value, given an x-value Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between x and y
Least squares estimate of slope b1, is a statistic Sampling distribution of b1 has mean = β1 , and standard error σb1 sb1 is point estimate of σb1, where s = standard error of the estimate: Add calculations of sb1 to your “ fit a line to data” code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Sb1 measures the variability of estimates of the slope Small values Sb1 indicate estimate of slope b1 is precise Large values Sb1 indicate estimate of slope b1 is unstable and true value of the slope, b1, could be zero. T-test is based on the statistic When null hypothesis true (b1 = 0), t = b1 / Sb1 follows t-distribution with n-2 degrees of freedom Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example from the Cereals dataset: Your code should yield the values
b1 = Sb1 = T-statistic, t = b1 / Sb1 = / = The probability of such an extreme t-value by chance alone (p-value of the t-statistic) is very small (<0.000 according to Minitab results on next slide) Search the web for a “p-value of the t-statistic” calculator and verify this result Reject the null hypothesis that b1 = 0. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example: applying t-test to regression results of nutritional rating on sugar content Minitab results shown: The regression equation is Rating = Sugars Predictor Coef SE Coef T P Constant Sugars S = R-Sq = 58.0% R-Sq(adj) = 57.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Slope of Regression Line
We have confidence that a linear relationship exist between rating and sugar content of cereals Find a confidence interval on our estimate of slope of regression line. t-interval is based on the sampling distribution for b1 We are 100(1-alpha)% confident true slope β1 lies within where tn-2,1-a is a percentile point of the t-distribution with n-2 degrees of freedom. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Add a confidence interval for b1 to your code.
Use interpolation on degrees of freedom for 95% confidence to get tdf,95 if df>30. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Slope of Regression Line Example from the Cereals dataset:
b1 = –2.4193, Sb1 = T-critical value for 95% confidence and n – 2 = 75 degrees of freedom = t75,95% = 2.0 Slope estimate with confidence interval = – ± (2.0) (0.2376) We have 95% confidence that true slope is between – and –1.9441 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Mean Value of y Given x
Regression equation estimates the value of response variable for a given predictor value does not provide probability statement regarding their accuracy Probability statements about accuracy can be obtained for (1) Confidence interval for mean value of y, given x (2) Prediction interval for response of a randomly chosen example Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Mean Value of y Given x
xp given value of x, for which prediction being made yp regression result for x = xp s standard error of estimate tn-2,95 percentile point of the t distribution h(xp) leverage of xp Add this calculation to your code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Calculate 95% confidence interval on the average distance traveled by hikers that walk 5 hours
xp = 5, yp = 16, s = , n = 10, t8,95% = 2.306 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Prediction Interval for Response of a Randomly Chosen Example
Batting averages of individual player are more variable than mean batting averages of teams. Therefore, estimates of a team’s batting average are more precise than estimates of an individual payer’s batting average. In general, easier to predict mean value of a variable than to predict its value for a randomly chosen example. In general, prediction intervals for a randomly chosen example are more useful to data miners than confidence intervals on mean values Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Prediction Interval for Response of a Randomly Chosen Example
Note: similar to confidence interval for mean value of y, given x At the same confidence level, prediction interval are always wider than confidence interval on means. Add this calculation to your code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Estimate distance traveled by randomly selected hiker who walks for 5 hours (95% confidence)
Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Intervals vs Prediction Intervals
95% confident that the average distance traveled by hikers that walk 5 hours is between and km 95% confident that the distance traveled by a randomly chosen hiker that walk 5 hours is between and km Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Verifying Regression Assumptions
Review: in-silico populations Choose values of b0 and b1 and a range of predictor values xL < x < xU Choose 1,000,000 values of x uniformly distributed between xL and xU Choose 1,000,000 values of e from a normal distribution with zero mean and given variance Generate 1,000,000 response values Randomly choose 100 records from the in-silico population as a sample dataset. Generate estimates b0 and b1 of b0 and b1 by minimizing in-sample error Repeat with 100 randomly chosen datasets We have 100 samples of b0 and b1 from which we can make statistical inference about uncertainty in parameter estimates and prediction of response From each dataset we have 100 residuals at optimal values of b0 and b1 that mainly reflect the value of e in each record. Linear regression is a model of data based on the assumption that the dataset has statistics like the 100 records drawn from our in-silico population. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Linear regression is a model of data based on the assumption that the dataset has statistics like the 100 records drawn from our in-silico population. The only handle that we have to test this assumption are residuals at optimum choice of parameters. Normally distributed residuals are evidence for the validity of model assumptions. In most cases, residuals (errors) in a 1D linear regression model are due to effects on response of attributes other than the predictor of the model. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Results from a linear-regression model cannot be trusted unless assumptions of the model are verified by the distribution of minimized residuals. Two graphical methods of verification are discussed in text: (1) Normal probability plot of residuals (2) Plot of standardized residuals against predicted values You are only responsible for second method Add a scatter plot of standardized residuals vs predicted response to your code. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Review: Standardized Residuals
Leverage of ith observation: Standard error of ith residual Standardized Residual: Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Plot Standardized Residuals Against Predicted Values
Example: regression of distance vs. time for hiker data set Discernable pattern would indicate that assumptions about error are not true Too few data points to make a determination in this case Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Under the basic assumption that a linear relationship exist between predictor and response, residuals are an indication of error in data. Plot (A) no pattern suggests assumptions about error are valid Plot (B) suggests a functional relationship between error at different levels of response (i.e. not independent). Plot (C) suggests that variance of error increases as response increases. Plot (D) suggests that mean of error increases as response increases. A B C D Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

FYI: Diagnostic tests to verifying regression assumptions
Anderson-Darling test: Are residuals are normally distributed? Bartlett’s or Levene’s test: Do residuals have constant variance? Durban-Watson or Runs test: Are residuals independence of response? You are not responsible for these tests Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example: Baseball Data Set
Regression of number of home runs against batting average Players excluded where number of at bats is less than 100 Shorter data set has 209 records Apply your linear fit code to this data set Test your result by comparison with Table 2.14 text p 71 Make a scatter plot of standardized residual vs fit Compare your plot to Figure 2.16 text p70 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on home runs vs batting average
t-stat=7.9 p-value<0.000 Even though r2 is small p-value of t-statistic indicates high confidence that true slope is not zero. Can we believe these results? Apply graphical methods explored to validate assumptions about error in the data Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Graphical tests of assumptions about error
Probability plot indicates distribution of error is right-skewed Normality assumption violated Plot of standardized residuals vs. fit shows “funnel” pattern Constant variance assumption is violated Confidence limits on true slope cannot be trusted Try to improve confidence in regression by transformation to ln(home runs) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on ln(home runs) vs batting average
t-stat=7.5 p-value<0.000 My results slightly different from Table 2.15 p73 201 records after eliminating cases hr=0 Again p-value of t-statistic indicates true slope is not zero Can we have more confidence in results after transformation? Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Graphical tests of assumptions about error home runs ln(home runs)
Not perfect but significantly improved Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Accepting that model assumptions are valid
With my standard error of the estimate es = 2.31 is a typical error in home runs predicted by regression on batting average (text 1.96). My coefficient of determination r2 = 21.9% (text 23.8%) Indicates batting average accounts for about 20% of variability in ln(home_runs) of players with more than 100 at bats. Other attributes like size, strength, number of at bats, affect player’s ability to hit home runs Correlation coefficient Home runs have a weak positive correlation with batting average Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

With my b1 = and sb1 = 1.826, t-statistic = 7.47 (text 8.04) P-value, p (|t| > 7.47) ~ 0.000 Better than 95% confidence that batting average is a predictor of home runs (i.e. confident that the slope is not zero) My 95% confidence interval on slope is (10.0, 17.2) (text (8.73 ,14.4)) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

With my regression line, a player with .3 batting average is expected to hit e2.64 = 14.0 home runs My 95% confidence interval on mean number of home runs hit by players with batting average of 0.3 is (e2.4667, e2.8387) = (11.8, 17.1) Text (e2.6567, e2.9545) = (14.25, 19.19) My 95% prediction interval on number of home runs hit by a random player with batting average of 0.3 is (e0.9998, e4.1411) = (2.72, 74.1) is too wide to be useful Text (e1.4701, e4.1411) = (4.35, 62.87) also too wide to be useful Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Last part of Assignment 3:
Do regression on ln(home runs) vs batting average Tell me if I screwed up Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example: California Data Set
California data set includes census information for 858 towns/cities Do towns with high fraction of senior citizens tend to be small/large towns? Scatter plot of percentage over 64 against population Outliers are large cities Address skewness by transformation ln(population) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on % seniors vs ln(population)
Probability plot of standardized residuals slows residues are not normally distributed. Supported by Anderson-Darling test Plot of standardized residues vs fit shows “funnel” effect. Variance of residues not constant. Try regression on ln(% seniors) vs ln(population) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on ln(% seniors) vs ln(population)
Plot of standardized residuals vs. fits shows less “funnel” effect 8 of 9 outliers with low % seniors are town with Military installations Exclude outliers and continue analysis Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Transformations to Achieve Linearity
Points vs frequency in Scrabble® is non-linear Example of Frederick, Mosteller, and Tukey suggest “bulging rule” for finding transformation to achieve linearity E A, I N, R, T L, S, U G B, C, M, P F, H, V, W, Y J, X Q, Z K D Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

A, I N, R, T L, S, U G B, C, M, P F, H, V, W, Y J, X Q, Z K D Transformations to Achieve Linearity Shape of points vs frequency is like the curve in the lower-left quadrant “bulging rule” suggests transformations on both x and y that are to the left of t 1 Square root is not enough Try ln(t) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression of ln(points) vs ln(freq) has r2 = 87.6% Standard error of the estimate s = e = 1.34 points Estimates frequency = 4 to be 1.72 points. Actual either 1 or 2 points “E” is flagged as outlier E Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Box-Cox Transformations to Achieve Linearity
Choose a mesh of l values. Try regression of x on f(y)=(yl -1)/l if l not zero; x on f(y)=ln(y) if l=0. Plot sum of squared residuals vs l. Use the value of l that gives the smallest sum of squared residuals. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Non-linear regression in 1D

Fit a parabola to 1D data “target function” is “trend” in the data
Scatter around trend interpreted as noise H in this case is the set of all 2nd degree polynomials Select best member of H by “one-step optimization”

One-step optimization: linear least squares
Take derivatives of Ein(g) with respect to the coefficients of a parabola (collective call q ) and set equal to zero. Solve resulting 3x3 linear system Generalize to any degree polynomial using a matrix algebra Why do we call polynomial curve fitting “linear least squares”

Polynomial regression by linear least squares
Assume g(x|q) is polynomial of degree n-1 (i.e linear combination of 1, x, x2, …, xn-1 ) m = number of examples (xit, rit) in the training set Define mxn matrix A Aij = jth function in evaluated at xit q column vector of n unknown coefficients b column vector of m values of rit in training set If Aq = b has a solution, then g(xit|q) = rit for all i Not what we want, why? with n << m, Aq = b has no exact solution 80 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

define f(q) = ||r||2 = rTr
Normal Equations Look for an approximate solution which minimizes the Euclidean norm of the residual vector r = b – Aq, define f(q) = ||r||2 = rTr f(q) = (b – Aq)T(b – Aq) = bTb –2qTATb + qTATAq A necessary condition for q0 to be minimum of f(q) is f(q0) = o f(q) = 2ATAq – 2ATb optimal set of parameters is a solution of nxn symmetric system of linear equations ATAq = ATb 81 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Polynomial Regression: degree k with N data points
Solve DTDw = DTr for k+1 coefficients Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Given the parameters that minimize the sum of squared residuals,
are the values of the fit at xt, the locations data points, and R = Yfit – Y are the residuals at the data points Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Coefficient of determination
Denominator is the sum of squared error associated with the hypothesis that data is approximated by its mean value, a polynomial of degree zero Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 84

Review 1D polynomial regression (curve fitting) has all of the
fundamental characteristics of data mining Data points (x, y) support supervised machine learning with x as the attribute and y as the label The degree of the polynomial defines an hypothesis set Polynomials of higher degree are more complex hypotheses. Sum of squared residuals defines an Ein that can be used to select a member of the hypothesis set by matrix algebra. Eout can be analytically defined and calculated for in silico datasets (target function + noise)

Tuning regression models
The degree of the polynomial used in fitting data by polynomials is an example of complexity in the hypothesis set H used in data mining. As degree increases the hypothesis set has more adjustable parameters; hence, a greater diversity of shapes is possible.

Over-fitting Parabolic fit shown here looks OK but would a cubic give a better fit? Cubic fit will give a smaller Ein(g) but likely at the cost of a larger Eout(g) Cubic lets me fit more of the noise in the data, which is specific to this data set The optimum cubic fit to this data set is likely a poorer approximation to a different data set because noise is different.

Approximation – Generalization Tradeoff
In the theory of generalization (covered fundaments 3) it can be shown that Eout(g) < Ein(g) + W(N, H, d) where W is a function of N the training-set size, H the hypothesis set, and d the allowable uncertainty in the final model. W(N, H, d) is a bound on the difference between Eout(g) and Ein(g) If W(N, H, d) is small we can be confident of good generalization. At given complexity (determined by H), higher statistical confidence (1-d) can usually be achieved with larger N At fixed N and d, W usually increases with the complexity of H, making generalization less certain. Even though Ein(g) may decrease with higher complexity, Eout(g) may not. In least-squares 1D regression, this effect can be illustrated by the “Bias/Variance dilemma”

Given a parabolic target function, construct several “in silico” data sets by adding noise drawn from a normal distribution with zero mean and a specified variance Fit a cubic to each in silico data set. Averaging these results we get a consensus cubic fit Difference between consensus fit and target function called “bias” From consensus fit and individual cubic fits, we can calculate a variance

Formal definitions of Bias & Variance
Assume the target function f(x) is known Create M in silico datasets of size N by adding noise to f(x) For each dataset find the best gi(x) of given complexity Average gi(x) to get best overall estimator of f(x) Calculate bias and variance of best estimator as follows 90

Expectation value of Eout(g)
is out-of-sample error for ith training set Ex denotes average over the specified domain for f(x). where < > denotes average over data sets. Eout can be written as a sum of 3 terms, s2 + bias2 + variance where s2 is a contribution from noise in the data s2 does not depend on complexity of the hypothesis set, so we can ignore it in this discussion 91

Derive Eout = bias2 + variance

Polynomial fits to sin(x) + noise
Bias is RMSD f gi g one in silico experiment Linear regression: 5 experiments Smaller Bias Larger variance Each cubic has shape like f(x) Shape of gi varies more 93

Bias, variance and Eout from polynomial fits to sin(x) + noise
Best complexity is degree 3 Beyond 3, decreases in bias are offset of increases in variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 94

Cannot use bias/variance analysis to tune polynomial fits to
real data because f(x) is unknown; hence we cannot calculate the bias.

divide real data into training and validation sets
Use validation set to estimate Eout “elbow” in estimate of Eout indicates best complexity 96

Use 25 samples for training, 75 for validation
Assignment 3 due Generate the in silico data set of 2sin(1.5x)+N(0,1) with 100 uniformly distributed random values of x between 0 and 5 with 100 normally distributed values of noise Use 25 samples for training, 75 for validation Fit polynomials of degree 1 – 5 to the training set. Calculate at each degree. Plot Ein (min sum squared residuals) and Eval vs degree of polynomial Find the “elbow” in Eval for the best complexity for polynomial regression Use the full data set to find the optimum polynomial of best complexity Show this result as plot of data and fit on the same set of axes. Report the minimum sum of squared residuals and coefficient of determination

Get in silico data Calculate in-sample and validation errors

Evidence for cubic as best choice for degree of polynomial
VC bound suggests that small decreases in Eval for degree>3 do not indicate better generalization. Ein and Eval Degree of polynomial

Expected results: solid curve is target function,
Expected results: solid curve is target function, *’s are cubic fit, +’s are training data

Quiz # Review for quiz Look over questions in text at the end of chapter 2 Ask about anything you have doubts about Be prepared to answer questions

A B C D Review questions 1) 7 items that can be reported from “descriptive” regression in 1D: b0, b1, r2, s, outliers, high leverage points, and influential points 2) 4 items that can be reported from inference on regression in 1D test H0 that slope=0, confidence interval on slope, confidence interval on average response at a given predictor value, prediction interval on response at randomly chosen predictor value. 3) Why do we investigate the distribution of standardized residues before considering inference on regression? 4) What do we conclude about the distribution of standard residues from plots with shapes A, B, C, and D?

Example of influential observation
11th hiker (10, 23) has Cook’s Distance = 0.82 that is in the 62nd percentile of F1,10. Explain bases on table. F1,10 A Fstat If the criterion for influential observation is Cook’s Distance in, at least, the 50th percentile of the F distribution, what is the threshold with 1 and 10 degrees of freedom?

Regression Term Definition a. Influential observation E >> Measures the typical difference between the predicted response value and the actual response value. b. SSE H >> Represents the total variability in the values of the response variable alone, without reference to the predictor. c. r2 I >> An observation which has a very large standardized residual in absolute value. d. Residual G >> Measures the strength of the linear relationship between two quantitative variables, with values ranging from – 1 to 1. e. s F >> An observation which significantly alters the regression parameters based on its presence or absence in the data set. f. High leverage point K >> Measures the level of influence of an observation, by taking into account both the size of the residual and the amount of leverage for that observation. g. r B >> Represents an overall measure of the error in prediction resulting from the use of the estimated regression equation. h. SST F >> An observation which is extreme in the predictor space, without reference to the response variable. i. Outlier J >> Measures the overall improvement in prediction accuracy when using the regression as opposed to ignoring the predictor information. j. SSR D >> The vertical distance between the predicted response and the actual response. k. Cook’s Distance C >> The proportion of the variability in the response that is explained by the linear relationship between the predictor and response variables.

2.9 p88 Regression with very low r2.
What statistic in Table 2:11 suggests we might get useful results from regression? Table 2.11 p57 x When null hypothesis true (b1 = 0), t = b1 / Sb1 follows t-distribution with n-2 degrees of freedom. Test does not tell us a value of n.

2.12 p88 If H0 is true a statistic that follows the F1,5 distribution has a value of 5. Can I reject H0 with 95% confidence?

2.19 p90 Based on minitab output below, what is b0, b1, r2, r, s, sb1, 95% confidence interval on slope, etc.

2.20 p90 Based on scatter plot of data, what bulge rule should be applied to attempt a
transformation to linearity.

Regression in 1D: Fit a line to data by minimizing squared residuals.

Similar presentations

Presentation on theme: "Regression in 1D: Fit a line to data by minimizing squared residuals."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression in 1D: Fit a line to data by minimizing squared residuals.

Similar presentations

Presentation on theme: "Regression in 1D: Fit a line to data by minimizing squared residuals."— Presentation transcript:

Similar presentations

About project

Feedback