Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Slides:



Advertisements
Similar presentations
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Ch11 Curve Fitting Dr. Deshi Ye
The General Linear Model. The Simple Linear Model Linear Regression.
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Data mining in 1D: curve fitting
Overfitting and Regularization Chapters 11 and 12 on amlbook.com.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Regression Analysis, Chapter 13,
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Today Wrap up of probability Vectors, Matrices. Calculus
CpE- 310B Engineering Computation and Simulation Dr. Manal Al-Bzoor
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Probability theory: (lecture 2 on AMLbook.com)
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Applications The General Linear Model. Transformations.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Machine Learning CSE 681 CH2 - Supervised Learning.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
INTRODUCTION TO Machine Learning 3rd Edition
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Correlation & Regression Analysis
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
ES 07 These slides can be found at optimized for Windows)
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
CSE 4705 Artificial Intelligence
CH 5: Multivariate Methods
Bias and Variance of the Estimator
Correlation and Regression
CHAPTER 29: Multiple Regression*
Linear regression Fitting a straight line to observations.
10701 / Machine Learning Today: - Cross validation,
Product moment correlation
Parametric Methods Berlin Chen, 2005 References:
Parametric Estimation
Supervised machine learning: creating a model
Chapter 12: Data Analysis by linear least squares
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment

Review: Concepts from fundamentals 1 Define the following: Supervised learning Unsupervised learning Reinforcement learning Generalization Hypothesis set E in (h|X) H opt = argmin h (E in (h|X)) E out (h opt ) E test (h opt ) Version space Margins Support vectors

Review: Concepts from fundamentals 1 Define the following: H shatters N points VC dimension Break point

Review: Question about VC dimension The VC dimension of a linear dichotomizer in 2D is 3. What does 2D mean? What does dichotomizer mean? What does linear dichotomier mean? What does VC dimension of 3 mean for the linear dichotomizer in 2D ? Why is 4 a break point for the linear dichotomizer in 2D

5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) examples of family cars An expert on family cars has given us 100x more data and with engine power measured by a standard test. Data contains used cars. Use these data to find a relationship between engine power and price of a family car. Interpret your training data in terms of p(x,y)=p(x)p(y|x).

6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) examples of family cars Take your 500 family car data points, let x = engine power and y = price. How do you use these data to find the dependents of price on engine power? How do you estimate p(x) and p(y|x)?

7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) examples of family cars How do you estimate p(x) and p(y|x)? Set up a bin structure in the x variable. What is a good choice of bin width? Assign the 500 y values to bins according to their x value. Define a new x variable as the bin center. What is p(x)? What is p(y|x)?

Curve fitting: “regression” in 1D Regression can have any number of attributes Label on examples is always a number

Fit a parabola to data “target function” is “trend” in the data Scatter around trend interpreted as noise H in this case is the set of all 2 nd degree polynomials Select best member of H by min sum squared residuals

Finding the best member of H by calculus Take derivatives of E in (g) with respect to the coefficients of a parabola (collective call  and set equal to zero. Solve resulting 3x3 linear system Generalize to any degree polynomial using a matrix algebra

11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Assume g(x|  ) is polynomial of degree n-1 (i.e linear combination of 1, x, x 2, …, x n-1 ) m = number of examples (x i t, r i t ) in the training set Define mxn matrix A A ij = j th function in evaluated at x i t  column vector of n unknown coefficients b column vector of m values of r i t in training set If A  = b has a solution, then g(x i t |  ) = r i t for all i Not what we want, why? with n << m, A  = b has no exact solution Polynomial regression by linear least squares

12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Look for an approximate solution which minimizes the Euclidean norm of the residual vector r = b – A , define f(  ) = ||r|| 2 = r T r f(  ) = (b – A  ) T (b – A  ) = b T b –2  T A T b +  T A T A  A necessary condition for  0 to be minimum of f(  ) is  f(  0 ) = o  f(  ) = 2A T A  – 2A T b optimal set of parameters is a solution of nxn symmetric system of linear equations A T A  = A T b Normal Equations

Polynomial Regression: degree k with N data points Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Solve D T Dw = D T r for k+1 coefficients

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Given the parameters that minimize the sum of squared deviations, are the values of the fit at x t, the locations data points, and R = Y fit – Y are the residuals at the data points

Coefficient of determination 15 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Denominator is the sum of squared error associated with the hypothesis that data is approximated by its mean value, a polynomial of degree zero

Review 1D polynomial regression (curve fitting) has all of the fundamental characteristics of data mining Data points (x, y) support supervised machine learning with x as the attribute and y as the label The degree of the polynomial defines an hypothesis set Polynomials of higher degree are more complex hypotheses. Sum of squared residuals defines an E in that can be used to select a member of the hypothesis set by matrix algebra. E out can be analytically defined and calculated for in silico datasets (target function + noise)

Tuning regression models The degree of the polynomial used in fitting data by polynomials is an example of complexity in the hypothesis set H used in data mining. As degree increases the hypothesis set has more adjustable parameters; hence, a greater diversity of shapes is possible.

Over-fitting Parabolic fit shown here looks OK but would a cubic give a better fit? Cubic fit will give a smaller E in (g) but likely at the cost of a larger E out (g) Cubic lets me fit more of the noise in the data, which is specific to this data set The optimum cubic fit to this data set is likely a poorer approximation to a different data set because noise is different.

Approximation – Generalization Tradeoff In the theory of generalization (covered fundaments 3) it can be shown that E out (g) < E in (g) +  (N, H,  ) where  is a function of N the training-set size, H the hypothesis set, and  the allowable uncertainty in the final model.  (N, H,  ) is a bound on the difference between E out (g) and E in (g) If  (N, H,  ) is small we can be confident of good generalization. At given complexity (determined by H), higher statistical confidence (1-  ) can usually be achieved with larger N At fixed N and ,  usually increases with the complexity of H, making generalization less certain. Even though E in (g) may decrease with higher complexity, E out (g) may not. In least-squares 1D regression, this effect can be illustrated by the “Bias/Variance dilemma”

Fit a cubic to each in silico data set. Averaging these results we get a consensus cubic fit Difference between consensus fit and target function called “bias” From consensus fit and individual cubic fits, we can calculate a variance Given a parabolic target function, construct several “in silico” data sets by adding noise drawn from a normal distribution with zero mean and a specified variance

Formal definitions of Bias & Variance 21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Assume the target function f(x) is known Create M in silico datasets of size N by adding noise to f(x) For each dataset find the best g i (x) of given complexity Average g i (x) to get best overall estimator of f(x) Calculate bias and variance of best estimator as follows

Expectation values of E out (g) where denotes average over data sets. E out can be written as sum of 3 terms,  2 + bias 2 + variance where  2 is a contribution from noise in the data  2 does not depend on complexity of the hypothesis set, so we can ignore it in this discussion 22 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) is out-of-sample error for i th training set E x denotes average over the specified domain for f(x).

Derive E out = bias 2 + variance

24 Bias is RMSD f gigi g f one in silico experiment Linear regression: 5 experiments Each cubic has shape like f(x) Shape of g i varies more Polynomial fits to sin(x) + noise Smaller Bias Larger variance

25 Best complexity is degree 3 Beyond 3, decreases in bias are offset of increases in variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Bias, variance and E out from polynomial fits to sin(x) + noise

Cannot use bias/variance analysis to tune polynomial fits to real data because f(x) is unknown; hence we cannot calculate the bias.

27 “elbow” in estimate of E out indicates best complexity Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) divide real data into training and validation sets Use validation set to estimate E out

Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between 0 and 5 Use 25 samples for training, 75 for validation Fit polynomials of degree 1 – 5 to the training set. Calculate at each degree. Plot your result as shown in previous slide to find the “elbow” in E val and best complexity for data mining Use the full data set to find the optimum polynomial of best complexity Show this result as plot of data and fit on the same set of axes. Report the minimum sum of squared residuals and coefficient of determination

Get in silico dataCalculate in-sample and validation errors

Degree of polynomial E in and E val Evidence for cubic as best choice for degree of polynomial VC bound suggests that small decreases in E val for degree>3 do not indicate better generalization.

Expected results: solid curve is target function, *’s are cubic fit, +’s are training data