Presentation is loading. Please wait.

Presentation is loading. Please wait.

Goodfellow: Chap 5 Machine Learning Basics

Similar presentations


Presentation on theme: "Goodfellow: Chap 5 Machine Learning Basics"— Presentation transcript:

1 Goodfellow: Chap 5 Machine Learning Basics
Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content.

2 Chapter Sections Introduction 1 Learning Algorithms
2 Capacity, Overfitting and Underfitting 3 Hyperparameters and Validation Sets 4 Estimators, Bias andVariance 5 Maximum Likelihood Estimation 6 Bayesian Statistics (not covered in this course) 7 Supervised Learning Algorithms 8 Unsupervised Learning Algorithms (later in course) 9 Stochastic Gradient Descent 10 Building a Machine Learning Algorithm 11 Challenges Motivating Deep Learning

3 Introduction Machine Learning Two central approaches
Form of applied statistics with extensive use of computers to estimate complicated functions Two central approaches Frequentist estimators and Bayesian inference Two categories of machine learning Supervised learning and unsupervised learning Most machine learning algorithms based on stochastic gradient descent optimization Focus on building machine learning algorithms

4 1 Learning Algorithms A machine learning algorithm learns from data
A computer program learns from experience E with respect to some class of tasks T and performance measure P if it’s performance P on tasks T improves with experience E

5 1.1 The Task, T Machine learning tasks are too difficult to solve with fixed programs designed by humans Machine learning tasks are usually described in terms of how the system should process an example where An example is a collection of features measured from an object or event Many kinds of tasks Classification, regression, transcription, anomaly detection, imputation of missing values, etc.

6 1.2 The Performance Measure, P
Performance P is usually specific to the Task T For classification tasks, P is usually accuracy The proportion of examples correctly classified Here we are interested in the accuracy on data not seen before – a test set

7 1.3 The Experience, E Most learning algorithms in this book are allowed to experience an entire dataset Unsupervised learning algorithms Learn useful structural properties of the dataset Supervised learning algorithms Experience a dataset containing features with each example associated with a label or target For example, target provided by a teacher Blur between supervised and unsupervised

8 1.4 Example: Linear Regression
Linear regression takes a vector x as input and predicts the value of a scalar y as it’s output Task: predict scalar y from vector x Experience: set of vectors X and vector of targets y Performance: mean squared error => normal eqs Solution of the normal equations is

9 Pseudoinverse Method Intro to Algorithms by Cormen, et al.
General linear regression Assume polynomial functions Minimizing mean squared error => normal eqs Solution of the normal equations is

10 Pseudoinverse Method Intro to Algorithms by Cormen, et al.

11 Example: Pseudoinverse Method
Simple linear regression homework example Find Given points

12 Example: Pseudoinverse Method

13 Linear Regression MSE(train) Figure 5.1
Linear regression example Optimization of w 3 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 2 1 MSE(train) y -1 -2 -3 -1.0 -0.5 x1 0.5 1.0 w1 1.5 Figure 5.1 (Goodfellow 2016)

14 2 Capacity, Overfitting and Underfitting
The central challenge in machine learning is to perform well on new, previously unseen inputs, and not just on the training inputs We want the generalization error (test error) to be low as well as the training error This is what separates machine learning from optimization

15 2 Capacity, Overfitting and Underfitting
How can we affect performance on the test set when we get to observe only the training set? Statistical learning theory provides answers We assume train and test sets are generated independent and identically distributed – i.i.d. Thus, expected train and test errors are equal

16 2 Capacity, Overfitting and Underfitting
Because the model is developed from the training set the expected test error is usually greater than the expected training error Factors determining machine performance are Make the training error small Make the gap between train and test error small These two factors correspond to the two challenges in machine learning Underfitting and overfitting

17 2 Capacity, Overfitting and Underfitting
Underfitting and overfitting can be controlled somewhat by altering the model’s capacity Capacity = model’s ability to fit variety of functions Low capacity models struggle to fit the training data High capacity models can overfit the training data One way to control the capacity of a model is by choosing its hypothesis space For example, simple linear vs quadratic regression

18 Underfitting and Overfitting in Polynomial Estimation
Underfitting Appropriate capacity Overfitting y y y x0 x0 x0 Figure 5.2 (Goodfellow 2016)

19 2 Capacity, Overfitting and Underfitting
Model capacity can be changed in many ways – above we changed the number of parameters We can also change the family of functions –called the model’s representational capacity Statistical learning theory provides various means of quantifying model capacity The most well-known is the Vapnik-Chervonenkis dimension Old non-statistical Occam’s razor method takes simplest of competing hypotheses

20 2 Capacity, Overfitting and Underfitting
While simpler functions are more likely to generalize (have a small gap between training and test error) we still want low training error Training error usually decreases until it asymptotes to the minimum possible error Generalization error usually has a U-shaped curve as a function of model capacity

21 Generalization and Capacity
Training error Generalization error Underfitting zone Overfitting zone Error Generalization gap Optimal Capacity Capacity Figure 5.3 (Goodfellow 2016)

22 2 Capacity, Overfitting and Underfitting
Non-parametric models can reach arbitrarily high capacities For example, the complexity of the nearest neighbor algorithm is a function of the training set size Training and generalization error vary as the size of the training set varies Test error decreases with increased training set size

23 Training Set Size Figure 5.4 Bayes error
3.5 3.0 2.5 2.0 1.5 1.0 0.5 Bayes error Train (quadratic) Test (quadratic) Test (optimal capacity) Error (MSE) Train (optimal capacity) 0.0 Figure 5.4 100 101 Number of training examples 104 105 Optimal capacity (polynomial degree) 20 15 10 5 100 101 Number of training examples 104 105 (Goodfellow 2016)

24 2.1 No Free Lunch Theorem Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points In other words, no machine learning algorithm is universally any better than any other However, by making good assumptions about the encountered probability distributions, we can design high-performing learning algorithms This is the goal of machine learning research

25 2.2 Regularization The no free lunch theorem implies that we design our learning algorithm on a specific task Thus, we build preferences into the algorithm Regularization is a way to do this and weight decay is a method of regularization

26 2.2 Regularization Applying weight decay to linear regression
This expresses a preference for smaller weights l controls the preference strength with l = 0 imposing no preference strength The next figure shows 9th degree polynomial training with various values of l The true function is quadratic

27 Weight Decay Underfitting (Excessive λ)
Appropriate weight decay (Medium λ) Overfitting (λ →() y y y x( x( x( (Goodfellow 2016)

28 2.2 Regularization More generally, we can add a penalty called a regularizer to the cost function Expressing preferences for one function over another is a more general way of controlling a model’s capacity, rather than including or excluding members from the hypothesis space Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error

29 3 Hyperparameters and Validation Sets
Most machine learning algorithms have several settings, called hyperparameters, to control the behavior of the algorithm Examples In polynomial regression, the degree of the polynomial is a capacity hyperparameter In weight decay, l is a hyperparameter

30 3 Hyperparameters and Validation Sets
To avoid overfitting during training, the training set is often split into two sets Called the training set and the validation set Typically, 80% of the training data is used for training and 20% for validation Cross-Validation For small datasets, k-fold cross validation partitions the data into k non-overlapping subsets k trials are run, train on (k-1)/k of the data, test on 1/k Test error is estimated by averaging error on each run

31 4 Estimators, Bias and Variance
4.1 Point estimation Prediction of quantity – e.g., max likelihood 4.2 Bias 4.3 Variance and Standard Error

32 4 Estimators, Bias and Variance
4.4 Trading off Bias and Variance to Minimize Mean Squared Error Most common way to negotiate this trade-off is to use cross-validation Can also use MSE of the estimates The relationship between bias and variance is tightly linked to capacity, underfitting, and overfitting

33 Bias and Variance Underfitting zone Overfitting zone Bias
Generalization error Variance Optimal capacity Capacity (Goodfellow 2016)

34 5 Maximum Likelihood Estimation
Covered in Duda textbook

35 6 Bayesian Statistics Not covered in this course

36 7 Supervised Learning Algorithms
7.1 Probabilistic Supervised Learning Bayes decision theory 7.2 Support Vector Machines Also covered by Duda – maybe later in semester 7.3 Other Simple Supervised Learning Alg. k-nearest neighbor algorithm Decision trees – breaks input space into regions

37 Decision Trees 1 10 00 01 11 010 010 011 110 111 00 01 1110 1111 011 110 1 11 10 1110 111 1111 Figure 5.7 (Goodfellow 2016)

38 8 Unsupervised Learning Algorithms
Algorithms with unlabeled examples 8.1 Principal Component Analysis (PCA) Covered by Duda – later in semester 8.2 k-means Clustering

39 Principal Components Analysis
20 20 10 10 x2 z2 -10 -10 -20 -20 x1 z1 Figure 5.8 (Goodfellow 2016)

40 9 Stochastic Gradient Descent
Nearly all of deep learning is powered by stochastic gradient descent (SGD) SGD is an extension of gradient descent introduced on section 4.3

41 10 Building a Machine Learning Algorithm
Deep learning algorithms use a simple recipe Combine a specification of a dataset, a cost function, an optimization procedure, and a model

42 11 Challenges Motivating Deep Learning
The simple machine learning algorithms work well on many important problems But they don’t solve the central problems of AI Recognizing objects, etc. Deep learning was motivated by this failure

43 11.1 Curse of Dimensionality
Machine learning problems get difficult when the number of dimensions in the data is high This phenomenon is known as the curse of dimensionality Next figure shows 10 regions of interest in 1D 100 regions in 2D 1000 regions in 3D

44 Curse of Dimensionality
Figure 5.9 (Goodfellow 2016)

45 11.2 Local Consistency and Smoothness Regularization
To generalize well, machine learning algorithms need to be guided by prior beliefs One of the most widely used “priors” is the smoothness prior or local constancy prior Function should change little within a small region To distinguish O(k) regions in input space requires O(k) samples For the kNN algorithm, each training sample defines at most one region

46 Nearest Neighbor Figure 5.10 (Goodfellow 2016)

47 11.3 Manifold Learning A manifold is a connected region
Manifold algorithms reduce space of interest Such as variation across n-dimensional Euclidean space reduced by assuming that most of the space consists of invalid input Probability distributions over images, text strings, and sounds that occur in real life are highly concentrated Fig 11 is a manifold, 12 is static noise, 13 shows the manifold structure of a dataset of faces

48 Manifold Learning Figure 5.11 2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 0.5
3.0 3.5 4.0 Figure 5.11 (Goodfellow 2016)

49 Uniformly Sampled Images
Figure 5.12 (Goodfellow 2016)

50 QMUL Dataset Figure 5.13 (Goodfellow 2016)


Download ppt "Goodfellow: Chap 5 Machine Learning Basics"

Similar presentations


Ads by Google