Introduction to Predictive Learning

Name: Introduction to Predictive Learning
Uploaded: 2017-10-11T20:29:43+00:00
Duration: PTM17S45
Description: Introduction to Predictive Learning

Introduction to Predictive Learning
LECTURE SET 2 Basic Learning Approaches and Complexity Control Electrical and Computer Engineering

OUTLINE 2.0 Objectives 2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches 2.4 Generalization and Complexity Control 2.5 Application Example 2.6 Summary

2.0 Objectives To quantify the notions of explanation, prediction and model Introduce terminology Describe basic learning methods Importance of complexity control for generalization

Learning as Induction Induction ~ function estimation from data:
Deduction ~ prediction for new inputs: aka Standard Inductive Learning Setting

2.1 Data Encoding + Preprocessing
Common Types of Input & Output Variables (input variables ~ features) Real-valued Categorical (class labels) Ordinal (or fuzzy) variables Classification: categorical output Regression: real-valued output Ranking: ordinal output

Data Preprocessing and Scaling
Preprocessing is required with observational data (step 4 in general experimental procedure) Basic preprocessing includes - summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output - detection (removal) of outliers - scaling of input/output variables (may be necessary for some learning algorithms) Visual inspection of data is tedious but useful

Animal Body & Brain Weight Data (original, unscaled)

Removing Outliers Remove outliers: Brachiosaurus, Diplodocus, Triceratop, African elephant, Asian elephant and plot the data scaled to [0,1] range:

2.2 Terminology and Learning Problems
Input and output variables Learning ~ estimation of f(x): xy Loss function measures the quality of prediction Loss function: - defined for common learning tasks - has to be related to application requirements

Supervised Learning: Regression
Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (real-valued) Regression loss function  Estimation of real-valued function xy

Supervised Learning: Classification
Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is categorical output (class label) Loss function for binary classification:  Estimation of indicator function xy

Unsupervised Learning
Data in the form (x), where - x is multivariate input (i.e. vector) Goal: data reduction or clustering  Clustering = estimation of mapping X c

Inductive Learning Setting
Predictive Estimator observes samples (x ,y), and returns an estimated response Recall ‘first-principle’ vs ‘empirical’ knowledge Two modes of inference: identification vs imitation Minimization of Risk

Example: Regression estimation
Given: training data Find a function that minimizes squared error for a large number (N) of future samples: BUT Future data is unknown ~ P(x,y) unknown

Discussion Math formulation useful for quantifying
- explanation ~ fitting error (training data) - generalization ~ prediction error Natural assumptions - future similar to past: stationary P(x,y), i.i.d.data - discrepancy measure or loss function, i.e. MSE What if these assumptions do not hold?

2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches - Parametric Modeling - Non-parametric Modeling - Data Reduction 2.4 Generalization and Complexity Control 2.5 Application Example 2.6 Summary

Parametric Modeling Given training data (1) Specify parametric model
(2) Estimate its parameters (via fitting to data) Example: Linear regression F(x)= (w x) + b

Parametric Modeling: Classification
Given training data Estimate linear decision boundary: Estimate third-order decision boundary:

Non-Parametric Modeling
Given training data Estimate the model (for given ) as ‘local average’ of the training data. Note: need to define ‘local’, ‘average’ Example: k-nearest neighbors regression

Example of kNN Regression
Ten training samples from Using k-nn regression with k=1 and k=4:

Data Reduction Approach
Given training data, estimate the model as ‘compact encoding’ of the data. Note: ‘compact’ ~ # of bits to encode the model Example: piece-wise linear regression How many parameters needed for two-linear-component model?

Data Reduction Approach (cont’d)
Data Reduction approaches are commonly used for unsupervised learning tasks. Example: clustering. Training data encoded by 3 points (cluster centers) Issues: How to find centers? How to select the number of clusters? H

Standard Inductive Setting
Model estimation ~ inductive step, i.e. estimate function from data samples. Prediction ~ deductive step (Standard) Inductive Learning Setting Discussion: which of the 3 modeling approaches follow standard inductive learning? How humans perform inductive inference?

2.2 Terminology and Common Learning Tasks 2.3 Basic Learning Approaches 2.4 Generalization and Complexity Control - Prediction Accuracy (generalization) - Complexity Control: examples - Resampling 2.5 Application Example 2.6 Summary

Prediction Accuracy All modeling approaches implement ‘data fitting’ ~ explaining the data BUT the true goal ~ prediction Model explanation ~ fitting error, training error, empirical risk Prediction accuracy ~ generalization, test error, prediction risk Trade-off between training and test error is controlled by ‘model complexity’

Explanation vs Prediction
(a) Classification (b) Regression

Complexity Control: parametric modeling
Consider regression estimation Ten training samples Fitting linear and 2-nd order polynomial:

Complexity Control: local estimation
Consider regression estimation Ten training samples from Using k-nn regression with k=1 and k=4:

Complexity Control (cont’d)
Complexity (of admissible models) affects generalization (for future data) Specific complexity indices for Parametric models: ~ # of parameters Local modeling: size of local region Data reduction: # of clusters Complexity control = choosing good complexity (~ good generalization) for a given (training) data

How to Control Complexity ?
Two approaches: analytic and resampling Analytic criteria estimate prediction error as a function of fitting error and model complexity For regression problems: Example analytic criteria for regression Schwartz Criterion: Akaike’s FPE: where p = DoF/n, n~sample size, DoF~degrees-of-freedom

Resampling Split available data into 2 sets: Training + Validation
(1) Use training set for model estimation (via data fitting) (2) Use validation data to estimate the prediction error of the model Change model complexity index and repeat (1) and (2) Select the final model providing lowest (estimated) prediction error BUT results are sensitive to data splitting

K-fold cross-validation
Divide the training data Z into k randomly selected disjoint subsets {Z1, Z2,…, Zk} of size n/k For each ‘left-out’ validation set Zi : - use remaining data to estimate the model - estimate prediction error on Zi : 3. Estimate ave prediction risk as

Example of model selection(1)
25 samples are generated as with x uniformly sampled in [0,1], and noise ~ N(0,1) Regression estimated using polynomials of degree m=1,2,…,10 Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the polynomial model, along with training (* ) and validation (*) data points, for one partitioning. m Estimated R via Cross validation 1 0.1340 2 0.1356 3 0.1452 4 0.1286 5 0.0699 6 0.1130 7 0.1892 8 0.3528 9 0.3596 10 0.4006

Example of model selection(2)
Same data set, but estimated using k-nn regression. Optimal value k = 7 chosen according to 5-fold cross-validation model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning. k Estimated R via Cross validation 1 0.1109 2 0.0926 3 0.0950 4 0.1035 5 0.1049 6 0.0874 7 0.0831 8 0.0954 9 0.1120 10 0.1227

Test Error Double resampling (for estimating test error)
Previous example shows two models (estimated from the same data by different methods) Which model is better? (has lower test error). Note: Poly model has lower cross-validation error Double resampling (for estimating test error) Partition the data into: Learning/ validation/ test Test data should be never used for model estimation

Application Example Haberman’s Survival Data Set
- 5-year survival of female patients (following surgery for breast cancer) - 306 cases (patients) - inputs: Age, number of positive auxilliary nodes Method: k-NN classifier (only odd k-values) Note: input values are pre-scaled to [0,1] Model selection via LOO cross-validation Optimal k=45 yields min LOO error 22.75%

Model selection for k-NN classifier via cross-validation Optimal decision boundary for k=45
Error (%) 1 42 3 30.67 7 26 15 24.33 … …. 45 21.67 47 22.33 51 23 53 57 24 61 25 99 26.33

Estimating test error of a method
For the same example (Haberman’s data) what is the true test error of k-NN method ? Use double resampling, i.e. 5-fold cross validation to estimate test error, and LOO cross-validation to estimate optimal k for each training fold: Note: opt k-values are different; ave test error is larger than ave validation error. Optimal k LOO error Test error 1 11 22.5% 28.33% 2 37 25% 13.33% 3 23.33% 16.67% 4 33 24.17% 5 35 18.75% 35% mean 22.75% 23.67%

Summary and Discussion
Learning as function estimation (from data) ~ standard inductive learning setting Common types of learning problems: classification, regression, clustering Non-standard learning settings Model estimation via data fitting (ERM) Model complexity and generalization: - how to measure model complexity - several complexity (tuning) parameters

Introduction to Predictive Learning

Similar presentations

Presentation on theme: "Introduction to Predictive Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Predictive Learning

Similar presentations

Presentation on theme: "Introduction to Predictive Learning"— Presentation transcript:

Similar presentations

About project

Feedback