Introduction to Predictive Learning

Introduction to Predictive Learning
LECTURE SET 5 Statistical Methods Electrical and Computer Engineering

OUTLINE Objectives - introduce statistical terminology/methodology/motivation - taxonomy of methods - describe several representative statistical methods - interpretation of statistical methods under predictive learning Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion

Methodology and Motivation
Original motivation: - understand how the inputs affect the output  simple model involving a few variables Regression modeling: Response = model + error y = f(x) + noise, where f(x) = E(y/x) Linear regression: f(x) = wx +b Model parameters estimates via

OLS Linear Regression OLS solution: Example: SBP vs. Age
- first, center x and y-values - then calculate the slope and bias Example: SBP vs. Age The meaning of bias term?

Statistical Assumptions
Gaussian noise: zero mean, const variance Known (linear) dependency i.i.d. data samples (ensured by the protocol for data collection) – may not hold for observational data Do these assumptions hold for:

Multivariate Linear Regression
Parameterization Matrix form (for centered variables) ERM solution: Analytic solution (when d < n):

Linear Ridge Regression
When d > n, penalize large parameter values Regularization parameter estimated via resampling Example: - 10 training samples uniformly samples in [0,1] range - additive gaussian noise with st. deviation 0.5 Apply standard linear least squares: Apply ridge regression using optimal

Example cont’d Target function
Coefficient shrinkage: how w’s depend on lambda? Can it be used for feature selection?

Statistical Methodology for classification
For classification: output y ~ (binary) class label (0 or 1) Probabilistic modeling starts with known distributions Bayes-optimal decision rule for known distributions: Statistical approach ~ ERM - parametric form of class distributions is known/assumed  analytic form of D(x) is known, and its parameters are estimated from available training data Issues: loss function (used for statistical modeling)?

Gaussian class distributions

is a linear function in x
Logistic Regression Terminology: may be confusing (for non-statisticians) Gaussian class distributions (with equal covariances) is a linear function in x Logistic regression estimates probabilistic model: Equivalently, logistic regression estimates where sigmoid function is

Logistic Regression Example: interpretation of logistic regression model for the probability of death from a heart disease during 10-year period, for middle-aged patients, as a function of - Age (years, less 50) ~x1 - Gender male/female (0/1) ~x2 - cholesterol level, in mmol/L (less 5) ~ x3 where The probability of binary outcome ~ the risk (of death)* Logistic Regression Model interpretation: - increasing Age is associated with increased risk of death - females have lower risk of death (than males) - increasing Cholesterol level  increased risk of death

Estimating Logistic Regression
Given: training data How to estimate model parameters (w,b) ? Maximum Likelihood ~ minimize negative log-likelihood: where  non-linear optimization Solution w*, b*  estimated model: - which can be used for prediction and interpretation (for prediction, the model should be combined with costs)

Statistical Modeling Strategy
Data-analytic models are used for: understanding the importance of inputs in explaining the output ERM approach: - a statistician selects (manually) a few ‘good’ variables and several models are estimated - the final model selected manually ~ heuristic implementation of Occam’s razor Linear regression and logistic regression - both estimate E(y/x), since for classification:

Classification via multiple-response regression
How to use nonlinear regression s/w for classification? - classification methods estimate model parameters via minimization of squared-error  can use regression s/w with minor modifications: (1) for J class labels, use 1-of-J encoding, i.e. J=4 classes: ~ (4 outputs in regression). (2) estimate 4 regression models from the training data (usually all regression models use the same parameterization)

Classification via Regression
Training ~ regression estimation using 1-of-J encoding Prediction (classification) ~ based on the max response value of estimated outputs

OUTLINE Objectives Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods - model parameterization (representation) - nonlinear optimization strategies Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion

Taxonomy of Nonlinear Methods
Main idea: improve flexibility of classical linear methods ~ use flexible (nonlinear) parameterization Dictionary parameterization ~ SRM structure Two interrelated issues: - parameterization (of nonlinear basis functions) - optimization method used These two factors define methods taxonomy

Taxonomy of nonlinear methods
Decision tree methods: - piecewise-constant model - greedy optimization Additive methods: - backfitting method for model estimation Gradient-descent methods: - popular in neural network learning Penalization methods Note: all methods implement SRM structures

Dictionary representation Two possibilities
Linear (non-adaptive) methods ~ predetermined (fixed) basis functions  only parameters have to be estimated via standard optimization methods (linear least squares) Examples: linear regression, polynomial regression linear classifiers, quadratic classifiers Nonlinear (adaptive) methods ~ basis functions depend on the training data Possibilities : nonlinear b.f. (in parameters ) feature selection (i.e. wavelet denoising)

Example of Nonlinear Parameterization
Basis functions of the form i.e. sigmoid aka logistic function - commonly used in artificial neural networks - combination of sigmoids ~ universal approximator

Basis functions of the form i.e. Radial Basis Function(RBF) - RBF adaptive parameters: center, width - commonly used in artificial neural networks - combination of RBF’s ~ universal approximator

Neural Network Representation
MLP or RBF networks - dimensionality reduction - universal approximation property – see example at

Adaptive Partitioning (CART) each b.f. is a rectangular region in x-space Each b.f. depends on 2d parameters Since the regions are disjoint, parameters w can be easily estimated (for regression) as Estimating b.f.’s ~ adaptive partitioning

Example of CART Partitioning
CART Partitioning in 2D space - each region ~ basis function - piecewise-constant estimate of y (in each region) - number of regions ~ model complexity

Taxonomy of Nonlinear Methods Decision Trees - Regression trees (CART) - Boston Housing example - Classification trees (CART) Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion

Greedy Optimization Strategy
Minimization of empirical risk for regression problems where the model Greedy Optimization Strategy basis functions are estimated sequentially, one at a time, i.e., the training data is represented as structure (model fit) + noise (residual): (1) DATA = (model) FIT 1 + RESIDUAL 1 (2) RESIDUAL 1 = FIT 2 + RESIDUAL 2 and so on. The final model for the data will be MODEL = FIT 1 + FIT Advantages: computational speed, interpretability

Regression Trees (CART)
Minimization of empirical risk (squared error) via partitioning of the input space into regions where Example of CART partitioning for a function of 2 inputs

Growing CART tree Recursive partitioning for estimating regions (via binary splitting) Initial Model ~ Region (the whole input domain) is divided into two regions and A split is defined by one of the inputs(k) and split point s Optimal values of (k, s) chosen so that splitting a region into two daughter regions minimizes empirical risk Issues: - efficient implementation (selection of opt. split point) - optimal tree size ~ model selection(complexity control) Advantages and limitations

Valid Split Points for CART
How to choose valid points (for binary splitting)? valid points ~ combinations of the coordinate values of training samples, i.e. for 4 bivariate samples  16 points used as candidates for splitting:

CART Modeling Strategy
Growing CART tree ~ reducing MSE (for regression) Splitting the parent region is allowed only if # of samples exceeds certain threshold (~, Splitmin, user-defined). Tree pruning ~ reducing tree size by selectively combining adjacent leaf nodes (regions). This pruning implements minimization of the penalized MSE: where ~ MSE ~ number of leaf nodes (regions) and parameter is estimated via resampling

Example: Boston Housing data set
Objective: to predict the value of homes in Boston area Data set ~ 506 samples total Output: value of owner-occupied homes (in $1,000’s) Inputs: 13 variables 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B (Bk )^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population

Example CART trees for Boston Housing
1.Training set: 450 samples Splitmin =100 (user-defined)

2.Training set: 450 samples Splitmin =50 (user-defined)

3.Training set: 455 samples Splitmin =100 (user-defined) Note: CART model is sensitive to training samples (vs model 1)

Classification Trees (CART)
Binary classification example (2D input space) Algorithm similar to regression trees (tree growth via binary splitting + model selection), BUT using different empirical loss function

Loss functions for Classification Trees
Misclassification loss: poor practical choice Other loss (cost) functions for splitting nodes: For J-class problem, a cost function is a measure of node impurity where p(i/t) denotes the probability of class i samples at node t. Possible cost functions Misclassification Gini function Entropy function

Classification Trees: node splitting
Minimizing cost function = maximizing the decrease in node impurity. Assume node t is split into two regions (Left & Right) on variable k at a split point s. Then the decrease is impurity caused by this split is: where Misclassification cost ~ discontinuous (due to max) - may give sub-optimal solutions (poor local min) - does not work well with greedy optimization

Using different cost fcts for node splitting
(a) Decrease in impurity: misclassification = 0.25 gini = entropy = 0.13 (b) Decrease in impurity: misclassification = 0.25 gini = entropy = 0.22 Split (b) is better as it leads to a smaller final tree

Details of calculating decrease in impurity
Consider split (a) Misclassification Cost Gini Cost

MATLAB code (splitmin =10)
IRIS Data Set: A data set with 150 random samples of flowers from the iris species setosa, versicolor, and virginica (3 classes). From each species there are 50 observations for sepal length, sepal width, petal length, and petal width in cm. This dataset is from classical statistics MATLAB code (splitmin =10) load fisheriris; t = treefit(meas, species); treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'});

Sensitivity to random training data:
Consider IRIS data set where every other sample is used (total 75 samples, 25 per class). Then the CART tree formed using the same Matlab software (splitmin = 10, Gini loss fct)) is

Decision Trees: summary
Advantages - speed - interpretability - different types of input variables Limitations: sensitivity to - correlated inputs - affine transformations (of input variables) - general instability of trees Variations: ID3 (in machine learning), linear CART

Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion

Additive Modeling Additive model parameterization for regression
where is unknown (smooth) function. Each univariate component estimated separately Additive model for classification Backfitting is a greedy optimization approach for estimating basis functions sequentially

By fixing all basis functions the empirical risk (MSE) can be decomposed as
Each basis function is estimated via an iterative backfitting algorithm (until some stopping criterion is met) Note: can be interpreted as the response variable for the adaptive method

Backfitting Algorithm: Example
Consider regression estimation of a function of two variables of the form from training data For example Backfitting method: (1) estimate for fixed (2) estimate for fixed iterate above two steps Estimation via minimization of empirical risk

Backfitting Algorithm(cont’d)
Estimation of via minimization of MSE This is a univariate regression problem of estimating from n data points where Can be estimated by smoothing (kNN regression) Estimation of (second iteration) proceeds in a similar manner, via minimization of

Projection Pursuit regression
Projection Pursuit is an additive model: where basis functions are univariate functions (of projections) Features specify the projection of x onto w A sum of nonlinear functions can approximate any nonlinear model functions. See example below.

Projection Pursuit regression
Projection Pursuit is an additive model: where basis functions are univariate functions (of projections) Backfitting algorithm is used to estimate iteratively (a) basis functions (parameters ) via scatterplot smoothing (b) projection parameters (via gradient descent)

EXAMPLE: estimation of a two-dimensional fct via projection pursuit
Projections are found that minimize unexplained variance. Smoothing is performed to create adaptive basis functions. The final model is a sum of two univariate adaptive basis functions.

Greedy feature selection
Recall feature selection structure in SRM: - difficult (nonlinear) optimization problem - simple with orthogonal basis functions - why not use orthogonal b.f.’s for all apps? Consider sparse polynomial estimation (aka best subset regression) as an example of feature selection, i.e. features ~ Compare two approaches: - exhaustive search through all subsets - forward stepwise selection (in statistics)

Data set used for comparisons
30 noisy training samples generated from where and inputs are uniform in [0,1]

Feature selection via exhaustive search
Exhaustive search for best subset selection - estimate prediction risk (MSE) via leave-one-out cross validation - minimize empirical risk via least squares for all possible subsets of m variables (features) - select the best subset (~ min pred. risk) Based on min prediction risk (via x-validation) the following model was selected Final model estimated via linear regression using features with all data:

Forward subset selection (greedy method)
- first estimate the model using one feature - then add the second feature if it results in sufficiently large decrease in RSS, otherwise stop - etc. (sequentially adding one more feature) Step 1: select the first feature (m=1) from a set of candidate models: via so selected model is with RSS(1)=0.249 Step 2: select second feature (m=2) from a set of candidate models: via RSS = selected model with RSS(2)=

Forward subset selection (greedy method)
Step 2 (cont’d) - check whether including second feature in the model is justified using some statistical criterion, usually F test: so (m+1)-st feature is included only if F>90 For adding second feature: so we keep it in the model Step 3: select third feature from a set of candidate models: with RSS= RSS= Test whether adding third feature is justified via F test:  not justified, so the final model

Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Refs: V. Cherkassky and X. Shao, Signal estimation and denoising using VC-theory, Neural Networks, 14, 37-52, 2001 V. Cherkassky and S. Kilts, Myopotential denoising of ECG signals using wavelet thresholding methods, Neural Networks, 14, , 2001 Summary and discussion

Signal Denoising Problem

Signal denoising problem statement
Regression formulation ~ real-valued function estimation (with squared loss) Signal representation: linear combination of orthogonal basis functions (harmonic, wavelets) Differences (from standard formulation) - fixed sampling rate - training data X-values = test data X-values  Computationally efficient orthogonal estimators: Discrete Fourier/Wavelet Transform (DFT / DWT)

Examples of wavelets see http://en.wikipedia.org/wiki/Wavelet
Haar wavelet Symmlet

Meyer Mexican Hat

Wavelets (cont’d) Example of translated and dilated wavelet basis functions:

Issues for signal denoising
Denoising via (wavelet) thresholding - wavelet thresholding = sparse feature selection - nonlinear estimator suitable for ERM Main factors for signal denoising Representation (choice of basis functions) Ordering (of basis functions) ~ SRM structure Thresholding (model selection) Large-sample setting: representation Finite-sample setting: thresholding + ordering

Framework for signal denoising
Ordering of (wavelet) thresholding = = structure on orthogonal basis functions Traditional ordering Better ordering VC- thresholding Opt number of wavelets ~ via min VC-bound for regression where VC-dim. h=m (number of wavelets or DoF)

Empirical Results: signal denoising
Two target functions Symmlet wavelet Data set: 128 noisy samples, SNR = 2.5

Empirical Results: Blocks signal estimated by VC-based denoising

Empirical Results: Heavisine estimated by VC-based denoising

Application Study: ECG Denoising

A closer look of a noisy segment

Denoised ECG signal VC denoising applied to 4,096 noisy samples
Denoised ECG signal VC denoising applied to 4,096 noisy samples. The final model (below) has 76 wavelets

Summary and Discussion
Evolution of statistical methods - parametric  flexible (adaptive) - fast optimization (favor greedy methods – why?) - interpretable - model complexity ~ number of parameters (basis functions, regions, features …) - batch mode (for training) Probabilistic framework - classical methods assume probabilistic models of observed data - adaptive statistical methods lack probabilistic derivation, but use clever heuristics for controling model complexity

Introduction to Predictive Learning

Similar presentations

Presentation on theme: "Introduction to Predictive Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Predictive Learning

Similar presentations

Presentation on theme: "Introduction to Predictive Learning"— Presentation transcript:

Similar presentations

About project

Feedback