Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification Relationship to other prediction models Some simple examples of neural networks Parameter estimation Joint framework for prediction and classification Features of neural networks
Data mining and statistical learning - lecture 11 Ordinary least squares regression (OLS) x1x1 x2x2 xpxp … y Model: Terminology: 0 : intercept (or bias) 1, …, p : regression coefficients (or weights) The response variable responds directly and linearly to changes in the inputs
Data mining and statistical learning - lecture 11 Principal components regression (PCR) Extract principal components (linear combinations of the inputs) as derived features, and then model the target (response) as a linear function of these features x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y The response variable responds indirectly and linearly to changes in the inputs
Data mining and statistical learning - lecture 11 Neural network with a single target Output x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y Hidden layer of neurons Inputs The response to changes in inputs is indirect and nonlinear
Data mining and statistical learning - lecture 11 Neuron Sigmoid activation function
Data mining and statistical learning - lecture 11 Neural networks with a single target Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function (activation function) of these features x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y
Data mining and statistical learning - lecture 11 Neural network with one input, one neuron, and one target x z y
Data mining and statistical learning - lecture 11 Neural network with one input, one neuron, and one target x z y
Data mining and statistical learning - lecture 11 Neural network with one input, one neuron, and one target - a simple example Select Advanced user interface Select 1 hidden node Tick Outputs from Training,…
Data mining and statistical learning - lecture 11 Neural network with one input, one neuron, and one target
Data mining and statistical learning - lecture 11 Output from proc Neural - one input, one neuron, one target Parameter Estimates Gradient Objective N Parameter Estimate Function 1 x_H BIAS_H H11_y E-8 4 BIAS_y E-8 Value of Objective Function = H11 = Hidden layer 1, neuron 1
Data mining and statistical learning - lecture 11 Neural network with one input, one neuron, and one target - manual calculation of predicted values Parameter Estimates Gradient Objective N Parameter Estimate Function 1 x_H BIAS_H H11_y E-8 4 BIAS_y E-8 Standardize x to mean zero and variance one Compute xstand*x_H11+BIAS_H11 Take tanh to compute z Compute z*H11_y+BIAS_y
Data mining and statistical learning - lecture 11 Neural networks with one input, two neurons, and one target x z1z1 z2z2 y
Data mining and statistical learning - lecture 11 Output from proc Neural - one input, two neurons, one target Parameter Estimates Gradient Objective N Parameter Estimate Function 1 x_H x_H BIAS_H BIAS_H H11_y H12_y BIAS_y Value of Objective Function =
Data mining and statistical learning - lecture 11 Absorbance records for ten samples of chopped meat 1 response variable (fat) 100 predictors (absorbance at 100 wavelengths or channels) The predictors are strongly correlated to each other
Data mining and statistical learning - lecture 11 Absorbance records for 215 samples of chopped meat The target is poorly correlated to each predictor
Data mining and statistical learning - lecture 11 Neural networks with a single target and many inputs - the fat content and absorbance dataset A total of (p+2)*3+1 parameters are estimated x1x1 x1x1 xpxp z1z1 z2z2 z3z3 … y
Data mining and statistical learning - lecture 11 Neural networks with a single target and many inputs - parameter estimates for a model with three neurons. 291 Channel90_H Channel91_H Channel92_H Channel93_H Channel94_H Channel95_H Channel96_H Channel97_H Channel98_H Channel99_H BIAS_H BIAS_H BIAS_H H11_Fat H12_Fat H13_Fat BIAS_Fat Value of Objective Function = A total of 307 parameters
Data mining and statistical learning - lecture 11 Neural networks with a single target and many inputs - output from a model with three neurons
Data mining and statistical learning - lecture 11 Neural networks with a single target and many inputs - output from models with 1 to 10 neurons Convergence problems
Data mining and statistical learning - lecture 11 Neural networks with multiple targets Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function (activation function) of these features x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y1y1 yKyK …
Data mining and statistical learning - lecture 11 Neural networks for K-class classification With the softmax activation function and the deviance (cross-entropy) error function the neural network model is exactly a logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y1y1 yKyK …
Data mining and statistical learning - lecture 11 Neural networks for regression and K-class classification For regression, we use the sum-of- squared errors as our measure of fit For classification, we normally use the deviance (cross-entropy) error function and the corresponding classifier is. x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y1y1 yKyK …
Data mining and statistical learning - lecture 11 Fitting neural networks x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … y1y1 yKyK … M(p+1)+K(M+1) parameters (weights) We don’t want the global minimizer of the deviance (cross-entropy) function. Instead we use early stopping or a penalty term
Data mining and statistical learning - lecture 11 Neural networks Provide a joint framework for prediction and classification Can describe both linear and nonlinear responses Can accommodate multidimensional correlated inputs Are normally over-fitted – validation is a must Are difficult to interpret Convergence problems are not uncommon
Data mining and statistical learning - lecture 11 Some characteristics of different learning methods CharacteristicNeural networksTrees Natural handling of data of “mixed” type Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations of inputs Computational scalability (large N) Ability to deal with irrelevant inputs Ability to extract linear combinations of features GoodPoor InterpretabilityPoorFair/good Predictive powerGoodPoor