Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPH 636 - Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

Similar presentations


Presentation on theme: "CPH 636 - Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are."— Presentation transcript:

1 CPH 636 - Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are (deterministically) converted to (unobserved) quantities Z 1, Z 2, …, Z M called hidden units. These are, in turn, (stochastically) converted to outputs Y 1, …, Y K. For a regression problem, usually K=1. For a classification problem, we may regard Y 1, Y 2, …, Y K as dummy or indicator variables for class membership.

2 CPH 636 - Dr. Charnigo Chap. 11 Notes Formula (11.5) provides some of the mathematics. The sigmoid function σ(v) := 1/(1+exp(-v)) is shown in Figure 11.3. For large s and any v 0, σ(s{v-v 0 }) is a slightly smoothed version of a step function at v 0. This explains why the statistical methodology is called a neural network. Each of Z 1, Z 2, …, Z M is loosely analogous to a neuron in the human brain, which “fires” when signal received exceeds a threshold, just as one of Z 1, Z 2, …, Z M equals (nearly) 1 when v exceeds v 0.

3 CPH 636 - Dr. Charnigo Chap. 11 Notes Mathematically, Z m has the same structure as the probability in a logistic regression model with X 1, X 2, …, X P as predictors and α 0m, α 1m, …, α Pm as coefficients. In a regression problem with K=1, one usually takes g 1 to be the identity function, and so (11.5) says that Y 1 = β 01 + β 11 Z 1 + β 12 Z 2 + … + β 1M Z M + error. Thus, the output is a linear regression on nonlinear multivariate transforms of the original predictors, which resemble probabilities from logistic regression models.

4 CPH 636 - Dr. Charnigo Chap. 11 Notes In a classification problem, one usually takes g k to be as defined in (11.6). Thus, the output (vector) is a multinomial logistic regression on nonlinear multivariate transforms of the original predictors. The authors note that a simpler choice of g k may not work well in a classification problem, for much the same reason that linear regression failed in Section 4.2 when K>2.

5 CPH 636 - Dr. Charnigo Chap. 11 Notes The α and β parameters in a neural network are called “weights” and must be estimated from the data. In a regression problem, we would like the residual sum of squares to be small, per (11.9). This is like maximizing a likelihood function if one assumes that the error terms are independent and normally distributed with constant variance.

6 CPH 636 - Dr. Charnigo Chap. 11 Notes In a classification problem, we might like the cross entropy to be small, per (11.10). This is also like maximizing a likelihood function. Indeed, if K = 2 and we abbreviate f 1 (x i ) to p i, then (11.10) becomes the negative of ∑ i y i1 log(p i ) + ∑ i (1-y i1 ) log(1-p i ). The authors spend considerable, perhaps even undue, space describing the numerical method of back propagation for minimizing (11.9) or (11.10). However, their remarks in Section 11.5 are quite important.

7 CPH 636 - Dr. Charnigo Chap. 11 Notes First, iterative techniques for parameter estimation require starting or initial values. The authors suggest choosing the initial values of the α’s and β’s randomly. More specifically, if the inputs have been standardized, they recommend choosing the initial values from uniform distributions on [-0.7, +0.7]. Second, with two different sets of initial values, you may get vastly different sets of final estimates, because (11.9) and (11.10) may have numerous local minima.

8 CPH 636 - Dr. Charnigo Chap. 11 Notes The phenomenon of multiple local minima is easy to illustrate in one dimension, as shown below. In one dimension, you can also see what to do. Compare the local minima and select the smaller one. The only ways this approach can fail is if you do not detect one or more local minima or if the global minimum occurs at a boundary.

9 CPH 636 - Dr. Charnigo Chap. 11 Notes So, in the multi-dimensional problem of estimating the α’s and β’s, you can try numerous sets of initial values, generate the corresponding sets of final estimates, and choose whatever set maximizes the likelihood. Alternatively, you don’t need to “throw away” any sets of final estimates; you can simply average predictions obtained from the various sets of final estimates. Caution: This is not the same as averaging the various sets of final estimates and then making predictions.

10 CPH 636 - Dr. Charnigo Chap. 11 Notes Third, to avoid overfitting, you may employ one of two approaches. The first is to stop the iteration for minimizing (11.9) or (11.10) well before you actually attain the minimum. But when do you stop ? You could assess the performance of your neural network on a validation data set after each iteration and cease iterating once the validation error stopped decreasing. This is illustrated in Figure 11.11.

11 CPH 636 - Dr. Charnigo Chap. 11 Notes The second approach to avoiding overfitting is “weight decay”, in which you modify (11.9) or (11.10) by including a penalty like in ridge regression; cross validation could be useful to determine the strength of the penalty. Figure 11.4 revisits an old friend of ours from Chapter 2, demonstrating the salutary effect of weight decay. While the training error is increased, the test error is decreased, and the decision boundary is improved.

12 CPH 636 - Dr. Charnigo Chap. 11 Notes Fourth, the number of hidden units M must be chosen by the data analyst. While one could use cross validation for this purpose, the authors suggest that such effort may be unnecessary. In particular, if a penalty is added to (11.9) and (11.10), then estimated weights for unnecessary hidden units should be small. Thus, as suggested by Figure 11.7, the choice of M is not critical unless too small.

13 CPH 636 - Dr. Charnigo Chap. 11 Notes Section 11.7 (including Figure 11.11, which we saw earlier) illustrates the use of neural networks in the automated detection of handwritten digits. In Figure 11.10, the first method is multinomial logistic regression, the second is a neural network with one layer of hidden units, and the last three are neural networks with two layers of hidden units. The last three differ regarding elaborate a priori assumptions about some weights equaling each other (or zero).

14 CPH 636 - Dr. Charnigo Chap. 11 Notes By now, we have encountered several techniques for regression and classification. We may want to choose from among them, and Table 10.1 may provide some guidance in that regard. Alternatively, one may employ stacking, which is briefly described in Section 8.8. Stacking tries to combine the predictions obtained from disparate techniques in an optimal way. Roughly speaking, then, stacking is like averaging over the techniques themselves.


Download ppt "CPH 636 - Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are."

Similar presentations


Ads by Google