**BMI 707: Regularization and GPU Computing**

Andrew Beam, PhD

Department of Epidemiology

Harvard T.H. Chan School of Public Health

twitter: @AndrewLBeam

**GPUs**

**GPU COMPUTING**

Training neural nets = large matrix multiplications

GPUs = Massively parallel linear algebra devices

Special computer chips known as graphics processing units (GPUs) make training huge models on large data tractable

**GPUs vs. CPUs**

CPUs

1000s of number crunchers

GPUs

General

Purpose

Computation

**REGULARIZATION**

**REGULARIZATION**

One of the biggest problems with neural networks is overfitting.

Regularization schemes combat overfitting in a variety of different ways

**WHY DO WE NEED IT?**

**WHY DO WE NEED IT?**

**WHY DO WE NEED IT?**

**REGULARIZATION**

Learning means solving the following optimization problem:

where f(X) = neural net output

**REGULARIZATION**

One way to regularize is introduce penalties and change

**REGULARIZATION**

to

One way to regularize is introduce penalties and change

**REGULARIZATION**

A familiar why to regularize is introduce penalties and change

to

where R(W) is often the L1 or L2 norm of W. These are the well known ridge and LASSO penalties, referred to as weight decay by neural net community

**L2 REGULARIZATION**

We can limit the size of the L2 norm of the weight vector:

where

**L1/L2 REGULARIZATION**

We can limit the size of the L2 norm of the weight vector:

where

We can do the same for the L1 norm. What do these penalties do?

**SHRINKAGE**

L1 and L2 penalties shrink the weights towards 0

L2 Penalty

L1 Penalty

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. *The elements of statistical learning*. Vol. 1. New York: Springer series in statistics, 2001.

**SHRINKAGE**

L1 and L2 penalties shrink the weights towards 0

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. *The elements of statistical learning*. Vol. 1. New York: Springer series in statistics, 2001.

**SHRINKAGE**

L1 and L2 penalties shrink the weights towards 0

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. *The elements of statistical learning*. Vol. 1. New York: Springer series in statistics, 2001.

Why is this a "good" idea?

**STOCHASTIC REGULARIZATION**

Often, we will inject noise into the neural network during training. By far the most popular way to do this is **dropout**

**STOCHASTIC REGULARIZATION**

Often, we will inject noise into the neural network during training. By far the most popular way to do this is **dropout**

Given a hidden layer, we are going to set each element of the hidden layer to 0 with probability p each SGD update.

**STOCHASTIC REGULARIZATION**

One way to think of this is the network is trained by **bagged** versions of the network. Bagging reduces variance.

**STOCHASTIC REGULARIZATION**

One way to think of this is the network is trained by **bagged** versions of the network. Bagging reduces variance.

Others have argued this is an approximate Bayesian model

**STOCHASTIC REGULARIZATION**

Many have argued that SGD itself provides regularization

**INITIALIZATION REGULARIZATION**

The weights in a neural network are given random values initially

**INITIALIZATION REGULARIZATION**

The weights in a neural network are given random values initially

There is an entire literature on the best way to do this initialization

**INITIALIZATION REGULARIZATION**

The weights in a neural network are given random values initially

There is an entire literature on the best way to do this initialization

- Normal

- Truncated Normal

- Uniform

- Orthogonal

- Scaled by number of connections

- etc

**INITIALIZATION REGULARIZATION**

Try to "bias" the model into initial configurations that are easier to train

**INITIALIZATION REGULARIZATION**

Try to "bias" the model into initial configurations that are easier to train

Very popular way is to do **transfer learning**

**INITIALIZATION REGULARIZATION**

Try to "bias" the model into initial configurations that are easier to train

Very popular way is to do **transfer learning**

Train model on auxiliary task where lots of data is available

**INITIALIZATION REGULARIZATION**

Try to "bias" the model into initial configurations that are easier to train

Very popular way is to do **transfer learning**

Train model on auxiliary task where lots of data is available

Use final weight values from previous task as initial values and "fine tune" on primary task

**STRUCTURAL REGULARIZATION**

However, the key advantage of neural nets is the ability to easily include properties of the data directly into the model through the **network's structure**

Convolutional neural networks (CNNs) are a prime example of this (Kun will discuss CNNs)

**In class assignment:**

**PERCEPTRON BY HAND**

**PERCEPTRONS**

Let's say we'd like to have a single neural learn a simple function

y

X1 | X2 | y |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

Observations

**PERCEPTRONS**

How do we make a prediction for each observations?

y

X1 | X2 | y |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

Observations

**Predictions**

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

**Predictions**

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

First compute the weighted sum:

**Predictions**

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

First compute the weighted sum:

Transform to probability:

**Predictions**

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

First compute the weighted sum:

Transform to probability:

Round to get prediction:

**Predictions**

Putting it all together:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X1 | X2 | y | h | p | |
---|---|---|---|---|---|

0 | 0 | 0 | -0.5 | 0.38 | 0 |

0 | 1 | 1 | -1.5 | 0.18 | 0 |

1 | 0 | 1 | 0.5 | .62 | 1 |

1 | 1 | 1 | -0.5 | 0.38 | 0 |

Fill out this table

**Room for Improvement**

Let's define how we want to *measure* the network's performance.

There are many ways, but let's use *squared-error*:

**Room for Improvement**

Let's define how we want to *measure* the network's performance.

There are many ways, but let's use *squared-error*:

Now we need to find values for that make this error as small as possible

**The Backpropagation Algorithm**

Our perceptron performs the following computations

And we want to minimize this quantity

**The Backpropagation Algorithm**

Our perception performs the following computations

And we want to minimize this quantity

We'll compute the gradients for each parameter by "back-propagating" errors through each component of the network

**The Backpropagation Algorithm**

For we need to compute

**Computations**

**Loss**

To get there, we will use the chain rule

This is "backprop"

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

**The Backpropagation Algorithm**

Let's break it into pieces

**Computations**

**Loss**

Putting it all together

**Gradient Descent with Backprop**

1) Compute the gradient for

2) Update

is the* learning rate*

For some number of iterations we will:

3) Repeat until "convergence"

**Learning Rules for each Parameter**

Gradient for

Gradient for

Gradient for

Update for

Update for

Update for

is the* learning rate*

**IMPLEMENTATION IN PYTHON**

#
**BONUS: BACKPROP FOR** **MLPs**

**Perceptron -> MLP**

With a small change, we can turn our perceptron model into a multilayer perceptron

- Instead of just one linear combination, we are going to take several, each with a different set of weights (called a
*hidden unit)* - Each linear combination will be followed by a nonlinear activation
- Each of these
*nonlinear features*will be fed into the logistic regression classifier - All of the weights are learned end-to-end via SGD

MLPs learn a set of nonlinear features directly from data

"Feature learning" is the hallmark of deep learning approachs

**MULTILAYER PERCEPTRONS (MLPs)**

Let's set up the following MLP with 1 hidden layer that has 3 hidden units:

Each neuron in the hidden layer is going to do exactly the same thing as before.

**MULTILAYER PERCEPTRONS (MLPs)**

Computations are:

**MULTILAYER PERCEPTRONS (MLPs)**

Computations are:

Output layer weight derivatives

**MULTILAYER PERCEPTRONS (MLPs)**

Computations are:

Output layer weight derivatives

**MULTILAYER PERCEPTRONS (MLPs)**

Computations are:

Hidden layer weight derivatives

Output layer weight derivatives

**MULTILAYER PERCEPTRONS (MLPs)**

Computations are:

Hidden layer weight derivatives

Output layer weight derivatives

(if we use a sigmoid activation function)

**MLP Terminology**

Forward pass = computing probability from input

**MLP Terminology**

Forward pass = computing probability from input

**MLP Terminology**

Backward pass = computing derivatives from output

Forward pass = computing probability from input

**MLP Terminology**

Backward pass = computing derivatives from output

Hidden layers are often called "dense" layers

#### BMI 707 - Lecture 4: Regularization and GPU Computing

By beamandrew