Goodfellow: Chapter 9 Convolutional Networks

Goodfellow: Chapter 9 Convolutional Networks
Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content.

Chapter 9 Sections Introduction 1 The Convolution Operation
2 Motivation 3 Pooling 4 Convolution and Pooling as an Infinitely Strong Prior 5 Variants of the Basic Convolution Function 6 Structured Outputs 7 Data Types 8 Efficient Convolution Algorithms 9 Random or Unsupervised Features 10 The Neuroscientific Basis for Convolutional Networks 11 Convolutional Networks and the History of Deep Learning

Introduction Convolutional networks are also known as convolutional neural networks (CNNs) Specialized for data having grid-like topology 1D grid – time series data 2D grid – image data Definition Convolutional networks use convolution in place of general matrix multiplication in at least one layer Neural network convolution does not correspond to convolution used in engineering and mathematics

1. The Convolution Operation
Convolution is an operation on two functions Section begins with general convolution example Signal smoothing in locating spaceship with a laser sensor CNN convolutions (not general convolution) First function is network input x, second is kernel w Tensors refer to the multidimensional arrays E.g., input data and parameter arrays, thus TensorFlow The convolution kernel is usually a sparse matrix in contrast to the usual fully-connected weight matrix

2D Convolution Figure 9.1 Input Kernel a b c d w x e f g h y z i j k l
Output aw + bx + ey + fz bw + cx + fy + gz cw + dx + gy + hz ew + fx + iy + jz fw + gx + jy + kz gw + hx + ky + lz Figure 9.1 (Goodfellow 2016)

2. Motivation Convolution leverages three important ideas that help improve machine learning systems Sparse interactions Parameter sharing Equivariant representations

2. Motivation Sparse Interactions
Fully connected traditional networks m inputs in a layer and n outputs in next layer requires O(m x n) runtime (per example) Sparse interactions Also called sparse connectivity or sparse weights Accomplished by making kernel smaller than input k << m requires O(k x n) runtime (per example) k is typically several orders of magnitude smaller than m

Sparse connections due to small convolution kernel
Sparse Connectivity Sparse connections due to small convolution kernel Viewed from below s1 s2 s3 s4 s5 x1 x2 x3 x4 x5 Dense connections s1 s2 s3 s4 s5 Fully connected x1 x2 x3 x4 x5 (Goodfellow 2016)

Sparse connections due to small
Sparse Connectivity Sparse connections due to small s1 s2 s3 s4 s5 Viewed from above (receptive fields) convolution kernel x1 x2 x3 x4 x5 Dense connections s1 s2 s3 s4 s5 Fully connected x1 x2 x3 x4 x5 (Goodfellow 2016)

Growing Receptive Fields
h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 Deeper layer units have larger receptive fields (Goodfellow 2016)

2. Motivation Parameter Sharing
In traditional neural networks Each element of the weight matrix is unique Parameter sharing mean using the same parameters for more than one model function The network has tied weights Reduces storage requirements to k parameters Does not affect forward prop runtime O(k x n)

Parameter Sharing Black arrows = particular parameter
Convolution shares the same parameters across all spatial locations s1 s2 s3 s4 s5 x1 x2 x3 x4 x5 Traditional matrix multiplication does not share any parameters s1 s2 s3 s4 s5 x1 x2 x3 x4 x5 (Goodfellow 2016)

Edge Detection by Convolution
Right image = each orig pixel – left pixel detects edges Input Output -1 1 Kernel k = 2 (Goodfellow 2016)

Eﬃciency of Convolution
Input size: 320 by 280 Kernel size: 2 by 1 Output size: 319 by 280 Convolution Dense matrix Fully connected Sparse matrix Stored floats Each weight 2 319*280*320*280 > 8e9 2*319*280 = 178,640 Float mults+adds 319*280*3 = 267,960 > 16e9 Same as convolution (267,960) Forward computation (Goodfellow 2016)

2. Motivation Equivariant Representations
For an invariant function, if the input changes, the output change in the same way For convolution, a particular form of parameter sharing causes equivariance to translation In image processing, detecting edges is useful in the first layer, and edges appear more or less everywhere in the image

3. Pooling The pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs Max pooling reports the maximum output within a rectangular neighborhood Average pooling reports the average output Pooling helps make the representation approximately invariant to small input translations

Simple layer terminology
Convolutional Network Components Complex layer terminology Simple layer terminology Next layer Convolutional Layer Pooling stage Convolution stage: Aﬃne transform Input to layer Next layer Two common terminologies Pooling layer Detector stage: Nonlinearity e.g., rectified linear Detector layer: Nonlinearity e.g., rectified linear Convolution layer: Aﬃne transform Input to layers (Goodfellow 2016)

Max Pooling and Invariance to Translation
POOLING STAGE ... ... 0.1 1. 0.2 0.1 ... ... DETECTOR STAGE POOLING STAGE 1. 1. Same network with input shifted one pixel to right Little change in pooling stage 0.3 1. ... ... 0.3 0.1 1. 0.2 ... ... DETECTOR STAGE (Goodfellow 2016)

Cross-Channel Pooling and Invariance to Learned Transformations
Large response in pooling unit Large response in pooling unit Large response in detector unit 3 Large response in detector unit 1 Cross-channel pooling is max pooling over different feature maps Basically handles rotations (Goodfellow 2016)

downsized in next layer
Pooling with Downsampling 1. 0.2 0.1 0.1 1. 0.2 0.1 0.0 0.1 Max pooling downsized in next layer (Goodfellow 2016)

Example Classification Architectures
Output of softmax: Output of softmax: Output of softmax: 1,000 class probabilities 1,000 class probabilities 1,000 class probabilities Output of matrix multiply: 1,000 units Output of matrix multiply: 1,000 units Output of average pooling: 1x1x1,000 Output of reshape to vector: 16,384 units Output of reshape to vector: 576 units Output of convolution: 16x16x1,000 Output of pooling with stride 4: 16x16x64 Output of pooling to 3x3 grid: 3x3x64 Output of pooling with stride 4: 16x16x64 Output of convolution + ReLU: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 64x64x64 Output of pooling with stride 4: 64x64x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 256x256x64 Output of convolution + ReLU: 256x256x64 Output of convolution + ReLU: 256x256x64 Figure 9.11 Input image: 256x256x3 Input image: 256x256x3 Input image: 256x256x3 (Goodfellow 2016)

4. Convolution and Pooling as an Infinitely Strong Prior
Prior probabilities (beliefs before we see actual data) can be strong or weak A weak prior (e.g., Gaussian distribution with high variance) allows the data to move the parameters A strong prior (e.g., Gaussian distribution with low variance) strongly determines the parameters An infinitely strong prior controls the parameters A convolutional net ~ an infinitely strong prior Weights are zero except in small receptive fields Weights identical for neighboring hidden units

4. Convolution and Pooling as an Infinitely Strong Prior
Convolution and pooling can cause underfitting The prior is useful only when the assumptions made by the prior are reasonably accurate If a task relies on preserving precise spatial information, then pooling can increase training error The prior imposed by convolution must be appropriate

5. Variants of the Basic Convolution Function
Stride is the amount of downsampling Can have separate strides in different directions Zero padding avoids layer-to-layer shrinking Unshared convolution Like convolution but without sharing Partial connectivity between channels Tiled convolution Cycle between shared parameter groups

Convolution with Stride
Stride of two Strided convolution x1 x2 x3 x4 x5 s1 s2 s3 Equivalent to above but computationally wasteful Downsampling z1 z2 z3 z4 z5 x1 x2 x3 x4 x5 Figure 9.12 Convolution (Goodfellow 2016)

Zero Padding Controls Size
Without zero Kernel width of six padding ... ... ... ... With zero padding ... ... ... Prevents shrinking Figure 9.13 ... ... (Goodfellow 2016)

Local connection: like convolution, but no sharing
Kinds of Connectivity Local connection: like convolution, but no sharing s1 s2 s3 s4 s5 Unshared convolution a b c d e f g h i x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Convolution a b a b a b a b a x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Fully connected x1 x2 x3 x4 x5 Figure 9.14 (Goodfellow 2016)

Partial Connectivity Between Channels
Output Tensor Each output channel is a function of only a subset of the input channels Input Tensor Channel coordinates Figure 9.15 Spatial coordinates (Goodfellow 2016)

Tiled convolution Figure 9.16 Local connection (no sharing)
x1 b c x2 d e x3 f g x4 h i x5 s1 s2 s3 s4 s5 Tiled convolution (cycle between shared parameter groups ) Standard convolution (one group shared everywhere) Figure 9.16 (Goodfellow 2016)

6. Structured Outputs Convolutional networks are usually used for classification They can also be used to output a high-dimensional, structured object The object is typically a tensor

7. Data Types Single channel examples: Multi-channel examples:
1D audio waveform 2D audio data after Fourier transform Frequency versus time Multi-channel examples: 2D color image data Three channels: red pixels, green pixels, blue pixels Each channel is 2D for the image

8. Efficient Convolution Algorithms
Devising faster ways of performing convolution or approximate convolution without harming the accuracy of the model is an active area of research However, most dissertation work concerns feasibility and not efficiency

9. Random or Unsupervised Features
One way to reduce the cost of convolutional network training is to use features that are not trained in a supervised fashion Three methods (Rosenblatt used first two) Simply initialize convolutional kernels randomly Design them by hand Learn the kernels using unsupervised methods

10. The Neuroscientific Basis for Convolutional Networks
Convolutional networks may be the greatest success story of biologically inspired AI Some of the key design principles of neural networks were drawn from neuroscience Hubel and Wiesel won the Nobel prize in 1981 for their work on the cat’s visual system 1960s-1970s

Neurons in the retina perform simple processing, don’t change image representation Image passes through the optic nerve to a brain region called the lateral geniculate body Signal then reaches visual cortex area V1 V1 also called the primary visual cortex The first area of the brain that performs advanced processing of visual input Located at the back of the head

V1 properties captured in convolutional nets V1 has a 2D structure mirroring the retina image V1 contains many simple cells Each characterized by a linear function of the image in a small, spatially localized receptive field V1 contains many complex cells These cells respond to features similar to the simple cells But invariant to small shifts in the position of the feature Inspired pooling strategies such as maxout units

Although we know most about area V1, we believe similar principles apply to other areas Basic strategy of detection followed by pooling Passing through deeper layers, we find cells responding to specific concepts These cells are nicknamed “grandmother cells” The idea being that a neuron activates upon seeing their grandmother anywhere in the image

Reverse correlation In biological networks we don’t have access to the weights themselves However, we can put an electrode in a neuron, display images in front of the animal’s retina, and record the activation of the neuron We can then fit a linear model to these responses to approximate the neuron’s weights Most V1 cells have weights of Gabor functions

Gabor Functions Figure 9.18
White = positive weight, black = negative, gray = zero weight (Left) detectors in different orientations, (Center) detectors of increasing width and height, (Right) different sinusoidal params Figure 9.18 (Goodfellow 2016)

Gabor-like Learned Kernels
(Left) Weights learned by unsupervised learning (Right) Convolutional kernels learned by first layer of fully supervised convolutional maxout network Figure 9.19 (Goodfellow 2016)

11. Convolutional Networks and the History of Deep Learning
Convolutional networks have played an important role in the history of deep learning Application of neuroscience to machine learning First deep models to perform well First important commercial applications Used to win many contests Some of first deep networks trained with back-prop Performed well decades ago to pave the way toward acceptance of neural networks in general

11. Convolutional Networks and the History of Deep Learning
Convolutional networks allow specialized neural networks for grid-structured topology Most successful on 2D image topology For 1D sequential data we use recurrent networks

Goodfellow: Chapter 9 Convolutional Networks

Similar presentations

Presentation on theme: "Goodfellow: Chapter 9 Convolutional Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Goodfellow: Chapter 9 Convolutional Networks

Similar presentations

Presentation on theme: "Goodfellow: Chapter 9 Convolutional Networks"— Presentation transcript:

Similar presentations

About project

Feedback