Understanding the true capacity of deep nets

Understanding the true capacity of deep nets
[ICML’18 paper] Understanding the true capacity of deep nets Sanjeev Arora Princeton University Institute for Advanced Study Rong Ge Duke University Behnam Neyshabur IAS Yi Zhang Princeton University Support: NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Resarch, Mozilla Research. SRC/DARPA

Machine learning in the news
Operationally speaking, Machine learning = Search for patterns in data. Most of these headlines refer to deep learning

Old Idea: Curve fitting (Legendre, Gauss, c. 1800)
Gas Law (c. 1800) PV = nRT Phillips curve (1958): Inflation Unemployment Machine Learning = Surface fitting, with many more variables (”Learning patterns in data.”)

Deep Learning in a slide..
q: Parameters of deep net (x1,y1), (x2,y2),… iid (point, label) from distribution D (training data) Loss function (how well net output matched true label y on point x) Can be l2, cross-entropy…. \ell(\theta, x, y) \text{argmin}_{\theta} E_{i}[\ell(\theta, x_i, y_i)] \theta^{(t+1)} \leftarrow \theta^{(t)} - \eta \nabla_{\theta}(E_{i}[\ell(\theta^{(t)}, x_i, y_i)]) Objective Gradient Descent Stochastic GD : Estimate via small sample of i’s. Backpropagation: Fast algorithm to compute gradient.

Mysteries Why/how does optimization finds decent solutions? Highly nonconvex. Why do nets generalize (predict well unseen data)? E.g., VGG19 on CIFAR10: 6M+ variables; 50K samples. Expressiveness/Interpretability (what are the nodes expressing? How is depth useful?) Focus of today’s talk Training error Test error E_{i}[\ell(\theta, x_i, y_i)] \\ \\ E_{(x, y) \in \mathcal D}[\ell(\theta, x, y)] Explanations probably will be overlapping …

Effective Capacity of an ML model
Number of samples required to ensure generalization for the model class. Higher effective capacity can lead to overfitting with few training samples

Hope: Understand what makes net “well-trained”
Generalization mystery of deep learning e.g., Why does VGG19 (6M parameters) trained on CIFAR10 (50K samples) classify unseen data well? (Deep nets able to fit data with random labels [Chang et al ICLR’17. Regularization doesn’t help. Net can fit randomly labeled data.)

Better measure of effective capacity??
Generalization Theory m = # training samples “Effective capacity.” Simplest: N = # of model parameters (vacuous if N > m ) Classic results replace N by other “complexity measures” (VC dim, Rademacher complexity, etc…) \text{Test loss} - \text{training loss} \leq \sqrt{\frac{N}{m}} Better measure of effective capacity??

Data-independent Complexity Measures.
Data-dependent Complexity Measures. Example: Margin, Rademacher Complexity PAC-Bayes e.g., VC-dimension [Bartlett’96]:Depth x # Params  [Zhang et al’17] : Expirically, R..C is HIGH for deep nets. 

DESCRIPTIVE PRESCRIPTIVE VS
Important philosophical distinction worth teaching in our courses: DESCRIPTIVE VS PRESCRIPTIVE

DESCRIPTIVE Doctor, I wake up several times at night…..
Oh, you have sleep disorder DESCRIPTIVE

PRESCRIPTIVE NB: Requires Examination!
You have a growth in your nose that’s causing apnea. Let’s remove it. Doctor, I wake up several times at night….. PRESCRIPTIVE NB: Requires Examination!

DESCRIPTIVE Doctor, my deep net doesn’t generalize**
Oh, your net class has high VC Dim Net class + data distribn. has high Rademacher Complexity DESCRIPTIVE ** Loss is low on training data, high on heldout data.

Prescriptive ANSWeR needs examination! (OF NET, DATA)
Using Algorithm A on this architecture/data Combo, nets will generalize… Doctor, my deep net doesn’t generalize** Prescriptive ANSWeR needs examination! (OF NET, DATA) ** Loss is low on training data, high on heldout data.

An old notion: Flat Minima [Hochreiter, Schmidhuber’95]
[Hinton-Camp’93] “Flat minima” generalize better empirially [Keskar et all’16] Flat Sharp Flat minimum has lower description length  Fewer # of effective parameters. Frequent suggestion: Noisy gradient descent favors flatter minima… This workshop: Implicit bias of optimization…. Makes intuitive sense but hard to make quantitative…

Example of flat minima: Margin for linear classifiers
g = margin “Fat slab” expressible using log N /g2 numbers, (= effective capacity) Alternative Interpretation of Margin: “noise stability.” Optimal classifier is stable to noise in the data and noise in the classifier’s parameters.

Example of flat minima: Margin for linear classifiers
g = margin “Fat slab” expressible using log N /g2 numbers, (= effective capacity) Example 2: PAC-Bayes bound for 2-layer deep nets [Langford-Caruana’02]: Thought experiment: Do q* and q* + h have similar training loss? (h =Gaussian noise). Then var(h) analogous to margin; gives estimate of effective capacity. P q* + h Q P = Prior on hypotheses Q = Posterior Effective capacity ≈ KL(Q|P) [McAllester99]

Nonvacuous bound on “true parameters” has proved elusive..
Estimates from PAC-Bayes/Margin analyses [Dziugaite-Roy’17] have more nonvacuous bounds for MNIST** but don’t get any “complexity measure”

Nonvacuous bound on “true parameters” has proved elusive..
Caveat: Plot ignores “nuisance” factors like polylog, 1/e, etc. Estimates from PAC-Bayes/Margin analyses [Dziugaite-Roy’17] have more nonvacuous bounds for MNIST** but don’t get any “complexity measure”

Current status of generalization theory: postmortem analysis…
Trained net

Compression-based method for generalization bounds [A
Compression-based method for generalization bounds [A.,Ge, Neyshabur, Zhang’18]; user-friendly version of PAC-Bayes GENERALIZES! # parameters ≪ # datapoints # parameters ≫ # datapoints Training error changes little… Important: compression method and its random bits are fixed before seeing the training data.

Important caveat: Compression method proves generalization of compressed classifier, not originally trained classifier (also true for PAC-Bayes: proves generalization of noised classifier) Goal of theoretical analysis: Asymptotic “complexity measures” that apply across settings.

Reason for compressibility: Noise stability (can be seen as “margin” for deep nets)
[A., Ge, Neyshabur, Zhang’18] Weaker but related discovery “Importance of Single directions for Generalization”, Marcos et al’18 How Gaussian noise of certain norm injected at a layer gets attenuated as it passes through to higher layers. (Each layer fairly stable to noise introduced at lower layers!)

“Reliable machines and unreliable components…
Von Neumann, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. “Reliable machines and unreliable components… We have, in human and animal brains, examples of very large and relatively reliable systems constructed from individual components, the neurons, which would appear to be anything but reliable. … In communication theory this can be done by properly introduced redundancy.” Shannon, C. E. (1958). Von Neumann’s contributions to automata theory.

Understanding noise stability for one layer (no nonlinearity)
h : Gaussian noise x Mx |Mx|/|x| ≫ |M h|/|h| ≥ = x + h M(x + h) Layer Cushion = ratio (roughly speaking..) (\sum_i \sigma_i(M)^2)^{1/2}/\sqrt{n} \sigma_{max}(M) Distribution of singular values in a filter of layer 10 of VGG19.

Trivial compression idea (not good enough):
Drop small sing. values; get rank r matrix. Difficulty: Do not know how this compression affects computation at higher layers…. Compression (like JL Dimension Reduction) (1) Generate 𝑘 random sign matrices 𝑀 1 ,…, 𝑀 𝑘 (impt: picked before seeing data) (2) 𝐴 = 1 𝑘 𝑡=1 𝑘 𝐴, 𝑀 𝑡 𝑀 𝑡

Proof sketch : Noise stability  deep net compressible
Idea 1: Compress a layer (randomized; errors introduced are “Gaussian like”) Idea 2: Errors attenuate as they go through network, as noted earlier (Allows more aggressive compression…) Compression (like JL Dimension Redn) (1) Generate 𝑘 random sign matrices 𝑀 1 ,…, 𝑀 𝑘 (impt: picked before seeing data) (2) 𝐴 = 1 𝑘 𝑡=1 𝑘 𝐴, 𝑀 𝑡 𝑀 𝑡

Interlayer Cushion 𝑓 𝑥 +? 𝑓 𝑥 = 𝛻 𝑥 𝑓.𝑥 𝑥+𝜂 𝛻 𝑥 𝑓.𝑥 𝛻 𝑥 𝑓 𝐹 𝑥 Jacobian
Interlayer cushion: the relative change in the output of linearized multiple layers when Gaussian noise added to the input 𝛻 𝑥 𝑓.𝑥 𝛻 𝑥 𝑓 𝐹 𝑥 𝑓 𝑥 𝑥+𝜂 𝑥+𝜂 𝑥

Interlayer Smoothness: (Interlayer function’s response to noise introduced by compression at lower layer well-approximated by its Jacobian ) 𝑥+𝜂 𝑥 Empirically, this distance is small and depends on the magnitude of noise 𝜂

The Generalization Bound
capacity≈ depth × 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑛𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 layer cushion × interlayer cushion 2 #param

Empirical investigation of properties

Correlation to Generalization
(“Corrupted”: Dataset with many random labels)

Extending to convolutional nets (past analyses couldn’t apply meaningfully)
Difficulty: Same filter/matrix reused at all patches. Net is already “compressed”! Bad Idea 1: Compress each copy of filter independently (blows up effective # parameters) Bad Idea 2: Compress filter once. (reduces effective # parameters; but can’t control resulting error in the proof) You have Our idea: p-wise independent compression at different copies

Concluding thoughts on generalization…
Still hard to prove why the original net (as opposed to compressed/noised version) generalized. Argument still seems too crude to explain why VGG19 generalizes on CIFAR10 (50k samples) [Zhou et al’18] Use nonconvex optimization to compute nonvacuous PAC-Bayes generalization bound for Imagenet (1M training samples) (Like DR17, yields no asymptotic “complexity measure.”)

Come join special year at Institute for Advanced Study 2019-20 http://www.math.ias.edu/sp/
Resources: Grad seminar (hope to put all notes online soon) THANK YOU!

Understanding the true capacity of deep nets

Similar presentations

Presentation on theme: "Understanding the true capacity of deep nets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding the true capacity of deep nets

Similar presentations

Presentation on theme: "Understanding the true capacity of deep nets"— Presentation transcript:

Similar presentations

About project

Feedback