Head First Dropout Naiyan Wang.

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

A brief review of non-neural-network approaches to deep learning

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.

Neural networks Introduction Fitting neural networks

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

ImageNet Classification with Deep Convolutional Neural Networks

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.

An Introduction to Support Vector Machines (M. Law)

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

CSC321: Lecture 7:Ways to prevent overfitting

A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.

Gaussian Processes For Regression, Classification, and Prediction.

1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Speech Enhancement based on

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Markus Uhr Feature Extraction Sparse, Flexible and Efficient Modeling using L 1 -Regularization Saharon Rosset and Ji Zhu.

Regularization Techniques in Neural Networks

Learning Deep Generative Models by Ruslan Salakhutdinov

Environment Generation with GANs

Machine Learning & Deep Learning

Data Mining, Neural Network and Genetic Programming

Learning Deep L0 Encoders

Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Generative Adversarial Networks

Multimodal Learning with Deep Boltzmann Machines

ECE 6504 Deep Learning for Perception

Training Techniques for Deep Neural Networks

Classification Discriminant Analysis

Presented by Xinxin Zuo 10/20/2017

CS 4501: Introduction to Computer Vision Training Neural Networks II

ECE 599/692 – Deep Learning Lecture 9 – Autoencoder (AE)

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Fluctuation-Dissipation Relations for Stochastic Gradient Descent

Lecture Notes for Chapter 4 Artificial Neural Networks

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

实习生汇报 ——北邮张安迪.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Rong Ge, Duke University

Goodfellow: Chapter 14 Autoencoders

Presentation transcript:

Head First Dropout Naiyan Wang

Outline Introduction to Dropout Practical Improvement Basic idea and Intuition Some common mistakes for dropout Practical Improvement DropConnect Adaptive Dropout Theoretical Justification Interpret as an adaptive regularizer. Output approximated by NWGM.

Basic Idea and Intuition What is Dropout? It is a simple but very effective technique that could alleviate overfitting in training phase.

Basic Idea and Intuition If in the training phase the dropout is 𝜆, then in testing we lower the weight to 1 −𝜆, and use all of them. This is equivalent to train all possible 2 𝑁 networks at the same time in training, and averaging them out in testing.

Results MNIST TIMIT

Results

Some Common Mistakes Dropout is only limited to deep learning No, even simple logistic regression will benefit from it. Dropout is a just magic trick. (bug or feature?) No, we will show it is equivalent to a kind of regularization soon.

DropConnect DropConnect also masks the weight. Dropout DropConnect

Standout Instead of fixing the dropout rate 𝜆, this method learns it for each unit: 𝑚 𝑗 is the binary mask. We also learn 𝜋 in this model. The output: Note it is a stochastic network now.

Standout(con’t) Learning contains two parts: 𝜋 and 𝑤 For 𝑤, it is contained on both and it is hard to compute the exact derivative, so the authors ignore the first part. For 𝜋, it is quite like the learning in RBM, which minimize the free energy of the model. Empirically, 𝜋 and 𝑤 are quite similar. So the authors just set

Standout(con’t)

Results Both DropConnect and Standout show improvement over standard dropout in the paper. The real performance need to be tested in a fair environment.

Discussion The problem in testing Lower the weight is not an exact solution because of the use of nonlinear activation function DropConnect: Approximate the output by a moment matched Gaussian More results in the “Understanding Dropout”. Possible connection to Gibbs sampling with Bernoulli variable? Better way of dropout?

Adaptive Regularization In this paper, we consider the following GLM: Standard MLE on noisy observation optimizes: Some simple math gives: The Regularizer!

Adaptive Regularization(con’t) The explicit form is not tractable in general, so we resort to a second order approximation: Then the main result of this paper:

Adaptive Regularization(con’t) It is interesting in logistic regression: First, both types of noise penalize less to the highly activated or non-activated output. It is OK if you are confident. In addition, dropout penalizes less to the rarely activated features. Works well with sparse and discriminative features.

Adaptive Regularization(con’t) The general GLM case is equivalent to scale the penalty along the shape of diagonal of Fisher information matrix Also connect to AdaGrad, an online learning algorithm. Since the regularizer doesn’t depend on the label, we can also utilize the unlabeled data to design better adaptive regularizers.

Understanding Dropout This paper only focus on dropout and sigmoid unit. For one layer network, we can show that in testing, the output is just normalized weighted geometry mean: But how it is related to 𝐸(𝑂)?

Understanding Dropout The main result of this paper: For the first one, we have: A really tight bound no matter 𝐸=0, 1, 0.5. Interestingly, the second part of this paper is just a special case of the previous one.

Discussion These two papers are both limited to linear unit and sigmoid unit, but the most popular unit now is relu. We still need understand it.

Take Away Message Dropout is a simple and effective way to reduce overfitting. It could be enhanced by designing more advanced perturbation way. It is equivalent to a kind of adaptive penalty could account for the characteristic of data. Its testing performance could be approximated well by normalized weighted geometry mean.

References Hinton, Geoffrey E., et al. "Improving neural networks by preventing co-adaptation of feature detectors." arXiv preprint arXiv:1207.0580 (2012). Wan, Li, et al. "Regularization of neural networks using dropconnect." In ICML 2013. Ba, Jimmy, and Brendan Frey. "Adaptive dropout for training deep neural networks." in NIPS 2013. Wager, Stefan, Sida Wang, and Percy Liang. "Dropout training as adaptive regularization." in NIPS. 2013. Baldi, Pierre, and Peter J. Sadowski. "Understanding Dropout.“in NIPS. 2013. Uncovered Papers: Wang, Sida, and Christopher Manning. "Fast dropout training." in ICML 2013. Warde-Farley, David, et al. "An empirical analysis of dropout in piecewise linear networks." arXiv preprint arXiv:1312.6197 (2013).