Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.
January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
Regularized risk minimization
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Neural networks Introduction Fitting neural networks
Deep Learning Bing-Chen Tsai 1/21.
Sampling: Final and Initial Sample Size Determination
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
The loss function, the normal equation,
Pattern Recognition and Machine Learning
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Distributed Representations of Sentences and Documents
The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE.
Part I: Classification and Bayesian Learning
Machine learning Image source:
Radial Basis Function Networks
Collaborative Filtering Matrix Factorization Approach
Machine learning Image source:
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Linear Discrimination Reading: Chapter 2 of textbook.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine learning Image source:
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Additional NN Models Reinforcement learning (RL) Basic ideas: –Supervised learning: (delta rule, BP) Samples (x, f(x)) to learn function f(.) precise error.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning From Measurements in Exponential Families Percy Liang, Michael I. Jordan and Dan Klein ICML 2009 Presented by Haojun Chen Images in these slides.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
MATH Section 4.4.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Lecture 3b: CNN: Advanced Layers
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Semi-Supervised Clustering
Learning Deep Generative Models by Ruslan Salakhutdinov
Machine Learning & Deep Learning
Empirical risk minimization
Bounding the error of misclassification
Restricted Boltzmann Machines for Classification
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Department of Electrical and Computer Engineering
Collaborative Filtering Matrix Factorization Approach
Word embeddings based mapping
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Empirical risk minimization
Neural networks (1) Traditional multi-layer perceptrons
Neural networks (3) Regularization Autoencoder
Linear Discrimination
Introduction to Neural Networks
CS249: Neural Language Model
Presentation transcript:

Qual Presentation Daniel Khashabi 1

Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase, TACL,

Current Line of Research  Conventional approach to a classification problem:  Problems:  Never use the label information  Lose the structure in the output  Limited to the classes in the training set  Hard to leverage unsupervised data 3

Current Line of Research  For example take the relation extraction problem:  Conventional Approach:  Given sentence s and mentions e1 and e2, find their relation:  Output: “Bill Gates, CEO of Microsoft ….” Manager 4

Current Line of Research  Let’s change the problem a little:  Create a claim about the relation: “Bill Gates, CEO of Microsoft ….” R = Manager Text=“Bill Gates, CEO of Microsoft ….” Claim=“Bill Gates is manager of Microsoft” True 5

Current Line of Research  Creating data is very easy!  What we do:  Use knowledge bases to find entities that are related  Find sentences that contain these entities  Create claims about the relation inside the original sentence  Ask Turker’s to label it  Much easier than extracting labels and labelling 6

Current Line of Research  This formulation makes use of the information inherent in the label  This helps us to generalize over the relations that are not seen in the training data 7

Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase, TACL,

Dropout training  Proposed by (Hinton et al, 2012)  Each time decide whether to delete one hidden unit with some probability p 9

Dropout training  Model averaging effect  Among models, with shared parameters  Only a few get trained  Much stronger than the known regularizer  What about the input space?  Do the same thing! 10

Dropout training  Model averaging effect  Among models, with shared parameters  Only a few get trained  Much stronger than the known regularizer  What about the input space?  Do the same thing!  Dropout of 50% of the hidden units and 20% of the input units (Hinton et al, 2012) 11

Outline  Can we explicitly show that dropout acts as a regularizer?  Very easy to show for linear regression  What about others?  Dropout needs sampling  Can be slow!  Can we convert the sampling based update into a deterministic form?  Find expected form of updates 12

Linear Regression  Reminder:  Consider the standard linear regression  With regularization:  Closed form solution: 13

Dropout Linear Regression  Consider the standard linear regression  LR with dropout:  How to find the parameter? 14

Fast Dropout for Linear Regression  We had:  Instead of sampling, minimize the expected loss  Fixed x and y:  15

Fast Dropout for Linear Regression  We had:  Instead of sampling minimize the expected loss:  Expected loss: 16

Fast Dropout for Linear Regression  Expected loss:  Data-dependent regulizer  Closed form could be found: 17

Some definitions  Dropout each input dimension randomly:  Probit:  Logistic function / sigmoid : 18

Some definitions useful equalities  Useful equalities  We can find the following expectation in closed form: 19

Logistic Regression  Consider the standard LR  The standard gradient update rule is  For the parameter vector 20

Dropout on a Logistic Regression  Dropout each input dimension randomly:  For the parameter vector  Notation: 21

Fast Dropout training  Instead of using we use its expectation: 22

Fast Dropout training  Approx:  By knowing:  How to approximate?  Option 1:  Option 2:  Have closed forms but poor approximations 23

Experiment: evaluating the approximation  The quality of approximation for 24

Experiment: Document Classification  20-newsgroup subtask alt.atheism vs. religion.misc 25

Experiment: Document Classification(2) 26

Fast Dropout training  Approx:  By knowing:  27

Fast Dropout training  We want to:  Previously:  which could be found in closed form. 28

Fast Dropout training  We want to:  Previously:  deviates (approximately) from with and  Has closed form! 29