Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.

Regularized risk minimization

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Neural networks Introduction Fitting neural networks

Deep Learning Bing-Chen Tsai 1/21.

Sampling: Final and Initial Sample Size Determination

Supervised Learning Recap

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

The loss function, the normal equation,

Pattern Recognition and Machine Learning

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Distributed Representations of Sentences and Documents

The Multi-Layer Perceptron non-linear modelling by adaptive pre-processing Rob Harrison AC&SE.

Part I: Classification and Bayesian Learning

Machine learning Image source:

Radial Basis Function Networks

Collaborative Filtering Matrix Factorization Approach

Machine learning Image source:

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Linear Discrimination Reading: Chapter 2 of textbook.

Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Machine learning Image source:

Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Additional NN Models Reinforcement learning (RL) Basic ideas: –Supervised learning: (delta rule, BP) Samples (x, f(x)) to learn function f(.) precise error.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Learning From Measurements in Exponential Families Percy Liang, Michael I. Jordan and Dan Klein ICML 2009 Presented by Haojun Chen Images in these slides.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

MATH Section 4.4.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Lecture 3b: CNN: Advanced Layers

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Semi-Supervised Clustering

Learning Deep Generative Models by Ruslan Salakhutdinov

Machine Learning & Deep Learning

Empirical risk minimization

Bounding the error of misclassification

Restricted Boltzmann Machines for Classification

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

Department of Electrical and Computer Engineering

Collaborative Filtering Matrix Factorization Approach

Word embeddings based mapping

Neural Networks Geoff Hulten.

Deep Learning for Non-Linear Control

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Empirical risk minimization

Neural networks (1) Traditional multi-layer perceptrons

Neural networks (3) Regularization Autoencoder

Linear Discrimination

Introduction to Neural Networks

CS249: Neural Language Model

Presentation transcript:

Qual Presentation Daniel Khashabi 1

Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase, TACL,

Current Line of Research  Conventional approach to a classification problem:  Problems:  Never use the label information  Lose the structure in the output  Limited to the classes in the training set  Hard to leverage unsupervised data 3

Current Line of Research  For example take the relation extraction problem:  Conventional Approach:  Given sentence s and mentions e1 and e2, find their relation:  Output: “Bill Gates, CEO of Microsoft ….” Manager 4

Current Line of Research  Let’s change the problem a little:  Create a claim about the relation: “Bill Gates, CEO of Microsoft ….” R = Manager Text=“Bill Gates, CEO of Microsoft ….” Claim=“Bill Gates is manager of Microsoft” True 5

Current Line of Research  Creating data is very easy!  What we do:  Use knowledge bases to find entities that are related  Find sentences that contain these entities  Create claims about the relation inside the original sentence  Ask Turker’s to label it  Much easier than extracting labels and labelling 6

Current Line of Research  This formulation makes use of the information inherent in the label  This helps us to generalize over the relations that are not seen in the training data 7

Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase, TACL,

Dropout training  Proposed by (Hinton et al, 2012)  Each time decide whether to delete one hidden unit with some probability p 9

Dropout training  Model averaging effect  Among models, with shared parameters  Only a few get trained  Much stronger than the known regularizer  What about the input space?  Do the same thing! 10

Dropout training  Model averaging effect  Among models, with shared parameters  Only a few get trained  Much stronger than the known regularizer  What about the input space?  Do the same thing!  Dropout of 50% of the hidden units and 20% of the input units (Hinton et al, 2012) 11

Outline  Can we explicitly show that dropout acts as a regularizer?  Very easy to show for linear regression  What about others?  Dropout needs sampling  Can be slow!  Can we convert the sampling based update into a deterministic form?  Find expected form of updates 12

Linear Regression  Reminder:  Consider the standard linear regression  With regularization:  Closed form solution: 13

Dropout Linear Regression  Consider the standard linear regression  LR with dropout:  How to find the parameter? 14

Fast Dropout for Linear Regression  We had:  Instead of sampling, minimize the expected loss  Fixed x and y:  15

Fast Dropout for Linear Regression  We had:  Instead of sampling minimize the expected loss:  Expected loss: 16

Fast Dropout for Linear Regression  Expected loss:  Data-dependent regulizer  Closed form could be found: 17

Some definitions  Dropout each input dimension randomly:  Probit:  Logistic function / sigmoid : 18

Some definitions useful equalities  Useful equalities  We can find the following expectation in closed form: 19

Logistic Regression  Consider the standard LR  The standard gradient update rule is  For the parameter vector 20

Dropout on a Logistic Regression  Dropout each input dimension randomly:  For the parameter vector  Notation: 21

Fast Dropout training  Instead of using we use its expectation: 22

Fast Dropout training  Approx:  By knowing:  How to approximate?  Option 1:  Option 2:  Have closed forms but poor approximations 23

Experiment: evaluating the approximation  The quality of approximation for 24

Experiment: Document Classification  20-newsgroup subtask alt.atheism vs. religion.misc 25

Experiment: Document Classification(2) 26

Fast Dropout training  Approx:  By knowing:  27

Fast Dropout training  We want to:  Previously:  which could be found in closed form. 28

Fast Dropout training  We want to:  Previously:  deviates (approximately) from with and  Has closed form! 29