Tutorial on neural probabilistic language models

Tutorial on neural probabilistic language models
Piotr Mirowski, Microsoft Bing London South England NLP UCL April 30, 2014

Ackowledgements AT&T Labs Research New York University Microsoft
Srinivas Bangalore Suhrid Balakrishnan Sumit Chopra (now at Facebook) New York University Yann LeCun (now at Facebook) Microsoft Abhishek Arun

About the presenter NYU (2005-2010) Bell Labs (2011-2013)
Deep learning for time series Epileptic seizure prediction Gene regulation networks Text categorization of online news Statistical language models Bell Labs ( ) WiFi-based indoor geolocation SLAM and robotics Load forecasting in smart grids Microsoft Bing (2013-) AutoSuggest (Query Formulation)

Objective of this tutorial
Understand deep learning approaches to distributional semantics: word embeddings and continuous space language models

Outline Probabilistic Language Models (LMs) Neural Probabilistic LMs
Likelihood of a sentence and LM perplexity Limitations of n-grams Neural Probabilistic LMs Vector-space representation of words Neural probabilistic language model Log-Bilinear (LBL) LMs (loss function maximization) Long-range dependencies Enhancing LBL with linguistic features Recurrent Neural Networks (RNN) Applications Speech recognition and machine translation Sentence completion and linguistic regularities Bag-of-word-vector approaches Auto-encoders for text Continuous bag-of-words and skip-gram models Scalability with large vocabularies Tree-structured LMs Noise-contrastive estimation

Likelihood of a sentence and LM perplexity Limitations of n-grams Neural Probabilistic LMs Vector-space representation of words Neural probabilistic language model Log-Bilinear (LBL) LMs Long-range dependencies Enhancing LBL with linguistic features Recurrent Neural Networks (RNN) Applications Speech recognition and machine translation Sentence completion and linguistic regularities Bag-of-word-vector approaches Auto-encoders for text Continuous bag-of-words and skip-gram models Scalability with large vocabularies Tree-structured LMs Noise-contrastive estimation

Probabilistic Language Models
Probability of a sequence of words: Conditional probability of an upcoming word: Chain rule of probability: (n-1)th order Markov assumption

Learning probabilistic language models
Learn joint likelihood of training sentences under (n-1)th order Markov assumption using n-grams Maximize the log-likelihood: Assuming a parametric model θ Could we take advantage of higher-order history? target word word history

Evaluating language models: perplexity
How well can we predict next word? A random predictor would give each word probability 1/V where V is the size of the vocabulary A better model of a text should assign a higher probability to the word that actually occurs Perplexity: I always order pizza with cheese and ____ The 33rd President of the US was ____ I saw a ____ mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice and 1e-100 Slide courtesy of Abhishek Arun

Limitations of n-grams
Conditional likelihood of seeing a sub-sequence of length n in available training data Limitation: discrete model (each word is a token) Incomplete coverage of the training dataset Vocabulary of size V words: Vn possible n-grams (exponential in n) Semantic similarity between word tokens is not exploited the cat sat on mat hat my cat sat on the mat the cat sat on rug

Workarounds for n-grams
Smoothing Adding non-zero offset to probabilities of unseen words Example: Kneyser-Ney smoothing Back-off No such trigram? try bigrams… No such bigram? try unigrams… Interpolation Mix unigram, bigram, trigram, etc… [Katz, 1987; Chen & Goodman, 1996; Stolcke, 2002]

Continuous Space Language Models
Word tokens mapped to vectors in a low-dimensional space Conditional word probabilities replaced by normalized dynamical models on vectors of word embeddings Vector-space representation enables semantic/syntactic similarity between words/sentences Use cosine similarity as semantic word similarity Find nearest neighbours: synonyms, antonyms Algebra on words: {king} – {man} + {woman} = {queen}?

Vector-space representation of words
1 “One-hot” of “one-of-V” representation of a word token at position t in the text corpus, with vocabulary of size V ẑt Vector-space representation of the prediction of target word wt (we predict a vector of size D) v V zt-1 zt-2 Vector-space representation of the tth word history: e.g., concatenation of n-1 vectors of size D zv 1 D Vector-space representation of any word v in the vocabulary using a vector of dimension D Also called distributed representation

Learning continuous space language models
Input: word history (one-hot or distributed representation) Output: target word (one-hot or distributed representation) Function that approximates word likelihood: Linear transform Feed-forward neural network Recurrent neural network Continuous bag-of-words Skip-gram …

Learning continuous space language models
How do we learn the word representations z for each word in the vocabulary? How do we learn the model that predicts the next word or its representation ẑt given a word history? Simultaneous learning of model and representation

Vector-space representation of words
Compare two words using vector representations: Dot product Cosine similarity Euclidean distance Bi-Linear scoring function at position t: Parametric model θ predicts next word Bias bv for word v related to unigram probabilities of word v Given a predicted vector ẑt, the actual predicted word is the 1-nearest neighbour of ẑt Exhaustive search in large vocabularies (V in millions) can be computationally expensive… [Mnih & Hinton, 2007]

Word probabilities from vector-space representation
Normalized probability: Using softmax function Bi-Linear scoring function at position t: Parametric model θ predicts next word Bias bv for word v related to unigram probabilities of word v Given a predicted vector ẑt, the actual predicted word is the 1-nearest neighbour of ẑt Exhaustive search in large vocabularies (V in millions) can be computationally expensive… [Mnih & Hinton, 2007]

Loss function Log-likelihood model: Loss function to maximize:
Numerically more stable Loss function to maximize: Log-likelihood In general, loss defined as: score of the right answer + normalization term Normalization term is expensive to compute

Neural Probabilistic Language Model
zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD A B h Neural network 100 hidden units V output units followed by softmax word embedding in dimension D=30 R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat function z_hist = Embedding_FProp(model, w) % Get the embeddings for all words in w z_hist = model.R(:, w); z_hist = reshape(z_hist, length(w)*model.dim_z, 1); [Bengio et al, 2001, 2003; Schwenk et al, “Connectionist language modelling for large vocabulary continuous speech recognition”, ICASSP 2002]

function s = NeuralNet_FProp(model, z_hist) % One hidden layer neural network o = model.A * z_hist + model.bias_a; h = tanh(o); S = model.B * h + model.bias_b; zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD A B h Neural network 100 hidden units V output units followed by softmax word embedding in dimension D=30 R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Bengio et al, 2001, 2003; Schwenk et al, “Connectionist language modelling for large vocabulary continuous speech recognition”, ICASSP 2002]

zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD A B h Neural network 100 hidden units V output units followed by softmax function p = Softmax_FProp(s) % Probability estimation p_num = exp(s); p = p_num / sum(p_num); word embedding in dimension D=30 R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Bengio et al, 2001, 2003; Schwenk et al, “Connectionist language modelling for large vocabulary continuous speech recognition”, ICASSP 2002]

zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD A B h Neural network 100 hidden units V output units followed by softmax word embedding in dimension D=30 R R R R R discrete word space {1, ..., V} V=18k words Outperforms best n-grams (Class-based Kneyser-Ney back-off 5-grams) by 7% wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat Took months to train (in ) on AP News corpus (14M words) Complexity: (n-1)×D + (n-1)×D×H + H×V [Bengio et al, 2001, 2003; Schwenk et al, “Connectionist language modelling for large vocabulary continuous speech recognition”, ICASSP 2002]

Log-Bilinear Language Model
function z_hat = LBL_FProp(model, z_hist) % Simple linear transform Z_hat = model.C * z_hist + model.bias_c; zt-5 zt-4 zt-3 zt-2 zt-1 ẑt zt word embedding space ℜD C E Simple matrix multiplication word embedding in dimension D=100 R R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mnih & Hinton, 2007]

zt-5 zt-4 zt-3 zt-2 zt-1 ẑt zt word embedding space ℜD C E Simple matrix multiplication function s = Score_FProp(z_hat, model) s = model.R’ * z_hat + model.bias_v; word embedding in dimension D=100 R R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mnih & Hinton, 2007]

zt-5 zt-4 zt-3 zt-2 zt-1 ẑt zt word embedding space ℜD C E Simple matrix multiplication word embedding in dimension D=100 R R R R R R discrete word space {1, ..., V} V=18k words Slightly better than best n-grams (Class-based Kneyser-Ney back-off 5-grams) wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat Takes days to train (in 2007) on AP News corpus (14 million words) Complexity: (n-1)×D + (n-1)×D×D + D×V [Mnih & Hinton, 2007]

Nonlinear Log-Bilinear Language Model
zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD A B h ẑt zt E Neural network 200 hidden units V output units followed by softmax word embedding in dimension D=100 R R R R R R discrete word space {1, ..., V} V=18k words Outperforms best n-grams (Class-based Kneyser-Ney back-off 5-grams) by 24% wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat Took weeks to train (in ) on AP News corpus (14M words) Complexity: (n-1)×D + (n-1)×D×H + H×D + D×V [Mnih & Hinton, Neural Computation, 2009]

Learning neural language models
Maximize the log-likelihood of observed data, w.r.t. parameters θ of the neural language model Parameters θ (in a neural language model): Word embedding matrix R and bias bv Neural weights: A, bA, B, bB Gradient descent with learning rate η:

Maximizing the loss function
Maximum Likelihood learning: Gradient of log-likelihood w.r.t. parameters θ: Use the chain rule of gradients

Maximizing the loss function: example of LBL
Maximum Likelihood learning: Gradient of log-likelihood w.r.t. parameters θ: Neural net: back-propagate gradient function [dL_dz_hat, dL_dR, dL_dbias_v, w] = Loss_BackProp(z_hat, model, p, w) % Gradient of loss w.r.t. word bias parameter dL_dbias_v = -p; dL_dbias_v(w) = 1 - p; % Gradient of loss w.r.t. prediction of (N)LBL model dL_dz_hat = model.R(:, w) – model.R * p; % Gradient of loss w.r.t. vocabulary matrix R dL_dR = –z_hat * p’; dL_dR(:, w) = z_hat * (1 – p(w)); R=(zv) 1 D V

Learning neural language models
Randomly choose a mini-batch (e.g., 1000 consecutive words) Forward-propagate through word embeddings and through model Estimate word likelihood (loss) Back-propagate loss Gradient step to update model

FProp Look-up embeddings of the words in the n-gram using R Forward propagate through the neural net Look-up ALL vocabulary words using R and compute energy and probabilities (computationally expensive) zt-5 zt-4 zt-3 zt-2 zt-1 A B h ẑt zt word embedding space ℜD E Neural network 200 hidden units V output units followed by softmax word embedding in dimension D=100 R R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mnih & Hinton, Neural Computation, 2009]

BackProp Compute gradients of loss w.r.t. output of the neural net, back-propagate through neural net layers B and A (computationally expensive) Back-propagate further down to word embeddings R Compute gradients of loss w.r.t. words of all vocabulary, back-propagate to R zt-5 zt-4 zt-3 zt-2 zt-1 A B h ẑt zt word embedding space ℜD E Neural network 200 hidden units V output units followed by softmax word embedding in dimension D=100 R R R R R R discrete word space {1, ..., V} V=18k words wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mnih & Hinton, Neural Computation, 2009]

Stochastic Gradient Descent (SGD)
Choice of the learning hyperparameters Learning rate? Learning rate decay? Regularization (L2-norm) of the parameters? Momentum term on the parameters? Use cross-validation on validation set E.g., on AP News (16M words) Training set: 14M words Validation set: 1M words Test set: 1M words

Limitations of these neural language models
Computationally expensive to train Bottleneck: need to evaluate probability of each word over the entire vocabulary Very slow training time (days, weeks) Ignores long-range dependencies Fixed time windows Continuous version of n-grams

Adding language features to neural LMs
Additional features can be added as inputs to the neural net / linear prediction function. We tried POS (part-of-speech tags) and super-tags derived from incomplete parsing. zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD A B h feature embedding space ℜF ẑt zt E feature embedding C F F F F F word embedding R R R R R R discrete POS features {0,1}P the cat sat on DT NN VBD IN DT NN VBD IN DT discrete word space {1, ..., V} wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mirowski, Chopra, Balakrishnan and Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT; Bangalore & Joshi (1999) “Supertagging: an approach to almost parsing”, Computational Linguistics]

Constraining word representations
zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD Using the WordNet hierarchical similarity between words, we tried to force some words to remain similar to a small set of WordNet neighbours. A B h feature embedding space ℜF ẑt zt E feature embedding C F F F F F word embedding R R R R R R discrete POS features {0,1}P No significant change in training time or language model performance. WordNet graph of words DT NN VBD IN DT discrete word space {1, ..., V} wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mirowski, Chopra, Balakrishnan and Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT;

Topic mixtures of language models
B(1) h(1) We pre-computed the unsupervised topic model representation of each sentence in training using LDA (Latent Dirichlet Allocation) [Blei et al, 2003] with 5 topics. On test data, estimate topic using trained LDA model. zt-5 zt-4 zt-3 zt-2 zt-1 word embedding space ℜD θ1 A(k) B(k) h(k) feature embedding space ℜF θk ẑt zt θ1 E feature embedding C(1) F F F F F θk C(k) word embedding R R R R R R discrete POS features {0,1}P DT NN VBD IN DT Enables to model long-range dependencies at sentence level. sentence or document topic (5 topics) discrete word space {1, ..., V} f wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mirowski, Chopra, Balakrishnan and Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT; David Blei (2003) "Latent Dirichlet Allocation", JMLR]

Word embeddings obtained on Reuters
Example of word embeddings obtained using our language model on the Reuters corpus (1.5 million words, vocabulary V=12k words), vector space of dimension D=100 For each word, the 10 nearest neighbours in the vector space retrieved using cosine similarity: [Mirowski, Chopra, Balakrishnan and Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT]

Word embeddings obtained on AP News
Example of word embeddings obtained using our LM on AP News (14M words, V=17k), D=100 The word embedding matrix R was projected in 2D by Stochastic t-SNE [Van der Maaten, JMLR 2008] [Mirowski (2010) “Time series modelling with hidden variables and gradient-based algorithms”, NYU PhD thesis]

Recurrent Neural Net (RNN) language model
Time-delay 1-layer neural network with D output units word embedding space ℜD in dimension D=30 to 250 zt-1 W h zt Word embedding matrix U V o discrete word space {1, ..., M} M>100k words Handles longer word history (~10 words) as well as 10-gram feed-forward NNLM Training algorithm: BPTT Back-Propagation Through Time wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat Complexity: D×D + D×D + D×V [Mikolov et al, 2010, 2011]

Context-dependent RNN language model
Time-delay 1-layer neural network with D output units word embedding space ℜD in dimension D=200 zt-1 W h zt F sentence or document topic (K=40 topics) U V f G o discrete word space {1, ..., M} M>100k words Compute topic model representation word-by-word on last 50 words using approximate LDA [Blei et al, 2003] with K topics. Enables to model long-range dependencies at sentence level. wt-5 wt-4 wt-3 wt-2 wt-1 wt the cat sat on the mat [Mikolov & Zweig, 2012]

Perplexity of RNN language models
Test ppx Kneyser-Ney back-off 5-grams 123.3 Nonlinear LBL (100d) [Mnih & Hinton, 2009, using our implementation] 104.4 NLBL (100d) + 5 topics LDA [Mirowski, 2010, using our implementation] 98.5 RNN (200d) + 40 topics LDA [Mikolov & Zweig, 2012, using RNN toolbox] 86.9 Penn TreeBank V=10k vocabulary Train on 900k words Validate on 80k words Test on 80k words AP News V=17k vocabulary Train on 14M words Validate on 1M words Test on 1M words [Mirowski, 2010; Mikolov & Zweig, 2012; RNN toolbox:

Performance of LBL on speech recognition
HUB-4 TV broadcast transcripts Vocabulary V=25k (with proper nouns & numbers) #topics POS Word accuracy Method - 63.7% AT&T Watson [Goffin et al, 2005] 63.5% KN 5-grams on 100-best list 66.6% Oracle: best of 100-best list 57.8% Oracle: worst of 100-best list 64.1% Log-Bilinear models with nonlinearity and optional POS tag inputs and LDA topic model mixtures F=34 F=3 5 64.2% 64.6% Train on 1M words Validate on 50k words Test on 800 sentences Re-rank top 100 candidate sentences, provided for each spoken sentence by a speech recognition system (acoustic model + simple trigram) [Mirowski et al, 2010]

Performance of RNN on machine translation
[Image credits: Auli et al (2013) “Joint Language and Translation Modeling with Recurrent Neural Networks”, EMNLP] RNN with 100 hidden nodes Trained using 20-step BPTT Uses lattice rescoring RNN trained on 2M words already improves over n-gram trained on 1.15B words [Auli et al, 2013]

Syntactic and Semantic tests with RNN
Observed that word embeddings obtained by RNN-LDA have linguistic regularities “a” is to “b” as “c” is to _ Syntactic: king is to kings as queen is to queens Semantic: clothing is to shirt as dish is to bowl Vector offset method [Image credits: Mikolov et al (2013) “Efficient Estimation of Word Representation in Vector Space”, arXiv] Z1 Z2 Z3 ẑ Zv - + = cosine similarity [Mikolov, Yih and Zweig, 2013]

Microsoft Research Sentence Completion Task
1040 sentences with missing word; 5 choices for each missing word. Language model trained on 500 novels (Project Gutenberg) provided 30 alternative words for each missing word; Judges selected top 4 impostor words. Human performance: 90% accuracy [Image credits: Mikolov et al (2013) “Efficient Estimation of Word Representation in Vector Space”, arXiv] All red-headed men who are above the age of [ 800 | seven | twenty-one | 1,200 | 60,000] years, are eligible. That is his [ generous | mother’s | successful | favorite | main ] fault, but on the whole he’s a good worker. [Zweig & Burges, 2011; Mikolov et al, 2013a; ]

Semantic-syntactic word evaluation task
[Image credits: Mikolov et al (2013) “Efficient Estimation of Word Representation in Vector Space”, arXiv] [Mikolov et al, 2013a, 2013b;

Semantic Hashing 2000 500 250 125 2 125 250 500 2000 [Hinton & Salakhutdinov, “Reducing the dimensionality of data with neural networks, Science, 2006; Salakhutdinov & Hinton, “Semantic Hashing”, Int J Approx Reason, 2007]

Semi-supervised learning of auto-encoders
Add classifier module to the codes When a input X(t) has a label Y(t), back-propagate the prediction error on Y(t) to the code Z(t) Stack the encoders Train layer-wise y(t) y(t+1) document classifier f3 z(3)(t) z(3)(t+1) y(t) y(t+1) auto-encoder g3,h3 document classifier f2 z(2)(t) z(2)(t+1) y(t) y(t+1) auto-encoder g2,h2 document classifier f1 z(1)(t) z(1)(t+1) Random walk auto-encoder g1,h1 word histograms x(t) x(t+1) [Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008; Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010]

Semi-supervised learning of auto-encoders
Performance on document retrieval task: Reuters-21k dataset (9.6k training, 4k test), vocabulary 2k words, 10-class classification Comparison with: unsupervised techniques (DBN: Semantic Hashing, LSA) + SVM traditional technique: word TF-IDF + SVM [Ranzato & Szummer, “Semi-supervised learning of compact document representations with deep networks”, ICML, 2008; Mirowski, Ranzato & LeCun, “Dynamic auto-encoders for semantic indexing”, NIPS Deep Learning Workshop, 2010]

Deep Structured Semantic Models for web search
s: “racing car” Input word/phrase dim = 5M Bag-of-words vector dim = 50K d=500 Letter-tri-gram embedding matrix Letter-tri-gram coeff. matrix (fixed) Semantic vector d=300 t1: “formula one” t2: “ford model t” Compute Cosine similarity between semantic vectors cos(s,t1) cos(s,t2) W1 W2 W3 W4 [Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013]

Deep Structured Semantic Models for web search
Results on a web ranking task (16k queries) Normalized discounted cumulative gains Semantic hashing [Salakhutdinov & Hinton, 2007] Deep Structured Semantic Model [Huang, He, Gao et al, 2013] [Huang, He, Gao, Deng et al, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data”, CIKM, 2013]

Continuous Bag-of-Words
Simple sum word embedding space ℜD in dimension D=100 to 300 h Word embedding matrices U U U U W Extremely efficient estimation of word embeddings in matrix U without a Language Model. Can be used as input to neural LM. Enables much larger datasets, e.g., Google News (6B words, V=1M) discrete word space {1, ..., V} V>100k words wt-2 wt-1 wt+1 wt+2 wt the cat on the sat Complexity: 2C×D + D×V Complexity: 2C×D + D×log(V) (hierarchical softmax using tree factorization) [Mikolov et al, 2013a; Mnih & Kavukcuoglu, 2013; ]

Skip-gram Word embedding matrices
word embedding space ℜD in dimension D=100 to 1000 zt Word embedding matrices W W W W U Extremely efficient estimation of word embeddings in matrix U without a Language Model. Can be used as input to neural LM. Enables much larger datasets, e.g., Google News (33B words, V=1M) discrete word space {1, ..., V} V>100k words wt-2 wt-1 wt+1 wt+2 wt the cat on the sat Complexity: 2C×D + 2C×D×V Complexity: 2C×D + 2C×D×log(V) (hierarchical softmax using tree factorization) Complexity: 2C×D + 2C×D×(k+1) (negative sampling with k negative examples) [Mikolov et al, 2013a, 2013b; Mnih & Kavukcuoglu, 2013; ]

Vector-space word representation without LM
Word and phrase representation learned by skip-gram exhibit linear structure that enables analogies with vector arithmetics. This is due to training objective, input and output (before softmax) are in linear relationship. The sum of vectors in the loss function is the sum of log-probabilities (or log of product of probabilities), i.e., comparable to the AND function. [Image credits: Mikolov et al (2013) “Distributed Representations of Words and Phrases and their Compositionality”, NIPS] [Mikolov et al, 2013a, 2013b;

Examples of Word2Vec embeddings
debt aa decrease met slow france jesus xbox debts aaarm increase meeting slower marseille christ playstation repayments samavat increases meet fast french resurrection wii repayment obukhovskii decreased meets slowing nantes savior xbla monetary emerlec greatly had slows vichy miscl wiiware payments gunss decreasing welcomed slowed paris crucified gamecube repay dekhen increased insisted faster bordeaux god nintendo mortgage minizini decreases acquainted sluggish aubagne apostles kinect repaid bf reduces satisfied quicker vend apostle dsiware refinancing mortardepth reduce first pace vienne bickertonite eshop bailouts ee increasing persuaded slowly toulouse pretribulational dreamcast Example of word embeddings obtained using Word2Vec on the 3.2B word Wikipedia: Vocabulary V=2M Continuous vector space D=200 Trained using CBOW [Mikolov et al, 2013a, 2013b;

Performance on the semantic-syntactic task
[Image credits: Mikolov et al (2013) “Efficient Estimation of Word Representation in Vector Space”, arXiv] [Image credits: Mikolov et al (2013) “Distributed Representations of Words and Phrases and their Compositionality”, NIPS] Word and phrase representation learned by skip-gram exhibit linear structure that enables analogies with vector arithmetics. Due to training objective, input and output (before softmax) in linear relationship. Sum of vectors is like sum of log-probabilities, i.e. log of product of probabilities, i.e., AND function. [Mikolov et al, 2013a, 2013b;

Computational bottleneck of large vocabularies
Bulk of computation at word prediction and at input word embedding layers Large vocabularies: AP News (14M words; V=17k) HUB-4 (1M words; V=25k) Google News (6B words, V=1M) Wikipedia (3.2B, V=2M) Strategies to compress output softmax target word word history scoring function softmax

Reducing the bottleneck of large vocabularies
Replace rare words, numbers by <unk> token Subsample frequent words during training Speed-up 2x to 10x Better accuracy for rare words Hierarchical Softmax (HS) Noise-Contrastive Estimation (NCE) and Negative Sampling (NS) [Morin & Bengio, 2005, Mikolov et al, 2011, 2013b; Mnih & Teh 2012, Mnih & Kavukcuoglu, 2013]

Hierarchical softmax by grouping words
Group words into disjoint classes: E.g., 20 classes with frequency binning Use unigram frequency Top 5% words (“the”) go to class 1 Following 5% words go to class 2 Factorize word probability into: Class probability Class-conditional word probability Speed-up factor: O(|V|) to O(|C|+max|VC|) target word word history scoring function softmax [Mikolov et al, 2011, Auli et al, 2013]

Hierarchical softmax by grouping words
target word word history scoring function softmax [Image credits: Mikolov et al (2011) “Extensions of Recurrent Neural Network Language Model”, ICASSP] [Mikolov et al, 2011, Auli et al, 2013]

Hierarchical softmax using WordNet
Use WordNet to extract IS-A relationships Manually select one parent per child In the case of multiple children, cluster them to obtain a binary tree Hard to design Hard to adapt to other languages [Image credits: Morin & Bengio (2005) “Hierarchical Probabilistic Neural Network Language Model”, AISTATS] [Morin & Bengio, 2005;

Hierarchical softmax using Huffman trees
Frequency-based binning “this is an example of a huffman tree” [Image credits: Wikipedia, Wikimedia Commons [Mikolov et al, 2013a, 2013b]

Hierarchical softmax using Huffman trees
Replace comparison with V vectors of target words by comparison with log(V) vectors target word word history path to target word at node j predicted word vector vector at node j of target word sigmoid [Mikolov et al, 2013a, 2013b]

Noise-Contrastive Estimation
Conditional probability of word w in the data: Conditional probability that word w comes from data D and not from the noise distribution: Auxiliary binary classification problem: Positive examples (data) vs. negative examples (noise) Scaling factor k: noisy samples k times more likely than data samples Noise distribution: based on unigram word probabilities Empirically, model can cope with un-normalized probabilities: [Mnih & Teh, 2012; Mnih & Kavukcuoglu, 2013; Mikolov et al, 2013a, 2013b]

Conditional probability that word w comes from data D and not from the noise distribution: Auxiliary binary classification problem: Positive examples (data) vs. negative examples (noise) Scaling factor k: noisy samples k times more likely than data samples Noise distribution: based on unigram word probabilities Introduce log of difference between: score of word w under data distribution and unigram distribution score of word w [Mnih & Teh, 2012; Mnih & Kavukcuoglu, 2013; Mikolov et al, 2013a, 2013b]

New loss function to maximize: Compare to Maximum Likelihood learning: [Mnih & Teh, 2012; Mnih & Kavukcuoglu, 2013; Mikolov et al, 2013a, 2013b]

Negative sampling Noise contrastive estimation Negative sampling
Remove normalization term in probabilities Compare to Maximum Likelihood learning: [Mnih & Teh, 2012; Mnih & Kavukcuoglu, 2013; Mikolov et al, 2013a, 2013b]

Speed-up over full softmax
LBL with full softmax, trained on APNews data, 14M words, V=17k 7days Skip-gram (context 5) with phrases, trained using negative sampling, on Google data, 33G words, V=692k + phrases 1 day [Image credits: Mikolov et al (2013) “Distributed Representations of Words and Phrases and their Compositionality”, NIPS] Penn TreeBank data (900k words, V=10k) LBL (2-gram, 100d) with full softmax, 1 day LBL (2-gram, 100d) with noise contrastive estimation 1.5 hours RNN (HS) 50 classes 145.4 0.5 RNN (100d) with 50-class hierarchical softmax 0.5 hours (own experience) [Image credits: Mnih & Teh (2012) “A fast and simple algorithm for training neura probabilistic language models”, ICML] [Mnih & Teh, 2012; Mikolov et al, , 2013b]

Thank you! Further references: following this slide
Basic (N)LBL Matlab code available on demand Contact: Acknowledgements: Sumit Chopra (AT&T Labs Research / Facebook) Srinivas Bangalore (AT&T Labs Research) Suhrid Balakrishnan (AT&T Labs Research) Yann LeCun (NYU / Facebook) Abhishek Arun (Microsoft Bing)

References Basic n-grams with smoothing and backtracking (no word vector representation): S. Katz, (1987) "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35, no. 3, pp. 400–401 S. F. Chen and J. Goodman (1996) "An empirical study of smoothing techniques for language modelling", ACL A. Stolcke (2002) "SRILM - an extensible language modeling toolkit” ICSLP, pp. 901–904

References Neural network language models:
Y. Bengio, R. Ducharme, P. Vincent and J.-L. Jauvin (2001, 2003) "A Neural Probabilistic Language Model", NIPS (2000) 13: J. Machine Learning Research (2003) 3: F. Morin and Y. Bengio (2005) “Hierarchical probabilistic neural network language model", AISTATS Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, J.-L. Gauvain (2006) "Neural Probabilistic Language Models", Innovations in Machine Learning, vol. 194, pp

References Linear and/or nonlinear (neural network-based) language models: A. Mnih and G. Hinton (2007) "Three new graphical models for statistical language modelling", ICML, pp. 641–648, A. Mnih, Y. Zhang, and G. Hinton (2009) "Improving a statistical language model through non-linear prediction", Neurocomputing, vol. 72, no. 7-9, pp – A. Mnih and Y.-W. Teh (2012) "A fast and simple algorithm for training neural probabilistic language models“ ICML, A. Mnih and K. Kavukcuoglu (2013) “Learning word embeddings efficiently with noise-contrastive estimation“ NIPS

References Recurrent neural networks (long-term memory of word context): Tomas Mikolov, M Karafiat, J Cernocky, S Khudanpur (2010) "Recurrent neural network-based language model“ Interspeech T. Mikolov, S. Kombrink, L. Burger, J. Cernocky and S. Khudanpur (2011) “Extensions of Recurrent Neural Network Language Model“ ICASSP Tomas Mikolov and Geoff Zweig (2012) "Context-dependent Recurrent Neural Network Language Model“ IEEE Speech Language Technologies Tomas Mikolov, Wen-Tau Yih and Geoffrey Zweig (2013) "Linguistic Regularities in Continuous SpaceWord Representations" NAACL-HLT

References Applications:
P. Mirowski, S. Chopra, S. Balakrishnan and S. Bangalore (2010) “Feature-rich continuous language models for speech recognition”, SLT G. Zweig and C. Burges (2011) “The Microsoft Research Sentence Completion Challenge” MSR Technical Report MSR-TR M. Auli, M. Galley, C. Quirk and G. Zweig (2013) “Joint Language and Translation Modeling with Recurrent Neural Networks” EMNLP K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi and D. Yu (2013) “Recurrent Neural Networks for Language Understanding” Interspeech

References Continuous Bags of Words, Skip-Grams, Word2Vec:
Tomas Mikolov et al (2013) “Efficient Estimation of Word Representation in Vector Space“ arXiv v3 Tomas Mikolov et al (2013) “Distributed Representation of Words and Phrases and their Compositionality” arXiv v1, NIPS

Probabilistic Language Models
Goal: score sentences according to their likelihood Machine Translation: P(high winds tonight) > P(large winds tonight) Spell Correction The office is about fifteen minuets from my house P(about fifteen minutes from) > P(about fifteen minuets from) Speech Recognition P(I saw a van) >> P(eyes awe of an) Re-ranking n-best lists of sentences produced by an acoustic model, taking the best Secondary goal: sentence completion or generation Slide courtesy of Abhishek Arun

Intuitive view of perplexity
How well can we predict next word? A random predictor would give each word probability 1/V where V is the size of the vocabulary A better model of a text should assign a higher probability to the word that actually occurs Perplexity: “how many words are likely to happen, given the context” Perplexity of 1 means that the model recites the text by heart Perplexity of V means that the model produces uniform random guesses The lower the perplexity, the better the language model I always order pizza with cheese and ____ The 33rd President of the US was ____ I saw a ____ mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice and 1e-100 Slide courtesy of Abhishek Arun

Stochastic gradient descent
[LeCun et al, "Efficient BackProp", Neural Networks: Tricks of the Trade, 1998; Bottou, "Stochastic Learning", Slides from a talk in Tubingen, 2003]

Dimensionality reduction and invariant mapping
Similarly labelled samples Dissimilar codes [Hadsell, Chopra & LeCun, “Dimensionality Reduction by Learning an Invariant Mapping”, CVPR, 2006]

Auto-encoder Target = input Target = input Code Code Input Input
“Bottleneck” code i.e., low-dimensional, typically dense, distributed representation “Overcomplete” code i.e., high-dimensional, always sparse, distributed representation

Auto-encoder Code Encoding “energy” Input decoding Code prediction
Decoding “energy” Input

Auto-encoder Code Encoding energy Input decoding Code prediction
Decoding energy Input

Auto-encoder loss function
For one sample t Encoding energy Decoding energy coefficient of the encoder error For all T samples Encoding energy Decoding energy How do we get the codes Z? We note W={C, bC, D, bD}

Auto-encoder backprop w.r.t. codes
Encoding energy Code prediction Input [Ranzato, Boureau & LeCun, “Sparse Feature Learning for Deep Belief Networks ”, NIPS, 2007]

Tutorial on neural probabilistic language models

Similar presentations

Presentation on theme: "Tutorial on neural probabilistic language models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tutorial on neural probabilistic language models

Similar presentations

Presentation on theme: "Tutorial on neural probabilistic language models"— Presentation transcript:

Similar presentations

About project

Feedback