Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.

Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1

Analyze language modelling results with NN compared to other techniques
2

Language models Language modeling is the art of determining the probability of sequence of words. It is used for speech recognition, machine translation, spelling correction and optical character recognition. Sometimes, when building a language model there are some sequences which never occur in training corpus due to which they are assigned zero probability. This makes it impossible to recognize these words how unambiguous the acoustics maybe. Perplexity and Entropy are used to evaluate to a language model. Lower the perplexity, better the model is. 3

Smoothing Smoothing is the technique used to eliminate the problem of zero probability in language model by taking extra probability from other words and distributing it to words with zero probability. Interpolation is the simplest technique to combine language models by interpolating together. Katz Smoothing is about predicting about how often we expect something to happen that has never happened before. Kneser-Ney Smoothing uses a modified backoff distribution based on the number of contexts word occurs in rather than its frequency. Jelinek-Mercer Smoothing uses a weighted average of backoff naive models. 4

Skipping Models Skipping models are used for larger n grams as there is less chance of having seen seen the exact context before. E.g. Let’s take a 5 gram model. Maybe we haven’t seen “Show John a good time” but we have seen “Show stan a good time”. A normal n gram model would back off “Show John a good”, “John a good”, “a good”.. which will have a relatively low probability but a skipping model would assign a higher probability to “Show ____ a good time” A Skipping model expresses a higher n gram model in terms of lower n gram model along with the higher n gram model. 5

Clustering Clustering can be soft/hard. When each word belongs to one class, it is called hard clustering otherwise soft clustering. Predictive Clustering is a smoothed form of clustering in which even though we may haven’t seen the exact phrase, we might have encountered a phrase with a word belonging to the same class. Context clustering is when the context of words are taken into account/ IBM Clustering is when a clustering model is interpolated with a trigram model. Full IBM is a more generalized version. Index Clustering is when we change the backoff order for a clustering model. 6

Caching and Sentence Mixture Models
If a speaker uses a word, it is likely that he will use the same word again in the future. This observation is the basis of caching. Context Caching depends on the context. We can form a smoothed bigram or trigram from previous words and interpolate with a standard trigram. In Conditional Caching, we weight the trigram differently depending whether or not we have seen the previously seen the context or not. In Sentence Mixture Models, we models different types of sentence in corpus. e.g. Wall Street Journal may have 3 types: General news stories, buisness sentences and financial market sentences. SMM capture long distance correlations. 7

N-grams models mainly Other techniques? 8

Other Techniques Maximum Entropy Models use a trigger model such as if we encounter a word “school”, it increases its probability as well as the probability of similar words. Whole Sentence Maximum Entropy Models predicts the probability of entire sentence rather than of individual words. However, we have very small gains for complex sentence level features. Latent Semantic Analysis is technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Structured Language Models are statistical parsers that can be thought as generative model of language. 9

Limitations of n-grams?
Do not take into account : Long-distance dependencies Similarities between words Curse of dimensionality Limitations of n-grams? 2 main problems 10

the neural networks 11

Neural networks ? 12

Which neural network? In order of development: Feedforward Recurrent
LSTM Which neural network? 13

Feedforward Three main ideas: Words ⇒ Real vectors Joint probabilities
Learning: Vectors values and probabilities Problem: Fixed size context 14

Recurrent INPUTS at time t: Concatenation of w(t) and h(t-1)
OUTPUT at time t: Probability distribution of next word given w(t) and h(t-1) 16

LSTM (Long Short-Term Memory)
From this... 17

LSTM (Long Short-Term Memory)
… to this. 18

Experiments 19

Datasets Penn-Treebank: Finnish dataset Over 1,000,000 words
2499 stories from Wall Street Journal content Split into: Training : ~ 890,000 words Validation: ~70,000 words Test: ~80,000 words Finnish dataset Much bigger **(If there is time) 20

Experiments N-grams Simple N-grams Orders 3 to 10
Smooth N-grams (Kneser-Ney) Neural Networks **(pending) Feedforward LSTM (Using the TheanoLM toolkit) 21

Results 22

Results N-grams 23

Results N-grams N-grams
Perplexity stops decreasing after certain order Smoothing helps to decrease the overall perplexity, especially at higher orders Very similar results from validation and test sets 24

References Marcus, Mitchell, et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999 Joshua T. Goodman, A Bit of Progress in Language Modeling, Extended Version, Machine Learning and Applied Statistics Group, Microsoft Research Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, A Neural Probabilistic Language Model, Département d'Informatique et Recherche Opérationnelle, Centre de Recherche Mathematiques, Universite de Montreal Tomas Mikolov, Martin Karaat, Lukas Burget, Jan \Honza" Cernocky, Sanjeev Khudanpur, Recurrent neural network based language model, Brno University of Technology, Czech Republic and Department of Electrical and Computer Engineering, John Hopkins University, USA TheanoLM 25

Questions? 26

Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.

Similar presentations

Presentation on theme: "Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.

Similar presentations

Presentation on theme: "Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1."— Presentation transcript:

Similar presentations

About project

Feedback