Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
1 Advanced Smoothing, Evaluation of Language Models.
A Bit of Progress in Language Modeling Extended Version
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
Statistical Models for Automatic Speech Recognition Lukáš Burget.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Model for Machine Translation Jang, HaYoung.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Language Identification and Part-of-Speech Tagging
Distributed Representations for Natural Language Processing
N-Grams Chapter 4 Part 2.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Statistical NLP: Lecture 7
Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSE 190 Neural Networks: The Neural Turing Machine
Tools for Natural Language Processing Applications
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Statistical Models for Automatic Speech Recognition
Improving a Pipeline Architecture for Shallow Discourse Parsing
Vector-Space (Distributional) Lexical Semantics
Neural Machine Translation By Learning to Jointly Align and Translate
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Neural Language Model CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Statistical Models for Automatic Speech Recognition
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Word Embedding Word2Vec.
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Machine Translation(MT)
CPSC 503 Computational Linguistics
Speech recognition, machine learning
Word embeddings (continued)
INF 141: Information Retrieval
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Introduction to Sentiment Analysis
Presented By: Harshul Gupta
Recurrent Neural Networks
Speech recognition, machine learning
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1

Analyze language modelling results with NN compared to other techniques 2

Language models Language modeling is the art of determining the probability of sequence of words. It is used for speech recognition, machine translation, spelling correction and optical character recognition. Sometimes, when building a language model there are some sequences which never occur in training corpus due to which they are assigned zero probability. This makes it impossible to recognize these words how unambiguous the acoustics maybe. Perplexity and Entropy are used to evaluate to a language model. Lower the perplexity, better the model is. 3

Smoothing Smoothing is the technique used to eliminate the problem of zero probability in language model by taking extra probability from other words and distributing it to words with zero probability. Interpolation is the simplest technique to combine language models by interpolating together. Katz Smoothing is about predicting about how often we expect something to happen that has never happened before. Kneser-Ney Smoothing uses a modified backoff distribution based on the number of contexts word occurs in rather than its frequency. Jelinek-Mercer Smoothing uses a weighted average of backoff naive models. 4

Skipping Models Skipping models are used for larger n grams as there is less chance of having seen seen the exact context before. E.g. Let’s take a 5 gram model. Maybe we haven’t seen “Show John a good time” but we have seen “Show stan a good time”. A normal n gram model would back off “Show John a good”, “John a good”, “a good”.. which will have a relatively low probability but a skipping model would assign a higher probability to “Show ____ a good time” A Skipping model expresses a higher n gram model in terms of lower n gram model along with the higher n gram model. 5

Clustering Clustering can be soft/hard. When each word belongs to one class, it is called hard clustering otherwise soft clustering. Predictive Clustering is a smoothed form of clustering in which even though we may haven’t seen the exact phrase, we might have encountered a phrase with a word belonging to the same class. Context clustering is when the context of words are taken into account/ IBM Clustering is when a clustering model is interpolated with a trigram model. Full IBM is a more generalized version. Index Clustering is when we change the backoff order for a clustering model. 6

Caching and Sentence Mixture Models If a speaker uses a word, it is likely that he will use the same word again in the future. This observation is the basis of caching. Context Caching depends on the context. We can form a smoothed bigram or trigram from previous words and interpolate with a standard trigram. In Conditional Caching, we weight the trigram differently depending whether or not we have seen the previously seen the context or not. In Sentence Mixture Models, we models different types of sentence in corpus. e.g. Wall Street Journal may have 3 types: General news stories, buisness sentences and financial market sentences. SMM capture long distance correlations. 7

N-grams models mainly Other techniques? 8

Other Techniques Maximum Entropy Models use a trigger model such as if we encounter a word “school”, it increases its probability as well as the probability of similar words. Whole Sentence Maximum Entropy Models predicts the probability of entire sentence rather than of individual words. However, we have very small gains for complex sentence level features. Latent Semantic Analysis is technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Structured Language Models are statistical parsers that can be thought as generative model of language. 9

Limitations of n-grams? Do not take into account : Long-distance dependencies Similarities between words Curse of dimensionality Limitations of n-grams? 2 main problems 10

the neural networks 11

Neural networks ? 12 https://medium.com/emergent-future/the-fault-in-our-approach-what-youre-doing-wrong-while-implementing-recurrent-neural-network-lstm-929fbe17723c

Which neural network? In order of development: Feedforward Recurrent LSTM Which neural network? 13

Feedforward Three main ideas: Words ⇒ Real vectors Joint probabilities Learning: Vectors values and probabilities Problem: Fixed size context 14 https://hackernoon.com/overview-of-artificial-neural-networks-and-its-applications-2525c1addff7

15

Recurrent INPUTS at time t: Concatenation of w(t) and h(t-1) OUTPUT at time t: Probability distribution of next word given w(t) and h(t-1) 16 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM (Long Short-Term Memory) From this... 17 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM (Long Short-Term Memory) … to this. 18 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Experiments 19

Datasets Penn-Treebank: Finnish dataset Over 1,000,000 words 2499 stories from Wall Street Journal content Split into: Training : ~ 890,000 words Validation: ~70,000 words Test: ~80,000 words Finnish dataset Much bigger **(If there is time) 20

Experiments N-grams Simple N-grams Orders 3 to 10 Smooth N-grams (Kneser-Ney) Neural Networks **(pending) Feedforward LSTM (Using the TheanoLM toolkit) 21

Results 22 http://www.glowscript.org/docs/VPythonDocs/graph.html

Results N-grams 23

Results N-grams N-grams Perplexity stops decreasing after certain order Smoothing helps to decrease the overall perplexity, especially at higher orders Very similar results from validation and test sets 24

References Marcus, Mitchell, et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999 Joshua T. Goodman, A Bit of Progress in Language Modeling, Extended Version, Machine Learning and Applied Statistics Group, Microsoft Research Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, A Neural Probabilistic Language Model, Département d'Informatique et Recherche Opérationnelle, Centre de Recherche Mathematiques, Universite de Montreal Tomas Mikolov, Martin Karaat, Lukas Burget, Jan \Honza" Cernocky, Sanjeev Khudanpur, Recurrent neural network based language model, Speech@FIT, Brno University of Technology, Czech Republic and Department of Electrical and Computer Engineering, John Hopkins University, USA TheanoLM https://github.com/senarvi/theanolm 25

Questions? 26