Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.

Exploiting the distance and the occurrence of words for language modeling
Chong Tze Yuang

Outline Introduction Term-distance (TD) Term-occurrence (TO)
Neural network implementation Conclusion Outline Introduction Background Objective Term-distance (TD) Term-occurrence (TO) Formulation Experiments Neural network implementation Architecture Conclusion

Introduction Background – Types of language model – Motivation
Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Introduction Background – Types of language model – Motivation

Background What is a language model? Applications
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Background What is a language model? The role of a language model (LM) is to estimate the probability of a given word sequence Applications Speech recognition P(“… blue cheap stocks …”) < P(“… blue chip stocks …”) Machine translation Information retrieval Handwriting recognition P(“… blue dib stocks …”) < P(“… blue chip stocks …”)

Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Background The role of a language model (LM) is to estimate the probability of a given word sequence The chain rule: a collapse in blue chip stocks could be a boon for the rest of the market P(“a”) × P(“collapse”|“a”) × … × P(“stocks”|“a collapse … chip”) × … × P(“market”|“a collapse … the”) |V|1 |V|2 … |V|6 … |V|16 Model complexity increases exponentially as history grows longer

The data scarcity problem
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion The data scarcity problem Using the entire word sequence in the history for prediction incurs severe data scarcity problem The model requires an enormous number of parameters Insufficient training data a collapse in blue chip stocks could be a boon for the rest of the market P(“market”|“a collapse … the”) |V|16 Data scarcity!!

Simplification of the lexical structure
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Simplification of the lexical structure The context in the history is reduced to a simpler structure N-gram: rest of the market a collapse in blue chip stocks could be a boon for the Short context < 4 words Distant bigram: the market a collapse in blue chip stocks could be a boon for rest of the Intermediate context < 10 words Skip-gram: boon for the market a collapse in blue chip stocks could be a rest of the stocks a collapse in blue chip could be boon for the rest of market Trigger: Long context > 10 words a collapse in blue chip stocks could be boon for the rest of market Bag-of-words:

Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Model combination Simplifying the word sequence causes information loss – model combination For example, n-gram model is combination with the BOW model N-gram model: captures the syntactic information in short context BOW model: captures the semantic information in long context rest of the market a collapse in blue chip stocks could be a boon for the n-gram model + Combined score a collapse in blue chip stocks could be boon for the rest of market bag-of-words model

Taxonomy Introduction Term-distance (TD) Term-occurrence (TO)
Neural network implementation Conclusion Taxonomy Language model Type Combination N-gram Distant-bigram Skip-gram Trigger Bag-of-words Linear interpolation Log-linear interpolation Maximum entropy Back-off Variable length LSI Bucketing Variable length Structured model PLSI LDA Statistical approach Neural network implementation Connectionist approach

Comparison The n-gram model is combined with the rest of the models
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Comparison Language model Pro Con N-gram Exploit the word arrangement Neglect the far context Distant-bigram Exploit far context (10 words) Neglect the dependencies among words Skip-gram Neglect words Trigger Exploit far context (>10 words) Neglect the word arrangement & some words Bag-of-words Exploit far context (>>10 words) Neglect the word arrangement The n-gram model is combined with the rest of the models To keep the word arrangement in the near context (syntactic) To exploit the far context (semantic)

the events exploited by these two models are inter-dependent
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Motivation The n-gram model + the bag-of-words model The inter-dependencies between the two events are neglected The parameters in the combined model is not consistent to the statistics in the training data. of the market a collapse in blue chip stocks could be boon for the rest of n-gram model bag-of-words model the events exploited by these two models are inter-dependent P1(“market”|“of the”) P2(“market”|BOW(“a”, “collapse”, “in”,…)) + Average

Motivation Exploit both the (n─1)-gram and the bag-of-words jointly
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Motivation Exploit both the (n─1)-gram and the bag-of-words jointly of the (n─1)-gram market P(“market”|“of the”, BOW(“a”, “collapse”, “in”,…)) of in could be blue for a the collapse stocks rest chip boon bag-of-words

Motivation + Another interpretation:
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Motivation Another interpretation: The (n−1)-gram happens in conditioned to its respective bag-of-words of the|BOW(of, the) market a collapse in blue chip stocks could be boon for the rest of Modified n-model bag-of-words model P1(“market”|“of the” |BOW(“of”, “the”)) P2(“market”|BOW(“a”, “collapse”, “in”,…)) + Average

Term-distance (TD) Term-occurrence (TO)
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Term-distance (TD) Term-occurrence (TO) Formulation – Experiments – NLP applications

Recap Problem statement: Proposed solution: Focus:
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Recap Problem statement: Traditional approaches combine LMs which each is trained independently from each others – the parameters in the combined model is not consistent to parameters in the training data Proposed solution: Exploit the events jointly Focus: Combination of an n-gram model and a bag-of-words model

Formulation – the joint model
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Formulation – the joint model of the market P(“market”|“a collapse in … of the”) of in could be blue for a the collapse stocks rest chip boon

Formulation – the decoupling
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Formulation – the decoupling of the|BOW(of, the) market a collapse in blue chip stocks could be boon for the rest of P1(“market”|“of the” |BOW(“of”, “the”)) P2(“market”|BOW(“a”, “collapse”, “in”,…)) + Average

Formulation – the TD and the TO components
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Formulation – the TD and the TO components Term-distance (TD) + Term-occurrence (TO)

Term Distance (TD) Definition: Data scarcity problem:
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Term Distance (TD) Definition: How likely an arrangement of words would be seen, given the binary occurrences of the words Data scarcity problem: Complexity is reduced from |V|n to n! For example, given a 50K vocabulary, for a 10-word context: O(n-gram) = × 1023 O(Term-distance) = × 106

Term Occurrence (TO) Definition: Data scarcity problem
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Term Occurrence (TO) Definition: How likely the binary occurrences of a set of words would be seen Data scarcity problem Complexity is reduced from |V|n to V For example, given a 50K vocabulary, for a 10-word context: O(n-gram) = × 1023 O(Term-distance) = 5.0 × 105

Implementation Original formulation:
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Implementation Original formulation: Simplification to further reduce the model complexity For the TD and the TO components, words in the history are assumed independent Another n-gram model is introduced to recover such inter-dependencies

Smoothing – the term distance
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Smoothing – the term distance Maximum likelihood (ML) estimation: Definition: How likely a word would happen in a specific position, given its occurrence in the history Computation from counts: Zero probability problem: Due to data scarcity, word might not happen in a given position.

Smoothing – the term distance
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Smoothing – the term distance Running average filtering Words occur in different positions in a context might follow a smooth function For example, count of “stocks” that occurs in position 6th can be inferred from the count of “stocks” that occurs in position … 4th, 5th, 7th, 8th, …

Smoothing – the term occurrence
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Smoothing – the term occurrence Maximum likelihood (ML) estimation: Definition: How likely a word would (binary) occur in the history Computation from counts: Zero probability problem: Due to data scarcity, word might not occur in the history Laplace smoothing: where \delta is a small value of count, e.g

3.4 Perplexity evaluation
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.4 Perplexity evaluation The purpose is to show the proposed TDTO model, which exploiting the n- gram and the BOW jointly, provides a better language model. The models are evaluated and compared by using two corpora Wall Street Journal (WSJ) – consists of well grammatically written news articles Switchboard (SWB) – consists of spontaneous conversation transcripts Experiments: Perplexity of the TDTO model with different lengths of history – to evaluate the capability of the model in exploiting long context in the history Comparison of the TDTO model to the BOW model – to evaluate if exploiting the n- gram and the BOW jointly provides a better language model Comparison of the TDTO to the distant-bigram and the trigger models – to evaluate if the TDTO model exploits long contexts better than other language models

3.4.1 Perplexity evaluation – TDTO for exploiting long contexts
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.4.1 Perplexity evaluation – TDTO for exploiting long contexts Motivation: To evaluate the capability of the TDTO model to exploit the long contexts Observation: The TDTO model showed lower perplexities as the contexts grow longer WSJ: to (11.2%) SWB: 81.7 to 76.3 (6.5%) Conclusion: The TDTO model is capable to capture information from long context.

3.4.2 Perplexity evaluation – TDTO to exploit n-gram & BOW jointly
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.4.2 Perplexity evaluation – TDTO to exploit n-gram & BOW jointly Recap: The TDTO model provides a more principle manner to exploit the n-gram & the BOW, where both structures are modeled jointly Motivation: To compare the TDTO model to the log-linear interpolation of n-gram and BOW models Observation: (Figure will be added later) Conclusion: The TDTO model performs better than the n-gram+BOW model. Benefit of exploiting the events jointly.

3.4.2 Perplexity evaluation – TDTO to exploit n-gram & BOW jointly
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.4.2 Perplexity evaluation – TDTO to exploit n-gram & BOW jointly Recap: The TDTO model provides a more principle manner to exploit the n-gram & the BOW, where both structures are modeled jointly LLI of n-gram and BOW TDTO

3.4.3 Perplexity evaluation – TDTO vs. distant-bigram/trigger
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.4.3 Perplexity evaluation – TDTO vs. distant-bigram/trigger Motivation: To compare the TDTO model to the distant-bigram (DBG) & the trigger (TGR) model to see if TDTO exploit the long contexts more effectively. Observation: For both datasets, the TDTO model is shown to exploit the long contexts more effectively – lower perplexities Conclusion: The TDTO model performs better than the DBG and the TGR model.

3.5 TDTO for NLP Applications
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.5 TDTO for NLP Applications This section describes the use of the TDTO model for three NLP applications Speech recognition – Aurora 4 Document categorization – Reuters (topic) & Cornell (sentiment) Word prediction – WSJ

3.5.1 TDTO for NLP Applications – Speech recognition WER performance
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.5.1 TDTO for NLP Applications – Speech recognition WER performance Aurora-4 Task1 Motivation: to compare the TDTO model to the high-order n-gram model Observation: WER improvement to lower order n-gram (e.g. 2-gram) shows the usefulness of TDTO model in capturing information from distant context. WER on higher order n-gram (e.g. 6-gram) shows that TDTO model is capable to provide complementary information to the n-gram model, i.e. back-off. Conclusion: The TDTO model reduced WER better than higher order n-gram model. 1N. Parihar, J. Picone, D. Pearce, H.G. Hirsch, “Performance analysis of the Aurora large vocabulary baseline system,” 2004.

3.5.2 TDTO for NLP Applications – Document Categorization (Reuters)
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.5.2 TDTO for NLP Applications – Document Categorization (Reuters) Binary classification 10 classes with most data in the Reuter dataset4 TD and TO of history length 5 were combined with unigram model (1G) – comparison to 6-gram model (6G) Observation: 1GTD outperformed 6G (in 7 classes and in average) – Longer context is more effective to be modeled by TD then by n-gram. 1GTO did not show improvement – contradicts with some findings which occurrence information is used for text classification – further study is required. TD can be potentially be used as a discriminating attribute. Classes 1G 6G 1GTD 1GTO ACQ 0.9611 0.9571 0.9631 0.8395 CORN 0.4589 0.4240 0.4732 0.2974 CRUDE 0.8986 0.7637 0.9073 0.5131 EARN 0.9617 0.9744 0.9689 0.9396 GRAIN 0.8488 0.8333 0.8563 0.6422 INTEREST 0.6613 0.6294 0.7079 0.3754 MONEY-FX 0.8208 0.8065 0.4976 SHIP 0.5802 0.3980 0.2043 TRADE 0.6184 0.6343 0.6667 0.4185 WHEAT 0.5484 0.4702 0.5462 0.2937 MicroAve 0.8526 0.8200 0.8640 0.6351 MacroAve 0.7358 0.6891 0.7491 1G. Cao, J.-Y. Nie & J. Bai, “Integrating word relationship into language model,” 2005. 2M.-S. Wu & H.-M. Wang, “A term association translation model for naïve Bayes text classification,” 2012. 3T. Wandmacher & J.-Y. Antoine, "Methods to integrate a language model with semantic information for a word prediction component," 2007. 4D. Lewis, The reuters text categorization test collection, 1995.

3.5.2 TDTO for NLP Applications – Document Categorization (Cornell)
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.5.2 TDTO for NLP Applications – Document Categorization (Cornell) Binary classification on Cornell polarity dataset1. TD and TO of history length 5 were combined with unigram model (1G) – comparison to 6-gram model (6G) Results on F1 measure Modeling context with TD showed comparable result as n-gram – Notice that 1GTD showed identical F1 results as 6G. TD can be potentially be used as a discriminating attribute. Classes 1G 6G 1GTD 1GTO Good review 0.8141 0.8543 0.7739 Bad review 0.8442 Average 0.8492 1B. Pang & L. Lee, "A sentimental education: sentiment analysis using subjectivity," 2004.

3.5.3 TDTO for NLP Applications – Word prediction
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion 3.5.3 TDTO for NLP Applications – Word prediction Predicting the 20 most frequent words in BLIIP WSJ Corpus1. Accuracy Bigram: 47.1% Bigram + TDTO: 50.4% TDTO has improved bigram’s accuracy 7% relatively. TD and TO exploited from the distant context has been shown to improve the bigram’s result. (Figure to be added later) 1E. Charniak et al., “BLLIP WSJ Corpus Release 1,” 2000.

Neural network implementation
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Neural network implementation Architecture – Experiments

Motivation The TDTO model was smoothed naively
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Motivation The TDTO model was smoothed naively The TD is smoothed by using a running average filter, while and the TO component models is smoothed by using Laplace method -- no theoretical background Traditional smoothing methods for language models, e.g. Kneser-Ney and Witten-Bell, deal only with the n-grams – are not suitable for the TDTO model Neural networks have been shown to provide better smoothing for language models Use neural networks to estimate probabilities based on the TD and the TO Why there is a need to this extension?

Neural network language modeling
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Neural network language modeling Neural network language model N-gram Distant bigram Bag-of-words without memory with memory NN-LM RNN-LM What are the NN LM approaches? LSTM-LM

Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Neural network language modeling Continuous space modeling Projecting a discrete lexical event in the history into the continuous space as a vector – estimating the probability based on the vector representation Superior smoothing capability Vectors of two semantically or syntactically similar words shall be located in a neighboring sub-space – vectors of the rare events are estimated from the vectors of more frequent events N-gram model implemented by using neural networks, e.g. NNLM, RNNLM, LSTMLM, etc. provides better language modeling capability than traditional discounting approaches How does a NN smooth?

Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Neural network language modeling Superior smoothing capability in the continuous space diagram How does a NN smooth?

NN-TDTO language model
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion NN-TDTO language model Using neural networks for the implementation of the TDTO model Probability estimation from continuous representations of TD and TO Neural networks project the TD and the TO of a given context to the continuous space, which smoothing can be performed more effectively Probabilities of the rare TD and TO can be estimated from the vectors in the neighbor sub-space. Why is NN suitable for TDTO?

NN-TDTO language model – the advantage
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion NN-TDTO language model – the advantage In the implementation of the TDTO model, a given context is simplified into word-pairs for feasibility purpose Simplification of the TD component model: Simplification of the TO component model: Neural networks allow a large number of parameters to be learnt under a single back-propagation algorithm, e.g. deep neural networks Why is NN suitable for TDTO?

NN-TDTO language model
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion NN-TDTO language model The TDTO model estimates probabilities based on the TD and the TO events in the history Neural network Term-distance How to implement NN-TDTO? Term-occurrence

NN-TDTO – the architecture
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion NN-TDTO – the architecture Input layer TD: a vector of vocabulary size indicating the locations of the words in the history TO: a vector of vocabulary size indicating the occurrences of the words in the history First hidden layer Project the TD and the TO to the continuous space Second hidden layer Estimate the score of the target-words Output layer Normalize the score to probability TD TO First hidden layer Second hidden layer Input layer Output layer How to implement NN-TDTO?

Architecture – input representation
TD A k-hot continuous vector indicating the distances of the words in the given context Formula: TO A k-hot binary vector indicating the occurrences of the word in the given context How to implement NN-TDTO?

Comparison to other neural network implementation of LMs
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Comparison to other neural network implementation of LMs

Experiments Motivation Result:
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Experiments Motivation To evaluate if neural network implementation offers a better smoothing to the TDTO model Result: Neural networks have shown to provide better smoothing

Experiment Motivation Result:
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Experiment Motivation To evaluate if neural network implemented TDTO model improve the ASR system Result: More accurate ASR

Conclusion Conclusion – Future works Introduction
Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Conclusion Conclusion – Future works

Introduction – The data scarcity problem
The role of a language model (LM) is to estimate the probability of a given word sequence. For example, To estimate “market” in the following sentence “a collapse in blue chip stocks could be a boon for the rest of the market” Using the entire history is infeasible due to data scarcity problem P(“market”|“a collapse in blue chip stocks could be a boon for the rest of the”) a collapse in blue chip stocks could be a boon for the rest of the market history-context target-word

Language model … a collapse in blue chip stocks could be a boon for the rest of the market history-context target-word

The n-gram model The n-gram model
Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion The n-gram model The n-gram model Use only the immediate (n─1) words for prediction a collapse in blue chip stocks could be a boon for the rest of the market P(“a”) × P(“collapse”|“a”) × … × P(“stocks”|“blue chip”) × … × P(“market”|“of the”) The trigram model uses only two words for word prediction

Introduction – The n-gram model
Use only the immediate (n─1) words for prediction Useful information in the far context will be neglected a collapse in blue chip stocks could be a boon for the rest of the market Words such as “collapse” and “stock” which are useful for predicting “market” cannot be reach.

Introduction – Model combination
The n-gram model is combined with another model that exploits the long contexts the bag-of-words model the distant-bigram model the skip-gram model the trigger model

Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Background The role of a language model (LM) is to estimate the probability of a given word sequence The chain rule: a collapse in blue chip stocks could be a boon for the rest of the market P(“a”) P(“collapse”|“a”) P(“stocks”|“a collapse … chip”) P(“market”|“a collapse … the”) × … |V|1 |V|2 Data scarcity!! |V|16 Model complexity increases exponentially as history grows longer |V|6

Introduction Term-distance (TD) Term-occurrence (TO) Neural network implementation Conclusion Formulation Based on the Bayes rule, the inter-dependencies can be captured through the following decoupling operation The prior – the unigram probability The term-distance (TD) – the n-gram likelihood which the word arrangement is conditioned on the BOW of the respective words. The term occurrence (TO) – the BOW model The inter-dependencies between the word arrangement and the word occurrence are captured here.

Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.

Similar presentations

Presentation on theme: "Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.

Similar presentations

Presentation on theme: "Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang."— Presentation transcript:

Similar presentations

About project

Feedback