Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Similar presentations


Presentation on theme: "1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf."— Presentation transcript:

1 1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf

2 2 Introduction Statistical language models –part of state-of-the-art speech recognizers Word sequence: w 1,w 2,…,w T Consider the word sequence to be a Markov process N-gram Probability:

3 3 Introduction Use bigram representations –Bigram probability –Trigram in experiments Although n-gram models are powerful, but are constrained to n. –Dependencies longer than n ? The problem of representing long distance dependencies has been explored.

4 4 Introduction Sentence-level: –words can co-occur in a sentence due to topic, grammatical dependencies Article or conversation-level: –The subject of an article of a conversation Dynamic cache language models address the second issue –Increase the likelihood of a word given that it has been observed previously in the article Other methods: –Trigger language models –Context-free grammars

5 5 Mixture Model Framework N-gram level: Sentence level: P k : bigram model for k-th class λ k : mixture weight

6 6 Mixture Model Framework Two main issues: –Automatic clustering to handle data not explicitly marked for topic dependence –Robust parameter estimation

7 7 Clustering Specified by hand or determined automatically Agglomerative clustering –Partition into desired number of topics Performed at the article level –Reduce computation The initial clusters can be singleton units of data

8 8 Clustering 1.let the desired number of clusters be C* and initial number of clusters be C 2.Find the best clusters, say A i and A j, to maximize some similarity criterion S ij 3.Merge A i and A j and decrement C 4.If current number of C=C*, then stop; otherwise go to Step 2

9 9 Clustering Similarity measure: –Set-intersection measure –Combination of inverse document frequencies |A i |: the number of unique words in cluster i |A w |: the number of clusters containing the word w N ij : the normalization factor Advantage of idf measure: –High-frequency words, such as function words are automatically discounted

10 10 Robust Parameter Estimation Once the topic-dependent clusters are obtained, the parameters of the m components of the mixture can be estimated. Two issues: –Initial clusters may not be optimal EM Viterbi-style –Sparse data problems Double mixtures: –Sentence level –N-gram level

11 11 Iterative Topic Model Reestimation Each component model in the sentence-level mixture is conventional n-gram model Some back-off smoothing technique used –Witten-Bell technique Clustering method is simple, articles may be clustered incorrectly, parameters may not be robust. –Iteratively reestimate

12 12 Iterative Topic Model Reestimation Viterbi-style training technique: –Analogous to vector quantization design with LBG –Maximum likelihood is used as minimum distance –http://www.data-compression.com/vq.shtml#lbghttp://www.data-compression.com/vq.shtml#lbg Stop criterion: the size of the m clusters becomes nearly constant

13 13 Iterative Topic Model Reestimation EM algorithm: –Incomplete data: do not know exactly which topic cluster a certain sentence belongs to –Theoretically better than Viterbi algorithm, although there is a little difference in practice –Potentially more robust –Complicated for language model training Witten-bell back-off scheme

14 14 Iterative Topic Model Reestimation E-step: expected likelihood of every sentence in the training corpus as belongs to each of the m topics. A bigram model in the p-th iteration, the likelihood of the i-th training sentence as belong to the j-th topic is given:

15 15 Iterative Topic Model Reestimation Derivation:

16 16 Iterative Topic Model Reestimation M-step: Bigram (w b,w c ) probability for the j-th topic-model :

17 17 Iterative Topic Model Reestimation Unigram back-off:

18 18 Iterative Topic Model Reestimation Meaning: Witten-Bell

19 19 Iterative Topic Model Reestimation MLE for bigram (w b, w c ): Unigram w c :

20 20 Topic Model Smoothing Smoothing due to sparse data: –Interpolate with general model at n-gram level Come across nontopic sentences: –Including a general model in addition to the m component models at sentence level

21 21 Topic Model Smoothing Weight are estimated separately using a held-out data

22 22 Dynamic adaptation The parameters (λ k,P(w i |w i-1 )) of the sentence- level mixture model are not updated as we observe new sentence. Dynamic language modeling tries to capture short-term fluctuations in word frequencies. –Cache or trigger model (word frequency) –Dynamic mixture weight adaptation (topic)

23 23 Dynamic Cache Adaptation The cache maintains a list of counts of recently observed n-grams Static language model probabilities are interpolated with the cache n-gram probabilities. Specific topic-related fluctuations can be captured by using a dynamic cache model if the sublanguage reflects style.

24 24 Dynamic Cache Adaptation Two important issues to dynamic language modeling with a cache: –The definition of cache –The mechanism for choosing the interpolated weight The probabilities of content words vary with time within the global topic Small but consistent improvements over the frequency-based rare-word cache

25 25 Dynamic Cache Adaptation The equation for adapted mixture model incorporating component-caches: μ: empirically estimated on a development test set to minimize WER

26 26 Dynamic Cache Adaptation Two approaches for updating the word counts –Partially observed documents is cached and the cache is flushed between documents. –A sliding window is used to determine the contents of the cache if the document or article is reasonably long. Select first approach –Do not have long articles

27 27 Dynamic Cache Adaptation Extend cache-based n-gram adaptation to the sentence-level mixture model Fractional counts of words are assigned to each topic according to their relative likelihoods The likelihood of the k-th model given the i-th sentence yi with words w 1,…,w T :

28 28 Dynamic Cache Adaptation Derivation:

29 29 Dynamic Cache Adaptation Meaning: k topic caches n features Sentence i z i1 z i2 z ik

30 30 Mixture Weight Adaptation The sentence-level mixture weights are updated with each new observed sentence, to reflect the increased information on which to base the choice of weight.

31 31 Experiments Corpus: –NAB news, not marked for topics –Switchboard, 70 topics Recognition system: –Boston University system Non-parametric trajectory stochastic segment model –BBN byblos system: Speaker-independent HMM system

32 32 Experiments

33 33 Experiments

34 34 Experiments

35 35 Conclusions Investigate a new approach to language modeling using a simple variation of the n-gram approach: sentence-level mixture Automatic clustering algorithm to classify text as one of m different topics Other language modeling advances can be easily incorporated in this framework.


Download ppt "1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf."

Similar presentations


Ads by Google