Presentation is loading. Please wait.

Presentation is loading. Please wait.

>65536 Arthur Chan May 4, 2006. What so special about 65536? 65536 = 2 ^ 16 Do you know? –Sphinx III did not support language model with more than 65536.

Similar presentations

Presentation on theme: ">65536 Arthur Chan May 4, 2006. What so special about 65536? 65536 = 2 ^ 16 Do you know? –Sphinx III did not support language model with more than 65536."— Presentation transcript:

1 >65536 Arthur Chan May 4, 2006

2 What so special about 65536? 65536 = 2 ^ 16 Do you know? –Sphinx III did not support language model with more than 65536 (2^16) words –CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well. Though a word could have counts more than 65536 (- four_byte_counts)

3 Why 65536 was the limit? Both Sphinx III and CMU-Cambridge LM Toolkit V2 was written in 95-99 –Time when having 64M RAM is extravagant (Now 64G seems to be the number.) (At that time, Pentium 166 or 200 are hot) –Programmers therefore designed clever structures to deal with memory issues In Sphinx, DMP format was invented (WordID: 16bits) In CMU-Cambridge LM Toolkit, 16 bits data types were used for wordID.

4 This Talk (30-35 pages) Describe our effort on breaking the 16 bits limit in –Sphinx 3 –CMU-Cambridge LM Toolkit V2 Half a talk –Features not fully tested in real-life –But the talk itself is quite long. Technical Detail of the changes. –Sphinx III – The easy part (9 pages) –CMU-Cambridge LM Toolkit – The tough part (10 pages) The Root of the Evil (11 pages) –Why does this problem exist? Why does it persist? –What if similar problem appear? How do we solve and avoid them?

5 Disclaimer about the Speaker Notorious of being negative on Language Modeling Techniques –Symptom 1: Yell at others when his LM code has bugs. –Symptom 2: Yell at others He should be forgiven because –His Master Thesis Supervisor taught him that when he was young –Prof. Ronald Rosenfeld’s taught him the same –He also read Dr. Joshua Goodman’s papers

6 Terminology “Probability” actually means –The estimate of the probability Back-off weight means –When some n-gram is unseen in the training data Back-off to (n-1)-gram “probability” times a weight According to Manning, four-gram should be tetragram, bigram should be digram –Well, it’s lucky it doesn’t matter to us today

7 LM Component of Sphinx III The Easy Part

8 What Sphinx 3.6 RCI supports ARPA LM DMP LM –A memory efficient version of ARPA LM –Could be run in disk-mode as well Class-based LM Multiple LMs and LM switching dynamically lm_convert –(new in 3.6!) Conversion tool for ARPA, DMP LM

9 A note on the DMP format A tree like format –Bigram is indexed by prefix unigram –Trigram is indexed by prefix bigram. Bigram, Trigram probabilities and back- of weights –Quantized to 4 decimal point. So you see following statements in the code:

10 Funny C statements in the Code /* HACK!! to quantize probs to 4 decimal digits */ p = p3*10000; p3 = p*0.0001; If you delete this, then the LM will be larger because quantization is not done.

11 Reasons why Sphinx III only supports less than 65536 words 16 bits data structures for –Bigram –Trigram –Cache structure

12 Bogus Reason of Why Sphinx III doesn’t more than 65536 words A very bad misconception –“The decoding is constrained by the dictionary” –WRONG In both flat and tree lexicon search. Only LM words are traversed. –RIGHT Generally, decoding is constrained by the intersection of the LM word and dictionary words

13 Several Proposed Surgery Procedure 1, Rewrite the whole LM routine –Oops! But it takes too much time, –Old routine is very memory efficient 2, Replace the old LM by just switching the type of data structure –Problem: All the binary LMs we generated have the old layout. –We will lose backward compatibility very badly

14 Final Solution lm now support two data structures: 16 and 32 bits lm_convert and decode will support two types of binaries LM –DMP that has a 16 bit layout –DMP32 that has a 32 bit layout –Magic version number will decide which layout to use –Regression test could ensure not bad code check-ins When to use which format is hidden from –Any one called the lm routine. (for a few exceptions)

15 Partial Verification of the Code The 16 bit and 32 bit code produce exactly the same decoding results for –decode –decode_anytop –(allphone’s trigram could probably left untest.) A faked LM with more than 65536 words could be used and run in decode

16 Current Practical Limit The lm data structure in lm.h –Theoretically support LM with Less than 4 billion unigram Less than 4 billion bigram Less than 4 billion trigram –What if we have n-gram size larger than 4 billion? Answer: we are dead people Further answer: it is easily fixable Other data structure from Sphinx 3? –hash.c doesn’t return prime number large than 900001 –Further answer: it is easily fixable as well

17 Conclusion Technically –Sphinx III 32bit mode is not that difficult to take care. –The problem was also confined to one data structure Thanks to the modular design of Sphinx III Pretty easy to solve. Sphinx III’s decision of using binary format –If I were Ravi, I will do that as well –Much faster loading time for large model.

18 CMU-Cambridge LM Toolkit V2 The Tough Part

19 CMU-Cambridge LM Toolkit Version 2 LM Support of CMU-Cambridge LM Toolkit Version 2 –LM training Parameter estimation with backoff weight computation Support both –LM in ARPA format –LM in BINLM format »BINLM is not the same as DMP format. »bin2arpa could translate BINLM to ARPA

20 Purpose of the toolkit Training LM for –Speech Recognition –Statistical Machine Translation –Document Classification –Hand writing Recognition A note: –Occasionally, speech recognition is really not everything

21 Standard Procedure of Training In V2 time, –David Huggins-Daines wasn’t in CMU –Training is separated into 4 stages text2wfreq -> Find the word frequency table wfreq2vocab –Find the vocabulary we need (smaller than the frequency table) text2idngram –Convert the text to a stream of ngram and its count (idngram) –The ngram word id is alphabetically sorted idngram2lm –Gather the counts, compute the discounted estimates and the backoff weights.

22 Reasons why V2 doesn’t support 65536 words There is one single file that typedef many data structures –But the variables are not used very often Most variables are not typedef. –Many of them are declared as unsigned short int

23 Another Issue……. What if we have more than 4 billion n- gram this time? –e.g if n>5 Not forgivable in LM training because –MT people are already having this problem. (unigram size is 5 million)

24 Strategy Spent 90% of the time to make sure the data type was declared correctly Give up taking care of both 16-bit and 32-bit binary layout together. –Compile time switch (THIRTYTWOBITS) is provided –Reasonable because users seldom used BINLM any way User need to use DMP format in Sphinx III Tool chain is now completed –Number of ngram is a 64 bit number

25 What we support now One could trained an LM with more than 65536 words –text2wfreq, wfreq2vocab, text2idngram, idngram2lm are fixed One could convert an LM –binlm2arpa, ngram2mgram are fixed One could compute the perplexity of an LM and some statistics from the text –evallm, idngram2stats are fixed

26 Other Evils of Detail V2’s hash table is using a very bad hash function –Many collisions –Legacy from pre-90s –One could take a 4 hour nap to load the word list if we train a 500k word model. –After using Dan J. Barstein’s hash function, the load time is acceptable (<1 min) Binary layout was one of the most time-consuming part of development

27 Verification 16 bits and 32 bits code provides exactly the same results 32 bit code could train a LM from a faked corpora with 10M unique words. Note: we are talking about uniq words. Both Dave perl tests and Arthur’s tests are all passed. –So, things like LM interpolation is actually working too.

28 Current Limitation Theoretically support –1.84 x10^19 ngrams. The 4-step procedure used too much space –100 M words training requires 10 G harddisk 1-2G RAM –1G word training requires more 100G harddisk and 20G RAM?

29 So, we still have issue when …… Ascending order of difficulties –What if MT people asked us to run their LM in our recognizer (1M limit)? –What if we need to run decoding for 10 languages and each with 100k words? –What if we need to train a N word corpora (N= 1 billion) and there are N*N*N trigrams? –What if Prof. Jim Baker was back? –What if there were aliens?

30 Deliver Us From Evil Why this feature wasn’t implemented in 2001?

31 An Important Observation There is and implicit Development Deadlock between –Sphinx III –CMU-Cambridge LM Toolkit –SphinxTrain

32 General Pattern (part I) Decoder’s developer think –“Feature X is not implemented in the trainer” –“That is to say there will be no use if we implement feature X”. => Give up feature X

33 General Pattern (part II) Trainer’s developers think –“Feature X is not implemented in the decoder” –“That is to say there will be no use if we implement feature X”.  Give up feature X

34 Why Feature X is not implemented in the first place? Possible Reason 1: –In the past, someone analyze some results and conclude that Feature X is not useful Possible Reason 2: –Because of theoretical reasons Y and Z, someone conclude that Feature X is not useful Possible Reason 3: –Past hardware limitation

35 In Reality…… Feature X could turn out to be very useful, –E.g. More than 65k words in LM N-gram when N > 3 Interpolation (instead of backoff) in N-gram

36 Another Important Observation Constant give up of new features –Eventually give up the whole software development –Look at CMU-Cambridge LM Toolkit V2

37 How we should deal with this Problem? 1, Know that this is a problem (From anonymous self-help books.) 2, We need a joint understanding of both of the decoder and the trainer(s) –Question to ask: Is it really correct to always develop the decoder first? 3, New features of the training could always be tested in cheap ways –N-best and Lattice rescoring –Then the deadlock will be broken on one side

38 A Unified View of Our Software SphinxTrain CMU-Cambridge LM Toolkit Sphinx Brothers {2,3,4} depends on where you live The Suite

39 Issue 1 Q: “Do we have the right to change the LM Toolkit?” A: “Yes, according to the license if we open the source for research purpose, we could change, distribute the code. Our changes is endorsed by Prof. Rosenfeld (CMU), Dr. Clarkson (Cambridge) and Prof. Robinson (Cambridge)”

40 Issue 2 Q: “Do we have anything new in LM?” A: “That depends on the brilliance of our students and staffs. –Also generally brilliance of the public They have the right to contribute Actually, in past 10 years, –A lot of new thing were done in CMU in LM –Just no one collects them and put them together.”

41 Issue 3 Q: “Are you just getting yourself a lot of trouble?” A: “The troubles are always there, we just never face it. “

42 L Digression: Project L News: Some Folks are working on the LM toolkit now! L –Project Code: L –Three key supporters ABA Young Professor (or Prof. AB) –Hint: he is not exactly young DHA Young Student (or DH) ACA Young Staff (or AC) –Gathering code from around the world Thanks for Professor Yannick from LIUM in advance ATThanks to contributor AT

43 Conclusion 32 bits data structure now supported in both Sphinx III and CMU-Cambridge LM Toolkit. This brings up a lot of development issue –May be we should take the LM toolkit more seriously Maintenance (a must) New features development (if we have time)

44 Preview of the next 2 talks LProject L –Story of The Three Young Developers Development Progress of Sphinx 3.X (From X=3 to X=6) –What is the big picture of Sphinx?

Download ppt ">65536 Arthur Chan May 4, 2006. What so special about 65536? 65536 = 2 ^ 16 Do you know? –Sphinx III did not support language model with more than 65536."

Similar presentations

Ads by Google