Presentation is loading. Please wait.

Presentation is loading. Please wait.

K.Marasek 05.07.2005 Multimedia Department HTK Tutorial Prepared using HTKBook.

Similar presentations

Presentation on theme: "K.Marasek 05.07.2005 Multimedia Department HTK Tutorial Prepared using HTKBook."— Presentation transcript:

1 K.Marasek Multimedia Department HTK Tutorial Prepared using HTKBook

2 K.Marasek Multimedia Department Software architecture ztoolkit for Hidden Markov Modeling zoptimized for Speech Recognition zvery flexible and complete zvery good documentation (HTK Book) zData Preparation Tools zTraining Tools zRecognition Tools zAnalysis Tool

3 K.Marasek Multimedia Department General concepts zSet of programs with command-line style interface zEach tool has a number of required arguments plus optional arguments. The latter are always prefixed by a minus sign. HFoo -T 1 -f a -s myfile file1 file2 zOptions whose names are a capital letter have the same meaning across all tools. For example, the -T option is always used to control the trace output of a HTK tool. zIn addition to command line arguments, the operation of a tool can be controlled by parameters stored in a configuration file. For example, if the command HFoo -C config -f a -s myfile file1 file2 is executed, the tool HFoo will load the parameters stored in the configuration file config during its initialisation procedures zthe HTK data formats yaudio: many common formats plus HTK binary yfeatures: HTK binary ylabels: HTK (single or Master Label les) text ymodels: HTK (single or Master Macro les) text or binary yother: HTK text

4 K.Marasek Multimedia Department Data preparation tools zdata manipulation tools: HCopy – parametrze signals HQuant - vector quantization HLEd – label editor HHEd – model editor (master model file) HDMan - dictionary editor HBuild – language model conversion HParse – lattice file preparation (grammar conversion) zdata visualization tools: HSLab - speech label manipulation HList - data display and manipulation HSGen – generate sentences out of regular grammar

5 K.Marasek Multimedia Department Training tools The actual training process takes place in stages and it is illustrated in more detail in Fig Firstly, an initial set of models must be created. If there is some speech data available for which the location of the sub- word (i.e. phone) boundaries have been marked, then this can be used as bootstrap data. In this case, the tools HInit and HRest provide isolated word style training using the fully labelled bootstrap data. Each of the required HMMs is generated individually. HInit reads in all of the bootstrap training data and cuts out all of the examples of the required phone. It then iteratively computes an initial set of parameter values using a segmental k-means procedure. On the first cycle, the training data is uniformly segmented, each model state is matched with the corresponding data segments and then means and variances are estimated. If mixture Gaussian models are being trained, then a modified form of k-means clustering is used. On the second and successive cycles, the uniform segmentation is replaced by Viterbi alignment. The initial parameter values computed by HInit are then further re-estimated by HRest. Again, the fully labelled bootstrap data is used but this time the segmental k-means procedure is replaced by the Baum-Welch re-estimation procedure described in the previous chapter. When no bootstrap data is available, a so-called flat start can be used. In this case all of the phone models are initialised to be identical and have state means and variances equal to the global speech mean and variance. The tool HCompV can be used for this. Once an initial set of models has been created, the tool HErest is used to perform embedded training using the entire training set. HErest performs a single Baum-Welch re-estimation of the whole set of HMM phone models simultaneously. For each training utterance, the corresponding phone models are concatenated and then the forward- backward algorithm is used to accumulate the statistics of state occupation, means, variances, etc., for each HMM in the sequence. When all of the training data has been processed, the accumulated statistics are used to compute re-estimates of the HMM parameters. HErest is the core HTK training tool. It is designed to process large databases, it has facilities for pruning to reduce computation and it can be run in parallel across a network of machines

6 K.Marasek Multimedia Department Recognition and analysis tools zHVite – performs Viterbi-based speech recognition. HVITE takes as input a network describing the allowable word sequences, a dictionary defining how each word is pronounced and a set of HMMs. It operates by converting the word network to a phone network and then attaching the appropriate HMM definition to each phone instance. Recognition can then be performed on either a list of stored speech files or on direct audio input. As noted at the end of the last chapter, HVITE can support cross-word triphones and it can run with multiple tokens to generate lattices containing multiple hypotheses. It can also be configured to rescore lattices and perform forced alignments. zHResults uses dynamic programming to align the two transcriptions and then count substitution, deletion and insertion errors. Options are provided to ensure that the algorithms and output formats used by HRESULTS are compatible with those used by the US National Institute of Standards and Technology (NIST). As well as global performance measures, HRESULTS can also provide speaker-by-speaker breakdowns, confusion matrices and time-aligned transcriptions. For word spotting applications, it can also compute Figure of Merit (FOM) scores and Receiver Operating Curve (ROC) information.

7 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 1. Set the task yPrepare the grammar in the BNF format:  [.] optional  {.} zero or more  (.) block  loop  > context dep. loop .|. alternative yCompile grammar to lattice format xD:\htk-3.1\bin.win32\HParse location-grammar $location= where is | how to find | how to come to; $ex= sorry | excuse me | pardon; $intro= can you tell me | do you know ; $address= acton town | admirality arch | baker street | bond street| big ben | blackhorse road | buckingham palace | cambridge | canterbury | charing cross road | covent garden | downing street | ealing | edgware road | finchley road | gloucester road | greenwich | heathrow airport | high street | house of parliament | hyde park | kensington | king's cross | leicester square | marble arch | old street | paddington station | piccadilly circus | portobello market | regent's park | thames river | tower bridge | trafalgar square | victoria station | westminster abbey | whitehall | wimbledon | windsor; $end= please; (!ENTER{_SIL_}({$ex} {into} {$location} $address {$end}){_SIL_}!EXIT)

8 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 2 – prepare pronunciation dictionary yFind the list of words using in the task – lg.wlist yPrepare dictionary by hand, automatically or using standard pronounciation dictionary (e.g. Beep for British English) yOr use the whole Beep dictionary zwhere [where] 1.0 w zwhere [where] 1.0 w r zis [is] 1.0 I z zhow [how] 1.0 h aU zadmirality [admirality] 1.0 { d l i: t i: zpalace [palace] 1.0 p { l I s

9 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 3 - Record the Training and Test Data yHTK has a tool for prompts recordings HSLab but it is working under Linux only yUsually other programs used for that yFirst generate prompts than record them yD:\htk-3.1\bin.win32\HSGen -l -n 200 beep.dic > lg how to come to baker street _SIL_ !EXIT 2. ealing please _SIL_ !EXIT 3. heathrow airport !EXIT 4. leicester square _SIL_ !EXIT 5. king's cross please _SIL_ !EXIT 6. hyde park _SIL_ !EXIT 7. _SIL_ greenwich please _SIL_ _SIL_ _SIL_ _SIL_ _SIL_ !EXIT 8. old street !EXIT 9. high street _SIL_ _SIL_ _SIL_ _SIL_ !EXIT 10. whitehall !EXIT 11. old street !EXIT 12. canterbury please !EXIT 13. into edgware road !EXIT 14. whitehall _SIL_ !EXIT 15. whitehall _SIL_ !EXIT 16. finchley road please please please _SIL_ !EXIT yRecord prompts and store in chosen format: 16 kHz, 16-bit, headerless (?)

10 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 4 - Create the Transcription Files yIn the HTK all transcription files can be merged into one Master Label File (MLF) y usually it is enough to have word level transcripts yIf phone level necessary it can be automatically generated using HLEd #!MLF!# "*/S0001.lab" how to come to baker street "*/S0002.lab" ealing please (etc...)

11 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 5 - Parametrize the Data yUse HCopy: compute MFCC and delta parameters yUse config file to set all the options (hcopy.conf)  HCopy -T 1 -C hcopy.conf -S file.list ### hcopy.conf ###input file specific section SOURCEFORMAT = NOHEAD HEADERSIZE = 0 #16kHz corresponds to msec SOURCERATE= 625 ### ###analysis section ### # no DC offset correction ZMEANSOURCE = FALSE # no random noise added ADDDITHER = 0.0 #preemphasis PREEMCOEF = 0.97 #windowing TARGETRATE= WINDOWSIZE= USEHAMMING= TRUE #fbank analysis NUMCHANS= 24 LOFREQ= 80 HIFREQ= 7500 #don't take the sqrt: USEPOWER= TRUE #cepstrum calculation NUMCEPS= 12 CEPLIFTER= 22 #energy ENORMALISE= FALSE ESCALE= 1.0 RAWENERGY= FALSE #delta and delta-delta DELTAWINDOW = 2 ACCWINDOW= 2 SIMPLEDIFFS= FALSE ### ###output file specific section ### TARGETKIND= MFCC_D_A_0 TARGETFORMAT= HTK SAVECOMPRESSED= TRUE SAVEWITHCRC= TRUE

12 K.Marasek Multimedia Department ~o ~h "p" e e e e e e e e e e e e e e e e e e e e e e e e e+0 zStep 6 – Create Monophone HMMs z define a prototype model and clone for all phones

13 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 7 – Initialize models  Use Hinit: HInit -S trainlist -H globals -M dir1 proto  Firstly, the Viterbi algorithm is used to find the most likely state sequence corresponding to each training example, then the HMM parameters are estimated. As a side-effect of finding the Viterbi state alignment, the log likelihood of the training data can be computed. Hence, the whole estimation process can be repeated until no further increase in likelihood is obtained.  if no initial data use HCompV for flat start initialization will scan a set of data files, compute the global mean and variance and set all of the Gaussians in a given HMM to have the same mean and variance

14 K.Marasek Multimedia Department How to use HTK in 10 easy steps z Step 8 - Isolated Unit Re-Estimation using HRest y Its operation is very similar to HInit except that, it expects the input HMM definition to have been initialised and it uses Baum-Welch re-estimation in place of Viterbi training ywhereas Viterbi training makes a hard decision as to which state each training vector was ``generated'' by, Baum- Welch takes a soft decision. This can be helpful when estimating phone-based HMMs since there are no hard boundaries between phones in real speech and using a soft decision may give better results. yHRest -S trainlist -H dir1/globals -M dir2 -l ih -L labs dir1/ih yThis will load the HMM definition for /ih/ from dir1, re- estimate the parameters using the speech segments labelled with ih and write the new definition to directory dir2.

15 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 9 - Embedded Training using HERest y HERest embedded training simultaneously updates all of the HMMs in a system using all of the training data. yOn startup, HERest loads in a complete set of HMM definitions. Every training file must have an associated label file which gives a transcription for that file. Only the sequence of labels is used by HERest, however, and any boundary location information is ignored. Thus, these transcriptions can be generated automatically from the known orthography of what was said and a pronunciation dictionary. yHERest processes each training file in turn. After loading it into memory, it uses the associated transcription to construct a composite HMM which spans the whole utterance. This composite HMM is made by concatenating instances of the phone HMMs corresponding to each label in the transcription. The Forward-Backward algorithm is then applied and the sums needed to form the weighted averages accumulated in the normal way. When all of the training files have been processed, the new parameter estimates are formed from the weighted sums and the updated HMM set is output. yHERest -t S trainlist -I labs \ -H dir1/hmacs -M dir2 hmmlist -t : beam limits yCan be used to prepare context-dependent models

16 K.Marasek Multimedia Department How to use HTK in 10 easy steps zStep 10 - Use HVite to recognize utterances and HResults to evaluate recognition rate yD:\htk-3.1\bin.win32\HVite -g -w -H wsjcam0.mmf –S test.list -C hvite.conf –i recresults. mlf beep.dic wsjcam0.mlist yA lot of other options to be set (beam width, scale factors, weights, etc.) yOn line: D:\htk-3.1\bin.win32\HVite -g -w -H wsjcam0.mmf -C live.conf beep.dic wsjcam0.mlist yStatistics of results: HResults -I testref.mlf tiedlist recout.mlf ====================== HTK Results Analysis ============== Ref : testrefs.mlf Rec : recout.mlf Overall Results SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855] ========================================================== N = total number, I = insertions, S = substitutions, D = deletions correct: H = N-S-D, %Corr=H/N, Acc=(H-I)/N

17 K.Marasek Multimedia Department Bye zThanks for your participation!

Download ppt "K.Marasek 05.07.2005 Multimedia Department HTK Tutorial Prepared using HTKBook."

Similar presentations

Ads by Google