# CMUSphinx and pocketSphinx

## Presentation on theme: "CMUSphinx and pocketSphinx"— Presentation transcript:

CMUSphinx and pocketSphinx

Windows install Make subdirectory CMUSphinx

Running pocketsphnix Note audio file in CMUSphinx\pocketsphinx\test\data\goforward.raw Open terminal and Change directory to d:\Stephans\CMUSphinx\pocketsphinx\bin\Release Pocketsphinx_batch.exe should be there, unless compile failed Make file ctlFile.txt with text of the name of the file we will decode goforward Make file called argFile.txt with contents (more about these later) -hmm ../../model/hmm/en_US/hub4wsj_sc_8k -lm ../../model/lm/en/turtle.DMP -dict ../../model/lm/en/turtle.dic Move CMUSphinx/sphinxbase/bin/Release/sphinxbase.dll To CMUSphinx/pocketsphinx/bin/Release CMUSphinx\pocketsphinx\test\data\goforward.raw CMUSphinx\pocketsphinx\bin\Release\goforward.raw run pocketsphinx_batch.exe -argfile argFile.txt -cepdir ../../test/data -ctl ctlFile.txt -cepext .raw -adcin true -hyp out.txt Note: the command line arguments must be in this order!! Where -argfile argFile.txt defines the name of the arguments file. These aurgments are displayed on the screen when the program runs. You can check if they match -cepdir ../../test/data defines the path to the files to be processed -cepdir must come before -ctl -ctl ctlFile.txt defines the ctlFile, which contains the name of the files to process. These names cannoy have the path or the extension -cepext .raw defines the extension of the files in the ctlFile -adcin true means that the files are audio files -hyp out.txt defines the output file More details on the parameters are After running, the outfile contains go forward ten meters (goforward )

Make and decode a new audio file
Open windows sound recorder Record “go forward ten meters” Save as myGoForward.wma Saves as .wma file Get wma to wav converted Save as c:\pocketsphnix\test\data\myGoForward.wav I use 4musics multiformat converted. Other converters should work Change ctlFile.txt to myGoForward In terminal run pocketsphinx_batch.exe -argfile argFile.txt -ctl ctlFile.txt -cepdir ./ -cepext .wav -adcin true -hyp out2.txt Check that out2.txt says go forward ten meters

Make your own acoustic model and language
We will go over the what is going on later. But first, let’s try the process. Alternatively, you can read the about what is going on first and then return to this section Download data Get mswav version Save it to your CMUSphinx directory Decompress

models Three types of models are used acoustic model
Used to model the sound of a phone Typically, this a HMM is used Each phone has a HMM Mapping from HMMs to phones Since the acoustic model is a HMM, in the CMU Sphinx the HMM is the same as the acoustic model phonetic dictionary Maps phones to words In CMU Sphinx, .dic files are dictionary files language model Used to determine sequences of words are allowed. For example, “he super run the sally” is not allowed in the language model

Set up config file From CMUSphinx\SphinxTrain\etc
Copy feat.params sphinx_train.cfg To CMUSphinx\an4\etc Sphinc_train.cfg is the main configuration file Open sphinx_train.cfg in an editor Line 6: $CFG_DB_NAME = “an4”; Line 7:$CFG_BASE_DIR = "d:\\stephans\\CMUSphinx\\an4"; Line 8: $CFG_SPHINXTRAIN_DIR = "d:\\Stephans\\CMUSphinx\\SphinxTrain"; Line 11:$CFG_BIN_DIR = "d:\\Stephans\\CMUSphinx\\sphinxbase\\bin\\Release"; Line 13: $CFG_SCRIPT_DIR = "d:\\Stephans\\CMUSphinx\\SphinxTrain\\scripts"; Check out line These say where the wav files are and that we are using mswav, which is what we downloaded Line 232:$DEC_CFG_DB_NAME = 'an4'; Line 233: $DEC_CFG_BASE_DIR = 'd:\\Stephans\\CMUSphinx\\an4'; Line 234 does not seem to matter Line 239:$DEC_CFG_BIN_DIR = "d:\\Stephans\\CMUSphinx\\pocketsphinx\\bin\\Release"; Save sphinx_train.cfg

Other changes copy sphinxbase.dll from
CMUSphinx\sphinxbase\bin\Release To CMUSphinx\SphinxTrain\bin\Release In CMUSphinx\an4\etc directory, copy or rename an4.ug.lm.DMP to an4.lm.DMP Open CMUSphinx\SphinxTrain\scripts\sphinxtrain.in in an editor Line 3: sphinxpath="d:\\Stephans\\CMUSphinx“ In many places is /lib/sphinxtrain. Change this to /SphinxTrain Copy files From CMUSphinx\pocketsphinx\bin\Release, copy pocketspinx_batch.exe and pocketsphinx.dll to CMUSphinx\SphinxTrain/bin/Release Try skipping this and setting line 243 of .cfg

check Open a cmd prompt Type path and make sure that the directory to
python is there SphinxTrain\bin\Release is there

Run training Change to CMUSphinx\an4 directory Run
python ..\SphinxTrain\scripts\sphinxtrain.in run This will take a while (15 minutes) Results from test is sentence error rate of 45% (nearly half of the sentences had at least one error) and 15.7% word error rate (15.7% of the words were incorrectly estimated) This can fail because python was not installed or the path to python was not set Or the path to SphinxTrain

Check log Open an4.html Check for errors
MODULE: 30 Training Context Dependent models A few errors of type: “Failed to align audio to trancript: final state of the search is not reached” are acceptable MODULE: 50 Training Context dependent models At the very end is the test decoding Open log file Note parameters for running decoding, specifically, where Hmm, dic, and lm is

Test with your own voice sample
Record sample Convert to .wav Run pocketsphinx_batch pocketsphinx_batch -hmm d:\Stephans\CMUSphinx\an4/model_parameters/an4.cd_cont_200 -lw 10 -feat 1s_c_d_dd -beam 1e-80 -wbeam 1e-40 -dict d:\Stephans\CMUSphinx\an4/etc/an4.dic -lm d:\Stephans\CMUSphinx\an4/etc/an4.lm.DMP -wip 0.2 -ctl d:\Stephans\CMUSphinx\an4/myTest/ctlFile.txt -ctloffset 0 -ctlcount 130 -cepdir d:\Stephans\CMUSphinx\an4/myTest -cepext .wav -hyp d:\Stephans\CMUSphinx\an4/myTest/results.txt -agc none -varnorm no -cmn current -adcin true

test

background At a first approximation, words are a sequences of sounds, where each sound is a phone. However, the exactly pronunciation of a phone depends on the phones before and after. Diphones are two phones. Diphones are less impacted by the phones that come before or after. Triphones and quinphones are possible. The general name is senone While there are many phones, not all combinations of a phone is a word. Thus, we should not simple recognize phones, by recognize words as a sequence of phones Besides phones are fillers (e.g., breath, “um”). An Utterance is a sequence of words and fillers Utterances are separated by a pause

models Three types of models are used acoustic model
Used to model the sound of a phone Typically, this a HMM is used Each phone has a HMM Mapping from HMMs to phones Since the acoustic model is a HMM, in the CMU Sphinx the HMM is the same as the acoustic model phonetic dictionary Maps phones to words In CMU Sphinx, .dic files are dictionary files language model Used to determine sequences of words are allowed. For example, “he super run the sally” is not allowed in the language model

Running with other models
Many acoustic and language models are available at

Building Your Own Acoustic Model and Language Model
Building your own models is time consuming Acoustic models require Lots of recordings of people saying words and sentences Not that difficult to do Accurate transcription of the recording Time consuming There are many acoustic models available online It is possible to take an existing model are quickly adapt it to a particular speaker Language Model Different systems need different language models A voice control for your TV needs to recognize only a few words like “volume up,” “change channel,” … A voice driven composer needs to recognize a different set of words The performance of the recognizer is improved if your language only considers the relevant words. You can take an existing language model and trim it to what you need, or make on from scratch Many models are available from

example To explore acoustic and language models, get the AN4 database
Save it to your CMUSphinx directory Decompress Also, explore the PDA dataset This data is from letters and numbers, e.g., “A”, “B”, “19” We can test this system by saying things like “A”, “B”, etc.

Acoustic model The acoustic model is used to translate recorded sounds into labeled phones, e.g., recorded sound in file asc.wav is “AH” Roughly speaking, acoustic models take the sound sample as input and the quality of fit as output asc.wav -> AH-Model-> -12 asc.wav -> AY-Model-> -14 AH-Model gives a better fit of the recorder sound Making a acoustic model is called training Inputs to training are audio files and transcriptions Challenge: Usually the audio file has many phones, not just one E.g., from AN4 data set, an audio file contains a recording of the words “TWO SIX EIGHT FOUR FOUR ONE EIGHT “ CMUSphinx\an4\wav\an4_clstk\fash\cen7-fash-b.wav E.g. from PDA data set, an audio file might contain a recording of the words: “MARGINS HISTORICALLY HAVE PEAKED BY MID YEAR HE SAYS” CMUSphinx\PDA\PDAs\001\PDAs01_001_1.wav

Transcriptions Approach one: the recording from the PDA set is transcribed as: M AA R JH AX N Z SIL HH IX S T AO R IX K AX L IY SIL ... Two problems with approach one If the word margins are in other files, we need to enter the pronounciation of the word twice There are two ways that people pronounce historically HH IX S T AO R IX K AX L IY HH IX S T AO R IX K L IY (this one actually says historicly, which is incorrect) Two stage transciptions (results in many files) Transcription file: gives the words spoken This file contains one line for each file used in training The line contains the text of the words spoken and the filename (without extension such as .wav) The AN4 dataset includes the file an4_train_transcription and it includes the line: <s> TWO SIX EIGHT FOUR FOUR ONE EIGHT </s> (cen7-fash-b) The PDA dataset includes the file PDAs.train_all.sent and it includes the line: MARGINS HISTORICALLY HAVE PEAKED BY MID YEAR HE SAYS (PDAs01_041) Hmm, this is missing the <s> and </s>, I think that the software requires <s> and </s>.. To use the pda data set, add <s> and </s> Dictionary file A mapping from words to phones (elementary spoken sounds) Allows words to have multiple pronunciations E.g., the AN4 dataset includes the file an4.dic and it includes the lines ELEVEN IH L EH V AH N ELEVEN(2) IY L EH V AH N E IY By combining the transcript file and dictionary file, the sounds in each recorded audio file can be determined However, it is a bit tricky to determine which part of the audio file corresponds to which sound. This is a major challenge facing training Recall, the overall goal of training is to find models for each sound. But to make the training process easier for the users, we only provide recordings of words and sentences.

Training Files needed your_db_train.fileids - List of files used for training E.g., AN4 includes an4_train.fileids Format path/filename (without extension!) The path is from where the SphinxTrain program is executed E.g., an4_train.fileids path is relative to where AN4 /etc directory. So SphinxTrain needs to be run from this directory your_db_train.transcription - Transcription for training (described on previous slide) your_db.dic - Phonetic dictionary (described on previous slide) your_db.filler - List of fillers and what they map to Fillers are things like silence, breathing, “um” etc. Fillers should also be used in the transcript E.g., <s> TWO +UM+ SIX EIGHT FOUR FOUR ONE EIGHT </s> Fillers use the + sign before and after During training, models for fillers will be computed Decoding is more complicated Fillers are allowed to be added, but there is some penalty the fillers are ignored when computing the probability of a sequence of words E.g., the language model might tell us that “go to bed” is common, and “go up bed” is uncommon. If the decoder detect “go um to bed” it translates it to “go to bed” For some reason, fillers are not used in the an4 and PDA transcript files <s>, </s>, SIL are silence are included SMACK is listed in the PDA filler file, but not in the transcript File format </s> SIL <s> SIL <sil> SIL ++INHALE++ +INHALE+ your_db.phone - Phoneset file a list of all labels of phones used (sounds), including fillers E.g., an4.phone: AA, AE, AH, … Every phone label used in the dictionary must be in the .phone file AND the filler labels

Must have sphinxtrain/bin/debug in path
Must copy sphinxbase.dll to sphinxtrain/bin/debug or set path to Move pocketsphinix exe and dll Edit sphinxtrain.in to remove /log and set prefix to path Must use python 2.7 Delete an4.html before running This is a log file. Will not exist before the first run. But if you run and find errors, you can check it. But make sure to delete it before running so you can see the errors Change an4.ug.lm.DMP to an4.lm.DMP

Language model Language models define which combinations of words are allowed. And, which combinations are more common or less common Language model defines How often a word appears Words: Go, stop, hi, bye How often combinations of word appear Combinations with 2 words: Go forward; go back; … Note that the length of these sequences can be 2, 3, .. The language cannot specify all combinations of any length. So only combinations up to some length (e.g., 2 or 3) are specified .ARPA files specify the language with a particular format See for some details See next slide There is an online language maker that takes sentences, counts the combinations of words and makes a ARPA file If you make your own arpa file, you must sort it before using sphinx_lm_sort < unsorted.arpa > sorted.arpa Then convert to lm sphinx_lm_convert –I sorted.arpa –o sorted.lm.DMP Note that sometimes files that end in .lm are in the arpa format The DMP can be used to decode

ARPA format <header - information ignored by applications>
\data\ ngram 1=9 ngram 2=11 ngram 3=3 \1-grams: <unk>        </s> <s>   When will the   Stock Go    Up    \2-grams: <s> When     <s> the      <s> Up       When will    will </s>    will the     the </s>     the Go       Stock Go     Go Up        Up </s>      \3-grams: <s> When will      When will the      Go Up </s>   \end\ /data/ specifies how many entries The numbers are log10 of probabilities For the 3-gram entry -1.2 go to bed -.1. The first number, -0.2 is log10 of the probability that the last word (bed) occurs given the first two words have occurred There might be other 3-grams like go to sleep, etc. The second number is the probability that no words occur after this 3-gram For the 2-gram entry -.2 go to -10.1 The first number is the log10 of the probability that to occurs after go The second number is the probability that no words will come after go to Not so likely For the 1-gram go -0.27 The first number is the probability that go occurs Go can occur by itself The second number is not the log10 of a probability, but is log10 of a weight (it could be log10 of a probability, but does not have to be)

Running pocketsphinx on android
I could only get this working on Linux. Windows might be possible (I didn’t try MAC) The instructions here are almost correct Follow instructions for getting and compiling sphinxbase and pocketsphinx Get PocketSphinxDemo.tar.gz Import that to eclipse File->import->Existing Projects into workspace-(next)- Select “Select archive file” browse and select PocketSphinxDemo.tar.gz In an editor, open eclipse/workspace/PocketSphinxDemo/jni/Anroid.mk In the second to last line Change LOCAL_STATIC_LIBRARIES := sphinxutil sphinxfe sphinxfeat sphinxlm pocketsphinx To LOCAL_STATIC_LIBRARIES := pocketsphinx sphinxlm sphinxfeat sphinxfe sphinxutil (back to instructions from web page) Build, Change directory to eclipse/workspace/PocketSphinxDemo/jni Android/andtroid-ndk-r7b/ndk-build –B Adjust properties->Builders as described on web page I’m not sure how important this is. Swig makes an interface between java and c++, but these files have already been down loaded. ndk is run from the command line

On phone (the directory should be /mnt/sdcard/Android/data/edu.cmu.pocketsphinx) adb shell mkdir /mnt/sdcard/Android/data/edu.cmu.pocketsphinx cd /mnt/sdcard/Android/data/edu.cmu.pocketsphinx Make directory struction as shown on web page /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /hmm/en_US /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /hmm/hub4wsj_sc_8k Not sure if this is needed. /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /lm /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US Cd to CMUSphinx/pocketsphinx/model/hmm/en_US/ Android/android-sdk/platform-tools/adb push ./hub4wsj_sc_8k /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k Cd to CMUSphinx/pocketsphinx/model/lm Android/android-sdk/platform-tools/adb push ./en_US /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/

In eclipse In RecognizerTask.java, change code to include the correct path This path must match the path where the model files are located pocketsphinx.setLogfile("/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/pocketsphinx.log"); Config c = new Config(); /* * In 2.2 and above we can use getExternalFilesDir() or whatever it's called */ c.setString("-hmm", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k"); c.setString("-dict", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/hub dic"); c.setString("-lm", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/hub DMP"); c.setString("-rawlogdir", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx"); // Only use it to store the audio Note that these lines are also changed if you use different models Build, run and test

Windows install Requires Android NDK
Flex for windows: Bison for windows: Get CMUSphinix from here: ?? Note that this contains the

Or google: pocketSphinx android Or: But order of libs at the end need to be reversed Only compiles on linux, because is need yacc

resources:

Voice activity detection
VAD is used to detect if anyone is speaking