Presentation on theme: "Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC"— Presentation transcript:
Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC
Vianix Background Fast-paced speech technology company with corporate headquarters located in Virginia Beach, Virginia. Vianix has developed, tested, proven and licensed MASC ® – Managed Audio Sound Compression (MASC ® ) – State-of-the-Art speech compression technology High performance enabling voice technology For a broad spectrum of healthcare, multimedia communications and enterprise applications
NSR Vs DSR NSR DSR
Disadvantages of DSR Bandwidth Requirements: Bit rates ~7-11 kbps, no better than that of compressed voice (For e.g., a lot of VBR encoders can compress from 5 to 17 Kbps with a good recognition accuracy). Speech Reconstruction: Not possible to listen to original voice although more recent advances of DSR allow only low quality reconstruction of voice from features such as LPC or cepstral coefficients (MFCC as in ETSI based Aurora). Playback using TTS: Most DSR applications can only synthesize voice using TTS for an audio playback. DWER: Overall DWER may be lower or greater than NSR based recognition Feature-aware recognition: The recognition engine has to know the type of feature extraction being done apriori in order for the recognition engine to transcribe accurately. Cost of additional client: Additional expenditure as the front-end each time a client needs to be changed.
Advantages of NSR Delay/Jitter for Transcription: Any delay in the network transmission of NSR is inconsequential because most transcription applications are non-real-time. Single Client: NSR front-end clients do not need to be changed. The same front-end terminal such as those used in VoIP and other applications. Bandwidth Requirement: Transmission of speech data over any data network for NSR applications requires almost the same bandwidth requirements to encode speech data (For e.g., there are different encoders today offering VBR levels to meet bandwidth requirements without compensating too much on the recognition accuracy). Bit-stream domain recognition: Recognizing speech at the compressed bit-stream domain avoids complications such as no additional feature extraction mechanism is required on the device, and there are no reconstruction losses on the server. Channel coding: Standard schemes can be used with compressed stream (to avoid channel errors) VoIP robustness: Earlier, it was difficult to send compressed voice (only voice features) through the data channel. Now that VoIP has become very robust, high quality compressed voice content can also be sent via data channels.
PESQ / MOS PESQ (Perceptual Evaluation of Speech Quality) –Originally defined as part of P.861 as PSQM as an objective measure –Modification to PSQM as PESQ in P.862 –PESQ combines the excellent psycho-acoustic and cognitive model of PSQM+ with a time alignment algorithm that handles varying delays. –PESQ usually ranges from 1 to 4.5 MOS (Mean Opinion Score) –A linear mapping and proportional to PESQ –MOS, according to ITU standard can be between 1.0 to 5.0 –MOS is a subjective measure as opposed to PESQ being an objective measure
Other Metrics Variable Bit-Rate: –Various bitrates for different codecs (which support variable bit rates) including MASC codec were compared with variable bit-rates. –Bit-rates range from almost 5 kbps to 20 kbps. MIPS: –Computational efficiency for diff codecs are compared using V-Tune. –MIPS ranges from 20 to about 200 depending on the codec used. WER: –A measure to compute the number of words in percentage that have NOT been correctly identified by an ASR. –Accuracy of the ASR engine is computed by identifying how many words were inaccurate. DWER: difference in WER from the original uncompressed PCM samples to decompressed/decoded PCM samples. –Absolute and Relative. –Absolute here and a relative number can be obtained by computing the ratio of Absolute DWER to the Original Uncompressed WER.
Procedure for Comparison of different Codecs Procedure for ADWER Computation Comparison of MASC with other various codecs –ADWER –PESQ –Bit-rate
PCM REF PCM REF Delta Automatic Speech Recognition Engine Transcribed Text from PCM Ref Comparison Of Text Files for Word Error Rate Original Text %WER Deg %WER REF Encoder PCM Deg STAGE 1 STAGE 2 WER = STAGE 3 %WER REF %WER Deg Decoder Transcribed Text from PCM Deg Signal Train for DWER Calculation
Procedure for computing ADWER Stage 1: Obtain the transcribed text of the PCM reference file by passing it thru the PSM. Obtain % WER of transcribed text from the original text (WER REF) All inputs were converted to 8 KHz from 16 KHz using Adobe Audition 2.0 Stage 2: Repeat Stage 1 with the PCM reference file encoded and decoded with different encoders and decoders i.e., Repeat Stage 1 using the “Degraded/Decompressed PCM” as input to ASR (WER DEG). Used Adobe-Audition 2.0 or Sound-Recorder to convert from PCM to compressed/encoded data and back to Decoded/Decompressed PCM. Stage 3: ADWER = WER REF - WER DEG
Inputs and Outputs Input: Speech Test Vectors –A set of test vectors in.wav format are required to adapt and evaluate on ASR –456 test vectors consisting of eight users (4 Male and 4 Female). Each user has eleven adaptation files and forty six evaluation files. Output: Transcribed Text –WER computed from Original text and Transcribed Text from PCM Reference –ADWER computed as a difference between Text from Reference PCM and Text from Degraded/Decompressed PCM
CodecWER Absolute DWERPESQbit rate 8KHz Reference MASC Optimized MASC Original GSM AMR NB setting AMR NB setting EVRC VMR-NB G G Speex setting Speex setting True Speech Comparison of 8KHz Codecs on ASR1
MASC is the only Codec that exists today at 8 KHz and at a ADWER in the 0.5 range
CodecWER Absolute DWERBit ratePESQ 8 KHz Ref MASC Opt Fixed MASC Opt L MASC Opt L MASC Opt L MASC Original Fixed MASC Original L MASC Original L MASC Original L GSM AMR NB AMR NB AMR NB G726 - MS ADPCM Speex Speex Speex Speex TrueSpeech Comparison of 8KHz Codecs on ASR2
Summary Although, there is a perception that DSR might be using low bandwidth and high accuracy, given the importance of voice reconstruction at the back-end and the accuracy w.r.t ASR engines, NSR outweighs DSR with lot more advantages in reality.