Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC

Slides:

Advertisements

Similar presentations

IP Cablecom and MEDIACOM 2004 Prediction and Monitoring of Quality for VoIP services Quality for VoIP services Vincent Barriac – France Télécom R&D SG12.

Advertisements

Speech Coding Techniques

An Exploration in the Detection of Hidden Data in Audio Bit Streams Presented by: John Monk CS 525, Spring Semester 2002

Time-Frequency Analysis Analyzing sounds as a sequence of frames

RMAUG Professional Development Series 2/11/09 Dwight Reifsnyder.

STQ Workshop, Sophia-Antipolis, February 11 th, 2003 Packet loss concealment using audio morphing Franck Bouteille¹ Pascal Scalart² Balazs Kövesi² ¹ PRESCOM.

I Power Higher Computing Multimedia technology Audio.

Speech Compression. Introduction Use of multimedia in personal computers Requirement of more disk space Also telephone system requires compression Topics.

Ranko Pinter Simoco Digital Systems

Speech codecs and DCCP with TFRC VoIP mode Magnus Westerlund

© 2006 AudioCodes Ltd. All rights reserved. AudioCodes Confidential Proprietary Signal Processing Technologies in Voice over IP Eli Shoval Audiocodes.

Voice over the Internet (the basics) CS 7270 Networked Applications & Services Lecture-2.

PROJECT PRESENTATION “ Analyzing Factors that affect VoIP Call Quality ” Presented By: Vamsi Krishna Karnati 11/24/2014.

1 TAC2000/ IP Telephony Lab Perceptual Evaluation of Speech Quality (PESQ) Speaker: Wen-Jen Lin Date: Dec

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

SWE 423: Multimedia Systems Chapter 7: Data Compression (1)

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.

A Software Defined Radio Implementation for Voice Transmission over Wireless Ad-hoc Networks Jason Tran SURF-IT 2009 Fellow Mentors: Dr. Homayoun Yousefi’zadeh.

Voice Quality Evaluation for Wireless Transmission with ROHC S. Rein and F.H.P. Fitzek and M. Reisslein Voice Quality Evaluation for Wireless Transmission.

EET 450 Chapter 18 – Audio. Analog Audio Sound is analog Consists of air pressure that has a variety of characteristics  Frequencies  Amplitude (loudness)

© 2006 Cisco Systems, Inc. All rights reserved. 2.2: Digitizing and Packetizing Voice.

The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.

Brian White CS529 SPEAK WITH FORWARD ERROR CORRECTION: IMPLEMENTATION AND EVALUATION.

K. Salah 1 Chapter 28 VoIP or IP Telephony. K. Salah 2 VoIP Architecture and Protocols Uses one of the two multimedia protocols SIP (Session Initiation.

8th and 9th June 2004 Mainz, Germany Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 1 Vincent Barriac, Jean-Yves.

Slide title In CAPITALS 50 pt Slide subtitle 32 pt Frame Header Based Speech Quality Analysis Method in a Circuit-Switched Media Gateway Master’s Thesis.

GODIAN MABINDAH RUTHERFORD UNUSI RICHARD MWANGI.  Differential coding operates by making numbers small. This is a major goal in compression technology:

Improving Voice Quality in International Mobile-to-Mobile Calls Aram Falsafi, Seattle, WA PIMRC September 2008.

AUDIO COMPRESSION msccomputerscience.com. The process of digitizing audio signals is called PCM PCM involves sampling audio signal at minimum rate which.

Introduction to Multimedia Networking (2) Advanced Multimedia University of Palestine University of Palestine Eng. Wisam Zaqoot Eng. Wisam Zaqoot October.

Tratamiento Digital de Voz Prof. Luis A. Hernández Gómez ftp.gaps.ssr.upm.es/pub/TDV/DOC/ Tema2c.ppt Dpto. Señales, Sistemas y Radiocomunicaciones.

1 Requirements for the Transmission of Streaming Video in Mobile Wireless Networks Vasos Vassiliou, Pavlos Antoniou, Iraklis Giannakou, and Andreas Pitsillides.

An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google Talk, and MSN Messenger Chen-Chi Wu, Kuan-Ta Chen, Yu-Chun Chang, and Chin-Laung.

Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.

17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.

November 1, 2005IEEE MMSP 2005, Shanghai, China1 Adaptive Multi-Frame-Rate Scheme for Distributed Speech Recognition Based on a Half Frame-Rate Front-End.

Code : STM#220 Samsung Electronics Co., Ltd. IP Telephony System Error Handling & Management IP Telephony System Error Handling & Management Distribution.

© 2006 Cisco Systems, Inc. All rights reserved. Optimizing Converged Cisco Networks (ONT) Module 2: Cisco VoIP Implementations.

New Models for Perceived Voice Quality Prediction and their Applications in Playout Buffer Optimization for VoIP Networks University of Plymouth United.

Department of Communication and Electronic Engineering University of Plymouth, U.K. Lingfen Sun Emmanuel Ifeachor New Methods for Voice Quality Evaluation.

University of Plymouth United Kingdom {L.Sun; ICC 2002, New York, USA1 Lingfen Sun Emmanuel Ifeachor Perceived Speech Quality.

Digital Recording. Digital recording is different from analog in that it doesn’t operate in a continuous way; it breaks a continuously varying waveform.

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.

VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.

LOG Objectives  Describe some of the VoIP implementation challenges such as Delay/Latency, Jitter, Echo, and Packet Loss  Describe the voice encoding.

ITU-T G.729 EE8873 Rungsun Munkong March 22, 2004.

Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.

Subband Coding Jennie Abraham 07/23/2009. Overview Previously, different compression schemes were looked into – (i)Vector Quantization Scheme (ii)Differential.

Ch 10. Multimedia Communications over WMNs Myungchul Kim

Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)

CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.

Voice Sampling. Sampling Rate Nyquist’s theorem states that a signal can be reconstructed if it is sampled at twice the maximum frequency of the signal.

Alan Clark Telchemy Modeling the effects of Burst Packet Loss and Recency on Subjective Voice Quality Alan Clark Telchemy

Sub-Band Coding Multimedia Systems and Standards S2 IF Telkom University.

1 What is Multimedia? Multimedia can have a many definitions Multimedia means that computer information can be represented through media types: – Text.

Ch 10. Multimedia Communications over WMNs Myungchul Kim

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Audio Formats. Digital sound files must be organized and structured so that your media player can read them. It's just like being able to read and understand.

1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.

Using Speech Recognition to Predict VoIP Quality

VoIP over Wireless Networks

Speech recognition in mobile environment Robust ASR with dual Mic

Digital Communications Chapter 13. Source Coding

Spread Spectrum Audio Steganography using Sub-band Phase Shifting

Introduction King Saud University

Overview What is Multimedia? Characteristics of multimedia

Audio Compression Techniques

Introduction 1st semester King Saud University

Presentation transcript:

Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC

Vianix Background Fast-paced speech technology company with corporate headquarters located in Virginia Beach, Virginia. Vianix has developed, tested, proven and licensed MASC ® – Managed Audio Sound Compression (MASC ® ) – State-of-the-Art speech compression technology High performance enabling voice technology For a broad spectrum of healthcare, multimedia communications and enterprise applications

NSR Vs DSR NSR DSR

Disadvantages of DSR Bandwidth Requirements: Bit rates ~7-11 kbps, no better than that of compressed voice (For e.g., a lot of VBR encoders can compress from 5 to 17 Kbps with a good recognition accuracy). Speech Reconstruction: Not possible to listen to original voice although more recent advances of DSR allow only low quality reconstruction of voice from features such as LPC or cepstral coefficients (MFCC as in ETSI based Aurora). Playback using TTS: Most DSR applications can only synthesize voice using TTS for an audio playback. DWER: Overall DWER may be lower or greater than NSR based recognition Feature-aware recognition: The recognition engine has to know the type of feature extraction being done apriori in order for the recognition engine to transcribe accurately. Cost of additional client: Additional expenditure as the front-end each time a client needs to be changed.

Advantages of NSR Delay/Jitter for Transcription: Any delay in the network transmission of NSR is inconsequential because most transcription applications are non-real-time. Single Client: NSR front-end clients do not need to be changed. The same front-end terminal such as those used in VoIP and other applications. Bandwidth Requirement: Transmission of speech data over any data network for NSR applications requires almost the same bandwidth requirements to encode speech data (For e.g., there are different encoders today offering VBR levels to meet bandwidth requirements without compensating too much on the recognition accuracy). Bit-stream domain recognition: Recognizing speech at the compressed bit-stream domain avoids complications such as no additional feature extraction mechanism is required on the device, and there are no reconstruction losses on the server. Channel coding: Standard schemes can be used with compressed stream (to avoid channel errors) VoIP robustness: Earlier, it was difficult to send compressed voice (only voice features) through the data channel. Now that VoIP has become very robust, high quality compressed voice content can also be sent via data channels.

PESQ / MOS PESQ (Perceptual Evaluation of Speech Quality) –Originally defined as part of P.861 as PSQM as an objective measure –Modification to PSQM as PESQ in P.862 –PESQ combines the excellent psycho-acoustic and cognitive model of PSQM+ with a time alignment algorithm that handles varying delays. –PESQ usually ranges from 1 to 4.5 MOS (Mean Opinion Score) –A linear mapping and proportional to PESQ –MOS, according to ITU standard can be between 1.0 to 5.0 –MOS is a subjective measure as opposed to PESQ being an objective measure

Other Metrics Variable Bit-Rate: –Various bitrates for different codecs (which support variable bit rates) including MASC codec were compared with variable bit-rates. –Bit-rates range from almost 5 kbps to 20 kbps. MIPS: –Computational efficiency for diff codecs are compared using V-Tune. –MIPS ranges from 20 to about 200 depending on the codec used. WER: –A measure to compute the number of words in percentage that have NOT been correctly identified by an ASR. –Accuracy of the ASR engine is computed by identifying how many words were inaccurate. DWER: difference in WER from the original uncompressed PCM samples to decompressed/decoded PCM samples. –Absolute and Relative. –Absolute here and a relative number can be obtained by computing the ratio of Absolute DWER to the Original Uncompressed WER.

Procedure for Comparison of different Codecs Procedure for ADWER Computation Comparison of MASC with other various codecs –ADWER –PESQ –Bit-rate

PCM REF PCM REF Delta Automatic Speech Recognition Engine Transcribed Text from PCM Ref Comparison Of Text Files for Word Error Rate Original Text %WER Deg %WER REF Encoder PCM Deg STAGE 1 STAGE 2 WER = STAGE 3 %WER REF %WER Deg Decoder Transcribed Text from PCM Deg Signal Train for DWER Calculation

Procedure for computing ADWER Stage 1: Obtain the transcribed text of the PCM reference file by passing it thru the PSM.  Obtain % WER of transcribed text from the original text (WER REF)  All inputs were converted to 8 KHz from 16 KHz using Adobe Audition 2.0 Stage 2: Repeat Stage 1 with the PCM reference file encoded and decoded with different encoders and decoders i.e., Repeat Stage 1 using the “Degraded/Decompressed PCM” as input to ASR (WER DEG).  Used Adobe-Audition 2.0 or Sound-Recorder to convert from PCM to compressed/encoded data and back to Decoded/Decompressed PCM. Stage 3:  ADWER = WER REF - WER DEG

Inputs and Outputs Input: Speech Test Vectors –A set of test vectors in.wav format are required to adapt and evaluate on ASR –456 test vectors consisting of eight users (4 Male and 4 Female). Each user has eleven adaptation files and forty six evaluation files. Output: Transcribed Text –WER computed from Original text and Transcribed Text from PCM Reference –ADWER computed as a difference between Text from Reference PCM and Text from Degraded/Decompressed PCM

CodecWER Absolute DWERPESQbit rate 8KHz Reference MASC Optimized MASC Original GSM AMR NB setting AMR NB setting EVRC VMR-NB G G Speex setting Speex setting True Speech Comparison of 8KHz Codecs on ASR1

MASC is the only Codec that exists today at 8 KHz and at a ADWER in the 0.5 range

CodecWER Absolute DWERBit ratePESQ 8 KHz Ref MASC Opt Fixed MASC Opt L MASC Opt L MASC Opt L MASC Original Fixed MASC Original L MASC Original L MASC Original L GSM AMR NB AMR NB AMR NB G726 - MS ADPCM Speex Speex Speex Speex TrueSpeech Comparison of 8KHz Codecs on ASR2

Summary Although, there is a perception that DSR might be using low bandwidth and high accuracy, given the importance of voice reconstruction at the back-end and the accuracy w.r.t ASR engines, NSR outweighs DSR with lot more advantages in reality.