Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Slides:

Advertisements

Similar presentations

Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.

Advertisements

Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.

Topics Recognition results on Aurora noisy speech databaseRecognition results on Aurora noisy speech database Proposal of robust formant.

Abstract This article investigates the importance of the vocal source information for speaker recognition. We propose a novel feature extraction scheme.

Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors

Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.

Speaker Adaptation for Vowel Classification

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Advances in WP1 and WP2 Paris Meeting – 11 febr

SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification

Communications & Multimedia Signal Processing Analysis of Effects of Train/Car noise in Formant Track Estimation Qin Yan Department of Electronic and Computer.

Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.

Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.

„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.

Study of Word-Level Accent Classification and Gender Factors

Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

Modeling speech signals and recognizing a speaker.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Implementing a Speech Recognition System on a GPU using CUDA

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Jacob Zurasky ECE5526 – Spring 2011

Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

1 Linear Prediction. 2 Linear Prediction (Introduction) : The object of linear prediction is to estimate the output sequence from a linear combination.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

ON REAL-TIME MEAN-AND-VARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES Pere Pujol, Dušan Macho, and Climent NadeuNational ICT TALP Research Center.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者：汪逸婷 1.

Variation of aspect ratio Voice section Correct voice section Voice Activity Detection by Lip Shape Tracking Using EBGM Purpose What is EBGM ？ Experimental.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.

Performance Comparison of Speaker and Emotion Recognition

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

ARTIFICIAL NEURAL NETWORKS

Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan

1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.

PROJECT PROPOSAL Shamalee Deshpande.

3. Applications to Speaker Verification

A maximum likelihood estimation and training on the fly approach

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University of Technologyh, Japan) 1

Background  The importance of phase in human speech recognition has been reported.  In conventional speaker recognition methods based on mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. 2

Purpose and method  We aim to use the phase information for speaker recognition.  We propose a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech and combines the phase information with MFCCs. 3

Investigating the effect of phase 4 Conventional MFCCs that capture the vocal tract information cannot distinguish the different speaker characteristics caused by vocal source. The phase is greatly influenced by vocal source characteristics. We generated a speech wave for different vocal sources and pitch, and a fixed vocal tract shape corresponding to vowel /a/.

Phase information extraction  The short-term spectrum S(ω, t) for the i-th frame of a signal is obtained by the DFT of an input speech signal sequence For conventional MFCCs, power spectrum is used, but the phase information is ignored. In this paper, phase is also extracted as one of the feature parameters for speaker recognition. 5

Problem of unnormalized phase 6 Example of the effect of clipping position on phase for Japanese vowel /a/  However, the phase changes depending on the clipping position of the input speech even with the same frequency ω. The unnormalized wrapped phases of two windows become quite a bit different because the phases change depending on the clipping position.

Phase normalization (1/2)  To overcome this problem, the phase of a certain basis radian frequency of all frames is converted to constant, and the phase of the other frequency is estimated relative to this. In the experiments discussed in this paper, the phase of basis radian frequency is set to 2π × 1000 Hz.  For example, setting the phase of the basis radian frequency to π/4, we have 7

Phase normalization (2/2)  The difference of unnormalized wrapped phase on basis frequency and the normalized wrapped phase is With ω = 2πf in the other frequency (that is, ), the difference becomes Thus, the spectrum on frequency ω becomes and the phase information is normalized as 8

Comparison of unnormalized phase and normalized phase 9 Example of the effect of clipping position on phase for Japanese vowel /a/ After normalizing the wrapped phase, the phase values become very similar.

From phase θ to phase{cosθ, sinθ} 10  There is a problem with this method when comparing two phase values. For example, with the two values and, the difference is then the difference despite the two phases being very similar to one another.  Therefore, for this research, we changed the phase into coordinates on a unit circle, that is,

How to synchronize the splitting section

Combination method 12  The likelihood of MODEL 1 is linearly coupled with that of MODEL 2 to produce a new score given by where is the likelihood produced by the n-th speaker model based on MFCC and the n-th speaker model based on phase, n=1,2,…,N with N being the number of speakers registered.  The GMM based on MFCCs is combined with the GMM based on phase information.

 NTT database  # speaker: 35 (22 males and 13 females)  # session: 5 (1990.8, , , , )  # training utterance: 5 (1990.8)  # test utterance: 1 (about 4 seconds), 35×4×5=700 trials 13  JNAS database  # speaker: 270 (135 males and 135 females)  # training utterance: 5 (about 2 seconds / sentence)  # test utterance: 1 (about 5.5 seconds), about 95 sentences / person 270×95=25650 trials Experimental setup (1/3)

 Noise  Stationary noise (in a computer room)  Non-stationary noise (in an exhibition hall) 14 Experimental setup (2/3)  Noisy speech  Noise was added to clean speech at the average SN ratios of 20 dB and 10 dB, respectively.

15 MFCCPhase Sampling frequency16k Hz Frame length25 ms12.5 ms Frame shift12.5 ms5 ms Dimensions25 {θ} ： 12 ｛ cosθ,sinθ ｝ :24 GMMs 8 mixtures with full- covariance matrices 64 mixtures with diagonal covariance matrices Experimental setup (3/3)

16 Speaker identification using clean speech

17 Speaker identification result on NTT database (1/2) Speaker identification results using the combination of MFCC-based GMM and the original phase {θ}

18 Speaker identification results using the combination of MFCC-based GMM and the modified phase {cosθ, sinθ} Speaker identification result on NTT database (2/2)

19 Speaker identification result on JNAS database Speaker identification results using the combination of MFCC-based GMM and the modified phase {cosθ, sinθ}

20 Speaker identification under stationary/non-stationary noisy conditions

21 Speaker identification results under noisy conditions (1/2) NTT database Speaker identification rate (%)

22 Speaker identification results under noisy conditions (2/2) JNAS database Speaker identification rate (%)

23 Conclusion  We proposed a phase information extraction method which normalizes the change variation of phase depending on the clipping position of the input speech and integrates the phase information with MFCC.  The experimental results showed that the combination of phase information and MFCC improved the speaker recognition performance remarkably than MFCC-based method.

24 Thank you for your attention!