Presentation is loading. Please wait.

Presentation is loading. Please wait.

An overview of Robustness Related Issues in speaker recognition a plenary overview talk at APSIPA ASC 2014 Thomas Fang Zheng CSLT, RIIT, Tsinghua University.

Similar presentations


Presentation on theme: "An overview of Robustness Related Issues in speaker recognition a plenary overview talk at APSIPA ASC 2014 Thomas Fang Zheng CSLT, RIIT, Tsinghua University."— Presentation transcript:

1 An overview of Robustness Related Issues in speaker recognition a plenary overview talk at APSIPA ASC 2014 Thomas Fang Zheng CSLT, RIIT, Tsinghua University APSIPA ASC 2014, Dec 9-12, 2014, Siem Reap, Cambodia

2 Introduction Environmental-Related Issues Speaker-Related Issues Application-Oriented Issues Summary Reference 1

3 Automatic speaker recognition (id & veri)  Active research areas (cross-channel, noise, …)  Wide applications (telephone banking, forensics, …) A lot of challenges in practical applications Three categories of robustness issues  Environment-related issues  Speaker-related issues  Application-oriented issues 2

4 ENV-Related issues Noise Robustness Channel Mismatch Speaker-Related Issues Gender Physical conditions Speaking style Cross-Lingual Aging APP-Oriented Issues Main applications SUSR 3

5 Factors:  Recording / Environmental noises Two research directions:  Feature level: Spectral Subtraction (Boll 1979) / RASTA filtering (Hermansky 1994) PCA (Kocsor 2000) /LDA (Lomax 2007) /HLDA (Saon 2000) in feature domain  Model level: Model compensation algorithms (Gales 1996) 4

6 Factors:  Various types of microphones /transmission channels Three research directions:  Feature transformation CMS /CMN (Furui 1981) ; Feature mapping (Reynolds 2003)  Model compensation SMS (Teunen 2000) (Speaker Model Synthesis); subspace projection  Score normalization Z-Norm, H-Norm, T-Norm,... 5

7 State-of-the-art approaches  JFA ( Joint Factor Analysis ) (Kenny 2007) : a more comprehensive statistical approach, which defines both the speaker- and channel- variations as two independent random variables.  i-vector (Dehak 2011) : a low-rank total variability is defined to represent both speaker- and channel-variations at the same time. 6

8 Inter-channel compensation methods  i-vector leads to less discrimination among speakers due to channel variations. So many inter-channel compensation methods were proposed to extract accentuate speaker information.  NAP ( Nuisance Attribute Projection ) (Solomonoff 2004) : to find the optimized projection.  WCCN ( Within Class Covariance Normalization ) (Hatch 2006) : Linear transform.  LDA ( Linear Discriminant Analysis ) (Dehak 2011) /PLDA ( Probabilistic LDA ) (Loffe 2006) : PLDA is a generative model and has achieved great success. 7

9 ENV-Related issues Noise Robustness Channel Mismatch Speaker-Related Issues Gender Physical conditions Speaking style Cross-Lingual Aging APP-Oriented Issues Main applications SUSR 8

10 Gender Physical conditions (cold or laryngitis) Speaking style (emotion /speaking rate /volume /idiom) Cross-Lingual (language mismatch) Ageing (voice changes with time/age) 9

11 Better scenario: training with gender dependent (GD) features and recognizing with known gender information. In applications, gender info is often not available. Approaches: To design a gender independent system, and then  Pairwise discriminative training based on i-vector (Cumani 2012)  Source-normalization for variation to separate genders as a pre-processing step based on a PLDA classifier (McLaren 2012)  Male and female are physiologically different, their speech should be difficultly precossed and analyzed: FFT-size, frame-shift (resolution), UBM,..., the authors’ preliminary results show significant improvement when doing this way. 10

12 Speech is a behavioral signal. Variability of Speaker’s physical conditions  Cold /nasal congestion /laryngitis, etc. “cold-affected” speech in speaker recognition (Tull 1996) This direction is still rare, and speech databases are difficult to collect and organize. But research on it has practical importance. 11

13 Emotion: an intrinsic nature of human beings. Categories:  Analysis of various emotion-related acoustic factors Prosody /Voice quality /pitch /duration /sound intensity  Emotion-compensation methods emotion-added model training method (Wu 2005) supra-segmental HMM (Shahin 2009) emotion-dependent CMLLR transformations (Bie 2013) 12

14 Speaking rate: another high level speaker-related variable and has a big impact on speaker verification performance. Rate mismatch between training and test utterances Speech recognition  A probabilistic method to estimate speaking rate (Yasuda 2012)  A speech rate classifier (SRC) (Martinez 1998) Speaker recognition  Non-linear time alignment or DTW ( Dynamic Time Warping ), effective or not? 13

15 Idiom: a person’s personal style of word usage and a high-level inter-speaker characteristic. It is actually a kind of discriminate information rather than a robustness issue, but it helps to improve the recognition performance. Human brain: self-learning with idioms Important threads:  Idiosyncratic word-usage: high-level feature  Idiosyncratic pronunciation feature: low-level feature 14

16 Language mismatch results in performance degradation. Previous work:  Training a pooled model from multi-lingual corpora (Ma 2004)  Language normalization (Akbacak 2007)  Language factor compensation (Lu 2009)  Feature combination (Nagaraja 2013) 15

17 Whether voice changes significantly with time? Performance degradation has been observed in the presence of time intervals. From the point of view of patter recognition:  Enrollment data (training model) and test utterances for verification are separated by some period of time. 16

18 Model domain:  Data augmentation (Beigi 2009) : speaker re-enrollment  MAP/MLLR-adaptation (Lamel 2000) : model adaptation Score domain:  A classifier with an ageing-dependent decision boundary (Kelly 2011) Feature domain:  F-ratio measure (Lu 2007)  Frequency warping and filter output weighting to emphasize speaker-sensitive and time- insensitive sub-bands (Wang 2012) 17

19 ENV-Related issues Noise Robustness Channel Mismatch Speaker-Related Issues Gender Physical conditions Speaking style Cross-Lingual Aging APP-Oriented Issues Main applications SUSR 18

20 User Authentication  commercial transactions /control access /online shopping Public Security and Judicature  Parolees monitoring /In-prison call monitoring /Forensics Speaker Adaptation in Speech Recognition  Speaker-dependent speech recognizer Multi-Speaker Environments  Speaker detection /tracking /segmentation /diarization 19

21 Short utterance speaker recognition (SUSR)  Unsatisfactory performance on GMM-UBM (NIST), JFA (Kenny 2004) and i-vector (Vogt 2008). Challenges (Zhang 2014)  Discriminative information inadequate and confusable Research directions  To select more discriminative data: Fisher-voice based feature fusion method combined with PCA and LDA (Zhang 2013).  To train more accurate model with high-level information: JFA and i-vector / phoneme specific multi-model method (Zhang 2012).  Better algorithms for scoring: ULS (Parris 1998) / WBLS (Malegaonkar 2008). 20

22 Coding mismatch  G.711 /G.729 /WeChat-specific format /... Integration of speech recognition and speaker recognition  Speech recognition: more speaker/dialect-independent  Speaker recognition: more speaker-dependent Voice quality control:  VAD and higher-discriminative feature/segment retrieval  High-quality speech vs distorted speech (noisy, clipped,...) 21

23 An overview of speaker recognition technologies with an emphasis on dealing with robustness issues. Three categories :  Environment-related issues  Speaker-related issues  Application-oriented issues Some directions have been touched by researchers while others may be future focuses. 22

24 Thank you APSIPA ASC 2014, Dec 9-12, 2014, Cambodia

25 M. Akbacak, J. H. Hansen (Akbacak 2007), “Language normalization for bilingual speaker recognition systems,” Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on. IEEE, 4: IV-257-IV-260. H. Beigi (Beigi 2009), “Effects of time lapse on speaker recognition results,” Proc. of 16th International Conference on Digital Signal Processing, pp. 1-6, F.-H. Bie, D. Wang, T. F. Zheng, J. Tejedor, R. Chen (Bie 2013), “Emotional adaptive training for speaker verification,” Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific. IEEE, 2013: 1-4. S. F. Boll (Boll 1979), “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, 1979, 27: S. Cumani, O. Glembek, N. Brummer, E. de Villiers, P. Laface (Cumani 2012), “Gender independent discriminative speaker recognition in i-vector space,” ICASSP, N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (Dehak 2011), “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, S. Furui (Furui 1981), “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust. Speech Signal Processing, (2): M. J. F. Gales and S. J. Young (Gales 1996), “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, 1996, 4(5): A. O. Hatch, S. S. Kajarekar, and A. Stolcke (Hatch 2006), “Within-class covariance normalization for SVM-based speaker recognition,” in INTERSPEECH’ 06,

26 H. Hermansky and N. Morgan (Hermansky 1994), “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, (4): S. Ioffe (Ioffe 2006), “Probabilistic linear discriminant analysis,” in ECCV2006, 2006, pp. 531–542. F. Kelly and N. Harte (Kelly 2011), “Effects of long-term ageing on speaker verification,” Biometrics and ID Management, Volume 6583 of Lecture Notes in Computer Science, pp , Springer Berlin/Heidelberg, P. Kenny, P. Dumouchel (Kenny 2004), “Experiments in Speaker Verification using Factor Analysis Likelihood Ratios,” in Proceedings of Odyssey04 - Speaker and Language Recognition Workshop, Toledo, Spain, P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel (Kenny 2007), “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, A. Kocsor, L. Toth, A. Kuba, K. Kovacs, M. Jelasity, T. Gyimothy, J. Csirik (Kocsor 2000), “A comparative study of several feature transformation and learning methods for phoneme classification,” International Journal of Speech Technology, (3): L. Lamel and J. Gauvin (Lamel 2000), “Speaker verification over the telephone,” Speech Communication, Volume 2000, Issue 31, pp , R. G. Lomax and D. L. Hahs-Vaughn (Lomax 2007), “Statistical concepts: a second course,” Lawrence Erlbaum Associates, X. Lu and J. Dang (Lu 2007), “Physiological feature extraction for text independent speaker identification using non-uniform subband processing,” Proc. of ICASSP 2007, pp , 2007 L. Lu, Y. Dong, X. Zhao, J. Liu, H. Wang (Lu 2009), “The effect of language factors for robust speaker recognition,” Acoustics, Speech and Signal Processing, ICASSP

27 B. Ma, and H.-L. Meng (Ma 2004), “English-Chinese bilingual text-independent speaker verification,” Acoustics, Speech, and Signal Processing, Proceedings (ICASSP'04). IEEE International Conference on. Vol. 5, A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran and J. Fortuna (Malegaonkar 2008), “On the enhancement of speaker identification accuracy using weighted bilateral scoring,” IEEE International Carnahan Conference on Security Technology (ICCST): , F. Martinez, D. Tapias, J. Alvarez (Martinez 1998), “Towards speech rate independence in large vocabulary continuous speech recognition,” Acoustics, Speech and Signal Processing, M. McLaren and D. A. van Leeuwen (McLaren 2012), “Gender-independent speaker recognition using source normalization,” in Proc. ICASSP, 2012, pp B. G. Nagaraja, H. S. Jayanna (Nagaraja 2013), “Combination of Features for Multilingual Speaker Identification with the Constraint of Limited Data,” International Journal of Computer Applications, 2013, Vol.70 (6), pp.1-6. NIST Speaker Recognition Evaluation Plan (NIST), Online Available E. S. Parris and M. J. Carey (Parris 1998), “Multilateral techniques for speaker recognition,” International Conference on Spoken Language Processing (ICSLP), D. A. Reynolds (Reynolds 2003), “Channel robust speaker verification via feature mapping,” ICASSP, 2003, (2): G. Saon, M. Padmanabhan, R. Gopinath, S. Chen (Saon 2000), “Maximum likelihood discriminant feature spaces,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing, : I. Shahin (Shahin 2009), “Speaker identification in emotional environments,” Iranian Journal of Electrical and Computer Engineering, vol. 8, no.1, pp. 41–46,

28 A. Solomonoff, C. Quillen, and W. M. Campbell (Solomonoff 2004), “Channel compensation for SVM speaker recognition,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2004, pp. 57–62. R. Teunen, B. Shahshahani, and L. Heck (Teunen 2000), “A model-based transformational approach to robust speaker recognition,” in Proc. ICSLP’00, 2000, pp. 495–498. R. G. Tull and J. C. Rutledge (Tull 1996), “‘Cold Speech’ for Automatic Speaker Recognition,” Acoustical Society of America 131st Meeting Lay Language Papers, May, R. Vogt, B. Baker, and S. Sridharan (Vogt 2008), “Factor analysis subspace estimation for speaker verification with short utterances,” in Interspeech, Brisbane, L.-L. Wang, X.-J. Wu, T. F. Zheng and C.-H. Zhang (Wang 2012), “An Investigation into Better Frequency Warping for Time- Varying Speaker Recognition,” APSIPA ASC, T. Wu, Y.-C. Yang, and Z.-H. Wu (Wu 2005), “Improving speaker recognition by training on emotion-added models,” in Proc. Affective Computing and Intelligent Interaction, 2005, pp. 382–389. H. Yasuda and M. Kudo (Yasuda 2012), “Speech rate change detection in martingale framework,” in Proc. ISDA, 2012, pp C.-H. Zhang, X.-J. Wu, T. F. Zheng and L.-L. Wang (Zhang 2012), “A K-phoneme-class based multi-model method for short utterance speaker recognition,” The 4th Asia-Pacific Signal and Information Processing Association, Annual Summit and Conference, APSIPA ASC, C.-H. Zhang and T. F. Zheng (Zhang 2013), “A fishervoice based feature fusion method for short utterance speaker recognition,” IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP, C.-H. Zhang (Zhang 2014), “Research on Short Utterance Speaker Recognition,” PhD thesis, Tsinghua University, April


Download ppt "An overview of Robustness Related Issues in speaker recognition a plenary overview talk at APSIPA ASC 2014 Thomas Fang Zheng CSLT, RIIT, Tsinghua University."

Similar presentations


Ads by Google