Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection 2 Department of Electronic and Information Engineering The Hong Kong.

Slides:



Advertisements
Similar presentations
Known Non-targets for PLDA-SVM Training/Scoring Construction of Discriminative Kernels from Known and Unknown Non-targets for PLDA-SVM Scoring Results.
Advertisements

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Advanced Speech Enhancement in Noisy Environments
Abstract This article investigates the importance of the vocal source information for speaker recogni- tion. We propose a novel feature extraction scheme.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
As applied to face recognition.  Detection vs. Recognition.
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
GMM-Based Multimodal Biometric Verification Yannis Stylianou Yannis Pantazis Felipe Calderero Pedro Larroy François Severin Sascha Schimke Rolando Bonal.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Classifying Motion Picture Audio Eirik Gustavsen
Communications & Multimedia Signal Processing Analysis of the Effects of Train noise on Recognition Rate using Formants and MFCC Esfandiar Zavarehei Department.
Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification
ICASSP'06 1 S. Y. Kung 1 and M. W. Mak 2 1 Dept. of Electrical Engineering, Princeton University 2 Dept. of Electronic and Information Engineering, The.
1 INTRODUCTION METHODSRESULTSCONCLUSION Noise Robust Speech Recognition Group SB740 Noise Robust Speech Recognition Group SB740.
Oral Defense by Sunny Tang 15 Aug 2003
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Bala Lakshminarayanan AUTOMATIC TARGET RECOGNITION April 1, 2004.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.
Multimodal Information Analysis for Emotion Recognition
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Gammachirp Auditory Filter
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Wavelets, Filter Banks and Applications Wavelet-Based Feature Extraction for Phoneme Recognition and Classification Ghinwa Choueiter.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Predicting Voice Elicited Emotions
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
SNR-Invariant PLDA Modeling for Robust Speaker Verification Na Li and Man-Wai Mak Department of Electronic and Information Engineering The Hong Kong Polytechnic.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.
ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.
Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.
Speech Processing Dr. Veton Këpuska, FIT Jacob Zurasky, FIT.
Automatic License Plate Recognition for Electronic Payment system Chiu Wing Cheung d.
CS 445/656 Computer & New Media
Research on Machine Learning and Deep Learning
Traffic State Detection Using Acoustics
ARTIFICIAL NEURAL NETWORKS
Spoken Digit Recognition
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Audio and Speech Computers & New Media.
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
John H.L. Hansen & Taufiq Al Babba Hasan
A maximum likelihood estimation and training on the fly approach
SNR-Invariant PLDA Modeling for Robust Speaker Verification
Presenter: Shih-Hsiang(士翔)
Combating Replay Attacks Against Voice Assistants
Presentation transcript:

Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection 2 Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China Baiying LEI 1,2 Man-Wai MAK 2 1 Department of Biomedical Engineering, Shenzhen University, Shenzhen, China Funding Sources: Motorola Solutions Foundation The Hong Kong Polytechnic University

2 Contents 1.Motivations of Sound-Event Detection 2.Objectives 3.Methodology – System Architecture – Acoustic Features and Fusion – Sound Event Partitioning 4.Experiments and Results 5.Conclusions 2

Under some situations (e.g., in a washroom), surveillance via video cameras is inappropriate. Audio is a viable alternative under such situations. With the high processing power of today’s smartphones, it becomes possible to turn a smartphone into a personal audio surveillance and monitoring system. Audio-based surveillance can make effective use of mobile devices, allowing the surveillance system to be moved from one place to another easily. Abnormal sound events such as screaming can be detected and emergency phone calls can be automatically made. 3 Motivation

1.Determining suitable acoustic features for scream sound detection 2.Addressing the data-imbalance problem (scream vs. non-scream) in training SVM classifiers 3.Implement the detection algorithm on mobile phones 4 Objectives of This Work

5 Methodology

6 System Architecture Android App Playback background noise Playback sound events

7 Feature Extraction and Fusion Characteristics of scream sounds –Almost impossible to detect them in the time domain –But their spectral characteristics are still visible in the spectrogram under very noise condition

8 Feature Extraction and Fusion Time-Frequency Acoustic Features –MFCC (Mel-frequency cepstral coefficients) Commonly used in speech and speaker recognition systems Known to be not very noise robust –GFCC (Gammatone frequency cepstral coefficients) Based on auditory filtering and cepstral analysis More noise robust than MFCC

9 Feature Extraction and Fusion Correlation between MFCC and GFCC Fusion may help improve performance

10 Feature Extraction and Fusion Feature Fusion: Score Fusion: –Fuse the scores produced by MFCC-based and GFCC-based SVM classifiers SVM scores

Feature Extraction and Fusion Feature Fusion + Score Fusion: Score from feature-fusion SVM Score from score-fusion SVM

PCA Whitening and Normalization P: projection matrix comprising eigenvectors λ: Eigenvalues

Sound-Event Partitioning Based on our previously work on Utterance Partitioning for speaker verification

14 Experiments and Results

15 Sound Data 1000 sound events collected from –Human Sound Effect ( –Freesound.org 240 Screams and 760 Non-screams Non-Screams (22 types): –Applause, babycry, cheering, cough, crowd, door-slam, groan, grunt, gunshot, kiss, laugh, nose-blow, phone-ring, sniff, sniffle, snore, snort, speech, spit, throat, vocal, whistle

16 Sound Data

17 Effect of Background Noise Babble noise from NOISEX’92 was added to the sound events so that the resulting noisy sound events have SNR of 10dB, 5dB, 0dB, and -5dB Performance (%EER, False Acceptance = False Rejection) Perform better under matched conditions

18 Effect of Sound-Event Partitioning and Fusion 2 Partitions per sound event is sufficient Score Fusion + Feature Fusion is the best

19 Effect of Sound-Event Partitioning and Feature Preprocessing Having sound-event partitioning is always better PCA-Whitening and L2-norm are useful

20 Conclusion Sound-event partitioning and feature pre- preprocessing methods are proposed for scream sound detection. It was found that –Having sound-event partitioning is always better –PCA-Whitening and L2-norm are useful –Score fusion + feature fusion is effective Demo

21