Speech and Singing Voice Enhancement via DNN

Slides:



Advertisements
Similar presentations
Audio Compression ADPCM ATRAC (Minidisk) MPEG Audio –3 layers referred to as layers I, II, and III –The third layer is mp3.
Advertisements

AVQ Automatic Volume and eQqualization control Interactive White Paper v1.6.
Hearing Aids and Hearing Impairments Part II Meena Ramani 02/23/05.
Advanced Speech Enhancement in Noisy Environments
Multipitch Tracking for Noisy Speech
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Speaker Adaptation for Vowel Classification
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Sound source segregation (determination)
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptive Noise Cancellation ANC W/O External Reference Adaptive Line Enhancement.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.
IE341: Human Factors Engineering Prof. Mohamed Zaki Ramadan Lecture 6 – Auditory Displays.
By Sarita Jondhale1 Pattern Comparison Techniques.
Super Power BTE A great new Trimmer Family. The new & complete, fully digital Trimmer family ReSound is proud to introduce the complete new trimmer family,
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Speech Enhancement Using Spectral Subtraction
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
METHODOLOGY INTRODUCTION ACKNOWLEDGEMENTS LITERATURE Low frequency information via a hearing aid has been shown to increase speech intelligibility in noise.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Speech Perception 4/4/00.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.
P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti / DSP 2009, Santorini, 5-7 July DSP 2009 (Santorini, Greece. 5-7 July 2009), Session: S4P,
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent SPACE Symposium - 05/02/091 Objective intelligibility assessment of pathological speakers Catherine Middag,
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Performance of Digital Communications System
What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.
Linear Classifiers (LC) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Speech Enhancement Algorithm for Digital Hearing Aids
AVQ Automatic Volume and eQualization control
Research Process. Information Theoretic Blind Source Separation with ANFIS and Wavelet Analysis 03 February 2006 서경호.
Speech Enhancement Summer 2009
Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor Zhipeng Xie1,2,Jun Du1, Ian McLoughlin3.
Supervised Speech Separation
4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies
PATTERN COMPARISON TECHNIQUES
Precedence-based speech segregation in a virtual auditory environment
Singing Voice Separation via Active Noise Cancellation 使用主動式雜訊消除於歌聲分離
LECTURE 11: Advanced Discriminant Analysis
Speech Enhancement with Binaural Cues Derived from a Priori Codebook
AVQ Automatic Volume and eQqualization control
Feature Selection for Pattern Recognition
Cocktail Party Problem as Binary Classification
Intelligent Information System Lab
3. Applications to Speaker Verification
Deep Learning Based Speech Separation
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
Human Speech Perception and Feature Extraction
EE513 Audio Signals and Systems
Speech Perception (acoustic cues)
Endpoint Detection ( 端點偵測)
Perception & Neurodynamics Lab
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Speech and Singing Voice Enhancement via DNN J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Outline Singing voice separation (SVS) Speech separation SVS via active noise cancelation SVS for monaural recordings Speech separation Speech separation based on binary masks (slides by Prof DL Wang at Ohio State University)

IBM definition The idea is to retain parts of a mixture where the target sound is stronger than the acoustic background, and discard the rest The definition of the ideal binary mask (IBM) θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB Optimal SNR: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask in terms of SNR gain (Li & Wang, 2009) Maximal articulation index (AI) in a simplified version (Loizou & Kim, 2011) It does not actually separate the mixture! 3

IBM illustration

Subject tests of ideal binary masking IBM separation leads to large speech intelligibility improvements Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners (Brungart et al.’06; Li & Loizou’08; Cao et al.’11; Ahmadi et al.’13), and above 9 dB for hearing-impaired (HI) listeners (Anzalone et al.’06; Wang et al.’09) Improvement for modulated noise is significantly larger than for stationary noise With the IBM as the goal, the speech separation problem becomes a binary classification problem This new formulation opens the problem to a variety of pattern classification methods Brungart and Simpson (2001)’s experiments show coexistence of informational and energetic masking. They have also isolated informational masking using across-ear effect when listening to two talkers in one ear experience informational masking from third signal in the opposite ear (Brungart and Simpson, 2002) Arbogast et al. (2002) divide speech signal into 15 log spaced, envelope modulated sinewaves. Assign some to target and some to interference: intelligible speech with no spectral overlaps

Speech perception of noise with binary gains Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech) IBM modulated noise for ??? Speech shaped noise Create a continuous noise signal that match long-term average spectrum of speech Further approach to divide “speech-spectrum-shaped” noise into number of bands and modulate with natural speech envelopes However, noise signal too “speech-like” and reintroduce informational masking (Brungart et al. 2004) Alternatively, use Ideal Binary Mask to remove informational masking Eliminate portions of stimulus dominated by the interfering speech. Retain only portions of target speech acoustically detectable in the presence of interfering speech

DNN as subband classifier (Wang & Wang’13) DNN is used for feature learning from raw acoustic features Train DNNs in the standard way. After training, take the last hidden layer activations as learned features Train SVMs using the combination of raw and learned features Linear SVM seems adequate The weights from the last hidden layer to the output layer essentially define a linear classifier Therefore the learned features are amenable to linear classification

Speech mixed with unseen, daily noises Sound demos Speech mixed with unseen, daily noises Cocktail party noise (5 dB) Mixture Separated Destroyer noise (0 dB) Mixture Separated

Results and sound demos Both HI and NH listeners showed intelligibility improvements HI subjects with separation outperformed NH subjects without separation

Conclusions From auditory masking to the IBM notion, to binary classification for speech separation In other words, separation is about classification, not target estimation This new formulation enables the use of supervised learning Extensive training with DNN is a promising direction This approach has yielded the first demonstrations of speech intelligibility improvement in noise