Presentation is loading. Please wait.

Presentation is loading. Please wait.

Harmonic-Temporal Clustering of Speech

Similar presentations


Presentation on theme: "Harmonic-Temporal Clustering of Speech"— Presentation transcript:

1 Harmonic-Temporal Clustering of Speech
Jonathan Le Roux, Hirokazu Kameoka, Nobutaka Ono, Alain de Cheveigné, Shigeki Sagayama

2 Motivation and Approach
Precise and Robust F0 analysis Analysis of complex and varied acoustical scenes For speech, applications in speech recognition, prosody analysis, speech enhancement, speaker identification… Desirable features of a new pitch determination algorithm (PDA) The performance should stay high in a wide range of background noises (white noise, pink noise, noise bursts, music, other speech) Extracting simultaneously the pitch contours of several concurrent voices is possible Overall speech model, spectro-temporal model with constraints Several existing multi-pitch tracking algorithms: initial frame-by-frame analysis, then post-processing to reduce errors and obtain a smooth pitch contour (for example using HMMs) We propose to perform estimation and model-based interpolation simultaneously: Parametric model of the voiced parts of the power spectrum of speech Introduction of a noise model to extract harmonically structured “islands” within a “sea” of unstructured noise.

3 Overview of the method Simultaneous optimization of the parameters
Express the whole pitch contour as a smooth curve→ cubic spline Distribute audio objects with different acoustical properties Express the harmonic structure as a parametric function: GMM Express the power envelope in time direction as a parametric function: GMM Log-Frequency Simultaneous optimization of the parameters Characteristic: Through the harmonicity assumption, the method models the voiced parts of speech time

4

5 F0 estimation in noisy environments
Speech mixed with broadband background noise: Voiced speech with several types of interferences: Accuracy (%) of the F0 estimation:

6 Multi-pitch estimation
Co-channel speech of two speakers speaking simultaneously with equal average power. Test data Bagshaw database、150 mixtures 16kHz, monaural signal Results 8kHz Frequency 50Hz 0s time 「a-o-i」 「o-i-o-o-u」 1.3s 0s 1.3s No second sound here


Download ppt "Harmonic-Temporal Clustering of Speech"

Similar presentations


Ads by Google