Presentation is loading. Please wait.

Presentation is loading. Please wait.

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Similar presentations


Presentation on theme: "9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo."— Presentation transcript:

1 9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo Veiga 1,2 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal Automatically distinguishing Styles of Speech

2 2 Summary  Objective  Characterization of the corpus  Automatic segmentation  Method  Performance  Automatic classification  Features  Classification method  Results  Speech versus Non-speech  Read versus Spontaneous  Conclusions and future works | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

3 3 Objective  Automatic detection of styles of speech for segmentation of multimedia data Speech - Who? What? How? Style of a speech segment?  Segment broadcast news samples into the two most evident classes: read versus spontaneous speech (prepared and unprepared speech) Using a combination of phonetic and prosodic features  First explore a speech/non-speech segmentation slowfastclearinformalcausalplannedprepared spontaneousunprepared … | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

4 4 Characterization of the corpus Broadcast News audio corpus TV Broadcast News MP4 podcasts Daily download Extract audio stream and downsample from 44.1kHz to 16 kHz 30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels: Level 1– dominant signal: speech, noise, music, silence, clapping, … For speech: Level 2– acoustical environment: clean, music, road, crowd,… Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity) Level 4– speaker info: BN anchor, gender, public figures,… | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

5 5 Characterization of the corpus From Level 1 – speech versus non-speech From Level 3 – read speech (prepared) versus spontaneous speech | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

6 6 Methods Automatic Detection 1.Automatic Segmentation (find/mark different segments on the audio signal) 2.Automatic Classification (classify the segments) | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

7 7 Methods 1. Automatic segmentation Based on modified BIC (Bayesian Information Criterion): DISTBIC – uses distance (Kullback-Leibler) on the first step and delta BIC (  BIC) to validate marks s i-1 sisi s i+1 s i+2 ….  BIC<0  BIC>0 Parameters:  Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)  A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms  Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

8 8 Results Performance measure Automatic Segmentation: Collar (detection tolerance) range 0.5 s to 2.0 s A detected mark is assigned as correct if there is one reference mark inside the collar allowed interval | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

9 9 Results Segmentation performance F1-score: collar range 0.5 s to 2.0 s 0.8 0.7 0.6 0.5 0.4 0.3 0.51.01.52.0 | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

10 10 Results Recall: collar range 0.5 s to 2.0 s 1.0 0.9 0.8 0.7 0.6 0.5 1.01.52.0 Segmentation performance | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

11 11 Methods Phonetic (size of parameter vector for each segment: 214) Based on the results of a free phone loop speech recognition Phone duration and recognized log likelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation) Silence and speech rate Prosodic (size of parameter vector for each segment: 108) Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope First and second order statistics Polynomial fit of first and second order Reset rate (rate of voiced portions) Voiced and unvoiced duration rates | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013 2. Automatic Classification – Features a vector of 322 features for each segment is computed

12 12 Methods Classification SVM (Support Vector Machine) classifiers (WEKA tool, linear kernel, C=14): speech / non-speech read / spontaneous 2 step classification approach Speech / non-speech classification Read / spontaneous classification non-speech speech spontaneous read | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

13 13 Results Automatic detection (automatic segmentation + classification) Agreement time = % frame correctly classified Speech / non-speech detection Read / spontaneous detection | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

14 14 Results Classification only (using given manual segmentation) % - Accuracy Speech / non-speech classifier Read / spontaneous classifier | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

15 15 Conclusions and future work  Read speech can be distinguished from spontaneous speech with reasonable accuracy.  Results were obtained with only a few and simple measures of the speech signal.  A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).  We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.  We intend to automatically segment all audio genres and speaking styles. | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

16 16 THANK YOU | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

17 17 Appendix – BIC BIC (Bayesian Information Criterion) Dissimilarity measure between 2 consecutive segments Two hypothesizes: H 0 – No change of signal characteristics. Model: 1 Gaussian: H 1 – Change of characteristics. 2 Gaussians: μ – mean vector;  – covariance matrix Maximum likelihood ratio between H 0 and H 1 : X X1X1 X2X2 | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

18 18 Appendix – BIC | Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013


Download ppt "9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo."

Similar presentations


Ads by Google