Presentation is loading. Please wait.

Presentation is loading. Please wait.

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

Similar presentations


Presentation on theme: "Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney."— Presentation transcript:

1 Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney

2 2 ©September 15 Macquarie University System Goals Develop a simple end-to-end system for the SPKR task Platform for experimentation Improve on RT04s system

3 3 ©September 15 Macquarie University Overall Results

4 4 ©September 15 Macquarie University System Overview Feature Extraction SAD Segmentation Turn Clustering Speaker ID Single Distant Microphone Implemented in C and Tcl Runs in around 6x real time on single AMD64 Developed with RT04 devtest data –No AMI or VT data seen before eval

5 5 ©September 15 Macquarie University Feature Extraction SAD Segmentation Turn Clustering Speaker ID 26 coefficients: –12 MFCC –RMS Energy –Delta Coefficients 10ms frame rate, 25.6ms window Mean subtraction based on mean of first 60 seconds of file Uses the KTH Snack toolkit

6 6 ©September 15 Macquarie University Speech Activity Detection Feature Extraction SAD Segmentation Turn Clustering Speaker ID Goal: find obvious regions of non- speech for gross segmentation of recording GMMs for speech and non-speech –Speech model: 32 mixtures –Non-speech model: 8 mixtures Trained on RT04s devtest data set –Reference labels generated from speaker labelling –Ignored silence regions < 0.3s

7 7 ©September 15 Macquarie University Speech Activity Detection Feature Extraction SAD Segmentation Turn Clustering Speaker ID Evaluate frame classification error (%): DatasetNSPERSPER RT04s unseen3219 RT05s4715

8 8 ©September 15 Macquarie University Speech Activity Detection Feature Extraction SAD Segmentation Turn Clustering Speaker ID SAD is performed by classifying successive windows of 10 frames using the GMM models Consecutive regions are merged and labelled Non-speech < 0.35s merged with following segment Speech < 0.15s merged with following non-speech

9 9 ©September 15 Macquarie University Speech Activity Detection Feature Extraction SAD Segmentation Turn Clustering Speaker ID Evaluation –Frame classification error –Boundaries missed –nothing within 0.5s –Boundaries inserted inside real segments Meeting Frame Error % Boundary Error % # Auto NSPERSPERMissFP CMU 1415897917745 ICSI 1100994858899 NIST 0939719838497 AMI 120643182579348 VT 1430100099502

10 10 ©September 15 Macquarie University Turn Segmentation Feature Extraction SAD Segmentation Turn Clustering Speaker ID Speech regions are segmented using BIC criterion Compare fit of single gaussian model of sequence with pair of models each side of break Fixed windows of 200 frames advanced over speech region Peaks in delta BIC curve indicate change points

11 11 ©September 15 Macquarie University Turn Segmentation Feature Extraction SAD Segmentation Turn Clustering Speaker ID

12 12 ©September 15 Macquarie University Turn Clustering Feature Extraction SAD Segmentation Turn Clustering Speaker ID Given a set of speaker turns, find natural clusters Number of clusters unknown Requires: –Distance metric on speaker turns –Clustering algorithm –Cluster evaluation metric

13 13 ©September 15 Macquarie University Speaker Similarity Mean + variance of feature vectors K-L distance metric

14 14 ©September 15 Macquarie University Turn Clustering Feature Extraction SAD Segmentation Turn Clustering Speaker ID Implementation: –Select segments longer than 1.5s for clustering –KL distance on mean/variance of features –Hierarchical clustering –Select labellings for 2, 3…N speakers –Cluster evaluation performed after speaker ID

15 15 ©September 15 Macquarie University Speaker ID Feature Extraction SAD Segmentation Turn Clustering Speaker ID Use cluster labelled turns to train speaker models –32 mixture GMM Now classify and re-label all speaker turns Potentially correct poor clustering decisions Very small amounts of data to support models

16 16 ©September 15 Macquarie University Overall Results

17 17 ©September 15 Macquarie University What Didn’t Work Inter-channel phase and level differences Exemplar speaker models SVD based turn clustering –Find similar groups by factoring the distance matrix –One product of SVD is a number of clusters


Download ppt "Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney."

Similar presentations


Ads by Google