Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments.

Similar presentations


Presentation on theme: "Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments."— Presentation transcript:

1 Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments Architecture – processing graph, modules, directories Getting and running the system Points for further work

2 History (1): AMI and AMIDA AMI Project (Augmented Multi-party Interaction): Edinburgh, Sheffield, Brno, Twente, IDIAP, ICSI, TNO and others; 2004-2006 Capture, analysis, indexing and browsing of meeting data AMIDA Project (AMI with Distance Access): AMI Consortium as above; 2006-2009 Extension to videoconferences Meeting transcription system: Modules developed by multiple partners Multiple versions: IHM/MDM, offline/online, different architectures and platforms Took part in NIST Rich Transcription evaluations in 2007 and 2009 – hence “RT07” and “RT09” versions

3 History (2): Developments at CSTR Legacy of AMI and AMIDA: RT09 offline system for multiple distant microphones (MDM), running on ECDF compute server Eddie (also individual headset microphone (IHM) and RT07 systems) – used since 2009 by several people at CSTR Developments in 2011 (FRM – Cisco Project): Documentation written Scripts and config files tidied up Changes to a few modules Files placed in new Subversion repository Additional modules and interfacing to support PodCastle application

4 Architecture (1): Overall structure Beamforming Audio signals from multiple microphones Beamforming Speech/non-speech segmentation Speaker diarisation Speech recognition Speaker-attributed text with timings and scores Padding and noise reduction

5 Architecture (2): Details of speech recognition PLP with VTLN, CMN/CVN, HLDA Decoding pass 1 CMLLR adaptation Decoding pass 2 VTLN estimation(warp factor per speaker) PLP coding, CMN/CVN Fbank with VTLN, CMN/CVN, HLDA Feature merging, CMN/CVN Decoding pass 3 CMLLR adaptation Decoding pass 4MLLR adaptation Waveform Decoding by Juicer; HMMs and LM differ from passes 1-2 to passes 3-4

6 Architecture (3): ROTK framework (Resource Optimisation Toolkit) Modules (strictly module instances) connected together in processing graph (“mpg” file) – read by Python script sgproc, which creates directories, calls runmod script for each module instance and keeps track of progress Directory structure per module instance: MI directory in idal links to data files in preceding MIs’ out directories in.dlp (list of dal files & job numbers) ms0000.dal ms0001.dal... (data lists) out odal output data files out.dlp (list of dal files & job numbers) ms0000.dal ms0001.dal... (data lists) working files and subdirectories (module-specific) [created by runmod] [created by sgproc]

7 Architecture (4): Parallel processing Dependent on runmod script for each module, but typically... Different recording sessions, and speakers within each session, are processed in parallel (some modules also subdivide a speaker’s data if amount is large) runmod (or a subsidiary script) submits jobs to grid and records the job numbers Jobs for a later MI may be submitted before input data from an earlier MI are ready – using “-w ” option to submitjob, which calls qsub with “-hold_jid ”

8 Architecture (5): File locations In Subversion repository (https://svn-kerberos.ecdf.ed.ac.uk/repo/inf/cstr/rec/trunk) :https://svn-kerberos.ecdf.ed.ac.uk/repo/inf/cstr/rec/trunk pkg/rotk/b0013 – sgproc and system utilities pkg/jet/v0.04 – submitjob and gridenv..csh files pkg/mod/ – module-specific files: runmod, subsidiary scripts, source code mdm/mpg/*.mpg – processing graphs mdm/cfg/*.env – config files for all module instances mdm/global.cfg – template for global config file mdm/run-mdm.* – templates for top-level scripts to call sgproc with specific processing graphs On Eddie ( in /exports/work/inf_hcrc_cstr_nst/amiasr/asrcore – locations specified in config files ): exp/sysopt/bin/*/* – program files (sox, SHoUT, HTK, STK, Juicer etc) exp/sysopt/mdm-sys09dev/lib/*/* – HMMs, language models etc

9 Getting and running the system Get an account on Eddie Get access to repository (give me your UUN) Create a system directory and check out a copy of the system there Run copy_files (to copy and compile some binary files) Create a working directory (somewhere with enough space – probably best under /exports/work) Copy global.cfg and run-mdm.pad to and edit to specify and project code Create /data and put your wav files there (src- _ses- _chn-.wav – 16kHz mono) Create list of data files (without “.wav” extension) in /default.dal Run run-mdm.* – results will appear in /JU-M1-CMLLR4_MLLR32_0_D/out

10 Issues and points for further work We don’t have source code for some of the programs (e.g. SFeaCat, SfeaStack) – if possible we should replace these by our own or other open-source equivalents Many of the scripts are opaque (tcsh scripts calling Perl, building other scripts and then running them, etc) Licensing for some components is too restrictive for a commercial application Use of Juicer makes it difficult to adapt LM and vocabulary – desirable for many applications


Download ppt "Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments."

Similar presentations


Ads by Google