1 Quick Transcription of Fisher Data with WordWave Owen Kimball, Rukmini Iyer, Chia-lin Kao, Thomas Colthurst, John Makhoul.

Slides:



Advertisements
Similar presentations
Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?
Advertisements

CALO Decoder Progress Report for March Arthur (Decoder and ICSI Training) Jahanzeb (Decoder) Ziad (ICSI Training) Moss (ICSI Training) Carnegie Mellon.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Jump to first page ©2003, Darby Dickerson. License given for educational use. Cite & Source Steps and Strategies.
1 In-Process Metrics for Software Testing Kan Ch 10 Steve Chenoweth, RHIT Left – In materials testing, the goal always is to break it! That’s how you know.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
S1S1 S2S2 S3S3 ATraNoS Workshop 12 April 2002 Patrick Wambacq.
Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
Introductory Computer Programming, Problem Solving and Computer Assisted Assessment Charlie Daly, DCU John Waldron, TCD.
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Algorithm Programming Coding Advices Bar-Ilan University תשס " ו by Moshe Fresko.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
How Do I Find a Job to Apply to?
Introduction to Automatic Speech Recognition
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Called as the Interval Scheduling Problem. A simpler version of a class of scheduling problems. – Can add weights. – Can add multiple resources – Can ask.
INFO 637Lecture #81 Software Engineering Process II Integration and System Testing INFO 637 Glenn Booker.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
EMI INFSO-RI Metrics review Claudio (SA1), Lars, Duarte, Eamonn and Maria (SA2)
11 Update on Transcription of Fisher Phase II Data Owen Kimball, Chia-lin Kao, Tresi Arvizo, John Makhoul.
Preparing papers for International Journals Sarah Aerni Special Projects Librarian University of Pittsburgh 20 April 2005.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Testing E001 Access to Computing: Programming. 2 Introduction This presentation is designed to show you the importance of testing, and how it is used.
Paul Mundy Editing step by step How an expert does it.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Fall Week 4 CSCI-141 Scott C. Johnson.  Computers can process text as well as numbers ◦ Example: a news agency might want to find all the articles.
1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Software Engineering Chapter 3 CPSC Pascal Brent M. Dingle Texas A&M University.
Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.
11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Welcome to Year 6 SATs meeting Brindle St James’ CE Primary School.
1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.
Chapter 3 Syntax, Errors, and Debugging Fundamentals of Java.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
How to Design an Outstanding Scholarly Poster Adapted from the work of Marlene Berro, MS, RAC Office of Ethics and Compliance February, 2015.
Bell RingerDate: November 30th, ) WELCOME BACK!!! Hope you had a relaxing Thanksgiving Break, now time to get back to Social Studies… 2) Take out.
Python 1 SIGCS 1 Intro to Python March 7, 2012 Presented by Pamela A Moore & Zenia C Bahorski 1.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Constructing and Giving Research Presentations Paul Wagner (for MICS 2005)
* Statutory Assessment Tasks and Tests (also includes Teacher Assessment). * Usually taken at the end of Key Stage 1 (at age 7) and at the end of Key.
1 Taking Notes. 2 STOP! Have I checked all your Source cards yet? Do they have a yellow highlighter mark on them? If not, you need to finish your Source.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
The Fine Art of Knowing How Wrong You Might Be. To Err Is Human Humans make an infinitude of mistakes. I figure safety in numbers makes it a little more.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
Tetsuya Nasukawa, IBM Tokyo Research Lab
Case Study 1 By : Shweta Agarwal Nikhil Walecha Amit Goyal
Network Training for Continuous Speech Recognition
Presentation transcript:

1 Quick Transcription of Fisher Data with WordWave Owen Kimball, Rukmini Iyer, Chia-lin Kao, Thomas Colthurst, John Makhoul

2 Outline  Project Goals  Quick Transcription Conventions  How BBN post processes transcripts  Automatic segmentation algorithm  Initial experiment with more training data  Data Delivery Schedule

3 Motivation  Given success of quickly transcribed Switchboard II & Cell Data from CTRAN, want to try same approach with Fisher data.  In March 03, found new transcription service, WordWave, that offers better price, quality, and volume (hours per week) than CTRAN. –charging $165 per hour of transcribed data –100+ hours per week possible.  Goal: 1800 hours of Fisher conversations quickly transcribed by end of Calendar year 03

4 Process Overview  QTR Process – LDC ships audio to WWave and BBN – WWave does quick transcription, sends to BBN – BBN does error correction, post processing, and segmentation, sends to LDC – LDC publishes to community

5 Quick Transcription Conventions  WordWave had an existing style guide –Detailed, but oriented for human readability careful formatting of times, dates, lists, etc. for max readability  BBN and WWave modified guide for conversational speech transcription –For example: No numerals: numbers written as words No abbreviations Side identifiers required for each speaker –Tried to retain WordWave’s extra information where possible, e.g. punctuation –Iterative process with feedback from transcribers –In April, circulated version to community for feedback.

6 Sample Transcript L:Oh, so you think it was fear that kept Iraq from using it. R:Right. And what happens is -- L:But yet tha- -- R:What happens is suppose they get Saddam Hussein, which they eventually will, he's got one less thing to go against him. I mean if he were to use that, he might as well commit suicide because he's going to be captured and, you know. But the US also makes a lot of, uh, you know, treaties with other people. Like saying, "Okay, if you give up then you can come live in our country and we'll take care of you". Like Marcos, right? L:Yeah. R:"We'll overthrow you but, yeah, you can still come live here", you know. L:Right. I don't think they've done that to Saddam Hussein yet. R:[LAUGH] Ah, no. No.

7 BBN's Post Processing  Primary purpose is to add time information (utterance begin and endpoints) to WWave transcripts. –Utterance length should be short enough that BBN trainer doesn't choke.  To handle large volumes of data, tried to create process requiring minimal manual effort.  Post processing steps include: –Error correction –Format conversion and OOV handling –Auto-segmentation trickiest part; we are still trying to improve it.

8 Transcript Error Correction Process  Step 1: Auto-fix script –corrects simple errors, e.g., mismatched bracket types, missing spaces after punctuation, etc.  Step 2: Script to detect remaining errors –reads in auto-fixed output and dictionary –checks for things requiring human judgement, e.g., missing side identifier (“L” or “R”), illegal characters, unknown non- speech words, etc. –flags all new OOV words  Step 3: Human correction –All errors and OOVs found in Step 2 are manually checked –Built Emacs-based tool for this, using 'compile-mode’ jumps to location of next error found by script –Human corrects errors, esp. looks for typos flagged as OOVs –Takes about 0.2 x RT

9 Format Conversion and OOV Handling  Convert from WordWave format to SNOR –Remove all punctuation –Upcase everything –Generate list of OOVs  For CTRAN, created OOV pronunciations manually –most time-consuming part of processing those transcripts  For Wwave, created script to automatically find OOV pronunciations –Generates concatenated letter pronunciations of new acronyms –Strips prefixes, suffixes and breaks compounds; looks up resulting baseform in dictionary; reconstruct if found –If above fails, use TTS system Orator to generate pronunciation Very uneven quality, but good enough for segmentation alg.

10 Auto Segmentation Overview  Goal: Find endpoints of moderate-length utterances.  First designed for CTRAN, where we thought forced alignment might be too error prone –want to avoid failure due to transcriber error, signal problems. –used recognition with biased LM's for error robustness.  Other issues –Estimating cep normalization needs "normal" utterance sizes: too much silence included if use complete conversation side. –Transcribers identify conv sides as "L" and "R": we encourage them to wear headphones consistently, but no guarantee which is channel 1 and which is 2. measured about 5% of conv's have sides "flipped" this way –Whatever process we use will have some errors. Want to auto- detect poor utterances and reject them.  Above constraints lead to a multi-stage algorithm

11 Auto Segmentation Algorithm I  Step 1: Simple speech detect and cep normalization –energy-based speech detector gives rough speech regions, over which we estimate cepstral mean and variance –normalize complete conversation sides with these statistics.  Step 2: Assign sides to channels –decode both channels of audio with coarse PTM models –align each channel’s recognizer output to each transcript side –assign channel to transcript side with lower WER –reject conversation if both sides match same channel best Less than 1% of conversations fail this way.

12 Auto Segmentation Algorithm I, cont'd  Step 3: Initial Segmentation –Make conversation-specific LM A tight grammar that still allows deviation from transcript. –Decode with PTM models, above LM –Align decoder output with transcript, break into coarse chunks during reliable silences or at strings of insertion errors  Step 4: Refined segmentation –Make side-specific LM (tighter than conv-specific) –Decode initial segments with SCTM models and tighter LM –Chop initial segments into smaller segments  Step 5: “Filtering” decode –Same models as last step –Decode refined segmentation –Compare output to transcript and reject utterances with too- high alignment error #correct < #substitutions + #deletions + #insertions

13 Switchboard 20 Hour Experiments  Unadapted recognition on Dev01 and Eval01 using acoustic models trained with 20-hour Swbd1 set, LM trained on full Switchboard  ML, GI, VTL, HLDA-trained models Transcripts Training hoursSegmentation Dev01 WER Eval01 WER LDC/MSU19.9Manual + Auto CTRAN19.4Auto Fast LDC17.9Manual WWave Alg I19.2Auto

14 Judging Segmentation Quality  20-hour results are hard to interpret, but may indicate WWave is slightly worse than LDC –both MSU/LDC and LDC-Fast use manually-corrected segmentation- this may explain their (possible) edge.  Listening to segmented WWave conversations reveals some problems – words sometimes shifted to neighboring utterance  Despite uncertainty, we have focused on trying to improve accuracy via better automatic segmentation, testing on 20 hours and doing subjective listening tests.

15 Some Attempted Improvements  Defer segmentation decisions to better models –Since first pass PTM models are worse than later SCTM, tried making fewer decisions in first pass (bigger initial chunks) –More final segmentation decisions made by SCTM models –Showed no improvement on 20 hour set  Use turn information from transcripts to help segment –Transcribers indicate punctuation and turn taking that we ordinarily ignore –Tried using it in the language model, e.g. incorporating information about the location of sentence ends from punctuation –No improvement on 20 hour set.  Recently re-considering recognition-based approach –Recognition errors may add too much noise to recover from –Feared problems with forced alignment may be fixable, especially with fairly high quality transcripts like WWave's.

16 Segmentation Algorithm II  Initially tried forced alignment of whole conversations, but traceback failed for a significant percentage  Tried instead coarse initial segmentation followed by forced alignment of resulting chunks –avoids losing whole conversation due to one problem spot for alignment –but recognition errors from first pass chunking still possible.  Process –As before, normalize, pick sides, run first PTM decode –First chopping tuned to produce large initial segments –Run forced alignment on these large segments –Chop into smaller segments based on times of silence found in forced alignment. –Decode and filter bad utterances as before.

17 Results of Algorithm II  Same training, test conditions as before  Is this better? Our impression from listening is that it is, but this test is too weak to draw conclusions. Transcripts Training hoursSegmentation Dev01 WER Eval01 WER LDC/MSU19.9Manual + Auto CTRAN19.4Auto Fast LDC17.9Manual WWave Alg I19.2Auto WWave Alg II19.5Auto

18 Using Fisher Data in Large Training Set  How much does the QTR Fisher data help a larger system?  Add ~150 hours of Fisher data to 365 hours used in Eval03 training. –Includes 80 hrs Fisher data distributed to community in August plus more recent additions  Segmented with Algorithm I (recognition based)  Training method: same as Eval03 unadapted pass –GI, ML models with VTL, HLDA –3gram LM, 55k lexicon –Planned to train SAT, MMI models but ran out of time.  Result today using Fisher 150 hours in LM, but text normalization not quite right –May improve a little more.

19 Results of Fisher in Large Training Set  Unadapted decode on Eval01, Eval03  S= Switchboard Eval03 training, F = Fisher 150 LM Training AM Training Eval01 %WER Eval03 %WER AllSwbdFisher SS SS + F S +FS S + F

20 Data rates and Schedule  Contractual issues delayed ramping up effort –BBN initially had approval to transcribe just 300 hours –Approval for 1800 hours came in mid July WordWave increased transcription, currently ramping up to 100 hours / week Plan to finish by end of CY  BBN post-processing has had no problems keeping up  First 80 hour delivery from BBN -> LDC -> community in mid August  Proposed next release at 500 hours then entire set at end of year –Negotiable –Possible re-releases if we improve segmentation significantly

21 Future work  Do more testing –Adaptation, MMI, for both 20 hr and full training experiments –Fisher LM and vocab for full training exp  Improved Segmentation –Currently looking at forced alignment of full conversation –Re-release transcripts if/when any improvements proven.  Semi-automated quality checking –automatically find questionable areas worth listening to according to filtering recognition output + rules –Listen and clean up.  Encourage community to share bug reports and fixes –Is there a way to share fixes and improvements and maintain coherent versions?