Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2007 IBM Corporation 1 Speech Transcription for Broadcast Activities: The science, the art, and business realities Sara H. Basson Michael Picheny Bhuvana.

Similar presentations


Presentation on theme: "© 2007 IBM Corporation 1 Speech Transcription for Broadcast Activities: The science, the art, and business realities Sara H. Basson Michael Picheny Bhuvana."— Presentation transcript:

1 © 2007 IBM Corporation 1 Speech Transcription for Broadcast Activities: The science, the art, and business realities Sara H. Basson Michael Picheny Bhuvana Ramabhadran IBM T.J Watson Research Center

2 © 2007 IBM Corporation 2 Agenda  Captioning and Transcription: The need  The options  Automated speech transcription: state of the art  Is it ready for prime time? – samples from network transcripts  Quality control  Near-term solutions  The future

3 © 2007 IBM Corporation 3 Lack of Captioning and Transcription – The Problem  Proliferation of multimedia information  Audio: not always the medium of choice –Violates accessibility 22,000,000 Americans listed as deaf or hard of hearing Aging users US Federal Gov’t: 2001 amendment to Section 508 of the Rehabilitation Act: mandates that information that federal agencies provide to the public or to their employees be accessible. Time for editing (= cost of captioning) decreases as speech recognition accuracy improves.

4 © 2007 IBM Corporation 4 Transcription of Audio Material: It’s the Law Telecommunications Act of 1996:  100% of new English-language programming must be captioned by 2006  100% of Spanish-language programming must be captioned by 2010

5 © 2007 IBM Corporation 5 Transcription Contrasted with Other Speech Recognition Closed Captioning General dictation Call center data mining Government intelligence applications  Unconstrained Speech  Conversational  Large Vocabulary  High Resource  Telephone, Broadcast,Speeches TranscriptionTransactionEmbedded “For mortgage rates, say or press 1…” “Please say your tracking number…” Name Dialer  More constrained  More directed  Large Vocabulary  Lower Resource  Telephone Direction giving in car Spoken commands in car Phrase translation on a PDA  Most constrained  Most directed  Smaller Vocabulary  Lowest Resource  Embedded in a device

6 © 2007 IBM Corporation 6 Audio requiring transcription/captioning  Webcasts  Podcasts  Television programming  Movies  Digitized lectures  e-Learning materials  Corporate training  Meetings  Conferences  Tourist information  Medical transcription  Legal transcription  Call center data = Strong accessibility requirement (user demand, and corporate/legal mandates)

7 © 2007 IBM Corporation 7 Speech Recognition Challenges Over Time Connected Digit Sequences (TI Digits) TIMIT Acoustic-Phonetic Continuous Speech Corpus Broadcast News (BN) Speech in Noisy Environments (SPINE) Switchboard (SWB) Telephone conversations (about 70 topics) MALACH Corpus Increasing complexity

8 IBM Research © 2007 IBM Corporation 8 Progress in Base Technology Research Progress in Conversational Speech Progress in IBM Speech Products IBM Superhuman Speech Project NIST Benchmarks IBM Embedded Via Voice in Car IBM Websphere Voice Server - Telephony The NIST benchmark uses different test datasets each year, focusing on conversational speech. Human Performance – Conversational Telephony Base speech recognition technology has improved steadily over the last 15 years. Current error rates are low enough for many practical applications. Average error rates for 10 simple tasks (digits, name dialing, etc.) In-car tests are performed at several speed/noise levels.

9 © 2007 IBM Corporation 9 MALACH: A challenging speech corpus Emotional speech young man they ripped his teeth and beard out they beat him Disfluencies A- a- a- a- band with on- our- on- our- arm Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages. Goal: improved access to large multilingual spoken archives Challenges: Frequent interruptions: CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN

10 © 2007 IBM Corporation 10 Named Entity Detection in Segmentation Person Location 31 named entity tags: Organization Country Cardinal number Money Date Duration Age Ordinal number Percentage Animal Plant Substance Occupation Disease …

11 © 2007 IBM Corporation 11 Captioning audio: What are the options? StenographersCost, availability Automatic speech recognitionPerformance for speaker independent, any topic, multiple speakers, noisy backgrounds….. OptionsIssues Captioning and transcribing audio material: Additional Advantages  Text-based search vs. audio-based search  Reading text: faster than listening to the auditory equivalent  Second language learners  Individuals with certain learning disabilities

12 © 2007 IBM Corporation 12 Understandability….ASR vs. stenocaptioning: Manageable errors ASR:  a picture perfect landing for the space shuttle atlantis this morning the shuttle touched down at the kennedy space center in florida about six twenty one this morning IN ending a twelve day mission TRUTH:  a picture perfect landing for the space shuttle atlantis this morning the shuttle touched down at the kennedy space center in florida about six twenty one this morning ** ending a twelve day mission ASR:  since the diet drug combination FEN fen was pulled off the market some dieters **** been looking for something that would work as well we will see what's in the works TRUTH:  since the diet drug combination PHEN fen was pulled off the market some dieters HAVE been looking for something that would work as well we will see what's in the works

13 © 2007 IBM Corporation 13 Understandability….ASR vs. stenocaptioning: Distracting/confusing ASR:  ** TOOK IT makes a lot of FOLKS and also ** THAT e. mail volleys more than twice pick up the phone TRUTH  O. K. THAT makes a lot of SENSE and also IF AN e. mail volleys more than twice pick up the phone ASR:  STAY connected through e. mail has become very common in a lot of homes IN on the job but ********* on how it's used it can be terrific FOR disastrous we will look at some e. mail problems THAT possible solutions TRUTH:  STAYING connected through e. mail has become very common in a lot of homes AND on the job but DEPENDING on how it's used it can be terrific OR disastrous we will look at some e. mail problems AND possible solutions ASR:  so they do not have to make their own interpretation makes a lot of THINGS another tip TO write an e. mail IS WHAT IT a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead TRUTH:  so they do not have to make their own interpretation makes a lot of SENSE another tip TOO write an e. mail AS YOU WOULD a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead

14 © 2007 IBM Corporation 14 Text and punctuation

15 © 2007 IBM Corporation 15 Quality control for broadcast captioning Thursday, July 05, 2007 Closed Captions On Ohio TV: 24/7 Gibberish Dished To The Disabled

16 © 2007 IBM Corporation 16 Quality control for Broadcast Captioning ‭Q: Do captions have to meet accuracy requirements, such as having only so many spelling errors per program? ‭A: At present, captions are not required to meet any particular quality or accuracy standards. The Federal Communications Commission concluded that program providers have incentives to offer high quality captions, in keeping with the overall quality of the programs they offer. The FCC also concluded that it would be difficult to develop and monitor quality standards at this time. However, viewers may let video providers know whether they are satisfied with the captions through purchases of advertised products, subscriptions to program services, or contacts with providers concerning the programs. The above information has been excerpted from the FCC guidelines and the Captioned Media Program of the National Association of the Deaf.

17 © 2007 IBM Corporation 17 Using ASR for captioning….incrementally…UK Media and re-speaking

18 © 2007 IBM Corporation 18 Using ASR for Broadcast Captioning..incrementally…Protitle Live System Enables creation of subtitles in all major languages, using speech recognition Functions Correction in real time Validation in real time Timing Total cycle time between 2 to 7 seconds 5 seconds on average Economics - Re-speaking: 1/10 th the cost of real time stenographer

19 © 2007 IBM Corporation 19 Using ASR for Broadcast captioning…incrementally…Real-time editing  Assume: speaker obtains 80 percent ASR accuracy when speaking at a rate of 150 words a minute  Editor needs to correct 15 words in a minute to increase the accuracy to 90 percent. –by choosing the 15 most important errors, some of the remaining 15 errors may not detract significantly from understanding.  In classrooms in the UK and in other countries disabled students have people taking notes for them who are trying to type or write much faster than 15 words/minute to record as much as possible. If instead of trying to record everything, the speaker used speech recognition, the note taker need only type the corrections.  People can read four or more times faster than somebody speaks.  Therefore: possible to do ‘something else’ when reading words displayed at speaking speeds  Real time editing can be separated into three activities: –Finding the error and highlighting it –Entering the correction –Replacing the error with the correction  Using foot pedals to move the highlight to the exact position and triggering the replacement could enable the hands to remain free for entering the corrections. Source: Professor M. Wald, Southampton University

20 © 2007 IBM Corporation 20 Automated measures of accuracy Proposal from the WGBH National Center for Accessible Media (NCAM)  Use language-processing tools to develop an automated caption accuracy assessment system for real-time captions on live news programming  Can text-based data mining and speech-to-text technologies produce meaningful data about stenocaption accuracy? –Explore the capabilities of data mining software agents to identify discrepancies between errors contained within stenocaption data sets and speech-to-text data sets, and generate a caption accuracy analysis of the data set under review. Through these methods, goal is to:  Improve the ability of the television community to monitor and maintain the quality of live captioning they offer to viewers who are deaf or hard of hearing  Ease the current burden on caption viewers to document and advocate for comprehensible captions.

21 © 2007 IBM Corporation 21 Future vision…  Automatic Speech Transcription for less regulated arenas –Captioning podcasts, lectures, meetings, presentations…  Easier tools to modify and customize  Easier and more cost-effective mechanisms to deliver  Understanding quality control issues - - what is accuracy, what is the cost of an error  Back-up options  More pervasive usage  Higher quality deliverables

22 © 2007 IBM Corporation 22


Download ppt "© 2007 IBM Corporation 1 Speech Transcription for Broadcast Activities: The science, the art, and business realities Sara H. Basson Michael Picheny Bhuvana."

Similar presentations


Ads by Google