Transcription System using Automatic Speech Recognition (ASR) for the Japanese Parliament (Diet) Tatsuya Kawahara (Kyoto University, Japan)

Brief Biography 1995Ph. D. (Information Science), Kyoto Univ. 1995Associate Professor, Kyoto Univ. 1995-96Visiting Researcher, Bell Labs., USA 2003-Professor, Kyoto Univ. 2003-06IEEE SPS Speech TC member 2006-Technical Consultant, The House of Representative, Japan Published 150~ papers in automatic speech recognition (ASR) and its applications Web http://www.ar.media.kyoto-u.ac.jp/~kawahara/http://www.ar.media.kyoto-u.ac.jp/~kawahara/

Contents 1.Review of ASR technology 2.ASR system for the Japanese Diet 3.Next-generation transcription system of the Japanese Diet

Trend of ASR style Informal Formal onemultiple Number of speakers Formalpresentation Classroom lectures Phone conversation Business meetings Reading/Re-speaking Broadcast news Spontaneousspeech Parliament

Review of ASR technology (1/2) Broadcast News [world-wide] –Professional anchors, mostly reading manuscripts –Accuracy over 90% Public speaking, oral presentations [Japan] –Ordinary people making fluent speech –Accuracy ~80% (close-talking mic.) Classroom lectures [world-wide] –More informal speaking –Accuracy ~60% (pin mic.)

Review of ASR technology (2/2) Telephone conversations [US] –Ordinary people, speaking casually –Accuracy 60%  85% Business meetings [Europe/US] –Ordinary people, speaking less formally –Accuracy 70% (close mic.), 60% (distant mic.) Parliamentary meetings [Europe/Japan] –Politicians speaking formally –EU: plenary sessions: 90% –Japan: committee meetings: 85%

Deployment of ASR in Parliaments & Courts Some countries –Steno-mask & Voice writing –Re-speaking  Commercial dictation software Some local autonomies in Japan –Direct recognition of politicians’ speech Japanese Courts –ASR for efficient retrieval from recorded sessions Japanese Parliaments (=Diet) –to introduce ASR; direct recognition of politicians’ speech –Mostly in committee meetings …interactive, spontaneous, sometimes excited

Language-specific Issues in Japanese Need to convert kana (phonetic symbol) to kanji Conversion ambiguous  many homonym (ex.) KAWAHARA ( カワハラ ) → 河原 (not 川原 ) –Very hard to type-in real-time –Only limited stenographers using special keyboards can Difference in verbatim-style and transcript-style (ex.) おききしたいのですが  ききたい（のです） –Re-speaking is not so simple –need to rephrase in many cases

ASR Architecture Signal processing Acoustic model Language model Dictionary Recognition Engine (decoder) P(W/X) ∝ P(W) ・ P(X/W) P(W) X P(P/W) P(W) P(X/P) P(X/W) /a, i, u, e, o…/ 京都 ky o: t o 京都 + の + 天気 output: W=argmax P(W/X) Depend on input condition Depend on application

Current Status of ASR Problems unsolved –Spontaneous/conversational speech –Noisy environments Including distant microphones Solutions ad-hoc –Collect large-scale “matched” data (corpus) Same acoustic environment, speakers (10hours~) Cover same topics, vocabulary (~M words) –Prepare dedicated acoustic & language models Huge cost in development & maintenance

ASR Research in Kyoto Univ. Since 1960s, one of the pioneers Development of free software Julius Research in spontaneous speech recognition –1999- Oral presentations –2001- TV discussions –2004- Classroom lectures –2003- Parliamentary meetings

Free ASR Software: Julius Developed since 1997 in Kyoto-U & other sites Open-source  multi-platform (Linux, Mac, Windows, iPhone) Open architecture –Independent from acoustic & language models  Ported to many languages  Ported to many applications (telephony, robot…) Standard model for Japanese Widely-used research platform http://julius.sourceforge.jp

Corpus of Parliamentary Meetings Cover all major committees and plenary sessions 200 hours, 2.4M words Faithful transcripts of utterances including fillers, which are aligned with official minutes { えー } それでは少し、今 { そのー } 最初に大臣からも、 { そのー } 貯蓄から投資へという流れの中に { ま } 資するんじゃないだろうかとかいうような話もありましたけれども、 { だけど / だけれども } 、 { まあ } あなたが言うと本当にうそらしくなる { んで / ので }{ ですね、えー } もう少し { ですね、あのー } これは { あー } 財務大臣に { えー } お尋ねをしたいんです { が } 。 { ま } その { あの } 見通しはどうかということでありますけれども、これについては、 { あのー } 委員御承知の { その } 「改革と展望」の中で { ですね } 、我々の今 { あのー } 予測可能な範囲で { えー } 見通せるものについてはかなりはっきりと書かせていただいて ( い ) るつもりでございます。

Cover pronunciation variations Cover poor articulation Cover disfluencies & colloquial expressions ASR modules oriented for Spontaneous Speech Signal processing Acoustic model Language model Dictionary Recognition Engine (decoder) P(W/X) ∝ P(W) ・ P(X/W) P(W) X P(X/W) Corpus Innovative techniques

ASR Performance Accuracy –Word accuracy 85% (Character accuracy 87% ） Plenary sessions 90% Committee meetings 80 ～ 87% –90% seems almost perfect –No commercial software can achieve!! Real-time factor 1-3 –Latency in 10 min.

Related Techniques Noise suppression & dereverberation –Not serious once matched training data available Speaker change detection –Preferred –Current technology level seems not sufficient Auto-edit –Filler removal  easy –Colloquial expression replacement  non-trivial –Period insertion  still research stage

The House of Representatives in Japan 2005: terminated recruiting stenographers 2006: investigated ASR technology for the new transcription system 2007: developed a prototype system and made preliminary evaluations 2008: system design 2009: system implementation 2010: trial and deployment

ASR system: Kyoto Univ. model integrated to NTT engine Signal processing Acoustic model Language model Dictionary Recognition Engine (decoder) P(W/X) ∝ P(W) ・ P(X/W) P(W) X P(P/W) P(W) P(X/P) P(X/W) /a, i, u, e, o…/ 京都 ky o: t o 京都 + の + 天気 NTT Corp. Kyoto Univ.  House

Issues in Post-Editor For efficient correction of ASR errors and cleaning transcript into document-style Easy reference to original speech (+video) –by time, by utterance, by character (cursor) –Can speed up & down speech-replay Word-processor interface (screen editor); not line editor –to concentrate on making correct sentences –Serious misunderstanding between system developers and stenographers!!

System Evaluation (@Kyoto) Subjects ： 18 students Post-editing ASR outputs is more efficient than typing from scratch, regardless of the accuracy  Those hard for ASR are also hard for human 3 4 5 6 7 8 9 10 50556065707580859095 ASR accuracy edit time (min) Type from scratch Post-edit ASR output

System Evaluation (@Kyoto) Subjective evaluation correlates with ASR accuracy Threshold in 75% to have ASR preferred 1 2 3 4 5 6 7 50556065707580859095 ASR accuracy Usability score of ASR

System Evaluation (@House) Subjects: 8 stenographers System: proto-type ASR-based system reduced the edit time, compared with current short-hand system –78 min.  68 min. (for 5 min. segment) Threshold in ASR accuracy of 80% –75%  degradation in edit time; a half say negative in using ASR

Side effect of ASR-based system Everything (text/speech/video) digitized and hyper-linked  Efficient search & retrieval Less burden?  may work on longer segments?? Significantly less special training needed compared with current short-hand system

Conclusions ASR of parliamentary meetings is feasible, given a large collection of data –~100 hour speech –~1G word text (minutes) –Accuracy 85-90% Effective post-processing is still under investigation Automatic translation research is also ongoing

Transcription System using Automatic Speech Recognition (ASR) for the Japanese Parliament (Diet) Tatsuya Kawahara (Kyoto University, Japan)

Similar presentations

Presentation on theme: "Transcription System using Automatic Speech Recognition (ASR) for the Japanese Parliament (Diet) Tatsuya Kawahara (Kyoto University, Japan)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transcription System using Automatic Speech Recognition (ASR) for the Japanese Parliament (Diet) Tatsuya Kawahara (Kyoto University, Japan)

Similar presentations

Presentation on theme: "Transcription System using Automatic Speech Recognition (ASR) for the Japanese Parliament (Diet) Tatsuya Kawahara (Kyoto University, Japan)"— Presentation transcript:

Similar presentations

About project

Feedback