Presentation is loading. Please wait.

Presentation is loading. Please wait.

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Similar presentations


Presentation on theme: "Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand."— Presentation transcript:

1 Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand

2 Background on Thai speech recognition research Isolated syllable recognition 1995 Isolated word recognition Connected sub-word recognition 1999 Small task continuous speech recognition 2003 LVCSR 2005 Broadcast news transcription system 2007 Difficulty Thienlikit et al., 2004 Newspaper read-speech recognition

3 Development of Thai Broadcast News Transcription System Research on broadcast news transcription system for Thai falls behind other languages English: 1995 (Stern, 1997 ) Japanese: 1997 (Matsuoka et al., 1997 ) Mandarin: 1998 (Guo et al., 1998 ) Italian: 2000 (Federico et al., 2000 ) We need to speed up our research activities to catch up with others 3 Targets 1.Development of Thai broadcast news corpus Speech corpus: training and testing data Text corpus: language modeling 2.Development of a prototype system

4 Speech corpus  Structure information of broadcast news was annotated  Section, Speaker’s turn, Segments  Property tags were annotated to each speaker’s turn  Speaker’s name, if known  Speaker’s gender: male / female  Speaking mode: planned / spontaneous  Background noise: clean / music / noise  Only speech from announcers speaking in the studio was transcribed  Transcription and annotation was created by one transcriber and checked by another transcriber 4

5 Episode : one broadcast news session Structure of broadcast news 5 Section 1 : one news topic Section 2 Section 3

6 Episode : one broadcast news session Section 1 : one news topic Structure of broadcast news 5 Speaker’s turn : speaker A Speaker’s turn : speaker BSpeaker’s turn : speaker A

7 Episode : one broadcast news session Structure of broadcast news 7 Section 1 : one news topic Speaker’s turn : speaker A Segment : one sentence or clause

8 Speech corpus  Structure information of broadcast news was annotated  Section, Speaker’s turn, Segments  Property tags were annotated to each speaker’s turn  Speaker’s name, if known  Speaker’s gender: male / female  Speaking mode: planned / spontaneous  Background noise: clean / music / noise  Only speech from announcers speaking in the studio was transcribed  Transcription and annotation was created by one transcriber and checked by another transcriber 8

9 Episode : one broadcast news session Example of structure information 9 Section 1 : Speaker’s turn : Segment : sentence A Segment : sentence B Segment : sentence C Sports Mr. A, male, planned speech, clean speech

10 Speech corpus  Structure information of broadcast news was annotated  Section, Speaker’s turn, Segments  Property tags were annotated to each speaker’s turn  Speaker’s name, if known  Speaker’s gender: male / female  Speaking mode: planned / spontaneous  Background noise: clean / music / noise  Only speech from announcers speaking in the studio was transcribed  Transcription and annotation was created by one transcriber and checked by another transcriber 10

11 Text corpus  No structure information was annotated  Additional information  Speaking mode: planned / spontaneous 11

12 Problems of Thai transcription text  No space between words  Definition of word is very ambiguous  No good morphological analyzer  Difficulties in transcription and checking process  Manually word-segmented transcription was made  Instruction was created for transcribers  Automatically segmented transcription 12 Future target

13 Broadcast news collection  News programs from one public TV station in Thailand were recorded  Total of 105 news episodes  Speech corpus : 35 news episodes  17 hours  Text corpus : 70 news episodes 13

14 Analysis of speech corpus 14

15 Information of speech & text corpora AttributeSpeech corpusText corpus No. of sentences 13k32k No. of words 224k573k No. of unique words 10k14k No. of phonemes 899k- No. of speakers 8 female, 4 male - 15

16 Data used in experiments  Test set data  Randomly selected from the speech corpus  3,000 utterances  Acoustic model training data for the baseline system  Phonetically balanced sentence speech corpora  LOTUS (Kasuriya et al., 2003 ) and the corpus developed internally  Read speech corpora  40.3 hours ( 68 male and 68 female)  Acoustic model adaptation data  Selected from the speech corpus  No overlap between adaptation data and test set data  Language model training data  Text corpus + transcript from speech corpus excluded test set 16

17 Experimental condition  Acoustic model  Gender-dependent acoustic model  12 MFCCs, delta, and delta energy  Triphones, 1000 tied - states, 8 Gaussian mixtures  Language model  Tri-grams  Dictionary size: about 18 k words  TITech WFST speech recognition system (Dixon et al., 2007 ) was used as a speech decoder 17

18 Acoustic model adaptation  Supervised adaptation using MLLR  F-condition adaptation F 0 : clean, plannedF 1 : clean, spontaneous F 3 : music noiseF 4 : other noise  Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus  Speaker adaptation  Adaptation data: 200 utterances regardless of F-condition randomly selected from the speech corpus 18

19 WER results 19 Speaker adaptation yielded better WER F-condition Proportion Time#words F035.3%17160 F11.0%629 F314.0%7882 F449.7%27542

20 Discussion  High WER  Mismatch recording condition  The speech corpus was only used as testing and adaptation data  Small text corpus  Inefficient language model 20

21 Conclusion  Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented  Speech corpus was annotated with structure information which is useful for further research purpose  An LVCSR system was setup and tested with the corpus 21

22 Future work  Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007 )  Compound pseudo-morpheme (CPM) unit  Pseudo-morpheme error rate (F 0 condition)  Manually-segmented word unit system: 20.5 %  CPM unit system: 19.9 %  Improving language model by using newspaper text  Collaboration with NECTEC: additional 50 hours of speech corpus 22

23 Thank you 23

24 Thank you 24

25 Thank you 25

26 Background Isolated syllable recognition 1995 Isolated word recognition Connected sub-word recognition 1999 Small task continuous speech recognition 2003 LVCSR 2005 Broadcast news LVCSR 2007 Difficulty Thienlikit, 2004 Newspaper read-speech recognition

27 Development of Thai Broadcast News LVCSR System  Development of an LVCSR system requires speech and text corpora  Existing speech corpora for Thai LVCSR research  NECTEC-ATR  LOTUS (NECTEC)  GlobalPhone (CMU) 27 Newspaper read-speech 1.Development of Thai broadcast news corpus Speech corpus: training and testing data Text corpus: language modeling 2.Development of a prototype of LVCSR system

28 Experiments & Developed corpora  Speech corpus  The size of the speech corpus is still rather small  It was used in three ways  Test data  Adaptation data  A part of transcription text was used for training LM  Text corpus  It was used for training LM 28

29 Perplexity & OOV rates F-condition PerplexityOOV rate MaleFemaleMaleFemale F F F F Overall

30 Transcription process Text corpus transcribing 7 persons Guideline 30 Speech corpus transcribing 4 persons Speech corpus checking 2 persons Lexical entries checking 1 person Speech corpus Lexical entries checking 1 person Text corpus

31 Speech corpus  Transcription and annotation of about 17 hours of TV broadcast news  Tool: “Transcriber” (Barras et al., 2001 )  Additional information  speaker information: name, gender  speaking mode: planned/spontaneous speech  Speech from announcers speaking in the studio 31

32 Transcription conventions  Guideline for the transcription process  Segment segmentation  Word segmentation  Repeating word  Thai/English abbreviation  Number entity  Special tags 32

33 Introduction  Thai speech processing research in TokyoTech  Dialogue system [Whittiwiwattchai, 2003]  LVCSR system  Dictation system [Tianlikid,2005]  Broadcast news recognition system 33

34 Overview  Introduction  Corpus description  Recording and transcription processes  Corpus evaluation  Conclusion 34

35 Thai language corpora  Large language corpora are crucial to a state- of-the-art natural language processing system  Thai speech resources for speech processing  NECTEC-ATR  LOTUS (NECTEC)  GlobalPhone (CMU)  TSynC- 1 (NECTEC) 35 Newspaper read-speech Unit-selection speech synthesis

36 WER Result F-condition Time proportion WER (%) MaleFemale F028.1% F11.5% F311.5% F458.9% Overall100%

37 Text corpus  Text transcribed from 35 hours of TV broadcast news  Additional information  Speaking mode: planned/spontaneous 37

38 Transcription conventions (1)  Sentence segmentation  No sentence marker in Thai language  Ambiguous  Grammatically, there are 3 types of sentence  Simple sentence  Compound sentence  Complex sentence  Sentence was defined as a simple sentence or clause with the help of delimited breaths 38 Composed from several of clauses or simple sentences

39 Transcription conventions (2)  Word segmentation  No word boundary marker in Thai language  Lead to difficulties in transcription and data checking processes  Too ambiguous to define all rules  A few rules of simple segmentation patterns were defined  Undefined patterns were left to the decision of transcribers 39

40 Transcription conventions (3)  Repeating word  Thai/English abbreviation  Number entity  Special tags  Disfluencies, filled-pauses, exclamations  Foreign words  Some other events: uncertainly transcribed part, etc. 40

41 Recorded programs  News programs from one public TV station in Thailand was recorded  Total of 105 news episodes  Speech corpus  35 news episodes  About 17 hours of speech data  Text corpus:  70 news episodes 41


Download ppt "Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand."

Similar presentations


Ads by Google