Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan.

Similar presentations


Presentation on theme: "Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan."— Presentation transcript:

1 Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan Lee and Helen Meng The Chinese University of Hong Kong at ROCLING XIV 16th August 2001

2 Acknowledgment The CUCall data collection is conducted under the support from the Innovation and Technology Fund (AF/96/99)The CUCall data collection is conducted under the support from the Innovation and Technology Fund (AF/96/99) We are also grateful to the industrial sponsors:We are also grateful to the industrial sponsors: –Group Sense Limited –SmarTone Mobile Communication Limited

3 Outline Corpus Design and OrganizationCorpus Design and Organization –phonetically oriented –application oriented Data Collection and ProcessingData Collection and Processing Data AnalysisData Analysis ConclusionsConclusions

4 Part I: Corpus Design and Organization

5 Overview extension to the CUCorpora microphone speech databaseextension to the CUCorpora microphone speech database collection of telephone speech data over fixed-line and mobile networkscollection of telephone speech data over fixed-line and mobile networks allow phonetically oriented and domain specific applicationsallow phonetically oriented and domain specific applications –rich phonetic coverage with speaking style variations –words, phrases and digit strings for specific use

6 CUCall Organization

7 Phonetically Oriented 5719 sentences5719 sentences –select from the pools of CUSENT training and testing set –target for phonetic coverage in a biphone context 90 short paragraphs90 short paragraphs –enrich the phonetic coverage in additional to the sentence materials –capture the variations brought about by the lengthy nature of the reading materials

8 Phonetically Oriented 6 spontaneous conversation6 spontaneous conversation –capture speakers’ spontaneous response –content is unlimited and unconstrained –contains all kinds of non-speech events, e.g. correction, hesitation, skipped word, … –questions must be simple and open-ended

9 Phonetically Oriented Criteria for the questions designCriteria for the questions design –simple enough for spontaneous response; avoid calculation, memory recall etc. –answers are expected to be different for different speakers –responses may be either long or short –avoid answers that are relevant to speakers’ privacy

10 Application Oriented 1440 words and phrases1440 words and phrases –simple words cover various domains names of placesnames of places listed companieslisted companies foreign currenciesforeign currencies navigation commandsnavigation commands Digit stringsDigit strings –strings of digits of various length all ten single digitsall ten single digits random generated strings of length 7, 8 and 16random generated strings of length 7, 8 and 16

11 Part II: Data Collection and Processing

12 Collection Process Preparation of reading materialsPreparation of reading materials –prepare reading materials as prompt sheets –separate male & female, fixed & mobile lines Distribution of prompt sheetDistribution of prompt sheet –distributed hierarchically through agents Speakers callSpeakers call –speakers call automatic recording servers –they are identified by unique serial numbers Questionnaire returnQuestionnaire return –information on age, telephone network type are collected

13 Data Collection System Set-up Calling End : From any location, using any telephone, by all walks of life Telephone Companies : mobile/fixed line network Telephone Companies : mobile/fixed line network Recording End : telephone outlet, telephony hardware, recording system, data storage system Recording End : telephone outlet, telephony hardware, recording system, data storage system Post-processing of data for various targeted for various targeted domains of applications Post-processing of data for various targeted for various targeted domains of applications ….. Note : CT board is Dialogic® D/41-ESC Recording Servers : fixed-line connection to local telephone companies Recording Servers : fixed-line connection to local telephone companies

14 Post-processing of Data Call validationCall validation –received prompt sheets are verified against the recorded speech data –user information are entered into databases Phonemic transcriptionPhonemic transcription –all accepted speech data are 100% phonemic transcribed on initial-final level Partitioning of collected dataPartitioning of collected data –collected data are partitioned properly –speech data and the transcriptions are organized per speaker basis

15 Validation: identify successful recording sessions Transcription: accurate verbatim transcription for the speech data Data Storage: collected telephone speech data Organization: organize data for easy access Distribution: printing CDROM for distribution. /nei5-hou2-maa1/. \speaker01\data\001.wav \002.wav. \speaker01\annotate\001.xsc \002.xsc. /nei5-hou2-maa1/ /ngo5-hou2-hou2/ /nei5-ne1/ /dou1-ng4-co3-laa1/. Data Processing After Collection

16 Part III: Data Analysis

17 Statistics of Reading Materials Part# per speaker# tonal syl.# base syl.syl. count Phonetically oriented corpora sent.50 (out of 5719)13995794 to 31 para.3 (out of 90)76841823 to 120 Application-specific corpora 1-digit 10 7-digit5 8-digit5 16-digit5 words48 (out of 1440)5623442 to 8

18 Frequency-of-frequency (FOF) Sentence Paragraph

19 Part IV: Conclusions

20 Current Status the collection process is divided into several stagesthe collection process is divided into several stages expected completion date: March 2002expected completion date: March 2002 until now, over 200 hours of data (from 1000 speakers) has been collecteduntil now, over 200 hours of data (from 1000 speakers) has been collected –120 hours for phonetically oriented data –80 hours for application-specific data over half of the collected have been phonemically transcribedover half of the collected have been phonemically transcribed

21 Conclusions design and collection process for the Cantonese telephone speech corpora is presenteddesign and collection process for the Cantonese telephone speech corpora is presented corpora are designed to cover both phonetically oriented and application- specific datacorpora are designed to cover both phonetically oriented and application- specific data include also long reading materials and open questions for spontaneous datainclude also long reading materials and open questions for spontaneous data details of post-processing and data analysis are givendetails of post-processing and data analysis are given

22 Thank You


Download ppt "Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan."

Similar presentations


Ads by Google