Presentation is loading. Please wait.

Presentation is loading. Please wait.

Designing a Multi-Lingual Corpus Collection System Jonathan Law Naresh Trilok Pace University 04/19/2002 Advisors: Dr. Charles Tappert (Pace University)

Similar presentations


Presentation on theme: "Designing a Multi-Lingual Corpus Collection System Jonathan Law Naresh Trilok Pace University 04/19/2002 Advisors: Dr. Charles Tappert (Pace University)"— Presentation transcript:

1 Designing a Multi-Lingual Corpus Collection System Jonathan Law Naresh Trilok Pace University 04/19/2002 Advisors: Dr. Charles Tappert (Pace University) Dr. Zhong-hua Wang (IBM) Dr. Fred Grossman (Pace University)

2 What is a Corpus ? Any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). Corpus in modern linguistics must have these properties –Sampling and representation –Finite Size –Machine-readable form –A standard reference

3 Importance of a Corpus for Automatic Speech Recognition (ASR) To Provide Training data for Speech Recognition To supply Training data for Automatic Language Identification To offer body of language to Research Community To enable analysis of language at all levels. To support transcription and labeling document for linguistic research

4 Sample Corpus Speech Wave File

5 Corpus Collection System Overview

6 Major Components –Corpus Collection Module via telephone Native speakers –Corpus Verification Module via Web or telephone Native speakers

7 Data Recordings Process –Via toll-free number on Tellme platform Caller select native language Prompt for general attributes (for naming convention) Prompt with pre-defined scripts (for short utterance) Prompt with open set responses (for long utterance)

8 Corpus Collection Protocol “Script” of questions and prompts for user responses Reproduced in language by native speaker (all in wav files) Prompts and Questions are all the same in all languages –Are you male or female (gender) ? –What day is today (date)? –What time is it (time)? –Please say all the days of the week ? –Describe the route you take to work or school (route)? –Describe the climate today (climate) ?

9 Corpus Collection Module Add Language or Prompts

10 Corpus Verification Module

11 Corpus Verification Module (cont.)

12

13 System Demonstration


Download ppt "Designing a Multi-Lingual Corpus Collection System Jonathan Law Naresh Trilok Pace University 04/19/2002 Advisors: Dr. Charles Tappert (Pace University)"

Similar presentations


Ads by Google