Download presentation
Presentation is loading. Please wait.
1
1 Linguistic Resources needed by Nuance Jan Odijk 060528 Cocosda/Write Workshop
2
2 Overview Nuance History Nuance Technologies Nuance Language Coverage Which Languages are needed Which data are needed Advantages
3
3 Nuance History ScanSoft (Digital Imaging) acquired: –Lernout & Hauspie speech divisions (2001) –Philips Speech Processing embedded and network divisions (2002) –Telelogue (2003) –LocusDialog (2003) –SpeechWorks (2004) –Talks (2004) –ART (2005) –Phonetic Systems (2005) –Rhetorical (2005) –MedRemote (2005) –Nuance (2005) company renamed Nuance –Dictaphone (2006)
4
4 Nuance Technologies Digital Imaging Speech Technologies –Text-to-Speech (TTS) –Automatic Speech Recognition (ASR) –Dictation –Speaker Verification –Audiomining Speech Applications/Solutions –Automated Attendant Systems –Directory Assistance Systems –Dictation end-user application –Multimodal applications
5
5 Nuance Technologies Platforms –Server –DeskTop –Embedded Automotive Mobile Phones Domains –Horizontal –Vertical Medical Legal Navigation....
6
6 Nuance Language Coverage Broad language coverage OCR supports 114 languages DeskTop Dictation in 8 languages TTS > 23 languages Telephony ASR > 40 languages Embedded ASR > 11 languages Broad language coverage necessary Most business customers are operating internationally Want a single provider of language and speech technologies
7
7 Nuance Language Coverage Language Coverage must be further broadened! Data are needed for that, but... Costs are high No single company can afford the investments
8
8 Which Languages? Priority 1 –Arabic, Chinese (Mandarin, Cantonese), Danish, Dutch, English (UK), English (US), Farsi, Finnish, French, French (Canadian), German, Hindi, Indonesian, Italian, Malaysian, Pilipino (Tagalog), Polish, Portuguese, Portuguese (Brazil), Russian, Spanish, Spanish (American), Swedish, Thai, Turkish, Vietnamese,... Priority 2 –Bulgarian, Croatian, Czech, Estonian, Greek, Gujarati, Hebrew, Hungarian, Icelandic, Japanese, Kannada, Kazak, Khmer, Latvian, Lithuanian, Macedonian, Malayalam, Marathi, Norwegian, Punjabi Romanian, Serbian, Sesotho, Sinhalese, Slovak, Slovenian, Swahili, Tamil, Telugu, Ukrainian, Urdu, Uzbek, Xhosa, Zulu,...
9
9 Which Data? There’s not Data but More Data but... Given Time and Costs constraints a minimal set is needed to develop technologies/applications for new languages
10
10 Which Data? Network ASR: SpeechDat family –SpeechDat-II, Orientel, SALA (I and II), LILA Embedded ASR –Automotive: SpeechDat-Car –Consumer Apps: SPEECON Pronunciation and Grammatical Lexicons: LC-STAR TTS synthesis: TC-STAR see –http://www.speechdat.orghttp://www.speechdat.org –http://www.tc-star.orghttp://www.tc-star.org –http://www.lc-star.comhttp://www.lc-star.com
11
11 Which Data? Desktop Office data Large Text Corpora (>300 million tokens plain text) –news –business / finance –traffic messages, weather messages –e-mail –SMS –...
12
12 Advantages Research can be done in your own language Part of the costs can be recovered by licensing data via ELRA to companies Companies can develop technologies/applications for your languages Contributes to securing the position of your language in the Internet era Ask your government for funding and support Some good examples: –STEVIN Programme Netherlands/Flanders –UPC databases for Catalan (Asunción Moreno)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.