Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Speech recognition Encompasses variety and range of activities Totally open-ended to content and audience – May claim more than really exists Restricted to small[er] set of phrases – Phrases within longer sections of speech Restricted to require training OR system learns – Dictation systems learn your voice

Speech recognition User speaks. System 'understands', at least enough to perform some action. Related to (but not the same as) –Natural language understanding –Voice print identification –Record information to be re-played to human in compressed form for later interaction –Speech synthesis (other direction): words to speech –?

Natural language understanding Skip speech altogether, but type in statements or phrases in normal language –What is normal? We tend not to speak that grammatically –Many 'natural language systems' actually use keywords Histor Moon rocks example Combine speech to natural language …

Continuous versus discrete Speaker speaks 'naturally' versus Speaker separates words

Examples Dictation: no understanding as such, produce words/sentences in a program (Telephone) Help desk / Information: generally restricted or directed speech, choosing from alternatives (may or may not be given). Advances the process [Restricted] commands: actually carrying out operations –Factory example: start and stop –Car: radio, heat/AC –Phone: call specific number

Training Dictation application: user takes time to read specific test to train the system –Note: some systems also adapt with use. If & when user corrects the results, system may do better next time. Phone lookup: user records names. No 'understanding', just record for matching.

Audience & content Some systems may allow adapting to audiences, for example, male versus female Some systems have restrictions on types of content –Historical note: IBM system in 1980s & 1990s was restricted to male, American-born speakers (no speech impediments) and legal text.

Speech recognition concepts Air pressure  diaphragm in phone  electrical signal  (Fourier Transform)  wave pattern matched against sets of canonical patterns (native speaker of English, perhaps male/female & young/old alternatives) generated for the specified grammar (using a segmentation=dividing up of the parts) Note: interplay of grammar and statistics distinguishes different approaches

Fourier Transform (Fast Fourier Transform -- FFT) Takes data representing a signal And produces numbers representing the combination of sine and cosine waves that make up the signal

Speech recognition Works on the product of the FFT Uses (in most cases) –Segmentation: attempt to break up into pieces, perhaps syllables or words –Grammar: definition of what is to be expected –Probabilities: if first part matched X, then greater probability that then next would match to Y

Current State of the Art General, no restrictions, speech reco, good enough to act on the speech? always about to happen? dictation / substitute for keyboard+ exists and satisfies many –Is this most important application for most users? –May not be killer ap, but may be good for motivating research Extra credit posting: prepare brief report on [a] current product or application. Can be one you use yourself.

Speech synthesis aka TTS (text to speech) Application determines that the computer needs to say certain words lexical units (syllables of words)  phonemes  pre-recorded (wav) files of phonemes

Speech synthesis This is again a segmentation process: need to divide up the words and then put together so speech sounds 'natural'. –particular phoneme may [need to] sound different in different context. –also need to deal with abbreviations & local accents –Place names (important in travel & weather applications) Special case: detect and use wav file for each name. Older methods were all synthesized –similar distinction between all synthesized and samples of music

Speech synthesis is essentially ‘the computer’ reading ‘out loud’. Easy to do most things More and more difficult to do complete job Different languages may be easier than English. People who are not monolingual please comment!

Restricted / directed speech applications The language is VoiceXML We will use evolution.voxeo.com to create directed speech applications. –Free facilty: put in URL pointing to a VoiceXML document. Supplies phone numbers to call in to test. –You need to register. –Note: previously used Tellme studios but they stopped offering service.

XML Generalization of HTML XML documents have markup. –Tag indicating type of element and, possibly with attributes, content, tag closer. Document must be well-formed. –Elements nested in other elements –Quotation marks around attribute values Developers decide on element types. –So, we need to obey rules of VoiceXML Each element type can only have certain child elements

Screen shot from Voxeo

Screen shot: phone numbers

Homework (over break) Sign up to be Voxeo developer. – Start VoiceXML tutorials. – Do your own hello, world application. Start planning your VoiceXML project.

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Similar presentations

Presentation on theme: "Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Similar presentations

Presentation on theme: "Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials."— Presentation transcript:

Similar presentations

About project

Feedback