Application of Speech Recognition,

Application of Speech Recognition,
Synthesis, Dialog

Speech for communication
The difference between speech and language Speech recognition and speech understanding

System does not know what you want System does not know who you are
Speech recognition can only identify words System does not know what you want System does not know who you are

Speech and Audio Processing
Signal processing: Convert the audio wave into a sequence of feature vectors Speech recognition: Decode the sequence of feature vectors into a sequence of words Semantic interpretation: Determine the meaning of the recognized words Dialog Management: Correct errors and help get the task done Response Generation What words to use to maximize user understanding Speech synthesis: Generate synthetic speech from a ‘marked-up’ word string

Data Flow Part I Part II Signal Processing Speech Recognition Semantic
Interpretation Discourse Interpretation Dialog Management Speech Synthesis Response Generation

Semantic Interpretation: Word Strings
Content is just words System: What is your address? User: My address is fourteen eleven main street Need concept extraction / keyword(s) spotting Applications template filling directory services information retrieval

Semantic Interpretation: Pattern-Based
Simple (typically regular) patterns specify content ATIS (Air Traffic Information System) Task: System: What are your travel plans? User: [On Monday], I’m going [from Boston] [to San Francisco]. Content: [DATE=Monday, ORIGIN=Boston, DESTINATION=SFO]

Semantic Interpretation: Parsing
In general case, have to uncover who did what to whom: System: What would you like me to do next? User: Put the block in the box on Platform 1. [ambiguous] System: How can I help you? User: Where is A Bug’s Life playing in Summit? Requires some kind of parsing to produce relations: Who did what to whom: ?(where(present(in(Summit,play(BugsLife))))) This kind of representation often used for machine translation Often transferred to flatter frame-based representation: Utterance type: where-question Movie: A Bug’s Life Town: Summit

Robustness and Partial Success
Controlled Speech limited task vocabulary; limited task grammar Spontaneous Speech Can have high out-of-vocabulary (OOV) rate Includes restarts, word fragments, omissions, phrase fragments, disagreements, and other disfluencies Contains much grammatical variation Causes high word error-rate in recognizer Interpretation is often partial, allowing: omission parsing fragments

Speech Dialog Management

Discourse & Dialog Processing
Discourse interpretation: Understand what the user really intends by interpreting utterances in context Dialog management: Determine system goals in response to user utterances based on user intention Response generation: Generate natural language utterances to achieve the selected goals Discourse interpretation: takes the “who did what to whom” structure we get by interpreting the utterance in isolation, and puts it in the discourse context to figure out what the user really intends Dialogue management: takes the user’s utterance interpreted in discourse context and determines what the system should say in response by selecting one or more goals for the system to achieve. Response generation: determines a set of natural language utterances that the system should say to the user in order to achieve the goal selected by the dialogue manager.

Discourse Interpretation
Goal: understand what the user really intends Example: Can you move it? What does “it” refer to? Is the utterance intended as a simple yes-no query or a request to perform an action? Issues addressed: Reference resolution Intention recognition Interpret user utterances in context Discourse interpretation: takes the “who did what to whom” structure we get from interpreting the user utterance in isolation, and puts it in the discourse context. “Can you move it?”: question where relationship is MOVE, agent is HEARER, and patient is IT. Assuming that speech is the only mode of interaction, the referent for “it” cannot be resolved without taking discourse context into account. Many other issues in discourse interpretation: Ellipsis resolution Discourse structure construction

Reference Resolution U: Where is A Bug’s Life playing in Monroeville?
S: A Bug’s Life is playing at the Carmike theater. U: When is it playing there? S: It’s playing at 2pm, 5pm, and 8pm. U: I’d like 1 adult and 2 children for the first show. How much would that be? Knowledge sources: Domain knowledge Discourse knowledge World knowledge In order to understand this dialogue, many different types of knowledge must be used to resolve the references in the dialogue Domain knowledge is needed to understand that “A Bug’s Life” is the name of a movie, and “Summit” is the name of a town. Discourse knowledge is needed to figure out that “it” refers to “A Bug’s Life”, “there” refers to “the Summit theater”, and “that” refers to “the tickets”. World knowledge is needed to understand that people don’t generally buy adults and children, and along with domain knowledge, we can figure out that the user is referring to buying adult and child tickets. World knowledge is also needed to know what “the first show” refers to, because we must know that 2pm is earlier than 5 and 8pm.

Reference Resolution Focus stacks:
Maintain recent objects in stack Select objects that satisfy semantic/pragmatic constraints starting from top of stack Take into account discourse structure Rule-based filtering & ranking of objects for pronoun resolution For the most part, reference resolution focused on referring expressions that can be resolved using discourse knowledge alone. Focus stacks: add all noun phrases to stack in the order they occurred search from the top of the stack for objects that satisfy the constraints of the reference. E.g., singular references must refer to singular noun phrases, “there” must refer to a location. May take into account the discourse structure by keeping track of objects that were referred to in a subdialogue and popping those elements off the stack when the subdialogue is completed. Centering: In English, Cf is ranked by grammatical relations to the main verb (subject, object, object2) Cb(Un) is the highest ranked element of Cf(Un-1) in Un. Rules: If some element of Cf is realized as a pronoun, so should Cb Continuing > Retaining > Shifting

Dialog Management: Motivating Example
S: Would you like movie showtime or theater playlist information? U: Movie showtime. S: What movie do you want showtime information about? U: Saving Private Ryan. S: At what theater do you want to see Saving Private Ryan? U: Carmike. S: Saving Private Ryan is not playing at the Carmike theater. System initiative: system controls the dialogue flow, and user can only answer system questions. Implicit confirmation: recognized answer in previous utterance embedded in system’s next question for confirmation purposes. Informs user of failed query.

Interacting with the user
Dialog manager Guide interaction through task Map user inputs and system state into actions Domain agent Interact with back-end(s) Interpret information using domain knowledge Domain agent Domain agent

Dialog Management Goal: determine what to accomplish in response to user utterances, e.g.: Answer user question Solicit further information Confirm/Clarify user utterance Notify invalid query Notify invalid query and suggest alternative Interface between user/language processing components and system knowledge base Decision on selecting system goals is based on information from both the language processing frontend and the system knowledge base backend.

Dialog Management (Cont’d)
Main design issues: Functionality: how much should the system do? Process: how should the system do them? Affected by: Task complexity: how hard the task is Dialogue complexity: what dialogue phenomena are allowed Affects: robustness naturalness perceived intelligence Functionality: the more the system has to do the more complex the dialogue management strategies are, similar to the perplexity measure in ASR which captures the number of things the system can do next. Process: the same task can be accomplished in a flexible way (e.g., mixed-initiative dialogue strategy) or in an unnatural manner (e.g., system initiative dialogue strategy) Both the functionality and the process are affected by task complexity and dialogue complexity

. . . . Graph-based systems Welcome to Bank ABC!
Please say one of the following: Balance, Hours, Loan, ... What type of loan are you interested in? Please say one of the following: Mortgage, Car, Personal, ...

Frame-based systems Transition on keyword or phrase Zxfgdh_dxab: _____
askjs: _____ dhe: _____ aa_hgjs_aa: _____ . Transition on keyword or phrase Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ .

Application Task Complexity
Examples: Weather Information ATIS Call Routing Automatic Banking Travel Planning University Course Advising Simple Complex Directly affects: Types and quantity of system knowledge Complexity of system’s reasoning abilities Task conplexity measured in terms of the number of domain actions the system can take next. Simple tasks include call routing where the system’s only task is to identify one of a set of possible destinations a call can be directed to, or weather information and banking transactions where the system gathers all necessary information and queries its database to provide an answer or performs a database action. For the simple tasks, the system doesn’t need much more knowledge beyond the types of information needed to perform each task, and the system’s reasoning abilities are limited to figuring out one’s checking account number given their savings account number, etc. Complex tasks include tasks like travel planning and university course advisement. In these tasks, the system needs a lot of domain knowledge in order to play the role of a travel agent or an undergraduate advisor. For instance, the system needs to know that people can travel by air, by train, or by car, that people are more likely to change airlines than their date of travel, that most people prefer to get less expensive airfares, etc. It also has to have the knowledge to find connecting flights, etc. The system generally plays a much more active role in these tasks.

Dialog Complexity Determines what can be talked about:
The task only Subdialog: e.g., clarification, confirmation The dialog itself: meta-dialog Could you hold on for a minute? What was that click? Did you hear it? Determines who can talk about them: System only User only Both participants

Dialog Management: Functionality
Determines the set of possible goals that the system may select at each turn At the task level, dictated by task complexity At the dialogue level, determined by system designer in terms of dialog complexity: Are subdialogs allowed? Are meta-dialogs allowed? Only by the system, by the user, or by both agents?

Dialogue Management: Process
Determines how the system will go about selecting among the possible goals At the dialogue level, determined by system designer in terms of initiative strategies: System-initiative: system always has control, user only responds to system questions User-initiative: user always has control, system passively answers user questions Mixed-initiative: control switches between system and user using fixed rules Variable-initiative: control switches between system and user dynamically based on participant roles, dialogue history, etc. In system/user-initiative systems, systme decides what to do next based on current task status. In mixed/variable-initiative systems, system first determines initiative distribution (whether it has initiative for the next turn or not), and determins what to do based on task status and whether or not it has initiative.

Response Generation U: Is Saving Private Ryan playing at the Chatham
cinema?

S: No, it’s not. S provides elliptical response

S: No, Saving Private Ryan is not playing at the Chatham
cinema. S provides full response (which provides grounding information)

S: No, Saving Private Ryan is not playing at the Chatham
cinema; the theater’s under renovation. S provides full response and supporting evidence

Communicating with the user
Language Generator Decide what to say to user (and how to phrase it) Speech synthesizer Construct sounds and intonation Display Generator Action Generator

Response Generation Goal: generate natural language utterances to achieve goal(s) selected by the dialogue manager Issues: Content selection: determining what to say Surface realization: determining how to say it Generation gap: discrepancy between the actual output of the content selection process and the expected input of the surface realization process Content selection is considered part of discourse planning, but surface realization is the counterpart of syntactic parsing on the generation side.

Language generation Template-based systems “Linguistic” systems
Sentence templates with variables “Linguistic” systems Generate surface from meaning representation Stochastic approaches Statistical models of domain-expert speech

Dialog Evaluation Goal: determine how “well” a dialogue system performs Main difficulties: No strict right or wrong answers Difficult to determine what features make a dialogue system better than another Difficult to select metrics that contribute to the overall “goodness” of the system Difficult to determine how the metrics compensate for one another Expensive to collect new data for evaluating incremental improvement of systems Features that make a dialogue system better: efficiency vs. accuracy Metric compensation: are longer dialogues that are more likely to lead to completed tasks better than shorter dialogues that are more likely to fail?

Dialog Evaluation (Cont’d)
System-initiative, explicit confirmation better task success rate lower WER longer dialogs fewer recovery subdialogs less natural Mixed-initiative, no confirmation lower task success rate higher WER shorter dialogs more recovery subdialogs more natural Typical preferred features are highlighted in blue. However, it is sometimes unclear what features are preferred number services would prefer shorter dialogues, while 900 number services would prefer longer dialogues.

Speech Synthesis

Speech Synthesis (Text-to-Speech TTS)
Prior knowledge Vocabulary from words to sounds; surface markup Recorded prompts Formant synthesis Model vocal tract as source and filters Concatenative synthesis Record and segment expert’s voice Splice appropriate units into full utterances Intonation modeling

Recorded Prompts The simplest (and most common) solution is to record prompts spoken by a (trained) human Produces human quality voice Limited by number of prompts that can be recorded Can be extended by limited cut-and-paste or template filling

Speech Synthesis Rule-based Synthesis
Uses linguistic rules (+/- training) to generate features Example: DECTalk

The Source-Filter Model of Formant Synthesis
Model of features to be extracted and fitted Excitation or Voicing Source(s) to model sound source standard wave of glottal pulses for voiced sounds randomly varying noise for unvoiced sounds modification of airflow due to lips, etc. high frequency (F0 rate), quasi-periodic, choppy modeled with vector of glottal waveform patterns in voiced regions Acoustic Filter(s) shapes the frequency character of vocal tract and radiation character at the lips relatively slow (samples around 5ms suffice) and stationary modeled with LPC (linear predictive coding)

Concatenative Synthesis
Record basic inventory of sounds Retrieve appropriate sequence of units at run time Concatenate and adjust durations and pitch Synthesize waveform

Diphone and Polyphone Synthesis
Phone sequences capture co-articulation Cut speech in positions that minimize context contamination Need single phones, diphones and sometimes triphones Reduce number collected by phonotactic constraints collapsing in cases of no co-articulation Data Collection Methods Collect data from a single (professional) speaker Select text with maximal coverage (typically with greedy algorithm), or Record minimal pairs in desired contexts (real words or nonsense)

Signal Processing for Concatenative Synthesis
Diphones recorded in one context must be generated in other contexts Features are extracted from recorded units Signal processing manipulates features to smooth boundaries where units are concatenated Signal processing modifies signal via ‘interpolation’ intonation duration

Duration Modeling Must generate segments with the appropriate duration
Segmental Identity /ai/ in like twice as long as /I/ in lick Surrounding Segments vowels longer following voiced fricatives than voiceless stops Syllable Stress onsets and nuclei of stressed syllables longer than in unstressed Word “importance” word accent with major pitch movement lengthens Location of Syllable in Word word ending longer than word starting longer than word internal Location of the Syllable in the Phrase phrase final syllables longer than same syllable in other positions

Intonation: Tone Sequence Models
Functional Information can be encoded via tones: given/new information (information status) contrastive stress phrasal boundaries (clause structure) dialogue act (statement/question/command) Tone Sequence Models F0 contours generated from phonologically distinctive tones/pitch accents which are locally independent generate a sequence of tonal targets and fit with signal processing

Intonation for Function
ToBI (Tone and Break Index) System, is one example: Pitch Accent * (H*, L*, H*+L, H+L*, L*+H, L+H*) Phrase Accent (H-, L-) Boundary Tone % (H%, L%) Intonational Phrase <Pitch Accent>+ <Phrase Accent> <Boundary Tone> statement vs. question example: source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998

Text Markup for Synthesis
Bell Labs TTS Markup r(0.9) L*+H(0.8) Humpty L*+H(0.8) Dumpty r(0.85) L*(0.5) sat on a H*(1.2) wall. Tones: Tone(Prominence) Speaking Rate: r(Rate) and pauses Top Line (highest pitch); Reference Line (reference pitch); Base Line (lowest pitch) SABLE is an emerging standard extending SGML marks: emphasis(#), break(#), pitch(base/mid/range,#), rate(#), volume(#), semanticMode(date/time/ /URL/...), speaker(age,sex) Implemented in Festival Synthesizer (free for research, etc.):

Intonation in Bell Labs TTS
Generate a sequence of F0 targets for synthesis Example: We were away a year ago. phones: w E w R & w A & y E r & g O source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998

Barge-in: Interrupting the computer
Combined processing of input signal and output signal Signal detector runs looking for speech start and endpoints tests a generic speech model against noise model typically cancels echoes created by outgoing speech If speech is detected: Any synthesized or recorded speech is cancelled Recognition begins and continues until end point is detected

What you can do with Speech Recognition
Transcription dictation, information retrieval Command and control data entry, device control, navigation, call routing Information access airline schedules, stock quotes, directory assistance Problem solving travel planning, logistics

Human-machine interface is critical
Speech recognition is NOT the core function of most applications Speech is a feature of applications that offers specific advantages Errorful recognition is a fact of life

Properties of Recognizers
Speaker Independent vs. Speaker Dependent Large Vocabulary (2K-200K words) vs. Limited Vocabulary (2-200) Continuous vs. Discrete Speech Recognition vs. Speech Verification Real Time vs. multiples of real time Spontaneous Speech vs. Read Speech Noisy Environment vs. Quiet Environment High Resolution Microphone vs. Telephone vs. Cellphone Push-and-hold vs. push-to-talk vs. always-listening Adapt to speaker vs. non-adaptive Low vs. High Latency With online incremental results vs. final results Dialog Management

Users One time vs. Frequent users Homogeneity
Major Application Components Users One time vs. Frequent users Homogeneity Technically sophisticated

Features That Distinguish Products & Applications
Words, phrases, and grammar Models of the speakers Speech flow

Vocabulary: How many words How you add new words
Features That Distinguish Products & Applications Words and Phrases Vocabulary: How many words How you add new words Grammars Branching Factor (Perplexity) Available languages

Features That Distinguish Products & Applications
Models of Speakers Speaker Dependent Speaker Independent Speaker Adaptive

Talking to a Speech Recognition System
Features That Distinguish Systems Talking to a Speech Recognition System Discrete (pauses) Continuous (no pauses) Push-and-hold Push-to-talk Always listening Dialog Management Response Types

Speech Recognition vs. Touch Tone
Shorter calls Choices mean something Automate more tasks Reduces annoying operations Available

Transcription and Dictation
Transcription is transforming a stream of human speech into computer-readable form Medical reports, court proceedings, notes Indexing (e.g., broadcasts) Dictation is the interactive composition of text Report, correspondence, etc.

SpeechWear Vehicle inspection task film clip
USMC mechanics, fixed inspection form Wearable computer (COTS components) html-based task representation film clip

Speech recognition and understanding
Sphinx system speaker-independent continuous speech large vocabulary ATIS system air travel information retrieval context management film clip (1994)

Sample Market: Call Centers
Automate services, lower payroll Shorten time on hold Shorten agent and client call time Reduce fraud Improve customer service

Interface guidelines State transparency Input control Error recovery
Error detection Error correction Log performance Application integration

Speech Recognition Speaker Verification Speaker Identification
Applications related to Speech Recognition Speech Recognition Figure out what a person is saying. Speaker Verification Authenticate that a person is who she/he claims to be. Limited speech patterns Speaker Identification Assigns an identity to the voice of an unknown person. Arbitrary speech patterns

Stronger Authentication
Three Types of Security What You Have key, card, token What You Know password, PIN, maiden name Who You Are Stronger Authentication

Family Tree: Voice Biometrics
Speech Recognition Speech Processing Speech Synthesis Digitized Speech  Input Output Signature Verif. Typing Dynamics  Face Recognition Finger Geometry Fingerprinting Hand Geometry Iris/Retina Scan Biometrics DNA  Voice Biometrics Speaker Verification Speaker Identification …

Carnegie Mellon Speech Demos
CMU Communicator Call: CMU-PLAN ( ), also , or x8-1084 the information is accurate; you can use it for your own travel planning… CMU Universal Speech Interface (USI) CMU Movie Line Seems to be about apartments now… Call: (412)

Technology providers Voice XML Voice Portals tellme.com bevocal.com

Telephone Demos Nuance http://www.nuance.com
Banking: Travel Planning: Stock Quotes: SpeechWorks Banking: Stock Trading: MIT Spoken Language Systems Laboratory Travel Plans (Pegasus): Weather (Jupiter): IBM Mutual Funds, Name Dialing: VIA-VOICE Nuance systems: completely system driven, cannot handle user initiative at all. E.g. in stock quote system, must say the name of the company only, without embedding it in a sentence. SpeechWorks: prompts for each attribute, but claims to be able to do user initiative. MIT: pegasus wasn’t answering the phone; Jupiter is able to recognize full sentences and answer questions (takes into account dialogue history). IBM:

Questions?

Application of Speech Recognition,

Similar presentations

Presentation on theme: "Application of Speech Recognition,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application of Speech Recognition,

Similar presentations

Presentation on theme: "Application of Speech Recognition,"— Presentation transcript:

Similar presentations

About project

Feedback