Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Conversational Interfaces Jim Glass Spoken Language Systems Group MIT Laboratory for Computer Science February 10, 2003.

Similar presentations


Presentation on theme: "Introduction to Conversational Interfaces Jim Glass Spoken Language Systems Group MIT Laboratory for Computer Science February 10, 2003."— Presentation transcript:

1

2 Introduction to Conversational Interfaces Jim Glass (glass@mit.edu) Spoken Language Systems Group MIT Laboratory for Computer Science February 10, 2003

3 Speech interfaces are ideal for information access and management when: The information space is broad and complex, The users are technically naive, or Speech is the only available modality. Speech interfaces are ideal for information access and management when: The information space is broad and complex, The users are technically naive, or Speech is the only available modality. Virtues of Spoken Language Natural:Requires no special training Flexible:Leaves hands and eyes free Efficient:Has high data rate Economical:Can be transmitted and received inexpensively

4 Human Computer InputOutput Generation Understanding Speech Text Recognition Speech Text Synthesis Meaning Communication via Spoken Language

5 Speech Recognition Speech Recognition Language Understanding Context Resolution Context Resolution Dialogue Management Dialogue Management Language Generation Language Generation Speech Synthesis Speech Synthesis Audio Database Components of Conversational Systems

6 Hub GALAXY Language Generation Language Generation Speech Recognition Speech Recognition Language Understanding Context Resolution Context Resolution Database Dialogue Management Dialogue Management Speech Synthesis Speech Synthesis Audio SUMMIT TINA Discourse Dialogue Manager Dialogue Manager GENESIS ENVOICE Components of MIT Conversational Systems

7 Segment network created by interconnecting spectral landmarks Waveform Segment-Based Speech Recognition Frame-based measurements (every 5ms) ao - m - aedh-k p uw er z t k -axdx Probabilistic search finds most likely phone & word strings computers that talk

8 Segment-Based Speech Recognition

9 Natural Language Understanding showmeflightsfrombostontodenver flightdestinationsource topic display subject predicate full_parse command sentence predicate city tofromflight_list destinationsource flight display Some syntactic nodes carry semantic tags for creating semantic frame Clause: DISPLAY Topic: FLIGHT Predicate: FROM Topic: CITY Name: "Boston" Predicate: TO Topic: CITY Name: "Denver" Clause: DISPLAY Topic: FLIGHT Predicate: FROM Topic: CITY Name: "Boston" Predicate: TO Topic: CITY Name: "Denver"

10 HumanComputer Initiative Human takes complete control Computer is totally passive Human takes complete control Computer is totally passive H: I want to visit my grandmother. Computer maintains tight control Human is highly restricted Computer maintains tight control Human is highly restricted C: Please say the departure city. Dialogue Modeling Strategies Effective conversational interface must incorporate extensive and complex dialogue modeling Conversational systems differ in the degree with which human or computer takes the initiative Our systems use a mixed initiative approach, where both the human and the computer play an active role

11 U:I need a flight from Boston to San Francisco C:Did you say Boston or Austin? U:Boston, Massachusetts C:I need a date before I can access Travelocity U: Tomorrow C:Hold on while I retrieve the flights for you C:I have found 10 flights meeting your specification. When would you like to leave? U:In the morning. C:Do you have a preferred airline? U: United C:I found two non-stop United flights leaving in the morning… Help the user narrow down the choices Clarification (insufficient info) Clarification (recognition errors) Post-Retrieval: Multiple DB Retrievals => Unique Response Different Roles of Dialogue Management Pre-Retrieval: Ambiguous Input => Unique Query to DB

12 compassion disputed cedar city since giant since compassion disputed cedar city since giant since labyrinth abracadabra obligatory labyrinth abracadabra obligatory Continental flight 4695 from Greensboro is expected in Halifax at 10:08 pm local time. The third ad is a 1996 black Acura Integra with 45380 miles. The price is 8970 dollars. Please call (404) 399-7682. Concatenative Speech Synthesis Output waveform generated by concatenating segments of pre-recorded speech corpus. Concatenation at phrase, word or sub-word level. computer science laboratory Synthesis Examples

13 Multilingual Conversational Interfaces Adopts an interlingua approach for multilingual human- machine interactions Applications: –MuXing: Mandarin system for weather information –Mokusei: Japanese system for weather information –Spanish systems are also under development –New speech-to- speech translation work (Phrasebook) Language Generation Language Generation Speech Recognition Speech Recognition Discourse Resolution Discourse Resolution Text-to-Speech Conversion Text-to-Speech Conversion Dialogue Management Dialogue Management Language Understanding Language Understanding Application Back-end Application Back-end Audio Server Audio Server Audio Server Audio Server I/O Servers I/O Servers Application Back-end Application Back-end Hub Models Language Independent Language Dependent Language Transparent Text-to-Speech Conversion Text-to-Speech Conversion Text-to-Speech Conversion Text-to-Speech Conversion Models Rules Models Rules Application Back-end

14 Bilingual Jupiter Demonstration

15 Multi-modal Conversational Interfaces Typing, pointing, clicking can augment/complement speech A picture (or a map) is worth a thousand words Applications: –WebGalaxy –Allows typing and clicking –Includes map- based navigation –With display –Embedded in a web browser –Current exhibit at MIT Museum LANGUAGE UNDERSTANDING LANGUAGE UNDERSTANDING meaning SPEECH RECOGNITION SPEECH RECOGNITION GESTURE RECOGNITION GESTURE RECOGNITION HANDWRITING RECOGNITION HANDWRITING RECOGNITION MOUTH & EYES TRACKING MOUTH & EYES TRACKING

16 WebGalaxy Demonstration

17 Delegating Tasks to Computers Many information related activities can be done off line Off-line delegation frees the user to attend to other matters Application: Orion system –Task Specification: User interacts with Orion to specify a task “Call me every morning at 6 and tell me the weather in Boston.” “Send me e-mail any time between 4 and 6 p.m. if the traffic on Route 93 is at a standstill.” –Task Execution: Orion leverages existing infrastructure to support interaction with humans –Event Notification: Orion calls back to deliver information

18 Audio Visual Integration Audio and visual signals both contain information about: –Identity of the person: Who is talking? –Linguistic message: What’s (s)he saying? –Emotion, mood, stress, etc.: How does (s)he feel? The two channels of information –Are often inter-related –Are often complementary –Must be consistent Integration of these cues can lead to enhanced capabilities for future human computer interfaces

19 Audio Visual Symbiosis Personal Identity Paralinguistic Information Linguistic Message Robust Person ID Speaker ID Acoustic Signal Visual Signal Face ID Robust ASR Speech Recognition Lip/Mouth Reading Robust Paralinguistic Detection Acoustic Paraling. Detection Visual Paraling. Detection

20 Timing information is a useful way to relate inputs Multi-modal Interfaces: Beyond Clicking Does this mean “yes,” “one,” or something else? Inputs need to be understood in the proper context Where is she looking or pointing at while saying “this” and “there”? Move this one over there Are there any over here? What does he mean by “any,” and what is he pointing at?

21 Multi-modal Fusion: Initial Progress All multi-modal inputs are synchronized –Speech recognizer generates absolute times for words –Mouse and gesture movements generate {x,y,t} triples –Network Time Protocol (NTP) is used for msec time resolution Speech understanding constrains gesture interpretation –Initial work identifies an object or a location from gesture inputs –Speech constrains what, when, and how items are resolved –Object resolution also depends on information from application Speech:“Move this one over here” Pointing: (object) (location) time

22 Multi-modal Demonstration Manipulating planets in a solar-system application Created w. SpeechBuilder utility with small changes Gestures from vision (Darrell & Demirdjien)

23 Summary Speech and language are inevitable, i.e., –The need for mobility and connectivity –The miniaturization of computers –Humans’ innate desire to speak Progress has been made, e.g., –Understanding and responding in constrained domains –Incorporating multiple languages and modalities –Automation and delegation –Rapid system configuration Much interesting research remains, e.g., –Audiovisual integration –Perceptual user interfaces

24 Research Scott Cyphers James Glass T.J. Hazen Lee Hetherington Joseph Polifroni Shinsuke Sakai Stephanie Seneff Michelle Spina Chao Wang Victor Zue Research Scott Cyphers James Glass T.J. Hazen Lee Hetherington Joseph Polifroni Shinsuke Sakai Stephanie Seneff Michelle Spina Chao Wang Victor Zue Administrative Marcia Davidson Administrative Marcia Davidson Visitors Paul Brittain Thomas Gardos Rita Singh Visitors Paul Brittain Thomas Gardos Rita Singh S.M. Alicia Boozer Brooke Cowan John Lee Laura Miyakawa Ekaterina Saenko Sy Bor Wang S.M. Alicia Boozer Brooke Cowan John Lee Laura Miyakawa Ekaterina Saenko Sy Bor Wang Post-Doctoral Tony Ezzat Post-Doctoral Tony Ezzat Ph.D. Edward Filisko Karen Livescu Alex Park Mitchell Peabody Ernest Pusateri Han Shu Min Tang Jon Yi Ph.D. Edward Filisko Karen Livescu Alex Park Mitchell Peabody Ernest Pusateri Han Shu Min Tang Jon Yi The Spoken Language Systems Group M.Eng. Chian Chu Chia-Huo La Jonathon Lau M.Eng. Chian Chu Chia-Huo La Jonathon Lau


Download ppt "Introduction to Conversational Interfaces Jim Glass Spoken Language Systems Group MIT Laboratory for Computer Science February 10, 2003."

Similar presentations


Ads by Google