Introduction to Conversational Interfaces Jim Glass Spoken Language Systems Group MIT Laboratory for Computer Science February 10, 2003.

Slides:

Advertisements

Similar presentations

CHART or PICTURE INTEGRATING SEMANTIC WEB TO IMPROVE ONLINE Marta Gatius Meritxell González TALP Research Center (UPC) They are friendly and easy to use.

Advertisements

National Technical University of Athens Department of Electrical and Computer Engineering Image, Video and Multimedia Systems Laboratory

On Organic Interfaces Victor Zue

1 © 2005 CHIL KTH ASIDE 2005, Aalborg, Applications of distributed dialogue systems: The KTH Connector Jens Edlund & Anna Hjalmarsson Applications.

Manuela Veloso, Anthony Stentz, Alexander Rudnicky Brett Browning, M. Bernardine Dias Faculty Thomas Harris, Brenna Argall, Gil Jones Satanjeev Banerjee.

An overview of EMMA— Extensible MultiModal Annotation Michael Johnston AT&T Labs Research 8/9/2006.

Rob Marchand Genesys Telecommunications

Irek Defée Signal Processing for Multimodal Web Irek Defée Department of Signal Processing Tampere University of Technology W3C Web Technology Day.

Richard Yu.  Present view of the world that is: Enhanced by computers Mix real and virtual sensory input  Most common AR is visual Mixed reality virtual.

Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis

John Hu Nov. 9, 2004 Multimodal Interfaces Oviatt, S. Multimodal interfaces Mankoff, J., Hudson, S.E., & Abowd, G.D. Interaction techniques for ambiguity.

Learning in the Wild Satanjeev “Bano” Banerjee Dialogs on Dialog March 18 th, 2005 In the Meeting Room Scenario.

Ambient Computational Environments Sprint Research Symposium March 8-9, 2000 Professor Gary J. Minden The University of Kansas Electrical Engineering and.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

Stanford hci group / cs376 research topics in human-computer interaction Multimodal Interfaces Scott Klemmer 15 November 2005.

ISTD 2003, Audio / Speech Interactive Systems Technical Design Seminar work: Audio / Speech Ville-Mikko Rautio Timo Salminen Vesa Hyvönen.

Equal-party Conversation System for Language Learning Chih-yu Chao (advisor: Stephanie Seneff) April 14 th, 2006 Dialogs on Dialogs Reading Group.

Auditory User Interfaces

A Framework For Developing Conversational User Interfaces

Collaborative Cross-Language Search Douglas W. Oard University of Maryland, College Park May 14, 2015SICS Workshop.

Track: Speech Technology Kishore Prahallad Assistant Professor, IIIT-Hyderabad 1Winter School, 2010, IIIT-H.

Mobile Multimodal Applications. Dr. Roman Englert, Gregor Glass March 23 rd, 2006.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

BlindAid Semester Final Presentation Sandra Mau, Nik Melchior, and Maxim Makatchev.

Design Patterns & Component Frameworks CS67041Spring 2002 Open Agent Architecture (OAA) - Part II -Murali Krishnan.

Should Intelligent Agents Listen and Speak to Us? James A. Larson Larson Technical Services

ArcGIS Workflow Manager An Introduction

MVC pattern and implementation in java

CS 0004 –Lecture 1 Wednesday, Jan 5 th, 2011 Roxana Gheorghiu.

Spoken Dialogue Systems and the GALAXY Architecture 29 October 2000 Advanced Technology Laboratories 1 Federal Street A&E Building 2W Camden, New Jersey.

Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.

Speech User Interfaces Katherine Everitt CSE 490 JL Section Wednesday, Oct 27.

CP SC 881 Spoken Language Systems. 2 of 23 Auditory User Interfaces Welcome to SLS Syllabus Introduction.

Chapter 7. BEAT: the Behavior Expression Animation Toolkit

Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.

Break-out Group # D Research Issues in Multimodal Interaction.

Fall UI Design and Implementation1 Lecture 20: HCI Research Topics.

APML, a Markup Language for Believable Behavior Generation Soft computing Laboratory Yonsei University October 25, 2004.

Lessons Learned Mokusei: Multilingual Conversational Interfaces Future Plans Explore language-independent approaches to speech understanding and generation.

Networking Network Classification, by there: 3 The Rules they use to exchange data: Protocols.

卓越發展延續計畫分項三 User-Centric Interactive Media ~ 主持人 : 傅立成共同主持人 : 李琳山，歐陽明，洪一平，陳祝嵩水美溫泉會館研討會

1 Speech Processing. 2 Speech Processing:  Review of DSP Concepts  Review of Probability and Stochastic Processes  Anatomy and Physiology of Speech.

NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.

16.0 Spoken Dialogues References: , Chapter 17 of Huang 2. “Conversational Interfaces: Advances and Challenges”, Proceedings of the IEEE,

ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.

Speech Recognition MIT SMA 5508 Spring 2004 Larry Rudolph (MIT)

Introduction to Dialogue Systems. User Input System Output ?

Introduction to Computational Linguistics

Introduction to Computational Linguistics Jay Munson (special thanks to Misty Azara) May 30, 2003.

L C SL C S SpeechBuilder: Facilitating Spoken Dialogue System Creation Eugene Weinstein Project Oxygen Core Team MIT Laboratory for Computer Science

Beyond the PC Kiosks & Handhelds Albert Huang Larry Rudolph Oxygen Research Group MIT CSAIL.

DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.

金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

L C S Spoken Language Systems Group Stephanie Seneff Spoken Language Systems Group MIT Laboratory for Computer Science January 13, 2000 Multilingual Conversational.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Chapter – 8 Software Tools.

Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:

Stanford hci group / cs376 u Jeffrey Heer · 19 May 2009 Speech & Multimodal Interfaces.

W3C Multimodal Interaction Activities Deborah A. Dahl August 9, 2006.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Software Architecture for Multimodal Interactive Systems : Voice-enabled Graphical Notebook.

INTRODUCTION TO THE WIDA FRAMEWORK Presenter Affiliation Date.

/16 Final Project Report By Facializer Team Final Project Report Eagle, Leo, Bessie, Five, Evan Dan, Kyle, Ben, Caleb.

G. Anushiya Rachel Project Officer

3.0 Map of Subject Areas.

Managing Dialogue Julia Hirschberg CS /28/2018.

Multimodal Human-Computer Interaction New Interaction Techniques 22. 1

Presented by: Mónica Domínguez

Presentation transcript:

Introduction to Conversational Interfaces Jim Glass Spoken Language Systems Group MIT Laboratory for Computer Science February 10, 2003

Speech interfaces are ideal for information access and management when: The information space is broad and complex, The users are technically naive, or Speech is the only available modality. Speech interfaces are ideal for information access and management when: The information space is broad and complex, The users are technically naive, or Speech is the only available modality. Virtues of Spoken Language Natural:Requires no special training Flexible:Leaves hands and eyes free Efficient:Has high data rate Economical:Can be transmitted and received inexpensively

Human Computer InputOutput Generation Understanding Speech Text Recognition Speech Text Synthesis Meaning Communication via Spoken Language

Speech Recognition Speech Recognition Language Understanding Context Resolution Context Resolution Dialogue Management Dialogue Management Language Generation Language Generation Speech Synthesis Speech Synthesis Audio Database Components of Conversational Systems

Hub GALAXY Language Generation Language Generation Speech Recognition Speech Recognition Language Understanding Context Resolution Context Resolution Database Dialogue Management Dialogue Management Speech Synthesis Speech Synthesis Audio SUMMIT TINA Discourse Dialogue Manager Dialogue Manager GENESIS ENVOICE Components of MIT Conversational Systems

Segment network created by interconnecting spectral landmarks Waveform Segment-Based Speech Recognition Frame-based measurements (every 5ms) ao - m - aedh-k p uw er z t k -axdx Probabilistic search finds most likely phone & word strings computers that talk

Segment-Based Speech Recognition

Natural Language Understanding showmeflightsfrombostontodenver flightdestinationsource topic display subject predicate full_parse command sentence predicate city tofromflight_list destinationsource flight display Some syntactic nodes carry semantic tags for creating semantic frame Clause: DISPLAY Topic: FLIGHT Predicate: FROM Topic: CITY Name: "Boston" Predicate: TO Topic: CITY Name: "Denver" Clause: DISPLAY Topic: FLIGHT Predicate: FROM Topic: CITY Name: "Boston" Predicate: TO Topic: CITY Name: "Denver"

HumanComputer Initiative Human takes complete control Computer is totally passive Human takes complete control Computer is totally passive H: I want to visit my grandmother. Computer maintains tight control Human is highly restricted Computer maintains tight control Human is highly restricted C: Please say the departure city. Dialogue Modeling Strategies Effective conversational interface must incorporate extensive and complex dialogue modeling Conversational systems differ in the degree with which human or computer takes the initiative Our systems use a mixed initiative approach, where both the human and the computer play an active role

U:I need a flight from Boston to San Francisco C:Did you say Boston or Austin? U:Boston, Massachusetts C:I need a date before I can access Travelocity U: Tomorrow C:Hold on while I retrieve the flights for you C:I have found 10 flights meeting your specification. When would you like to leave? U:In the morning. C:Do you have a preferred airline? U: United C:I found two non-stop United flights leaving in the morning… Help the user narrow down the choices Clarification (insufficient info) Clarification (recognition errors) Post-Retrieval: Multiple DB Retrievals => Unique Response Different Roles of Dialogue Management Pre-Retrieval: Ambiguous Input => Unique Query to DB

compassion disputed cedar city since giant since compassion disputed cedar city since giant since labyrinth abracadabra obligatory labyrinth abracadabra obligatory Continental flight 4695 from Greensboro is expected in Halifax at 10:08 pm local time. The third ad is a 1996 black Acura Integra with miles. The price is 8970 dollars. Please call (404) Concatenative Speech Synthesis Output waveform generated by concatenating segments of pre-recorded speech corpus. Concatenation at phrase, word or sub-word level. computer science laboratory Synthesis Examples

Multilingual Conversational Interfaces Adopts an interlingua approach for multilingual human- machine interactions Applications: –MuXing: Mandarin system for weather information –Mokusei: Japanese system for weather information –Spanish systems are also under development –New speech-to- speech translation work (Phrasebook) Language Generation Language Generation Speech Recognition Speech Recognition Discourse Resolution Discourse Resolution Text-to-Speech Conversion Text-to-Speech Conversion Dialogue Management Dialogue Management Language Understanding Language Understanding Application Back-end Application Back-end Audio Server Audio Server Audio Server Audio Server I/O Servers I/O Servers Application Back-end Application Back-end Hub Models Language Independent Language Dependent Language Transparent Text-to-Speech Conversion Text-to-Speech Conversion Text-to-Speech Conversion Text-to-Speech Conversion Models Rules Models Rules Application Back-end

Bilingual Jupiter Demonstration

Multi-modal Conversational Interfaces Typing, pointing, clicking can augment/complement speech A picture (or a map) is worth a thousand words Applications: –WebGalaxy –Allows typing and clicking –Includes map- based navigation –With display –Embedded in a web browser –Current exhibit at MIT Museum LANGUAGE UNDERSTANDING LANGUAGE UNDERSTANDING meaning SPEECH RECOGNITION SPEECH RECOGNITION GESTURE RECOGNITION GESTURE RECOGNITION HANDWRITING RECOGNITION HANDWRITING RECOGNITION MOUTH & EYES TRACKING MOUTH & EYES TRACKING

WebGalaxy Demonstration

Delegating Tasks to Computers Many information related activities can be done off line Off-line delegation frees the user to attend to other matters Application: Orion system –Task Specification: User interacts with Orion to specify a task “Call me every morning at 6 and tell me the weather in Boston.” “Send me any time between 4 and 6 p.m. if the traffic on Route 93 is at a standstill.” –Task Execution: Orion leverages existing infrastructure to support interaction with humans –Event Notification: Orion calls back to deliver information

Audio Visual Integration Audio and visual signals both contain information about: –Identity of the person: Who is talking? –Linguistic message: What’s (s)he saying? –Emotion, mood, stress, etc.: How does (s)he feel? The two channels of information –Are often inter-related –Are often complementary –Must be consistent Integration of these cues can lead to enhanced capabilities for future human computer interfaces

Audio Visual Symbiosis Personal Identity Paralinguistic Information Linguistic Message Robust Person ID Speaker ID Acoustic Signal Visual Signal Face ID Robust ASR Speech Recognition Lip/Mouth Reading Robust Paralinguistic Detection Acoustic Paraling. Detection Visual Paraling. Detection

Timing information is a useful way to relate inputs Multi-modal Interfaces: Beyond Clicking Does this mean “yes,” “one,” or something else? Inputs need to be understood in the proper context Where is she looking or pointing at while saying “this” and “there”? Move this one over there Are there any over here? What does he mean by “any,” and what is he pointing at?

Multi-modal Fusion: Initial Progress All multi-modal inputs are synchronized –Speech recognizer generates absolute times for words –Mouse and gesture movements generate {x,y,t} triples –Network Time Protocol (NTP) is used for msec time resolution Speech understanding constrains gesture interpretation –Initial work identifies an object or a location from gesture inputs –Speech constrains what, when, and how items are resolved –Object resolution also depends on information from application Speech:“Move this one over here” Pointing: (object) (location) time

Multi-modal Demonstration Manipulating planets in a solar-system application Created w. SpeechBuilder utility with small changes Gestures from vision (Darrell & Demirdjien)

Summary Speech and language are inevitable, i.e., –The need for mobility and connectivity –The miniaturization of computers –Humans’ innate desire to speak Progress has been made, e.g., –Understanding and responding in constrained domains –Incorporating multiple languages and modalities –Automation and delegation –Rapid system configuration Much interesting research remains, e.g., –Audiovisual integration –Perceptual user interfaces

Research Scott Cyphers James Glass T.J. Hazen Lee Hetherington Joseph Polifroni Shinsuke Sakai Stephanie Seneff Michelle Spina Chao Wang Victor Zue Research Scott Cyphers James Glass T.J. Hazen Lee Hetherington Joseph Polifroni Shinsuke Sakai Stephanie Seneff Michelle Spina Chao Wang Victor Zue Administrative Marcia Davidson Administrative Marcia Davidson Visitors Paul Brittain Thomas Gardos Rita Singh Visitors Paul Brittain Thomas Gardos Rita Singh S.M. Alicia Boozer Brooke Cowan John Lee Laura Miyakawa Ekaterina Saenko Sy Bor Wang S.M. Alicia Boozer Brooke Cowan John Lee Laura Miyakawa Ekaterina Saenko Sy Bor Wang Post-Doctoral Tony Ezzat Post-Doctoral Tony Ezzat Ph.D. Edward Filisko Karen Livescu Alex Park Mitchell Peabody Ernest Pusateri Han Shu Min Tang Jon Yi Ph.D. Edward Filisko Karen Livescu Alex Park Mitchell Peabody Ernest Pusateri Han Shu Min Tang Jon Yi The Spoken Language Systems Group M.Eng. Chian Chu Chia-Huo La Jonathon Lau M.Eng. Chian Chu Chia-Huo La Jonathon Lau