# On Organic Interfaces Victor Zue

## Presentation on theme: "On Organic Interfaces Victor Zue"— Presentation transcript:

On Organic Interfaces Victor Zue (zue@csail.mit.edu)
I am very honored to have received this year’s ISCA medal. It is something that I have not expected, but will treasure for the rest of my life. Victor Zue MIT Computer Science and Artificial Intelligence Laboratory

Acknowledgements Graduate Students Research Staff Anderson, M.
Aull, A. Brown, R. Chan, W. Chang, J. Chang, S. Chen, C. Cyphers, S. Daly, N. Doiron, R. Flammia, G. Glass, J. Goddeau, D. Hazen, T.J. Hetherington, L. Huttenlocher, D. Jaffe, O. Kassel, R. Kasten,P. Kuo, J. Kuo, S. Lauritzen, N. Lamel, L. Lau, R. Leung, H. Lim, A. Manos, A. Marcus, J. Neben, N. Niyogi, P. Mou, X. Ng, K. Pan, K. Pitrelli, J. Randolph, M. Rtischev, D. Sainath, T. Sarma, S. Seward, D. Soclof, M. Spina, M. Tang, M. Wichiencharoen, A. Zeiger, K. Eric Brill Scott Cyphers Jim Glass Dave Goddeau T J Hazen Lee Hetherington Lynette Hirschman Raymond Lau Hong Leung Helen Meng Mike Phillips Joe Polifroni Shinsuke Sakai Stephanie Seneff Dave Shipman Michelle Spina Nikko Ström Chao Wang Research Staff For nearly three decades at MIT, I have had the good fortune of working with some of the smartest graduate students from all over the world. I have also benefited from a research staff that is truly first rate. Among them, I would like to single out Jim Glass and Stephanie Seneff, with each of whom I have worked for more than twenty five years. They have significantly shaped my thinking and contributed to the work that I am about to present. This award is as much an honor to all of them as it is to me. MIT Computer Science and Artificial Intelligence Laboratory

Introduction MIT Computer Science and Artificial Intelligence Laboratory

Virtues of Spoken Language
Natural: Requires no special training Flexible: Leaves hands and eyes free Efficient: Has high data rate Economical: Can be transmitted/received inexpensively Speech interfaces are ideal for information access and management when: The information space is broad and complex, The users are technically naive, The information device is small, or Only telephones are available. Speech is the most natural means for humans to communicate; nearly all of us can talk and listen to one another without special training. It is flexible; it can free our eyes and hands to attend to other tasks. Speech is also very efficient; one can typically speak several times faster than one can type or write. Nowadays, with the pervasiveness of landline, cellular, and internet phones, speech is also one of the most inexpensive ways for us to communicate. Speech interfaces are ideal for information access and management when: The information space is broad and complex, The users are technically naive, The information device is small, or Only telephones are available. MIT Computer Science and Artificial Intelligence Laboratory

Communication via Spoken Language
Human Computer Input Output Speech Text Recognition Speech Text Synthesis When one thinks about speech interfaces, the first human language technologies that come to mind are speech recognition and speech synthesis. These two technologies are extremely important, each with its own set of applications. However, if we are concerned with human-machine communication, then we need to derive the meaning of words and sentences on the input side, and generate surface forms from the meaning representation on the output side. Therefore, understanding and generation are important aspects of the solution. Generation Understanding Meaning MIT Computer Science and Artificial Intelligence Laboratory

Components of a Spoken Dialogue System
LANGUAGE GENERATION SPEECH SYNTHESIS Speech Sentence DIALOGUE MANAGEMENT DATABASE Graphs & Tables SPEECH RECOGNITION Speech Words DISCOURSE CONTEXT LANGUAGE UNDERSTANDING Meaning Representation This figure show the major components of a spoken dialogue system Enumerate … MIT Computer Science and Artificial Intelligence Laboratory

Tremendous Progress to Date
Data Intensive Training Technological Advances Inexpensive Computing Increased Task Complexity Over the past four decades, we have witnessed remarkable progress in the development of speech input/output technologies. Speech recognition, for example, benefited enormously by the introduction and maturity of the technique known as Hidden Markov modeling, or HMM Another contributing factor is the fact that computers are getting cheaper and more powerful, illustrated here by Moore’s Law. With the increasing availability of a large amount of training data, the error rate for speech recognition continues to fall. As a result, the community has been able to tackle more and more difficult problems. MIT Computer Science and Artificial Intelligence Laboratory

Some Example Systems BBN, 2007 MIT, 2007 KTH, 2007
I will illustrate this progress with three examples. I should take a moment to thank all my colleagues who generously provided me with video materials for this presentation The first example is the BBN multi-lingual broadcast monitoring system The next example illustrates multi-lingual speech recognition on a smart phone, with a vocabulary of several thousand words The last example is the chess playing program developed at KTH. Please note the use of discourse and pragmatic knowledge, and the expressive avatar throughout the interchange. MIT Computer Science and Artificial Intelligence Laboratory

Speech Synthesis Recent trend moves toward corpus-based approaches
Increased storage and compute capacity Availability of large text and speech corpora Modeled after successful utilization for speech recognition Many successful implementations, e.g., AT&T Cepstral Microsoft For speech synthesis, there has been a recent trend towards data-driven approaches. This is brought on by at least three factors: increased storage and compute capacity that enabled more data and compute intensive methods. the availability of large text and speech corpora that allowed much more data to be examined and analyzed. the successful use of these methods in speech recognition and the desire to adapt these techniques for synthesis. Illustrate There are now many successful commercial systems available in the market compassion disputed cedar city since giant compassion disputed cedar city since giant computer science MIT Computer Science and Artificial Intelligence Laboratory

But we are far from done …
Machine performance typically lags far behind human performance How can interfaces be truly anthropomorphic? Lippmann, 1997 As proud as we should be about the progress that this community continues to make, however, we are far from reaching human capabilities of recognizing and understanding nearly perfectly the speech spoken by many speakers, under varying acoustic environments, with essentially unrestricted vocabulary. This is illustrated by a study comparing machine and human performance. To be sure, machine performance has improved a great deal since the study, but there is still a large gap between humans and machines MIT Computer Science and Artificial Intelligence Laboratory

Premise of the Talk Propose a different perspective on development of speech-based interfaces Draw from insights in evolution of computer science Computer systems are increasingly complex There is a move towards treating these complex systems like organisms that can observe, grow, and learn Will focus on spoken dialogue systems In this talk, I intend to offer a perspective drawn from evolution in computer science, one that views complex systems as living organisms that can learn, grow, reconfigure, and repair themselves. I would argue that such a perspective can lead us to a type of interface that is truly anthropomorphic. I will focus my discussions on spoken dialogue systems. In the next part of the talk, I will outline what I mean by Organic Interfaces and describe some of their properties. This will be followed by a discussion of a few challenges, along with examples to illustrate my points. Some of the ideas are admittedly half-baked. In many ways, more questions are raised than are answered. It is my hope that this talk will trigger some discussion among us that will lead to further refinements of the ideas and perhaps engender new directions in our collective research agenda. MIT Computer Science and Artificial Intelligence Laboratory

Organic Interfaces MIT Computer Science and Artificial Intelligence Laboratory

Computer: Yesterday and Today
Computation of static functions in a static environment, with well-understood specification Computation is its main goal xxxxx Single agent xxxxxxxxxxxxxxxxxx Batch processing of text and homogeneous data Stand-alone applications Binary notion of correctness Adaptive systems operating in environments that are dynamic and uncertain Communication, sensing, and control just as important Multiple agents that may be cooperative, neutral, adversarial Stream processing of massive, heterogeneous data Interaction with humans is key Trade off multiple criteria It is perhaps informative to first examine the evolution of computing and computer science as a discipline. Enumerate … Since we cannot know the details of the environments in which they will be deployed, nor the behavior of the human operators, these systems must be able to execute based on incomplete information and be able to adapt to varying environments. To make these complex and interconnected systems more robust, we need to build into them the ability to adapt to dynamically changing environments, and to deal with uncertainty. Recent efforts have gone by the names of autonomic computing, cognitive computing, and organic computing. The idea is to incorporate properties of living organisms that can learn, grow, reconfigure, and recover from errors. Increasingly, we rely on probabilistic representation, machine learning techniques, and optimization principles to build complex systems MIT Computer Science and Artificial Intelligence Laboratory

Properties of Organic Systems
Robust to changes in environment and operating conditions Learning through experiences Observe their own behavior Context aware Self healing Some of the properties of organic computing systems are particularly important to the design of interfaces. Organic systems are robust to changes in the environment and operating conditions. An organic system can evolve over time by learning from experiences. For humans, learning can often be accomplished with just a few examples, rather than with voluminous amounts of data. Through learning, the system can increase its knowledge base and expand its capabilities. For learning to take place, it must be self aware and context aware; it must be able to observe itself in varying operating conditions and modify its behavior based on this observation. It must also detect what it doesn't know, and find ways to incorporate this knowledge into the system for future use. Context-awareness, learning, and adaptation are three inter-related properties. MIT Computer Science and Artificial Intelligence Laboratory

Research Challenges MIT Computer Science and Artificial Intelligence Laboratory

Some Research Challenges
Robustness Signal Representation Acoustic Modeling Lexical Modeling Multimodal Interactions Establishing Context Adaptation Learning Statistical Dialogue Management Interactive Learning Learning by Imitation * Please refer to written paper for topics not covered in talk Robustness Signal Representation Acoustic Modeling Lexical Modeling Multimodal Interactions Establishing Context Adaptation Learning Statistical Dialogue Management Interactive Learning Learning by Imitation Following this line of thinking, we can reexamine how today's speech-based interfaces can be made more organic. In this section, I will discuss some of the desirable properties of organic interfaces, and provide some illustrations of what has been done in these areas and some of the as-yet-unmet challenges. Enumerate Due to time and space limitations, I will primarily focus my attention on two aspects: speech understanding and dialogue interactions. As such, I will only address a few of the large number of challenges. In the written papers, I also cited a recent report of a committee chaired by Janet Baker that outlined some other challenges. MIT Computer Science and Artificial Intelligence Laboratory

Robustness: Acoustic Modeling
sentence phonetics syntax semantics word (syllable) morphology phonotactics phonemics acoustics Statistical n-grams have masked the inadequacies in acoustic modeling, but at a cost Size of training corpus Application-dependent performance To promote acoustic modeling research, we may want to develop a sub-word based recognition kernel Application independent Stronger constraints than phonemes Closed vocabulary for a given language Some success has been demonstrated (e.g., Chung & Seneff, 1998) Acoustic Models LM Units Sub-word Units Speech Recognition Kernel I will first discuss three issues concerning achieving robustness Present day approaches to speech recognition rely heavily on statistical n-grams to decode the underlying word sequence. While the power of n-gram remains unchallenged, their success is achieved with a heavy price. A potentially beneficial approach may lie in our ability to separate the application-dependent aspects from those that are application independent. Specifically, we could first develop a application-independent sub-word based recognition kernel that will accept the speech signal and produce a graph of sub-word units, utilizing acoustic and language models. This approach offers several advantages By focusing on the recognition of sub-word units, the recognition can be vocabulary independent, as long as AM and LM are trained on sufficient data from multiple domains We can better assess the contribution of the acoustic models By introducing multiple levels of sub-words representations, we should be able to capture longer distance constraints more parsimoniously We will also minimize the problem of unknown words MIT Computer Science and Artificial Intelligence Laboratory

Robustness: Lexical Access
Current approaches represent words as phoneme strings Phonological rules are sometimes used to derive alternate pronunciations “temperature” Lexical representation based on features offers much appeal (Stevens, 1995) Fewer models, less training data, greater parsimony Alternative lexical access models (e.g., Zue, 1983) Lexical access based on islands of reliability might be better able to deal with variability Phonological variations within and between words are well known. Most speech recognition systems today do not explicitly model them, but instead rely on context-dependent phone models to capture them. In a few systems phonological rules are used to expand the phonemic baseforms into pronunciation graphs. (Bushiness leads to larger search space and recognition errors/) A possible solution may be to represent the words in terms of features or feature bundles. Fewer models, less training data Alternatively, one can use such a broad class representation, together with phonotactic constraints, to initially whittle down the list of word candidates. Recent positive results have been obtained. The bushiest parts of the pronunciation graphs typically involve reduced syllables. Perhaps the unstressed and reduced syllables are not produced with as much precision as stressed ones. Perhaps it makes little sense for a system to explicitly account for the variabilities by enumerating all the alternate pronunciations. One can simply pay more attention to the stressed syllables for lexical decoding while using the unstressed syllables as place holders. This notion of islands of reliability suggests an island-driven lexical access strategy, in which the search is accomplished by anchoring on the stressed syllables. MIT Computer Science and Artificial Intelligence Laboratory

Robustness: Multimodal Interactions
Other modalities can augment/complement speech LANGUAGE UNDERSTANDING meaning SPEECH RECOGNITION GESTURE HANDWRITING MOUTH & EYES TRACKING For people like me who have worked hard on the speech problem, it is often tempting to think of speech as the only solution to the interface problem. However, the robustness with which humans communicate is partially achieved by utilizing multiple modalities MIT Computer Science and Artificial Intelligence Laboratory

Challenges for Multimodal Interfaces
Input needs to be understood in the proper context “What about that one” Timing information is a useful way to relate inputs Speech: “Move this one over here” Pointing: (object) (location) time Enumerate I will new spend a few minutes on the problem of integrating audio and visual cues for interface development. Handling uncertainties and errors (Cohen, 2003) Need to develop a unifying linguistic framework MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis
Benoit, 2000 The audio and visual signals both contain information about: Identity/location of the person Linguistic message Emotion, mood, stress, etc. Integration of these sources of information has been known to help humans MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis
The audio and visual signals both contain information about: Identity/location of the person Linguistic message Emotion, mood, stress, etc. Integration of these sources of information has been known to helps humans Exploiting this symbiosis can lead to robustness, e.g., Locating and identifying the speaker Hazen et al., 2003 Combining speaker identification and face identification can reduce the error rate by nearly 10 fold. MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis
The audio and visual signals both contain information about: Identity/location of the person Linguistic message Emotion, mood, stress, etc. Integration of these sources of information has been known to helps humans Exploiting this symbiosis can lead to robustness, e.g., Locating and identifying the speaker Speech recognition/understanding augmented with facial features Huang et al., 2004 A pioneer in AVSR is Chalapathi Netti and his colleagues at IBM. Their work has demonstrated significant word error reduction when visual signal has been added MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis
Cohen, 2005 The audio and visual signals both contain information about: Identity/location of the person Linguistic message Emotion, mood, stress, etc. Integration of these sources of information has been known to helps humans Exploiting this symbiosis can lead to robustness, e.g., Locating and identifying the speaker Speech recognition/understanding augmented with facial features Speech and gesture integration I will show two examples. The first one is the speech-pen integration system developed by Phil Cohen and his colleagues as part of the DARPA Cognitive Systems program. The second is the work of Alex Gruenstein, a graduate student at MIT working with Stephanie Seneff. Gruenstein et al., 2006 MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis
The audio and visual signals both contain information about: Identity/location of the person Linguistic message Emotion, mood, stress, etc. Integration of these sources of information has been known to helps humans Exploiting this symbiosis can lead to robustness, e.g., Locating and identifying the speaker Speech recognition/understanding augmented with facial features Speech and gesture integration Audio/visual information delivery Ezzat, 2003 Research conducted at KTH has shown that intelligibility can nearly double when a synthetic face is added to the speech signal. In this example, the face was synthesized by Tony Ezzat of MIT. He first derived some forty or so eigenfunctions of the face using a few minutes of training data. New facial variations can be generated by assigning the appropriate weights to these functions. The speech was synthesized by Jim Glass and his students using a corpus-based approach. MIT Computer Science and Artificial Intelligence Laboratory

Establishing Context Context setting is important for dialogue interaction Environment Linguistic constructs Discourse Much work has been done, e.g., Context-dependent acoustic and language models Sound segmentation Discourse modeling Some interesting new directions Tapestry of applications Acoustic scene analysis (Ellis, 2006) calendar photos weather address stocks phonebook music Context setting is an important aspect of spoken language communication. Knowing that we are speaking in a noisy environment, for example, enables us to adapt and disregard the interferences Knowledge about linguistic constructs enables us to favor one set of words over another (e.g., ''euthanasia'' vs. “youth in Asia”). Discourse knowledge is crucial for us to interpret the meaning of sentences based on previous parts of the conversation. In recent research transcribing Broadcast News, some researchers pre-segmented the input signal to mark changes in environment and talker in order to improve speech recognition performance. I believe it would be quite a while before we can develop omnipotent spoken dialogue systems that can handle a multitude of applications. Even if we can, the performance of such systems is bound to falter. Instead, we should perhaps strive towards developing intuitive ways to set the context, such that users will be able to deal with these applications with some knowledge of the capabilities of the systems (iPhone example) A logical and potentially more powerful extension of the audio pre-segmentation work would be to provide a complete scene analysis of the sounds surrounding us. (Credit Dan Ellis) MIT Computer Science and Artificial Intelligence Laboratory

Acoustic Scene Analysis
Acoustic signals contain a wealth of information (linguistic message, environment, speaker, emotion, …) We need to find ways to adequately describe the signals time signal type: speech transcript: although both of the, both sides of the Central Artery … topic: traffic report speaker: female . . . signal type: speech transcript: Forecast calls for at least partly sunny weather … topic: weather, sponsor acknowledgement, time speaker: male . . . signal type: speech transcript: This is Morning Edition, I’m Bob Edwards … topic: NPR news speaker: male, Bob Edwards . . . This is an illustration of the capabilities of such an imagined system, in which the acoustic signal is described by a set of meta-data tags. This is an area of active research for me and my student Tara Sainath. signal type: music genre: instrumental artist: unknown . . . Some time in the future … MIT Computer Science and Artificial Intelligence Laboratory

Learning Perhaps the most important aspect of organic interfaces
Use of stochastic modeling techniques for speech recognition, language understanding, machine translation, and dialogue modeling Many different ways to learn Passive learning Interactive learning Learning by imitation Over the past several decades, we have steadily seen stochastically-motivated learning techniques being applied to human language technologies. They have been successfully applied to speech recognition and understanding, machine translation, and now dialogue modeling. The taxonomy of learning as it applies to speech-based interfaces can be quite complex. At one extreme, the system could learn by observing user behaviors over time, and then making use of a statistical model of observed patterns to bias future decisions. The user might not even be aware that the system is altering its model of the world. Alternatively, the system may need to actively engage the user in dialogue in order to learn their preferences explicitly or to acquire new knowledge about the world. Finally, the system may want to learn user behavior explicitly through imitation. Some of the learning can be done off line, whereas other forms must be done during actual usage. In the next few minutes, I will discuss some of these learning techniques. MIT Computer Science and Artificial Intelligence Laboratory

Interactive Learning: An Example
New words are inevitable, and they cannot be ignored Acoustic and linguistic knowledge is needed to Detect Learn, and Utilize new words Fundamental changes in problem formulation and search strategy may be necessary Hetherington, 1991 No matter how large the size of the training corpora, the system will invariably encounter previously unseen words. For the Air Travel Information System, or ATIS, task, for example, the probability of the system encountering an unknown word, is about 0.003, even after encountering a 100,000-word training corpus. In real applications, a much larger fraction of the words uttered by users will not be in the system's working vocabulary >250,000 restaurants in US 250 addresses Ignoring them will not satisfy user’s needs. For a system to be truly helpful, it must be able not only to detect new words, taking into account acoustic, phonological, and linguistic evidence, but also to adaptively acquire them, both in terms of their orthography and linguistic properties. In some cases, fundamental changes in the problem formulation and search strategy may be necessary. MIT Computer Science and Artificial Intelligence Laboratory

Interactive Learning: An Example
New words are inevitable, and they cannot be ignored Acoustic and linguistic knowledge is needed to Detect Learn, and Utilize new words Fundamental changes in problem formulation and search strategy may be necessary Chung & Seneff, 2004 What is needed, then, is a generic capability to handle unknown words, beginning with detection, continuing on to disambiguation sub-dialogue, and terminating with an automatic update of the system such that it now knows the new word explicitly and understands its usage. For example, Chung & Seneff developed a spoken dialogue system that can flexibly incorporate new words from users and from dynamic information sources retrieved from the Web. Specifically, the system enlists a software agent to seek the data entries from the Web, given the current dialogue context. Subsequently, the system updates its vocabulary and language models with the newly retrieved data subset. Letter-to-sound rules are used to derive the proper pronunciation. New words can be detected and incorporated through Dynamic update of vocabulary MIT Computer Science and Artificial Intelligence Laboratory

Interactive Learning: An Example
New words are inevitable, and they cannot be ignored Acoustic and linguistic knowledge is needed to Detect Learn, and Utilize new words Fundamental changes in problem formulation and search strategy may be necessary Fillisko & Seneff, 2006 Alternatively, the system can ask the user to spell the unknown word once it has been detected. The spelling can be recognized and possibly corrected based on sound-to-letter rules of a language. New words can then be incorporated into the system for further action. New words can be detected and incorporated through Dynamic update of vocabulary Speak and Spell MIT Computer Science and Artificial Intelligence Laboratory

Learning by Imitation Many tasks can be learned through interaction
Allen et.al., (2007) Many tasks can be learned through interaction “This is how you enable Bluetooth.”  “Enable Bluetooth.” “These are my glasses.”  “Where are my glasses?” Promising research by James Allen (2007) Learning phase: User shows the system how to perform tasks (perhaps through some spoken commentary) System learns the task through learning algorithms and updates its knowledge base Application phase Looks up tasks in its knowledge base and executes the procedure In {\it learning by imitation}, the user can simply show the systems how to performance certain tasks, and in most cases, provide a spoken commentary to be associated with that task for future use. For example, a complex sequence of clicks on a menu hierarchy in a smart phone can be directly linked to the verbal command enable Bluetooth.'' There has been, to my knowledge, very little prior research within the speech community in the area of learning by imitation. Most notable however is the recent research by James Allen and his team at the University of Rochester. During the learning phase, for example, the user walks through a sequence of steps at a Web page in order to teach the system how to search and summarize from complex linguistically understood queries. Task models are constructed by fusing together information from language understanding with the observed demonstration. The active learning process allows the system to be able to master the process from a single example, due to the linguistic scaffolding provided in accompanying spoken dialogue interaction. Once learned, the system will be able to performed similar task on other requests. MIT Computer Science and Artificial Intelligence Laboratory

In Summary Great strides have been made in speech technologies
Truly anthropomorphic spoken dialogue interfaces can only be realized if they can behave like organisms Observe, learn, grow, and heal Many challenges remain … As I mentioned at the onset, great strides have been made in speech technologies. In this talk, I try to argue that future interfaces should behave more like a living organism that can provide robust performance in a wide range of operating conditions, learn from their experiences and adapt to the environment, user, and task. Some of the challenges of developing such an interface are being pursued by the research community with good results, while others will need increased attention. It is my hope that some of the ideas may find their way onto the research agenda of others in the community, in addition to my own. MIT Computer Science and Artificial Intelligence Laboratory

Thank You MIT Computer Science and Artificial Intelligence Laboratory

Dynamic Vocabulary Understanding
Dynamically alter vocabulary within a single utterance “What’s the phone number for Flora in Arlington.” What’s the phone number of Flora in Arlington What’s the phone number of Flora in Arlington ???? Clause: wh_question Property: phone Topic: restaurant Name: ???? City: Arlington Clause: wh_question Property: phone Topic: restaurant Name: Flora City: Arlington Hub NLG ASR Context TTS Dialog NLU Audio DB A desirable feature of the system is that changes in the database content via updates, such as new restaurants, do not require re-compilation of the main finite-state transducers (FSTs) in the recognition or the natural language parser. As an illustration, a user may ask, What is the number of Flora in Arlington?''. The system initially understood the query as What is the number of <unknown word> in Arlington.'' Based on the context, the system will retrieve all the restaurants in Arlington from a restaurant database, and select Flora based on its acoustic similarity to the input. It will then respond to the user with the requested information, and updates the speech recognition and language understanding components so that the previously unknown word can be used subsequently. This can all be done on one sentence. Arlington Diner Blue Plate Express Tea Tray in the Sky Asiana Grille Bagels etc Flora …. “The telephone number for Flora is …” MIT Computer Science and Artificial Intelligence Laboratory