Efficient Computer Interfaces Using Continuous Gestures, Language Models, and Speech Keith Vertanen Inference Group August 4th, 2004.

Slides:



Advertisements
Similar presentations
Collecting data Chapter 6. What is data? Data is raw facts and figures. In order to process data it has to be collected. The method of collecting data.
Advertisements

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Confidence Measures for Speech Recognition Reza Sadraei.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Speech User Interfaces
Efficient Computer Interfaces Using Continuous Gestures, Language Models, and Speech Keith Vertanen July 30 th, 2004.
Using CTW as a language modeler in Dasher Martijn van Veen Signal Processing Group Department of Electrical Engineering Eindhoven University.
Chapter 12 Designing the Inputs and User Interface.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Chapter 5 Models and theories 1. Cognitive modeling If we can build a model of how a user works, then we can predict how s/he will interact with the interface.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
Probabilistic Context Free Grammars for Representing Action Song Mao November 14, 2000.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
1 CSA4050: Advanced Topics in NLP Spelling Models.
1 An Assessment of a Speech-Based Programming Environment Andrew Begel Microsoft Research (formerly UC Berkeley)
Using Speech Recognition Copyright 2006 South-Western/Thomson Learning.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Lesson 1-Logging On to the System. Overview Importance of UNIX/Linux. Logging on to the system.
User Modeling of Assistive Technology Rich Simpson.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
NFIRS Data Entry Browser Interface (DEBI)
Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.
Microsoft Assistive Technology Products Brought to you by... Jill Hartman.
Speech Recognition MIT SMA 5508 Spring 2004 Larry Rudolph (MIT)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
Stentor A new Computer-Aided Transcription software for French language.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
1 Information-efficient human-computer interfaces David MacKay Department of Physics, University of Cambridge - with David Ward and Alan Blackwell
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Knowledge Based Systems ExpertSystems Difficulties in Expert System Development u Scarce resources – new technology – demand for trained personnel u Development.
Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Creating Speech Recognizers Quickly Björn Bringert Department of Computer Science and Engineering Chalmers.
Reza Yazdani Albert Segura José-María Arnau Antonio González
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Franklin (Mingzhe) Li, Mingming Fan & Khai N. Truong
The Morse Mouse Crossing Interface
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Document Expansion for Speech Retrieval (Singhal, Pereira)
Dynamic Programming Search
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Presenter : Jen-Wei Kuo
Presentation transcript:

Efficient Computer Interfaces Using Continuous Gestures, Language Models, and Speech Keith Vertanen Inference Group August 4th, 2004

The problem Speech recognizers make mistakes Correcting mistakes is inefficient  140 WPM Uncorrected dictation  14 WPMCorrected dictation, mouse/keyboard  32 WPMCorrected typing, mouse/keyboard Voice-only correction is even slower and more frustrating

Research overview Make correction of dictation:  More efficient  More fun  More accessible Approach:  Build a word lattice from a recognizer’s n-best list  Expand lattice to cover likely recognition errors  Make a language model from expanded lattice  Use model in a continuous gesture interface to perform confirmation and correction

Building lattice Example n-best list: 1: jack studied very hard 2: jack studied hard 3: jill studied hard 4: jill studied very hard 5: jill studied little

Insertion errors

Acoustic confusions Given a word, find words that sound similar Look pronunciation up in dictionary: studieds t ah d iy d Use observed phone confusions to generate alternative pronunciations:s t ah d iy d s ao d iy s t ah d iy … Map pronunciation back to words: s t ah d iy d studied s ao d iysaudi s t ah d iystudy

Acoustic confusions: “Jack studied hard”

Morphology confusions Given a word, find words that share the same “root”. Using the Porter stemmer: jacking jacks jack jacked study studying studied studies studi jack

Morphology confusions: “Jack studied hard”

Language model confusions: “Jack studied hard” Look at words before or after a node, add likely alternate words based on n-gram LM

Expansion results (on WSJ1)

Probability model Our confirmation and correction interface requires probability of a letter given prior letters:

Probability model Keep track of possible paths in lattice Prediction based on next letter on paths Interpolate with default language model Example, user has entered “the_cat”: 1.00

Handling word errors Use default language model during entry of erroneous word Rebuild paths allowing for an additional deletion or substitution error Example, user has entered “the_cattle_”:

Using expanded lattice Paths using arcs added during lattice expansion are penalized. Example, user has entered “jack_”:

Evaluating expansion Assume a good model requires as little information from the user as possible

Results on test set Model evaluated on held out test set (Hub1) Default language model  2.4 bits/letter  User decides between 5.3 letters Best speech-based model  0.61 bits/letter  User decides between 1.5 letters

“To the mouse snow means freedom from want and fear”

“The hibernating skunk curled up in his deep den uncurls himself and ventures forth to prowl the world”

Conclusions One-third of recognition errors covered by expanding lattice. Only insertion error expansion improves efficiency. Speech-based model significantly improves efficiency (2.4 bits -> 0.61 bits). A good correction interface is possible using Dasher and an off-the-shelf recognizer.

Future work Update Speech Dasher to use lattice-based probability model. Incorporate hypothesis probabilities into lattice (or even better get at recognizer’s lattice). Improve efficiency on sentences with few or no errors. User trials to validate numeric results.

Questions?