Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
EE442—Multimedia Networking Jane Dong California State University, Los Angeles.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
EE491D Special Topics in Communications Adaptive Signal Processing Spring 2005 Prof. Anthony Kuh POST 205E Dept. of Elec. Eng. University of Hawaii Phone:
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
Department of Electrical Engineering Systems. What is Systems? The study of mathematical and engineering tools used to analyze and implement engineering.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Introduction to Pattern Recognition
Copyright John Wiley & Sons, Inc. Chapter 3 – Interactive Technologies HCI: Developing Effective Organizational Information Systems Dov Te’eni Jane.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
TINONS1 Nonlinear SP and Pattern recognition
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
7-Speech Recognition Speech Recognition Concepts
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
1 Computational Linguistics Ling 200 Spring 2006.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
Compiled By: Raj G Tiwari.  A pattern is an object, process or event that can be given a name.  A pattern class (or category) is a set of patterns sharing.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal VideoConference Archives Indexing System.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Multimodal Information Analysis for Emotion Recognition
Spoken Dialog Systems and Voice XML Lecturer: Prof. Esther Levin.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Prof. Thomas Sikora Technische Universität Berlin Communication Systems Group Thursday, 2 April 2009 Integration Activities in “Tools for Tag Generation“
Introduction to Computational Linguistics
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
OTHER RESEARCH IN SIGNAL PROCESSING AND COMMUNICATIONS IN ECE Richard Stern Carnegie Mellon University (with Dave Casasent, Tsuhan Chen, Vijaya Kumar,
Chapter 1. SIGNAL PROCESSING:  Signal processing is concerned with the efficient and accurate extraction of information in a signal process.  Signal.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Basic Element of Electronics Data Processing Hardware Hardware Software Software Networking Networking Person involved in Computer Fields Person involved.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Digital Video Library - Jacky Ma.
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Conditional Random Fields for ASR
Overview What is Multimedia? Characteristics of multimedia
Kocaeli University Introduction to Engineering Applications
network of simple neuron-like computing elements
Research Institute for Future Media Computing
Speaker Identification:
Huawei CBG AI Challenges
Presentation transcript:

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal processing

Outline Traditional scope of signal processing: “signal” dimension and “processing/task” dimension Expansion along both dimensions –“signal” dimension –“task” dimension Case study on the “task” dimension –From speech recognition to speech understanding Three benefits for MMSP research

Signal Processing Constitution “… The Field of Interest of the Society shall be the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals by digital or analog devices or techniques. The term ‘signal’ includes audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and other signals…” (ARTICLE II) Translate to a “matrix”: “Processing type” (row) vs. “Signal type” (column)

4 Scope of SP in a matrix Media type Tasks/ Apps Audio/MusicSpeechImage/ Animation/ Graphics VideoText/ Document/ Language(s) CodingAudio Coding Speech Coding Image Coding Video Coding Document Compression/ Summary Communication (transmit/estim/detect) Record/ReproducingMicrophone/loud-speaker design Camera Analysis (filtering, enhance) De-noising/ Source separation Speech Enhancement/ Feature extraction Image/video enhancement (e.g. clear Type), Segmentation, feature extraction (e.g., SIFT) Grammar checking, Text Parsing SynthesisComputer Music Speech Synthesis (text-to-speech) Computer Graphics Video Synthesis?Natural Language Generation RecognitionAuditory Scene Analysis? Automatic Speech/Speaker Recognition Image Recognition (e.g, Optical character recognition, face recognition, finger print rec) Computer Vision (e.g. 3-D object Recognition) Text Categorization Understanding (Semantic IE) Spoken Language Understanding (e.g. voice search) Image Understanding ( e.g. scene analysis) Natural Language Understanding/ MT Retrieval/MiningMusic Retrieval Spoken Document Retrieval & Voice/Mobile Search Image Retrieval Video Search Text Search (info retrieval) Social Media AppsZune, Itune, etc.PodCastsPhoto Sharing (e.g. flickr) Video Sharing (e.g. Youtube, 3D Second Life) Blogs, Wiki, del.ici.ous…

5 Scope of SP in a matrix (expanded) Media type Tasks/ Apps Audio/Music Acoustics SpeechImage/ Animation/ Graphics VideoText/ Document/ Language(s) Coding/ Compression Audio Coding Speech Coding Image Coding Video Coding Document Compression/ Summary CommunicationMIMO; Voice over IP, DAB/DVB, IP-TVHome Network; Wireless? Security/forensicsMultimedia watermarking, encryption, etc. Enhancement/ Analysis De-noising/ Source separation Speech Enhancement/ Feature extraction Image/video enhancement, Segmentation, feature extraction (e.g., SIFT,SURF),computational photography Grammar checking, Text Parsing Synthesis/ Rendering Computer Music Speech Synthesis (text-to-speech) Computer Graphics Video SynthesisNatural Language Generation User-InterfaceMulti-Modal Human Computer Interaction (HCI --- Input Methods) /Dialog? Recognition /Verification- detection Auditory Scene Analysis Machine hearing? (Computer audition; e.g. Melody detection & Singer ID, etc.)? Automatic Speech/Speaker Recognition Image Recognition (e.g, Optical character recognition, face recognition, finger print rec) Computer Vision (e.g. 3-D object Recognition; “story telling” from video, etc.) Text Categorization Understanding (Semantic IE) Spoken Language Understanding (e.g. HMIHY) Image Understanding ( e.g. scene analysis) ? Natural Language Understanding/ MT Retrieval/MiningMusic Retrieval Spoken Document Retrieval & Voice/Mobile Search Image Retrieval (CBIR) Video Search Text Search (info retrieval) Social Media AppsItune, etc.PodCastsPhoto Sharing (e.g. flickr) Video Sharing (e.g. Youtube, 3D Second Life) Blogs, Wiki, del.ici.ous…

Speech Understanding: Case Study (Yaman, Deng, Yu, Acero: IEEE Trans ASLP, 2008) Speech understanding: not to get “words” but to get “meaning/semantics” (actionable by the system) Speech utterance classification as a simple form of speech “understanding” Case study: ATIS domain (Airline Travel Info System) “Understanding”: want to book a flight? or get info about ground transportation in SEA?

Traditional Approach to Speech Understanding/Classification Automatic Speech Recognizer Semantic Classifier Acoustic Model Language Model Classifier Model Feature Functions Find the most likely semantic class for the r th acoustic signal 1 st Stage: Speech recognition 2 nd Stage: Semantic classification

Traditional/New Approach Word error rate minimized in the 1 st stage, Understanding error rate minimized in the 2 nd stage. Lower word errors do not necessarily mean better understanding. The new approach: integrate the two stages so that the overall “understanding” errors are minimized.

New Approach: Integrated Design Key Components: Discriminative Training N-best List Rescoring Iterative Update of Parameters Automatic Speech Recognizer Semantic Classifier & LM Training Acoustic Model Language Model Classifier Model Feature Functions N-best List Rescoring using N-best List

Classification Decision Rule using N-Best List Approximating the classification decision rule Integrative Score sum over all possible W maximize over W in the N-best list

An Illustrative Example best score, but wrong class best sentence to yield the correct class, but low score

Minimizing the Misclassifications The misclassification function: The loss function associated with the misclassification function: Minimize the misclassifications:

Discriminative Training of Language Model Parameters Find the language model probabilities Count of the bigram in the word string of the n th competitive class Count of the bigram in the word string of the correct class to minimize the total classification loss weighting factor

Discriminative Training of Semantic Classifier Parameters Find the classifier model parameters to minimize the total classification loss weighting factor

Setup for the Experiments ATIS II+III data is used: –5798 training wave files –914 test wave files –410 development wave files (used for parameter tuning & stopping criteria) Microsoft SAPI 6.1 speech recognizer is used. MCE classifiers are built on top of max-entropy classifiers.

ASR transcription: One-best matching sentence, W. Classifier Training: Max-entropy classifiers using one-best ASR transcription. Classifier Testing: Max-entropy classifiers using one-best ASR transcription. Test WER (%)Test CER (%) Manual Transcription ASR Output Experiments: Baseline System Performance

Experimental Results One iteration of training consists of: SAPI SR Discriminative LM Training Discriminative Classifier Training CER Max-Entropy Classifier Training Speech Utterance

From Recognition to Understanding This case study illustrates that joint design of “recognition” and “understanding” components are beneficial Drawn from speech research area Speech translation has similar conclusion? Case studies from image/video research areas? Image recognition/understanding?

Summary The “matrix” view of signal processing –“signal type” as the column –“Task type” as the row Benefit 1: Natural extension of the “row” elements (e.g., text/language) & of “column” (e.g., understanding) Benefit 2: Cross-column breeding: e.g., Can speech/audio and image/video recognition researchers learn from each other in terms of machine learning & SP techniques (similarities & differences)? Benefit 3: Cross-row breeding: e.g., Given the trend from speech recognition to understanding (& the kind of approach in the case study), what can we say about image/video and other media understanding?