Dealing with Unknown Unknowns (in Speech Recognition) Hynek Hermansky Processing speech in multiple parallel processing streams, which attend to different.

Slides:



Advertisements
Similar presentations
APPLICATIONS OF ANN IN MICROWAVE ENGINEERING.
Advertisements

Language and Cognition Colombo, June 2011 Day 8 Aphasia: disorders of comprehension.
Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh.
Machine Learning Neural Networks
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
How to do backpropagation in a brain
Organizational Notes no study guide no review session not sufficient to just read book and glance at lecture material midterm/final is considered hard.
Un Supervised Learning & Self Organizing Maps Learning From Examples
Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.
What is Cognitive Science? … is the interdisciplinary study of mind and intelligence, embracing philosophy, psychology, artificial intelligence, neuroscience,
Associative Learning in Hierarchical Self Organizing Learning Arrays Janusz A. Starzyk, Zhen Zhu, and Yue Li School of Electrical Engineering and Computer.
Lecture #1COMP 527 Pattern Recognition1 Pattern Recognition Why? To provide machines with perception & cognition capabilities so that they could interact.
Baysian Approaches Kun Guo, PhD Reader in Cognitive Neuroscience School of Psychology University of Lincoln Quantitative Methods 2011.
What is Cognitive Science? … is the interdisciplinary study of mind and intelligence, embracing philosophy, psychology, artificial intelligence, neuroscience,
Introduction to machine learning
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
7-Speech Recognition Speech Recognition Concepts
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Perception Introduction Pattern Recognition Image Formation
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Cognitive Systems Foresight Language and Speech. Cognitive Systems Foresight Language and Speech How does the human system organise itself, as a neuro-biological.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Speech Perception 4/4/00.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
Chapter 6. Effect of Noise on Analog Communication Systems
Cognitive Systems Foresight Language and Speech. Cognitive Systems Foresight Language and Speech How does the human system organise itself, as a neuro-biological.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Cognitive models for emotion recognition: Big Data and Deep Learning
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
A field of study that encompasses computational techniques for performing tasks that require intelligence when performed by humans. Simulation of human.
Applications of THE MODULATION SPECTRUM For Speech Engineering Hynek Hermansky IDIAP, Martigny, Switzerland Swiss Federal Institute of Technology, Lausanne,
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Speech Enhancement based on
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Precedence-based speech segregation in a virtual auditory environment
Deep Learning Amin Sobhani.
Intro to Machine Learning
ARTIFICIAL NEURAL NETWORKS
Artificial Intelligence for Speech Recognition
Conditional Random Fields for ASR
Final Year Project Presentation --- Magic Paint Face
Mplp(t) derived from PLP cepstra,. This observation
CS621/CS449 Artificial Intelligence Lecture Notes
EE513 Audio Signals and Systems
John H.L. Hansen & Taufiq Al Babba Hasan
Human Speech Communication
Information Theoretical Analysis of Digital Watermarking
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presenter: Shih-Hsiang(士翔)
Speech Enhancement Based on Nonparametric Factor Analysis
Presentation transcript:

Dealing with Unknown Unknowns (in Speech Recognition) Hynek Hermansky Processing speech in multiple parallel processing streams, which attend to different parts of signal space and use different strengths of prior top-down knowledge is proposed for dealing with unexpected signal distortions and with unexpected lexical items. Some preliminary results in machine recognition of speech are presented.

indian white man There are things we do not know we don't know. Donald Rumsfeld

“Funding artificial intelligence is real stupidity” "After growing wildly for years, the field of computing appears to be reaching its infancy.” Research field of “mad inventors or untrustworthy engineers” supervised the Bell Labs team which built the first transistor President’s Science Advisory Committee developed the concept of pulse code modulation designed and launched the first active communications satellite Letter to Editor J.Acoust.Soc.Am..... should people continue work towards speech recognition by machine ? Perhaps it is for people in the field to decide.

Why am I working in this field? Problems faced in machine recognition of speech reveal basic limitations of all information technology ! Why did I climbed Mt. Everest? Because it is there ! -Sir Edmund Hilary Spoken language is one of the most amazing accomplishments of human race. access to information voice interactions with machines extracting information from speech data !

production, perception, cognition,.. knowledge We speak in order to hear, in order to be understood. -Roman Jakobson data Speech recognition …a problem of maximum likelihood decoding -Frederick Jelinek Hidden Markov Model

Ŵ = argmax W p(x|W) P(W) Ŵ – estimated speech utterance p(x|W i ) - likelihoods of acoustic models of speech sounds, the models are derived by training on very large amounts of speech data P(W) - prior probabilities of speech utterances (language model), model estimated from large amounts of data (typically text) Stochastic recognition of speech “Unknown unknowns” in machine recognition of speech distortions not seen in the training data of the acoustic model words that are not expected by the language model

One possible way of dealing with unknown unknowns Parallel information-providing streams, each carrying different redundant dimensions of a given target. A strategy for comparing the streams. A strategy for selecting “reliable” streams. Stream formation Different perceptual modalities Different processing channels within each modality Bottom-up and top-down dominated channels signal information fusion decision Comparing the streams ? various correlation (distance) measures Selecting reliable streams ????? Information in speech is coded in many redundant dimensions. Not all dimensions get corrupted at the same time.

Fletcher et al Probability of error of recognition of full-band speech is given by a product of probabilities of errors in subbands Boothroyd and Nittrouer Probability of error of recognition in contexts is given by a product of probabilities of errors of recognition without context and probability of error in channel which provides information about the context Final error dominated by the channel with smallest error ! Perceptual Data

A large number of parallel processing streams Different carrier frequencies Different carrier bandwidths Different spectral and temporal resolutions Different modalities Different prior biases Processing streams different carrier frequencies different temporal resolutions different spectral resolutions Auditory cortical receptive fields Evidence for different processing strategies time [s] frequency from N. Mesgarani

Evidence for equally powerful bottom-up and top-down streams ? From the subjective point of view, there is nothing special that would differentiate between the top-down and bottom-up dominated processing streams. All streams provide information for a decision. When all streams provide non-conflicting information, all this information is used for the decision. When the context allows for multiple interpretations of the sensory input, the bottom-up processing stream dominates. When the sensory input gets corrupted by noise, the top-down dominated stream fills in for the corrupted bottom-up input. Hermansky 2013

Monitoring Performance P1P1 P2P2 P miss = (1-P 1 )(1-P 2 ) Could it be that we know when we know ? observer - false positives and negatives are possible P miss_observed ≠ (1-P 1 )(1-P 2 )

Knowing when one knows ! Performance Monitoring in Sensory Perception picture density low high judgement 0 % 100 % sparse dense not sure human judgment (adopted from Smith et al 2003) similar data available for monkeys, dolphins, rats,… update classifier model of the output testing data training data classifier compare models model of the output Machine ?

time frequency data preprocessing artificial neural network trained on large amounts of labeled data Spectrogram Posteriogram ANN fusion phoneme posteriors up to 1 s

Fusion of streams of different carrier frequencies [Hermansky et al 1996, Li et al 2013]

Preliminary results using multi-stream speech recognition on noisy TIMIT data Processing is done in multiple parallel streams Signal corruption affects only some streams Performance monitor selects N best streams for further processing environmentconventionalproposedbest by hand clean31 %28 %25 % car at 0 dB SNR54 %38 %35 % Phoneme recognition error rates on noisy TIMIT data

up to 1000 ms high frequency components many processing layers (transformed) posterior probabilities of speech sounds mid frequency components low frequency components “smart” fusion up to 100 ms all available frequency components many processing layers (transformed) posterior probabilities of speech sounds conventional “deep” net “long, wide and deep”net time get info 1 get info i get info N

Conclusions we would eventually like to make Recognition should be done in parallel processing streams, each attending to a particular aspect of the signal and using different levels of top-down expectations Discrepancy among the streams indicates an unexpected signal Suppressing corrupted streams can increase robustness to unexpected inputs

Machine Emulation of Human Speech Communication..devise a clear, simple, definitive experiments. So a science of speech can grow, certain step by certain step. John Pierce human communication, speech production, perception, neuroscience, cognitive science,.. We speak, in order to be heard, in order to be understood Roman Jakobson Speech recognition …a problem of maximum likelihood decoding information and communication theory, machine learning, large data,…. Fred Jelinek The complexity for minimum component costs has increased at a rate of roughly a factor of two per year… Gordon Moore tools also John Pierce: ( Speech recognition is so far (1969) field of) mad inventors or untrustworthy engineers (because machine needs) intelligence and knowledge of language comparable to those of a native speaker. Sounds like a good goal to aim at !

Nima Mesgarani Samuel Thomas Feipeng Li Ehsan Variani Vijay Peddinti THANKS ! Jont Allen Harish Mallidi Misha PavelHamed Ketabdar