Csc2535 2013 Lecture 7 Recognizing speech. Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

Building an ASR using HTK CS4706
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Speech Recognition with Hidden Markov Models Winter 2011
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
Speaker Adaptation for Vowel Classification
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSC2535: Advanced Machine Learning Lecture 11a Priors and Prejudice Geoffrey Hinton.
Speech Recognition Deep Learning and Neural Nets Spring 2015.
Representing Acoustic Information
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
7-Speech Recognition Speech Recognition Concepts
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Jacob Zurasky ECE5526 – Spring 2011
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
CSC321: Neural Networks Lecture 6: Learning to model relationships and word sequences Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CS Statistical Machine learning Lecture 24
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
CSC2515 Lecture 10 Part 2 Making time-series models with RBM’s.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
CSE 190 Modeling sequences: A brief overview
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Speech Processing Speech Recognition
Statistical Models for Automatic Speech Recognition
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Presentation transcript:

csc Lecture 7 Recognizing speech. Geoffrey Hinton

Why speech recognition works better with a good language model We cannot identify phonemes perfectly in noisy speech –The acoustic input is often ambiguous: there are several different words that fit the acoustic signal equally well. People use their understanding of the meaning of the utterance to hear the right word. –We do this unconsciously and we are very good at it. –It can lead to bizarre errors: Given the right context, we hear “Wreck a nice beach” as “Recognize speech” Speech recognizers have to know which words are likely to come next and which are not. –This can be done quite well without full understanding.

A “standard” speech recognition system Step 1: Convert the soundwave into a sequence of frames of Mel Frequency Cepstral Coefficients (MFCCs) –Use Fourier analysis on ~25ms windows. –Smooth the spectrum of frequencies to capture the resonances of the vocal tract. –Use wider frequency bins for higher frequencies (but uniform width up to 1000Hz) –Advance the window by ~10ms

A “standard” speech recognition system Step 2: Model each frame of coefficients by using a mixture of Gaussians. –Affine transform all frames to deal with obvious, shared covariance structure. Make the affine transform depend on the speaker to eliminate some of the inter-speaker variation. –To cope with the fact that these Gaussians cannot model the strong temporal covariances, enhance the data with temporal differences and differences of differences. This allows local temporal covariances to be modelled as the variances of the differences.

A “standard” speech recognition system Step 3: Cope with the alignment problem by using a Hidden Markov Model. –Each hidden state of the HMM has its own mixture of Gaussians model. –These state-specific MoG models may use state-specific mixing proportions for a set of Gaussians that are shared by all of the hidden states. This reduces the number of parameters by a lot.

How HMMs solve the alignment problem HMMs have a very weak generative model: –Each frame is generated by a single hidden state. –So the full posterior over states only has as many probabilities as the number of states. –This makes it easy to search all possible alignments using the forward-backward algorithm (a form of dynamic programming). Since exact inference is tractable in an HMM we can use EM to learn the parameters of the transition matrix and of the Gaussians.

How to make progress in speech recognition Stop using MFCCs to model the soundwave. –They throw away a lot of detailed information This is good if you have a small, slow computer. Stop using mixtures of Gaussians to model the acoustics. –Its an exponentially inefficient model Stop using HMMs to model sequential structure. –HMMs are exponentially inefficient generative models. They require 2^N hidden states to carry N bits of constraint during generation.

An early use of neural nets Use a feedforward neural net to convert MFCCs into a posterior probability distribution over hidden states of the HMM. –To train this net we need to know the “correct” state of the HMM, so we need to bootstrap from an existing ASR system. –After training the neural net, we need to convert p(state|data) into p(data|state) in order to train an HMM.

A neat application of deep learning A very deep belief net beats the record at phone recognition on the very well-studied TIMIT database. The task: –Each of the 61 phones is modeled by its own little 3- state mono-phone HMM. –The neural net is trained to predict the probabilities of 183 context-dependent phone labels for the central frame of a short window of speech The training procedure: –Train lots of big layers, one at a time, without using the labels. –Add a 183-way softmax of context-specific phone labels –Fine-tune with backprop on a big GPU board for several days

One very deep belief net for phone recognition 11 frames of 39 MFCC’s 2000 binary hidden units 128 units 183 labels Mohamed, Dahl & Hinton (2011) not pre-trained The Mel Cepstrum Coefficients are a standard representation for speech

A neat application of deep learning After the standard post-processing using a bi-phone “language” model this net gets 23.0% phone error rate. –This was a record for speaker-independent phone recognition. Throw out the MFCCs and use filterbank coefficients plus their temporal deltas and delta-deltas. –With a deeper net (8 layers) this gets down to 20.7% For TIMIT, the classification task (i.e. Classify each phone when you are given the phone boundaries) is a bit easier than the recognition task. –On TIMIT, deep networks are the best at classification too (Honglak Lee)

Getting rid of the HMM This is going to be more difficult because HMMs are a convenient way to deal with alignment. –Their generative weakness is a strength. The big hope is a recurrent neural net. –Initially this could be trained to predict the next frame. This provides lots of constraint and there is lots of data. –Then it can be trained to predict both the next frame and the next phone label. –But we don’t really know how to deal with alignment when using an RNN.

THE END