COMP 4060 Natural Language Processing Speech Processing.

Slides:



Advertisements
Similar presentations
Introduction to Speech Recognition
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Chapter 15 Probabilistic Reasoning over Time. Chapter 15, Sections 1-5 Outline Time and uncertainty Inference: ltering, prediction, smoothing Hidden Markov.
Hidden Markov Models Theory By Johan Walters (SR 2003)
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Natural Language Processing - Speech Processing -
Artificial Intelligence Speech and Natural Language Processing.
Speech and Natural Language Processing Christel Kemke Department of Computer Science University of Manitobe Presentation for Human-Computer Interaction.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Introduction And Overview.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
The Beatbox Voice-to-Drum Synthesizer A BSTRACT The Beatbox is a real time voice-to-drum synthesizer intended primarily for the entertainment of small.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Natural Language Understanding
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Isolated-Word Speech Recognition Using Hidden Markov Models
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Speech Signal Processing
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Fourier Concepts ES3 © 2001 KEDMI Scientific Computing. All Rights Reserved. Square wave example: V(t)= 4/  sin(t) + 4/3  sin(3t) + 4/5  sin(5t) +
Speech and Language Processing
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Entropy & Hidden Markov Models Natural Language Processing CMSC April 22, 2003.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Performance Comparison of Speaker and Emotion Recognition
CS 188: Artificial Intelligence Spring 2007 Speech Recognition 03/20/2007 Srini Narayanan – ICSI and UC Berkeley.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Acoustic Phonetics 3/14/00.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
CS 224S / LINGUIST 285 Spoken Language Processing
Speech Recognition
ARTIFICIAL NEURAL NETWORKS
Speech Processing Speech Recognition
CS 188: Artificial Intelligence Fall 2008
Statistical Models for Automatic Speech Recognition
EE513 Audio Signals and Systems
Digital Systems: Hardware Organization and Design
Speech recognition, machine learning
Speech Recognition: Acoustic Waves
Artificial Intelligence 2004 Speech & Natural Language Processing
Speech recognition, machine learning
Presentation transcript:

COMP 4060 Natural Language Processing Speech Processing

Speech and Language Processing Spoken Language Processing from speech to text to syntax and semantics to speech Speech Recognition human speech recognition and production acoustics signal analysis phonetics recognition methods (HMMs) Review

Speech Production & Reception Sound and Hearing  change in air pressure  sound wave  reception through inner ear membrane / microphone  break-up into frequency components: receptors in cochlea / mathematical frequency analysis e.g. Fast-Fourier Transform FFT  Spectrum  Perception / recognition of phonemes and subsequently words e.g. Neural Networks, Hidden-Markov Models

Speech Recognition Phases Speech Recognition acoustic signal as input signal analysis - spectrogram feature extraction phoneme recognition word recognition conversion into written words

Phoneme Recognition: HMM, Neural Networks Phonemes Acoustic / sound wave Filtering, Sampling Spectral Analysis; FFT Frequency Spectrum Features (Phonemes; Context) Grammar or Statistics Phoneme Sequences / Words Grammar or Statistics for likely word sequences Word Sequence / Sentence Speech Recognition Signal Processing / Analysis

Speech Signal composed of different sinus waves with different frequencies and amplitudes Frequency - waves/second  like pitch Amplitude - height of wave  like loudness + Noise(not sinus wave, non-harmonic) Speech composite signal comprising different frequency components

Waveform (fig. 7.20) "She just had a baby." Amplitude/ Pressure Time

Waveform for Vowel ae (fig. 7.21) Time Amplitude/ Pressure

Speech Signal Analysis Analog-Digital Conversion of Acoustic Signal Sampling in Time Frames ( “ windows ” )  frequency = 0-crossings per time frame  e.g. 2 crossings/second is 1 Hz (1 wave)  e.g. 10kHz needs sampling rate 20kHz  measure amplitudes of signal in time frame  digitized wave form  separate different frequency components  FFT (Fast Fourier Transform)  spectrogram  other frequency based representations  LPC (linear predictive coding),  Cepstrum

Waveform and Spectrogram (figs. 7.20, 7.23)

Waveform and LPC Spectrum for Vowel ae (Figs. 7.21, 7.22) Energy Frequency Formants Amplitude/ Pressure Time

Speech Signal Characteristics From Signal Representation derive, e.g.  formants - dark stripes in spectrum strong frequency components; characterize particular vowels; gender of speaker  pitch – fundamental frequency baseline for higher frequency harmonics like formants; gender characteristic  change in frequency distribution characteristic for e.g. plosives (form of articulation)

Speech Production Phonetic Features

Video of glottis and speech signal in lingWAVES (from

Phoneme Recognition Recognition Process based on features extracted from spectral analysis phonological rules statistical properties of language/ pronunciation Recognition Methods Hidden Markov Models Neural Networks Pattern Classification in general

Pronunciation Networks / Word Models as Probabilistic FAs (Fig 5.12)

Pronunciation Network for 'about' (Fig 5.13)

Viterbi-Algorithm - Overview (cf. Jurafsky Ch.5) The Viterbi Algorithm finds an optimal sequence of states in continuous Speech Recognition, given an observation sequence of phones and a probabilistic (weighted) FA (state graph). The algorithm returns the path through the automaton which has maximum probability and accepts the observation sequence. a[s,s'] is the transition probability (in the phonetic word model) from current state s to next state s', and b[s',o t ] is the observation likelihood of s' given o t. b[s',o t ] is 1 if the observation symbol matches the state, and 0 otherwise.

function VITERBI (observations of len T, state-graph) returns best-path num-states  NUM-OF-STATES (state-graph) Create a path probability matrix viterbi[num-states+2,T+2] viterbi[0,0]  1.0 for each time step t from 0 to T do for each state s from 0 to num-states do for each transition s' from s in state-graph new-score  viterbi[s,t] * a[s,s'] * b[s',(o t )] if ((viterbi[s',t+1] = 0) || (new-score > viterbi[s',t+1])) thenviterbi[s',t+1]  new-score back-pointer[s',t+1]  s Backtrace from highest probability state in the final column of viterbi[] and return path Viterbi-Algorithm (Fig 5.19) word model observation (speech recognizer)

The Viterbi Algorithm sets up a probability matrix, with one column for each time index t and one row for each state in the state graph. Each column has a cell for each state q i in the single combined automaton for the competing words (in the recognition process). The algorithm first creates N+2 state columns. The first column is an initial pseudo-observation, the second corresponds to the first observation-phone, the third to the second observation and so on. The final column represents again a pseudo- observation. In the first column, the probability of the Start-state is initially set to 1.0; the other probabilities are 0. Then we move to the next state. For every state in column 0, we compute the probability of moving into each state in column 1. The value viterbi [t, j] is computed by taking the maximum over the extensions of all the paths that lead to the current cell. An extension of a path at state i at time t-1 is computed by multiplying the three factors: the previous path probability from the previous cell forward[t-1,i] the transition probability a i,j from previous state i to current state j the observation likelihood b jt that current state j matches observation symbol t. b jt is 1 if the observation symbol matches the state; 0 otherwise. Viterbi-Algorithm Explanation (cf. Jurafsky Ch.5)

Word Recognition with Probabilistic FA / Markov Chain (Fig 5.14)

Speech Recognizer Architecture (Fig. 7.2)

Speech Processing - Important Types and Characteristics single word vs. continuous speech unlimited vs. large vs. small vocabulary speaker-dependent vs. speaker-independent training (or not) Speech Recognition vs. Speaker Identification

Natural Language and Speech Natural Language Processing written text as input sentences (well-formed) Speech Recognition acoustic signal as input conversion into written words Spoken Language Understanding analysis of spoken language (transcribed speech)

Speech & Natural Language Processing Areas in Natural Language Processing  Morphology  Grammar & Parsing (syntactic analysis)  Semantics  Pragamatics  Discourse / Dialogue  Spoken Language Understanding Areas in Speech Recognition  Signal Processing  Phonetics  Word Recognition

Additional References Hong, X. & A. Acero & H. Hon: Spoken Language Processing. A Guide to Theory, Algorithms, and System Development. Prentice-Hall, NJ, 2001 Figures taken from: Jurafsky, D. & J. H. Martin, Speech and Language Processing, Prentice-Hall, 2000, Chapters 5 and 7. lingWAVES: NL and Speech Resources and Tools: German Demonstration Center for Speech and Language Technologies: