Human Speech Communication

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Building an ASR using HTK CS4706
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Detection, segmentation and classification of heart sounds
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Speech recognition from spectral dynamics HYNEK HERMANSKY The Johns Hopkins University, Baltimore, Maryland, USA Presenter : 張庭豪.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
MPEG Audio Compression by V. Loumos. Introduction Motion Picture Experts Group (MPEG) International Standards Organization (ISO) First High Fidelity Audio.
Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
7-Speech Recognition Speech Recognition Concepts
Dealing with Unknown Unknowns (in Speech Recognition) Hynek Hermansky Processing speech in multiple parallel processing streams, which attend to different.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Basics of Neural Networks Neural Network Topologies.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Applications of THE MODULATION SPECTRUM For Speech Engineering Hynek Hermansky IDIAP, Martigny, Switzerland Swiss Federal Institute of Technology, Lausanne,
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
SOUND PRESSURE, POWER AND LOUDNESS
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
PATTERN COMPARISON TECHNIQUES
ARTIFICIAL NEURAL NETWORKS
Vocoders.
Artificial Intelligence for Speech Recognition
Conditional Random Fields for ASR
Mplp(t) derived from PLP cepstra,. This observation
Frequency Domain Perceptual Linear Predicton (FDPLP)
8-Speech Recognition Speech Recognition Concepts
Automatic Speech Recognition: Conditional Random Fields for ASR
EE513 Audio Signals and Systems
Speech recognition, machine learning
Govt. Polytechnic Dhangar(Fatehabad)
Dealing with Acoustic Noise Part 1: Spectral Estimation
Learning Long-Term Temporal Features
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Speech recognition, machine learning
Presentation transcript:

Human Speech Communication speaker dialogue interaction message linguistic code (< 50 b/s) motor control speech production SPEECH SIGNAL (> 50 kb/s) auditory processing speech perception processes self-control adaptation listener

speaker listener knowledge knowledge high bit rate u h e l o w r d low bit rate message very low bit rate speaker knowledge shared u h e l o w r d low bit rate message very low bit rate listener knowledge

Machine recognition of speech u h e l o w r d high bit rate u h e l o w r d low bit rate message word another word machine recognition of speech

u o COARTICULATION

hello world u h e l o w r d u h e l o w r d coarticulation+ talker idiosyncrasies + environmental variability = a big mess

Two dominant sources of variability in speech FEATURE VARIABILITY different people sound different, communication environment different, coarticulation effects, … TEMPORAL VARIABILITY people can say the same thing with different speeds “Doubly stochastic” process (Hidden Markov Model) Speech as a sequence of hidden states (phonemes) - recover the sequence never know for sure which data will be generated from a given state never know for sure in which state we are

already old Greeks …….. wall fire echoes activity shadows

f0=195 125 140 120 185 130 145 190 245 155 130 Hz hi Know what are the typical ranges of boy’s and girl’s voices ? how likely a boy walks first? how many boys and girls go typically together? how many more boys is typically there? Want to know where are the boys (girls) ?

the model pm pf P(sound|gender) 1-pm m f m f f0 1-pf p1m pm pf P(gender) Given this knowledge, generate all possible sequences of boys and girls and find which among them could most likely generate the observed sequence

Getting the parameters (training of the model) f0=140 120 190 125 155 130 145 160 245 165 150 Hz boys compute distributions of parameters for each state girls find the best alignment of states given the parameters compute distributions of parameters for each state find the best alignment of states given the parameters hi “Forced alingnment” of the model with the data

Machine recognition of speech finding boys and girls speech recognition people’s parade speech utterance gender groups speech sounds voice pitch vector of features derived from the signal prior probabilities of gender occurrence language model more complex model architecture

How to find w (efficiently) ? Form of the model M ( wi ) ? What is the data x ?

Data x ? Speech signal ? Describes changes in acoustic pressure original purpose is reconstruction of speech rather high bit-rate additional processing is necessary to alleviate the irrelevant information

Machine for recognition of speech acoustic training data prior knowledge speech signal pre-processing acoustic processing decoding (search) best matching utterance

time frequency time

time frequency

/j/ /u/ /ar/ /j/ /o/ /j/ /o/ Short-term Spectrum time 10-20 ms get spectral components /j/ /u/ /ar/ /j/ /o/ /j/ /o/ time frequency

Spectrogram – 2D representation of sound

Short-term Fourier analysis p frequency [rad/s] log gain

Spectral resolution of hearing spectral resolution of hearing decreases with frequency (critical bands of hearing, perception of pitch,…) critical bandwidth [Hz] frequency [Hz] 100 50 500 1000 2000 5000 10000

energies in “critical bands” frequency energies in “critical bands”

Sensitivity of hearing depends on frequency

intensity ≈ signal 2 [w/m2] loudness [Sones] loudness = intensity 0.33 intensity (power spectrum) loudness |.|0.33

Not all spectral details are important a) compute Fourier transform of the logarithmic auditory spectrum and truncate it (Mel cepstrum) b) approximate the auditory spectrum by an autoregressive model (Perceptual Linear Prediction – PLP) 6th order AR model frequency (tonality) power (loudness) 14th order AR model

Current state-of-the-art speech recognizers typically use high model order PLP

It’s about time (to talk about TIME)

Masking in Time masker signal t time increase in threshold t t 200 ms stronger masker suggests ~200 ms buffer (critical interval) in auditory system

filter with time constant > 200 ms (temporal buffer > 200 ms) time trajectories of the spectrum in critical bands of hearing filter with time constant > 200 ms (temporal buffer > 200 ms)

spectrogram (short-term Fourier spectrum) time [s] spectrogram (short-term Fourier spectrum) Perceptual Linear Prediction (PLP) (12th order model) RASTA-PLP

spectrum from RASTA-PLP filter spectrogram spectrum from RASTA-PLP

Data-guided feature extraction Spectrogram Posteriogram time frequency data preprocessing artificial neural network trained on large amounts of labeled data /f/ /ay/ /v/ time

Signal components inside the critical time-frequency window interact Masking in time increase in threshold of perception of the target noise bandwidth critical bandwidth what happens outside the critical band does not affect decoding of the sound in the critical band Masking in frequency stronger masker increase in threshold of perception of the target t 200 ms what happens outside the critical interval, does not affect detection of signal within the critical interval Signal components inside the critical time-frequency window interact

Emulation of cortical processing (MRASTA) 16 x 14 bands = 448 projections data1 t0 32 2-D projections with variable resolutions frequency data2 dataN 32 2-D projections with variable resolutions (critical-band spectral analysis) peripheral processing time

Multi-resolution RASTA (MRASTA) (Interspeech 05) -500 500 time [ms] Spectro-temporal basis formed by outer products of time central band frequency derivative 3 critical bands time [ms] frequency example -500 0 500 Bank of 2-D (time-frequency) filters (band-pass in time, high-pass in frequency) RASTA-like: alleviates stationary components multi-resolution in time

Spectral dynamics (much) more interesting than spectral shape Old way of getting spectral dynamics t0 short-term spectral components time f0 Older way of getting spectral dynamics (Spectrograph™) t0 f0 components spectral time

frequency time frequency time critical-band spectrum from all-pole models of auditory-like spectrum (PLP) frequency time critical-band spectrum from all-pole models of temporal envelopes of the auditory-like spectrum (FDPLP) frequency time

Phoneme recognition accuracy [%] Telephone speech Digit recognition accuracy [%] - ICSI Meeting Room Digit Corpus clean reverberated PLP 99.7 71.6 FDPLP 99.2 87.0 Improvements on real reverberations similar (IEEE Signal Proc.Letters 08) Reverberant speech Gain included Gain excluded Phoneme recognition accuracy [%] TIMIT HTIMIT PLP-MRASTA 67.6 47.8 FDPLP 68.1 53.5

FDPLP with static and dynamic compression Recognition accuracy [%] on TIMIT, HTIMIT, CTS and NIST RT05 meeting tasks PLP FDPLP TIMIT 64.9 65.4 HTIMIT 34.4 52.7 CTS 52.3 59.3 RT05 60.4 64.1 Hilbert envelope logarithmically compressed FDLP fit FDLP fit to Hilbert envelope FDLP fit compressed by PEMO model

TANDEM (Hermansky et al., ICASSP 2000) features for conventional speech recognizer should be Normally distributed and uncorrelated principal component projection pre-softmax outputs to HMM (Gaussian mixture based) classifier posteriors of speech sounds correlation matrix of features histogram of one element

Summary Alternatives to short-term spectrum based attributes could be beneficial data-driven phoneme posterior based extract speech-specific knowledge from large out-of-domain corpora larger temporal spans exploit coarticulation patterns of individual speech sounds models of temporal trajectories improved modeling of fine temporal details allows for partial alleviation of channel distortions and reverberation effects

Coarticulation u h e l o w r d coarticulation human speech production human auditory perception u h e l o w r d

Hierarchical bottom-up event-based recognition ? w r d low bit rate unequally distributed identities of individual speech sounds (phonemes) equally distributed posterior probabilities of speech sounds pre-processing to emulate known properties of peripheral and cortical auditory processes high bit rate

One way of going from phoneme posteriors to phonemes probability /n/ /ay/ matched filtering

(some of) the Issues e.t.c. ???????????????? SPEECH SIGNAL (high bit-rate) auditory perception acoustic “events” ??? cognitive processes ??? linguistic code (low bit rate) message (even lower bit rate) Perceptual processes, involved in decoding of message in speech ? where and how ? higher levels (cortical) probably most relevant acoustic “events” for speech ? Cognitive issues what to “listen for” ? roles of “bottom-up” and “top-down” channels ? coding alphabet (phonemes) ? category forming invariants when to make decision ? e.t.c. ????????????????

Speculation Improvements in acoustic processing could make domain-independent ASR feasible