8-Speech Recognition  Speech Recognition Concepts  Speech Recognition Approaches  Recognition Theories  Bayse Rule  Simple Language Model  P(A|W)

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,

Dimensional reduction, PCA

Speaker Adaptation for Vowel Classification

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000.

Dynamic Time Warping Applications and Derivation

Why is ASR Hard? Natural speech is continuous

A PRESENTATION BY SHAMALEE DESHPANDE

Representing Acoustic Information

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Introduction to Automatic Speech Recognition

Isolated-Word Speech Recognition Using Hidden Markov Models

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

7-Speech Recognition Speech Recognition Concepts

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.

Basics of Neural Networks Neural Network Topologies.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Basics of Neural Networks Neural Network Topologies.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Large Vocabulary Continuous Speech Recognition. Subword Speech Units.

Performance Comparison of Speaker and Emotion Recognition

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

Spectral Analysis Models

PATTERN COMPARISON TECHNIQUES

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

ARTIFICIAL NEURAL NETWORKS

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

8-Speech Recognition Speech Recognition Concepts

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

EE513 Audio Signals and Systems

Digital Systems: Hardware Organization and Design

LECTURE 15: REESTIMATION, EM AND MIXTURES

Presenter: Shih-Hsiang(士翔)

Measuring the Similarity of Rhythmic Patterns

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

8-Speech Recognition  Speech Recognition Concepts  Speech Recognition Approaches  Recognition Theories  Bayse Rule  Simple Language Model  P(A|W) Network Types 1

7-Speech Recognition (Cont’d)  HMM Calculating Approaches  Neural Components  Three Basic HMM Problems  Viterbi Algorithm  State Duration Modeling  Training In HMM 2

Recognition Tasks  Isolated Word Recognition (IWR) Connected Word (CW), And Continuous Speech Recognition (CSR)  Speaker Dependent, Multiple Speaker, And Speaker Independent  Vocabulary Size Small <20 Medium >100, <1000 Large >1000, <10000 Very Large >

Speech Recognition Concepts 4 NLP Speech Processing Text Speech NLP Speech Processing Speech Understanding Speech Synthesis Text Phone Sequence Speech Recognition Speech recognition is inverse of Speech Synthesis

Speech Recognition Approaches  Bottom-Up Approach  Top-Down Approach  Blackboard Approach 5

Bottom-Up Approach 6 Signal Processing Feature Extraction Segmentation Signal Processing Feature Extraction Segmentation Sound Classification Rules Phonotactic Rules Lexical Access Language Model Voiced/Unvoiced/Silence Knowledge Sources Recognized Utterance

Top-Down Approach 7 Unit Matching System Feature Analysis Lexical Hypo thesis Syntactic Hypo thesis Semantic Hypo thesis Utterance Verifier/ Matcher Inventory of speech recognition units Word Dictionary Grammar Task Model Recognized Utterance

Blackboard Approach 8 Environmental Processes Acoustic Processes Lexical Processes Syntactic Processes Semantic Processes Black board

Recognition Theories  Articulatory Based Recognition Use from Articulatory system for recognition This theory is the most successful until now  Auditory Based Recognition Use from Auditory system for recognition  Hybrid Based Recognition Is a hybrid from the above theories  Motor Theory Model the intended gesture of speaker 9

Recognition Problem  We have the sequence of acoustic symbols and we want to find the words that expressed by speaker  Solution : Finding the most probable of word sequence by having Acoustic symbols 10

Recognition Problem  A : Acoustic Symbols  W : Word Sequence  we should find so that 11

Bayse Rule 12

Bayse Rule (Cont’d) 13

Simple Language Model 14 Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

Simple Language Model (Cont’d) 15 Trigram : Bigram : Monogram :

Simple Language Model (Cont’d) 16 Computing Method : Number of happening W3 after W1W2 Total number of happening W1W2 AdHoc Method :

Error Production Factor  Prosody (Recognition should be Prosody Independent)  Noise (Noise should be prevented)  Spontaneous Speech 17

P(A|W) Computing Approaches  Dynamic Time Warping (DTW)  Hidden Markov Model (HMM)  Artificial Neural Network (ANN)  Hybrid Systems 18

Dynamic Time Warping

Search Limitation : - First & End Interval - Global Limitation - Local Limitation

Dynamic Time Warping Global Limitation :

Dynamic Time Warping Local Limitation :

Artificial Neural Network Simple Computation Element of a Neural Network

Artificial Neural Network (Cont’d)  Neural Network Types Perceptron Time Delay Time Delay Neural Network Computational Element (TDNN) 27

Artificial Neural Network (Cont’d) Single Layer Perceptron

Artificial Neural Network (Cont’d) Three Layer Perceptron...

Neural Network Topologies 30

TDNN 31

Neural Network Structures for Speech Recognition 32

Neural Network Structures for Speech Recognition 33

Hybrid Methods  Hybrid Neural Network and Matched Filter For Recognition 34 PATTERN CLASSIFIER Speech Acoustic Features Delays Output Units

Neural Network Properties  The system is simple, But too much iteration is needed for training  Doesn’t determine a specific structure  Regardless of simplicity, the results are good  Training size is large, so training should be offline  Accuracy is relatively good 35

Pre-processing  Different preprocessing techniques are employed as the front end for speech recognition systems  The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc. 36

37

38

39

40

41

42

روش MFCC  روش MFCC مبتني بر نحوه ادراک گوش انسان از اصوات مي باشد.  روش MFCC نسبت به ساير ويژگِيها در محيطهاي نويزي بهتر عمل ميکند.  MFCC اساساً جهت کاربردهاي شناسايي گفتار ارايه شده است اما در شناسايي گوينده نيز راندمان مناسبي دارد.  واحد شنيدار گوش انسان Mel مي باشد که به کمک رابطه زير بدست مي آيد: 43

مراحل روش MFCC مرحله 1: نگاشت سيگنال از حوزه زمان به حوزه فرکانس به کمک FFT زمان کوتاه. 44 : سيگنال گفتارZ(n) : تابع پنجره مانند پنجره همينگW(n( W F = e -j2 π/F m : 0,…,F – 1; : طول فريم گفتاري.F

مراحل روش MFCC مرحله 2: يافتن انرژي هر کانال بانک فيلتر. M تعداد بانکهاي فيلتر مبتني بر معيار مل ميباشد. تابع فيلترهاي بانک فيلتر است. 45

توزيع فيلتر مبتنی بر معيار مل 46

مراحل روش MFCC  مرحله 4: فشرده سازي طيف و اعمال تبديل DCT جهت حصول به ضرايب MFCC 47 در رابطه بالا L،...،0=n مرتبه ضرايب MFCC ميباشد. در رابطه بالا L،...،0=n مرتبه ضرايب MFCC ميباشد.

روش مل - کپستروم 48 Mel-scaling فریم بندی IDCT |FFT| 2 Low-order coefficients Differentiator Cepstra Delta & Delta Delta Cepstra سیگنال زمانی Logarithm

ضرایب مل کپستروم (MFCC) 49

ویژگی های مل کپستروم (MFCC)  نگاشت انرژی های بانک فیلترمل درجهتی که واریانس آنها ماکسیمم باشد (با استفاده از DCT )  استقلال ویژگی های گفتار به صورت غیرکامل نسبت به یکدیگر(تاثیر DCT )  پاسخ مناسب در محیطهای تمیز  کاهش کارایی آن در محیطهای نویزی 50

Time-Frequency analysis  Short-term Fourier Transform Standard way of frequency analysis: decompose the incoming signal into the constituent frequency components. W(n): windowing function N: frame length p: step size 51

Critical band integration  Related to masking phenomenon: the threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise  Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole 52

Bark scale 53

Feature orthogonalization  Spectral values in adjacent frequency channels are highly correlated  The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix  Decorrelation is useful to improve the parameter estimation. 54

Cepstrum  Computed as the inverse Fourier transform of the log magnitude of the Fourier transform of the signal  The log magnitude is real and symmetric -> the transform is equivalent to the Discrete Cosine Transform.  Approximately decorrelated 55

Principal Component Analysis  Find an orthogonal basis such that the reconstruction error over the training set is minimized  This turns out to be equivalent to diagonalize the sample autocovariance matrix  Complete decorrelation  Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes 56

Principal Component Analysis (PCA)  Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (PC)  Find an orthogonal basis such that the reconstruction error over the training set is minimized  This turns out to be equivalent to diagonalize the sample autocovariance matrix  Complete decorrelation  Computes the principal dimensions of variability, but not necessarily provide the optimal discrimination among classes 57

PCA (Cont.)  Algorithm 58 Apply Transform Output = (R- dim vectors) Input= (N-dim vectors) Covariance matrix Transform matrix Eigen values Eigen vectors

PCA (Cont.)  PCA in speech recognition systems 59

Linear discriminant Analysis  Find an orthogonal basis such that the ratio of the between-class variance and within-class variance is maximized  This also turns to be a general eigenvalue-eigenvector problem  Complete decorrelation  Provide the optimal linear separability under quite restrict assumption 60

PCA vs. LDA 61

Spectral smoothing  Formant information is crucial for recognition  Enhance and preserve the formant information: Truncating the number of cepstral coefficients Linear prediction: peak-hugging property 62

Temporal processing  To capture the temporal features of the spectral envelop; to provide the robustness: Delta Feature: first and second order differences; regression Cepstral Mean Subtraction: ○ For normalizing for channel effects and adjusting for spectral slope 63

RASTA (RelAtive SpecTral Analysis) Filtering of the temporal trajectories of some function of each of the spectral values; to provide more reliable spectral features This is usually a bandpass filter, maintaining the linguistically important spectral envelop modulation (1-16Hz) 64

65

RASTA-PLP 66

67

68

Language Models for LVCSR Word Pair Model: Specify which word pairs are valid

Statistical Language Modeling

Perplexity of the Language Model Entropy of the Source: First order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,

We often compute H based on a finite but sufficiently large Q: H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model P N (W) is used, An estimate of H is: In general: Perplexity is defined as:

Overall recognition system based on subword units