Dr. Babasaheb Ambedkar Marathwada University, Aurangabad

Dr. Babasaheb Ambedkar Marathwada University, Aurangabad
Department of Computer Science & IT, Dr. B.A.M.U., Aurangabad

Content What is Speech Recognition ? How human produces Speech
Application Areas What you can do with Speech Recognition Applications related to Speech Recognition Steps of Signal Processing Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching Database Statistic of Department

What is Speech Recognition ?
Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognised words can be an end in themselves, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding.

How human produces Speech:

Physically Handicapped
Application Areas Application Areas Education Domestic Military Robotics Medical Physically Handicapped

What you can do with Speech Recognition:
Transcription Command and control Information access Problem solving Call Centers

Cont.. Transcription Dictation, information retrieval
Medical reports, court proceedings, notes Indexing (e.g., broadcasts) Command and control Data entry, device control, navigation, call routing Information access Airline schedules, stock quotes, directory assistance

Cont.. - Automate services, lower payroll Problem solving
- Travel planning, logistics Call Centers - Automate services, lower payroll - Shorten time on hold - Shorten agent and client call time - Reduce fraud - Improve customer service

Applications related to Speech Recognition:
Figure out what a person is saying. Speaker Verification Authenticate that a person is who she/he claims to be. Limited speech patterns Speaker Identification Assigns an identity to the voice of an unknown person. Arbitrary speech patterns

Steps of Signal Processing
Digitization Converting analogue signal into digital representation Signal processing Separating speech from background noise Phonetics Variability in human speech Pragmatics Filtering of performance errors (disfluencies) Syntax and pragmatic Interpreting prosodic features

Digitization Analogue to digital conversion Sampling and quantizing
Use filters to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient Separating speech from background noise Noise cancelling microphones Two mics, one facing speaker, the other facing away Ambient noise is roughly same for both mics Knowing which bits of the signal relate to speech

Interpreting prosodic features
Identifying phonemes Differences between some phonemes are sometimes very small May be reflected in speech signal Often show up in articulation effects (transition to next sound) Interpreting prosodic features Pitch, length and loudness are used to indicate “stress”

Performance errors Performance “errors” include, Non-speech sounds
Hesitations False starts, repetitions Filtering implies handling at syntactic level or above Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future.

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling 4. Matching

Speech Recognition Techniques

Speech Recognition 1. Analysis 2. Feature Extraction 3. Modelling
4. Matching

1. Analysis The first stage is analysis. When the speaker speaks, the speech includes different types of information that help to identify a speaker. The information is different because of the vocal tract, the source of excitation as well as the behavior feature. Segmentation Analysis Sub-segmental Analysis Supra-segmental Analysis

Segmentation Analysis:
In segmentation analysis, the testing to extort the information of speaker is done by utilizing the frame size as well as the shift which is in between 10 to 30 milliseconds (ms). b. Sub-segmental Analysis: In this analysis technique, the testing to extract the information of speaker is done by utilizing the frame size as well as the shift which is in between 3 to 5 milliseconds (ms). The features of excitation state are analyzed and extracted by using this technique.

c. Supra-segmental Analysis:
In Supra-segmental analysis, the analysis to extract the behavior features of the speaker is done by utilizing the frame size as well as the shift size that ranges in between 50 to 200 milliseconds.

4. Matching

2. Feature Extraction It is considered as the heart of the system.
The work of this is to extract those features from the input speech (signal) that help the system in identifying the speaker. Feature extraction compresses the magnitude of the input signal (vector) without causing any harm to the power of speech signal. There are many feature extraction techniques. Some feature extraction techniques are mentioned in table.

Fig: Feature Extraction Diagram

Table: Some Feature Extraction Techniques
Sr. No. Method Property 1 Principal Component analysis (PCA) Eigenvector-based method. Nonlinear feature extraction method Supported to Linear map. Faster than other technique. It is good for Gaussian data. 2 Linear Discriminate Analysis(LDA) Linear feature extraction method Supported to supervised linear map. Faster than other technique, Better than PCA for classification. 3 Independent Component Analysis (ICA) Blind course separation method Support to Linear map It is iterative in nature It is good for non- Gaussian data.

4 Linear Predictive coding Static feature extraction Method. It is used for feature Extraction at lower order coefficient. 5 Cepstral Analysis Static feature extraction method. Power spectrum method. Used to represent spectral envelope 6 Mel-Frequency Scale Analysis Spectral analysis method. Mel scale is calculated. 7 Filter Bank Analysis It required frequencies possible Used for filter based feature extraction.

8 Mel-Frequency Cestrum Coefficient (MFFCs) Power spectrum is computed by performing Fourier Analysis, Robust and dynamic method for speech feature extraction 9 Kernel Based Feature Extraction Method Nonlinear transformations method 10 Wavelet Technique Better time resolution than Fourier Transform, Real time factor is minimum 11 Dynamic Feature Extractions i)LPC ii)MFCCs Acceleration and delta coefficients II and III order derivatives of Normal LPC and MFCCs coefficients

12 Spectral Subtraction Robust Feature extraction method 13 Cepstral Mean Subtraction Robust Feature extraction method for small vocabulary based system 14 RASTA Filtering Used for Noisy speech recognition 15 Integrated Phoneme Subspace Method (Compound Method) A transformation based on PCA + LDA + ICA. It gives Higher Accuracy than the existing Methods.

Feature Extraction Techniques
Linear Predictive coding Mel-frequency Cepstrum RASTA filtering Probabilistic Linear Discriminate Analysis

1. Linear Predictive Coding
LPC is based on an assumption: In a series of speech samples, we can make a prediction of the nth sample which can be represented by summing up the target signal’s previous samples (k). The production of an inverse filter should be done so that is corresponds to the formant regions of the speech samples. Thus the application of these filters into the samples is the LPC process.

Technique Characteristics LINEAR PREDICTIVE CODING Provides auto regression based speech features. Is a formant estimation technique A static technique. The residual sound is very close to the vocal tract input signal.

Advantages Disadvantages
Is a reliable, accurate and robust technique for providing parameters which describe the time varying linear system which represent the vocal tract. Computation speed of LPC is good and provides with accurate parameters of speech. Useful for encoding speech at low bit rate. Is not able to distinguish the words with similar vowel sounds. Cannot represent speech because of the assumption that signals are stationary and hence is not able to analyze the local events accurately. LPC generates residual error as output that means some amount of important speech gets left in the residue resulting in poor speech quality.

2. Mel-frequency Cepstrum
The main purpose of the MFCC processor is to copy the behavior of human ears. The derivation of MFCCs is done by the following steps. Fig: MFCCs Derivation

Technique Characteristics MEL – FREQUENCY CEPSTRUM (MFCC) Used for speech processing tasks. Mimics the human auditory system Mel frequency scale: linear frequency spacing below 1000Hz & a log spacing above 1000Hz.

The recognition accuracy is high. That means the performance rate of MFCC is high. MFCC captures main characteristics of phones in speech. Low Complexity. In background noise MFCC does not give accurate results. The filter bandwidth is not an independent design parameter. Performance might be affected by the number of filters.

3. RASTA filtering RASTA is short for RelAtive SpecTrAl.
It is a technique which is used to enhance the speech when recorded in a noisy environment. The time trajectories of the representations of the speech signals are band pass filtered in RASTA. Initially, it was just used to lessen the impact of noise in speech signal but know it is also used to directly enhance the signal.

Fig: Process of RASTA Technique

Technique Characteristics RelAtive SpecTrAl (RASTA Filtering) Is a band pass filtering technique. Designed to lessen impact of noise as well as enhance speech. That is, it is a technique which is widely used for the speech signals that have background noise or simply noisy speech.

Removes the slow varying environmental variations as well as the fast variations in artefacts. This technique does not depend on the choice of microphone or the position of the microphone to the mouth, hence it is robust. Captures frequencies with low modulations that correspond to speech. This technique causes a minor deprivation in performance for the clean information but it also slashes the error in half for the filtered case. RASTA combined with PLP gives a better performance ratio.

4. Probabilistic Linear Discriminate Analysis
This technique is an extension for linear probabilistic analysis (LDA). Initially this technique was used for face recognition but now it is used for speech recognition.

Technique Characteristics Probabilistic Linear Discriminate Analysis (PLDA) Based on i-vector extraction. The vector is one which is full of information and is a low dimensional vector having fixed length. This technique uses the state dependent variables of HMM. PLDA is formulated by a generative model.

Advantages Disadvantages Is a flexible acoustic model which makes use of variable number of interrelated input frames without any need of covariance modelling. High recognition accuracy The Gaussian assumption which are on the class conditional distributions. This is just an assumption and is not true actually. The generative model is also a disadvantage. The objective was to fit the date which takes class discrimination into account.

4. Matching

3. Modelling The goal of the modeling techniques is to produce speaker models by making use of the features extracted (feature vector). Modeling techniques are further categorized into speaker recognition & identification. Speaker recognition can be further classified into speaker dependent and speaker independent. Speaker identification is a process in which the system is able to identify who the speaker is on the basis of the extracted information from the speech signal.

Modeling Approaches: Acoustic-Phonetic approach
Pattern recognition approach Dynamic Time Warping (DTW) Artificial Intelligence Approach (AI)

1. Acoustic-Phonetic approach
The basic principle that this approach follows is identifying the speech signals and then providing these speech signals with apt labels to these signals. Thus the acoustic phonetic approach postulates that there exists finite number of phonemes of a language which can be commonly described by acoustic properties.

2. Pattern recognition approach
It involves two steps: Pattern Comparison and Pattern Training. It is further classified into Template Based and Stochastic approach. This approach makes use of robust mathematical formulas and develops speech pattern representations.

3. Dynamic Time Warping (DTW)
DTW is an algorithm which measures whether two of the sequences are similar that vary in time or even in speed. A good ASR system should be able to handle the different speeds of different speakers and the DTW algorithm helps with that. It helps in finding similarities in two given data keeping in mind the various constraints involved.

4. Artificial Intelligence Approach (AI)
In this approach, the procedure of recognition is developed in the same way as a person thinks, evaluates (or analyzes) and thereafter makes a decision on the basis of uniform acoustic features. This approach is the combination of acoustic phonetic approach and pattern approach.

4. Matching

4. Matching Sub word matching:
Phonemes are looked up by the search engine on which the system later performs pattern recognition. These phonemes are the sub words thus the name sub word matching. The storage that is required by this technique is in the range 5 to 20 bytes per word which is much less in comparison to whole word matching but it takes a large amount of processing.

2. Whole word matching: In this matching technique there exists a pre-recorded template of a particular word according to which the search engine matches the input signal. The processing that this technique takes is less in comparison to sub word matching. A disadvantage that this technique has is that we need to record each and every word that is to be recognized beforehand in order for the system to recognize it and thus it can only be used when we know the vocabulary of recognition beforehand. Also these templates need storage that ranges from 50 bytes to 512 bytes per word which very large as compared to sub word matching technique.

Database Statistic of Department
All developed Speech Database are according to Linguistic Data Consortium for Indian Languages (LDC-IL) standards. Marathi Speech Database RRD-IMSDN (8000) RRD-RMSDA-I (30000) RRD-IMSDA-II (30000) RRD-IMSDTA (37200) RRD-CMSDA (38000)

Emotional Speech Database
RRD-EMSID (3800) Artificial Emotional Marathi Speech Database (250) Swahili Speech Database RRD-IS2DN (3000) RRD-IS2DA (25000) RRD-GS3D (15000)

Dr. Babasaheb Ambedkar Marathwada University, Aurangabad

Similar presentations

Presentation on theme: "Dr. Babasaheb Ambedkar Marathwada University, Aurangabad"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Babasaheb Ambedkar Marathwada University, Aurangabad

Similar presentations

Presentation on theme: "Dr. Babasaheb Ambedkar Marathwada University, Aurangabad"— Presentation transcript:

Similar presentations

About project

Feedback