Distinctive Feature Detection For Automatic Speech Recognition

Name: Distinctive Feature Detection For Automatic Speech Recognition
Uploaded: 2017-09-09T08:50:07+00:00
Duration: PTM20S42
Channel: Cayden Brayton
Description: Distinctive Feature Detection For Automatic Speech Recognition

Distinctive Feature Detection For Automatic Speech Recognition
Jun Hou Prof. Lawrence Rabiner Dr. Sorin Dusan CAIP, ECE Dept., Rutgers University Sep.13, 2004

Outline The history of Automatic Speech Recognition
Current Feature Detection Technologies ASAT – Automatic Speech Attribute Transcription Distinctive Feature Detection, as a part of ASAT Proposed Work schedule

The Evolution of Speech Recognition
Data-driven (1980’s, 1990’s and 2000’s) vs. knowledge-driven (1960’s, 1970’s) Figure 1 S-curve limits ASR technology advances (C.-H. Lee) The gap between Human Speech Recognition (HSR) and Automatic Speech Recognition (ASR) is still very large Is HMM the end of the line? Or is there somewhere else to go?

Problems with Signals To Be Recognized
No two utterances of the same linguistic content are ever the same (often they are not even close in their waveforms or spectral characteristics) Speaker variation Speaking style Background environment etc.

Figure 2 State-of-the-art HMM-based systems (C.-H. Lee)
Statistical Methods Typical approaches: HMM and ANN Figure 2 State-of-the-art HMM-based systems (C.-H. Lee)

Statistical Methods Top-down approach. Higher level knowledge guides the processing primarily at the lower levels. Incremental discrimination to get refined results (e.g., better stop consonant discrimination) Utterance verification – Confidence measures to approximately estimate the reliability of the result, often on a word-by-word basis Errors inevitable, mainly when the measured features fall into the overlapped region of the different pdfs Data driven => Sensitive to training data, both the amount and type Robustness problem – Sensitive to speaking environment and transmission characteristics of the medium No explicit use of acoustic, or phonetic knowledge No clear calculation of the required size of the training data set High computational cost when the size of statistical patterns is large

HMM Issues Sequential model
Assumes frame independence – blindly treat frames with equal importance; more or less okay when using cepstral features No higher level (linguistic) knowledge used in acoustic modeling Etc. ti-1 ti ti+1 ti+2 Figure 3 HMM diagram

ANN Issues …… No meaningful representation of the internal nodes
Lots of uncertainty as to what processing is happening Computationally expensive Hard to train; virtually impossible to guarantee convergence at true minimum solution Etc. …… Input layer Hidden layer(s) Output layer Figure 4 ANN diagram

Knowledge Based Methods
Bottom-up approach. Uses acoustic-phonetic knowledge at all levels of processing. Temporal features are critical in discriminating some speech sounds, e.g., VOT in stop detection Spectral features are critical in discriminating other speech sounds, e.g., fricatives from spectral energy concentrations Learn information in temporal and spectral domains using both static and dynamic features

Problems with Knowledge-Based Methods
The knowledge of the acoustic properties of phonetic units is not complete. Hard to cover all the rules. The knowledge of phonetic properties of acoustic units is not complete. Pronunciation models explain the formation of waveforms from vocal tract shapes, but no clear reverse knowledge exists. The choice of features is not optimal in a well defined and meaningful sense. The design of sound classifiers is not optimal. No well-defined automatic tuning methods exist.

Feature Extraction-Ali et al
Auditory-Based Front End Processing Feature Extraction (Jakobson) 1. Total energy 2. Spectral Center of Gravity (SCG) 3. Duration 4. Low, medium and high frequency energy 5. Formant transitions 6. Silence detection 7. Voicing detection 8. Rate of change of energy in various frequency bands 9. Rate of change of SCG 10. Most prominent peak frequency 11. Rate of change of the most prominent peak frequency 12. Zero-crossing rate

Feature Extraction Utterance Segmentation (silence, obstruents, sonorants) Fine Utterance Classification into Four categories Sonorants – fine identification Stops – voiced and unvoiced Fricatives – voiced and unvoiced Silence Excellent performance for stops and fricatives

Feature Extraction Figure 5 Block diagram of the System
Figure 6 Block diagram of the front-end

Feature Extraction Fricative classification Voicing detection
DUP – The Duration of the Unvoiced Portion Place of articulation detection MDP - The Most Dominant Peak from the synchrony detector MNSS - The Maximum Normalized Spectral Slope SCG - The Spectral Center of Gravity MDSS - The Most Dominant Spectral Slope DRHF - The Dominant Relative to the Highest Filters

Feature Extraction Stop detection Voicing detection
Prevoicing VOT Closure duration Place of articulation detection BF - Burst Frequency The second formant of the following vowel MNSS DRHF, LINP (most prominent peak of the synchrony response after being laterally inhibited by the higher 10 filters) Formant transitions before and after the stop The voicing decision

Landmark Detection Landmark Detection – Junija, et al., PhD Thesis Proposal Manner landmarks are used, whereas place and voicing are extracted using the locations provided by the manner landmarks Two manner landmarks Defined by abrupt change, e.g., burst landmark for stop consonants, vowel onset point Defined by the most prominent manifestation of a manner phonetic feature, e.g., a point of maximum low energy in a vowel Three steps: Location of manner landmarks Analysis of landmarks for place and voicing phonetic features Matching phonetic features to features of words or sentence representations

Landmark Detection Recognition of 5 broad classes
Vowel Stop Fricative Sonorant consonant Silence Table 1 Broad manner classification of English phonemes Use Support Vector Machines (SVM) to segment TIMIT data into binary classes Results of 2 different feature organizations are reported: Parallel – discriminate each feature against all other features Hierarchical – distinguish the features using a probabilistic hierarchy

Landmark Detection Table 2 Landmarks extracted for each of the manner classes and knowledge based acoustic measurements

Table 3 Acoustic Parameters used in broad class segmentation
Landmark Detection Table 3 Acoustic Parameters used in broad class segmentation

Figure 7 Parallel SVM organization
Landmark Detection Compare the organizations of SVMs Figure 7 Parallel SVM organization Figure 8 Hierarchical SVM organization

Landmark Detection Compare classification results
Table 4 Results of parallel SVM organization Table 5 Results of hierarchical SVM organization

Landmark Detection Discussion
Combine landmarks with acoustic parameters The gap between correctness and accuracy is due to the insertions mainly of sonorant consonants and stops Performance gap between hierarchical SVM and parallel SVM architectures is due to ??? – possibly: wrong classification in the upper level in the hierarchical architecture causes error propagation to the lower level Isolated or connected word recognition Use Finite State Automata (FSA) to constrain the segmentation paths Doesn’t allow the use of a probabilistic language model

Landmark Detection– ANN
Benoit Launay, et al. Train Artificial Neural Network to map short-term spectral features to the posterior probability of some distinctive features Feed features into HMM

ASAT – Automatic Speech Attribute Transcription
NEW! ASAT – Automatic Speech Attribute Transcription Knowledge-based, data driven approach Figure 9 Bottom-up ASAT based on speech attribute detection, event merging and evidence verification (C.-H. Lee)

Distinctive Feature Detection
Attribute Detector 1 Attribute Detector 2 Attribute Detector 3 Attribute Detector 4 Attribute Detector 5 Attribute Detector 6 Attribute Detector 7 Attribute Detector 8 Attribute Detector M …… Feature Detector 1 Feature Detector 2 Feature Detector N Speech Signal Feature 1 Feature 2 Feature N Figure 10 Distinctive Feature Detection Attributes Combination: Linear, ANN, K-L, etc. 5. What outputs? 6. How to compute them? 1. What Attributes? 2. How to measure them? 4. How to combine the attributes to form features? 3. What Features?

Attributes and Features in ASAT – Issues to be Resolved
Q1: What attributes? Q2: How to measure them? Q3: What features? Q4: How to combine the attributes to form features? Q5: What outputs? Q6: How to compute the outputs? Q7: Why use them?

Q1: What attributes? MFCC and their derivatives, Energy in specific spectral ranges, Zero Crossing Rate, Formant Frequency, ratio of spectral peaks, etc. VOT, energy onset, energy offset, etc. Refer to those attributes in Ali’s paper Find other indicative attributes in spectral graph, cepstral graph, etc. Find other significant characteristics in waveforms Find characteristics inside/between the time and frequency domains Different set of attributes for each feature

Q2: How to measure them? Observe and analyze the speech signal in both time and frequency domain, e.g., filter bank analysis Data mining of meaningful “patterns” Experiments needed to find distinguishing attributes for each acoustic feature Enhance distinctive attributes, eliminate confusing attributes – better ways to measure things Find the relations of attributes inside a frame, e.g., between prominent attributes, weak attributes. Calculate correlation between attributes in succeeding frames Calculate information redundancy for different attributes

Q2: How to measure them? Topology of attribute organization
Parallel Organization – ASAT Organization Graph Organization Hierarchical – Junija et al. (features) Eliminate redundancy in computation One attribute may trigger the test of existence of other attributes Combined organization-i.e., sequential and graph methods combined

Q3: What features? Features available in current acoustic-phonetic area: binary distinctive features Distinctive features are related to: Voicing vocal folds vibrates or not Place of articulation The particular articulator that is used (glottis, soft palate, lips, etc.) Manner of articulation How that articulator is used to produce the sound

Q3: What features? Initial list of twelve pairs of distinctive features 1. Vocalic/non-vocalic 2. Consonantal/non-consonantal 3. Interrupted/continuent 4. Checked/unchecked 5. Strident/mellow 6. Voiced/unvoiced 7. Compact/diffuse 8 .Grave/acute 9. Flat/plain 10. Sharp/plain 11. Tense/lax 12. Nasal/oral English is characterized by 9 pairs of these features

Q3: What features? Need to detect all relevant features to perform automatic speech recognition at the phonetic level Acoustic-phonetic features are intuitively plausible, but there might exist other good features obtained from data mining and/or clustering techniques We can optimize (how we do it is unclear) and obtain the minimum necessary set of speech distinctive features May use attributes directly and together with features when calculating the outputs from the detectors

Q4: How to compute or estimate the features?
Develop combination methods and optimize them to get better combination of attributes to form meaningful features, and select the best features for phonemes and possibly larger acoustic units Possible combination algorithms: Linearly weighted average ANN K-L Fuzzy integral seems promising, compared with ANN (cf. Chang & Greenberg’s paper) Prominent attributes characterize features. The existence of some particular attributes may help to further define the feature or features.

Q5: What outputs? Modified features? Phonemes? Phoneme-like units?
Study the acoustic-phonetic theories and establish models that best describe the production of sound signals Study each acoustic class and find their differences and relations

Q6: How to compute the outputs?
Study acoustical variation during pronunciation, find common characteristics and distinguishing characteristics for acoustic-phonetic variations Score the outputs of the feature detectors using probabilities or likelihood measures of the presence of these distinctive features Other methods???

Q7: Why use them? We have no other choice at this time
These attributes and features may be far from optimal, but they are well motivated by acoustic-phonetic theories Will consider other ideas, as they are developed

Evaluation Evaluation criteria for attributes, features
Mutual information (cf. Hasegawa-Johnson’s paper) Entropy (e.g., traditional Shannon Entropy, Rényi Entropy, cf. Cachin’s paper) Perplexity, like that used in language modeling False acceptance rate, false rejection rate Other criteria??? Use those criteria to find correlations between attributes, as well as between features Gradually minimize the mutual information between attributes/features, e.g., Gradient Descent, and get the minimum sets of attributes and features

Segmentation of Speech
Study how humans segment different portions of speech, e.g., spectrum reading Multiple segmentations are possible, and thus we might want to search through a range of segmentation candidates to find the best result Collect the segments with high confidence scores Use other knowledge sources to help clarify the segments with poor scores

Training and Testing Database – TIMIT and/or Vic corpus
Divide the database into separate training and testing sets Training (1) On the training set (2) On the training set + testing set – is this meaningful or proper Find the difference between (1) and (2), and the generalization ability of the features to out of task data Test performance on the testing set

Training and Testing Training
Study differences between isolated words, connected words, continuous and spontaneous speech Try not to depend solely on the training data, but instead find rules that adapt the data and can be applied to more general environments Try not to defuse the model as more data is added

Training and Testing Testing Find reasons why the detectors failed
Observe error patterns Did the error patterns emerge due to different reasons? If so, re-examine previous steps, and combine the different information sources in ways that are less sensitive to the observed error patterns

Work Schedule First year: Set up the structure for the ASAT system
Define the most reasonable starting set of acoustic attributes and phonetic features Look at a range of ways of combining evidence from the acoustic attributes to create the phonetic features Evaluate the baseline performance of the system on a given training and testing set of date – most probably using TIMIT Baseline alternative approaches, especially front ends, including auditory models and standard speech recognition features

Distinctive Feature Detection For Automatic Speech Recognition

Similar presentations

Presentation on theme: "Distinctive Feature Detection For Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distinctive Feature Detection For Automatic Speech Recognition

Similar presentations

Presentation on theme: "Distinctive Feature Detection For Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback