Presentation on theme: "Distinctive Feature Detection For Automatic Speech Recognition"— Presentation transcript:
1Distinctive Feature Detection For Automatic Speech Recognition Jun HouProf. Lawrence RabinerDr. Sorin DusanCAIP, ECE Dept., Rutgers UniversitySep.13, 2004
2Outline The history of Automatic Speech Recognition Current Feature Detection TechnologiesASAT – Automatic Speech Attribute TranscriptionDistinctive Feature Detection, as a part of ASATProposed Work schedule
3The Evolution of Speech Recognition Data-driven (1980’s, 1990’s and 2000’s) vs. knowledge-driven (1960’s, 1970’s)Figure 1 S-curve limits ASR technology advances (C.-H. Lee)The gap between Human Speech Recognition (HSR) and Automatic Speech Recognition (ASR) is still very largeIs HMM the end of the line? Or is there somewhere else to go?
4Problems with Signals To Be Recognized No two utterances of the same linguistic content are ever the same (often they are not even close in their waveforms or spectral characteristics)Speaker variationSpeaking styleBackground environmentetc.
5Figure 2 State-of-the-art HMM-based systems (C.-H. Lee) Statistical MethodsTypical approaches: HMM and ANNFigure 2 State-of-the-art HMM-based systems (C.-H. Lee)
6Statistical MethodsTop-down approach. Higher level knowledge guides the processing primarily at the lower levels.Incremental discrimination to get refined results (e.g., better stop consonant discrimination)Utterance verification – Confidence measures to approximately estimate the reliability of the result, often on a word-by-word basisErrors inevitable, mainly when the measured features fall into the overlapped region of the different pdfsData driven => Sensitive to training data, both the amount and typeRobustness problem – Sensitive to speaking environment and transmission characteristics of the mediumNo explicit use of acoustic, or phonetic knowledgeNo clear calculation of the required size of the training data setHigh computational cost when the size of statistical patterns is large
7HMM Issues Sequential model Assumes frame independence – blindly treat frames with equal importance; more or less okay when using cepstral featuresNo higher level (linguistic) knowledge used in acoustic modelingEtc.ti-1titi+1ti+2Figure 3 HMM diagram
8ANN Issues …… No meaningful representation of the internal nodes Lots of uncertainty as to what processing is happeningComputationally expensiveHard to train; virtually impossible to guarantee convergence at true minimum solutionEtc.……Input layerHidden layer(s)Output layerFigure 4 ANN diagram
9Knowledge Based Methods Bottom-up approach. Uses acoustic-phonetic knowledge at all levels of processing.Temporal features are critical in discriminating some speech sounds, e.g., VOT in stop detectionSpectral features are critical in discriminating other speech sounds, e.g., fricatives from spectral energy concentrationsLearn information in temporal and spectral domains using both static and dynamic features
10Problems with Knowledge-Based Methods The knowledge of the acoustic properties of phonetic units is not complete. Hard to cover all the rules.The knowledge of phonetic properties of acoustic units is not complete.Pronunciation models explain the formation of waveforms from vocal tract shapes, but no clear reverse knowledge exists.The choice of features is not optimal in a well defined and meaningful sense.The design of sound classifiers is not optimal.No well-defined automatic tuning methods exist.
11Feature Extraction-Ali et al Auditory-Based Front End ProcessingFeature Extraction (Jakobson)1. Total energy2. Spectral Center of Gravity (SCG)3. Duration4. Low, medium and high frequency energy5. Formant transitions6. Silence detection7. Voicing detection8. Rate of change of energy in various frequency bands9. Rate of change of SCG10. Most prominent peak frequency11. Rate of change of the most prominent peak frequency12. Zero-crossing rate
12Feature ExtractionUtterance Segmentation (silence, obstruents, sonorants)Fine Utterance Classification into Four categoriesSonorants – fine identificationStops – voiced and unvoicedFricatives – voiced and unvoicedSilenceExcellent performance for stops and fricatives
13Feature Extraction Figure 5 Block diagram of the System Figure 6 Block diagram of the front-end
14Feature Extraction Fricative classification Voicing detection DUP – The Duration of the Unvoiced PortionPlace of articulation detectionMDP - The Most Dominant Peak from the synchrony detectorMNSS - The Maximum Normalized Spectral SlopeSCG - The Spectral Center of GravityMDSS - The Most Dominant Spectral SlopeDRHF - The Dominant Relative to the Highest Filters
15Feature Extraction Stop detection Voicing detection PrevoicingVOTClosure durationPlace of articulation detectionBF - Burst FrequencyThe second formant of the following vowelMNSSDRHF, LINP (most prominent peak of the synchrony response after being laterally inhibited by the higher 10 filters)Formant transitions before and after the stopThe voicing decision
16Landmark DetectionLandmark Detection – Junija, et al., PhD Thesis ProposalManner landmarks are used, whereas place and voicing are extracted using the locations provided by the manner landmarksTwo manner landmarksDefined by abrupt change, e.g., burst landmark for stop consonants, vowel onset pointDefined by the most prominent manifestation of a manner phonetic feature, e.g., a point of maximum low energy in a vowelThree steps:Location of manner landmarksAnalysis of landmarks for place and voicing phonetic featuresMatching phonetic features to features of words or sentence representations
17Landmark Detection Recognition of 5 broad classes VowelStopFricativeSonorant consonantSilenceTable 1 Broad manner classification of English phonemesUse Support Vector Machines (SVM) to segment TIMIT data into binary classesResults of 2 different feature organizations are reported:Parallel – discriminate each feature against all other featuresHierarchical – distinguish the features using a probabilistic hierarchy
18Landmark DetectionTable 2 Landmarks extracted for each of the manner classes and knowledge based acoustic measurements
19Table 3 Acoustic Parameters used in broad class segmentation Landmark DetectionTable 3 Acoustic Parameters used in broad class segmentation
20Figure 7 Parallel SVM organization Landmark DetectionCompare the organizations of SVMsFigure 7 Parallel SVM organizationFigure 8 Hierarchical SVM organization
21Landmark Detection Compare classification results Table 4 Results of parallel SVM organizationTable 5 Results of hierarchical SVM organization
22Landmark Detection Discussion Combine landmarks with acoustic parametersThe gap between correctness and accuracy is due to the insertions mainly of sonorant consonants and stopsPerformance gap between hierarchical SVM and parallel SVM architectures is due to ??? – possibly: wrong classification in the upper level in the hierarchical architecture causes error propagation to the lower levelIsolated or connected word recognitionUse Finite State Automata (FSA) to constrain the segmentation pathsDoesn’t allow the use of a probabilistic language model
23Landmark Detection– ANN Benoit Launay, et al.Train Artificial Neural Network to map short-term spectral features to the posterior probability of some distinctive featuresFeed features into HMM
24ASAT – Automatic Speech Attribute Transcription NEW!ASAT – Automatic Speech Attribute TranscriptionKnowledge-based, data driven approachFigure 9 Bottom-up ASAT based on speech attribute detection, event merging and evidence verification (C.-H. Lee)
25Distinctive Feature Detection Attribute Detector 1Attribute Detector 2Attribute Detector 3Attribute Detector 4Attribute Detector 5Attribute Detector 6Attribute Detector 7Attribute Detector 8Attribute Detector M……Feature Detector 1Feature Detector 2Feature Detector NSpeech SignalFeature 1Feature 2Feature NFigure 10 Distinctive Feature DetectionAttributes Combination:Linear,ANN,K-L,etc.5. What outputs?6. How to compute them?1. What Attributes?2. How to measure them?4. How to combine the attributes to form features?3. What Features?
26Attributes and Features in ASAT – Issues to be Resolved Q1: What attributes?Q2: How to measure them?Q3: What features?Q4: How to combine the attributes to form features?Q5: What outputs?Q6: How to compute the outputs?Q7: Why use them?
27Q1: What attributes?MFCC and their derivatives, Energy in specific spectral ranges, Zero Crossing Rate, Formant Frequency, ratio of spectral peaks, etc.VOT, energy onset, energy offset, etc.Refer to those attributes in Ali’s paperFind other indicative attributes in spectral graph, cepstral graph, etc.Find other significant characteristics in waveformsFind characteristics inside/between the time and frequency domainsDifferent set of attributes for each feature
28Q2: How to measure them?Observe and analyze the speech signal in both time and frequency domain, e.g., filter bank analysisData mining of meaningful “patterns”Experiments needed to find distinguishing attributes for each acoustic featureEnhance distinctive attributes, eliminate confusing attributes – better ways to measure thingsFind the relations of attributes inside a frame, e.g., between prominent attributes, weak attributes.Calculate correlation between attributes in succeeding framesCalculate information redundancy for different attributes
29Q2: How to measure them? Topology of attribute organization Parallel Organization – ASAT OrganizationGraph OrganizationHierarchical – Junija et al. (features)Eliminate redundancy in computationOne attribute may trigger the test of existence of other attributesCombined organization-i.e., sequential and graph methods combined
30Q3: What features?Features available in current acoustic-phonetic area: binary distinctive featuresDistinctive features are related to:Voicingvocal folds vibrates or notPlace of articulationThe particular articulator that is used (glottis, soft palate, lips, etc.)Manner of articulationHow that articulator is used to produce the sound
31Q3: What features?Initial list of twelve pairs of distinctive features1. Vocalic/non-vocalic2. Consonantal/non-consonantal3. Interrupted/continuent4. Checked/unchecked5. Strident/mellow6. Voiced/unvoiced7. Compact/diffuse8 .Grave/acute9. Flat/plain10. Sharp/plain11. Tense/lax12. Nasal/oralEnglish is characterized by 9 pairs of these features
32Q3: What features?Need to detect all relevant features to perform automatic speech recognition at the phonetic levelAcoustic-phonetic features are intuitively plausible, but there might exist other good features obtained from data mining and/or clustering techniquesWe can optimize (how we do it is unclear) and obtain the minimum necessary set of speech distinctive featuresMay use attributes directly and together with features when calculating the outputs from the detectors
33Q4: How to compute or estimate the features? Develop combination methods and optimize them to get better combination of attributes to form meaningful features, and select the best features for phonemes and possibly larger acoustic unitsPossible combination algorithms:Linearly weighted averageANNK-LFuzzy integral seems promising, compared with ANN (cf. Chang & Greenberg’s paper)Prominent attributes characterize features. The existence of some particular attributes may help to further define the feature or features.
34Q5: What outputs? Modified features? Phonemes? Phoneme-like units? Study the acoustic-phonetic theories and establish models that best describe the production of sound signalsStudy each acoustic class and find their differences and relations
35Q6: How to compute the outputs? Study acoustical variation during pronunciation, find common characteristics and distinguishing characteristics for acoustic-phonetic variationsScore the outputs of the feature detectors using probabilities or likelihood measures of the presence of these distinctive featuresOther methods???
36Q7: Why use them? We have no other choice at this time These attributes and features may be far from optimal, but they are well motivated by acoustic-phonetic theoriesWill consider other ideas, as they are developed
37Evaluation Evaluation criteria for attributes, features Mutual information (cf. Hasegawa-Johnson’s paper)Entropy (e.g., traditional Shannon Entropy, Rényi Entropy, cf. Cachin’s paper)Perplexity, like that used in language modelingFalse acceptance rate, false rejection rateOther criteria???Use those criteria to find correlations between attributes, as well as between featuresGradually minimize the mutual information between attributes/features, e.g., Gradient Descent, and get the minimum sets of attributes and features
38Segmentation of Speech Study how humans segment different portions of speech, e.g., spectrum readingMultiple segmentations are possible, and thus we might want to search through a range of segmentation candidates to find the best resultCollect the segments with high confidence scoresUse other knowledge sources to help clarify the segments with poor scores
39Training and Testing Database – TIMIT and/or Vic corpus Divide the database into separate training and testing setsTraining(1) On the training set(2) On the training set + testing set – is this meaningful or properFind the difference between (1) and (2), and the generalization ability of the features to out of task dataTest performance on the testing set
40Training and Testing Training Study differences between isolated words, connected words, continuous and spontaneous speechTry not to depend solely on the training data, but instead find rules that adapt the data and can be applied to more general environmentsTry not to defuse the model as more data is added
41Training and Testing Testing Find reasons why the detectors failed Observe error patternsDid the error patterns emerge due to different reasons? If so, re-examine previous steps, and combine the different information sources in ways that are less sensitive to the observed error patterns
42Work Schedule First year: Set up the structure for the ASAT system Define the most reasonable starting set of acoustic attributes and phonetic featuresLook at a range of ways of combining evidence from the acoustic attributes to create the phonetic featuresEvaluate the baseline performance of the system on a given training and testing set of date – most probably using TIMITBaseline alternative approaches, especially front ends, including auditory models and standard speech recognition features