Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers P. Green 1, J. Carmichael 1, A. Hatzis 1, P. Enderby 3, M. Hawley & M. Parker.

Similar presentations


Presentation on theme: "Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers P. Green 1, J. Carmichael 1, A. Hatzis 1, P. Enderby 3, M. Hawley & M. Parker."— Presentation transcript:

1 Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers P. Green 1, J. Carmichael 1, A. Hatzis 1, P. Enderby 3, M. Hawley & M. Parker 2, 1 Department of Computer Science, University of Sheffield, 2 Department of Medical Physics & Clinical Engineering, Barnsley District General Hospital NHS Trust, 3 Institute of General Practice, University of Sheffield. Abstract We describe an unusual ASR application: recognition of command words from severely dysarthric speakers, who have poor control of their articulators. The goal is to allow these clients to control assistive technology by voice. While this is a small vocabulary, speaker-dependent, isolated-word application, the speech material is more variable than normal, and only a small amount of data is available for training. After training a CDHMM recogniser, it is necessary to predict its likely performance without using an independent test set, so that confusable words can be replaced by alternatives. We present a battery of measures of consistency and confusability, based on forced-alignment, which can be used to predict recogniser performance. We show how these measures perform, and how they are presented to the clinicians who are the users of the system. Recogniser Design Continuous Density HMMs (CDHMM) with: Whole-word rather than phone level modelling with training data labelled at word level (typically 20 utterances per word). 11 HMM states per word model 3 GMMs per state Straight-through model topology 12 MFCCs 16 kHz sampling rate with 10ms frame window The Implications For Vocabulary Selection. Let’s have a closer look at a section of GR’s matrix…. ‘Alarm’ and ‘Lamp’ show high confusability with each other (but not with other words in the vocabulary), so one should be removed and replaced with an alternative item, perhaps ‘Light’ instead of ‘Lamp’. Sometimes it is not easy to guess which items will confuse easily with others. For normal speaker MP, the word ‘Volume’ shows low confusabililty in contrast to the other words… …but not so for the dysarthric GR: In actual practice, it was necessary to replace ‘Volume’ in GR’s recogniser’s vocabulary with ‘Power’. Visualising Confusability Inter and intra word model confusability can be visualised as a matrix. For greater visual impact, we use colour-coding to depict a range of values. Ideally, the areas of high confusability should only occur along the diagonal of the matrix (the word ‘confusing with itself’). For dysarthric speech, areas of high confusability are often found off the diagonal in unexpected locations. Does it Predict Actual Performance? Future Work The relationship between speech intelligibility and consistency. The use of this tool for speech disorder diagnostics: subjectively assessed intelligibility tests are psychometrically weak and inconsistent. Confusability metrics are objective and repeatable. Incorporating this tool into speech training software (see Hatzis et al., this conference).. Acknowledgements This research was sponsored by the UK Department of Health New and Emerging Application of Technology (NEAT) programme and received a proportion of its funding from the NHS Executive. The views expressed in this publication are those of the authors and not necessarily those of the Department of Health or the NHS Executive. Motivation  Dysarthrias (a family of neurologically based speech disorders characterised by loss of control of the articulators) are often connected to a more generalised motor impairment -- e.g stroke, MS – making normal interaction with the environment difficult.  This physical incapacity makes voice control of Electronic Assistive Technology (EAT) an attractive option BUT…  Severely dysarthric speech is so abnormal that off-the-shelf ASR products fail.  The STARDUST project aims to use custom-built ASR for control of EAT by severe dysarthrics The Stardust Project’s Achievements We have: used computer-based training to improve the speech consistency of most of the clients enrolled in the pilot project, making the speech recognition task easier (see Hatzis et al, this conference). built small vocabulary isolated-word recognisers for severely disordered speech. The accuracy of these speaker dependent recognisers is encouraging (10-word vocab): successfully used these recognisers to control Assistive Technology Confusability: Forecasting recogniser performance from the training set Dealing with sparse training data: severe dysarthrics cannot be asked to produce large quantities of training data. Data scarcity implies that all available speech (except extreme outliers) should be used for training: no separate training and test sets. We need to predict which words the recogniser is likely to confuse with one another, in order to modify the vocabulary if necessary. Phonetically-based confusability measures are not applicable to dysarthric speech. The following measures use only a training set for a vocabulary of N words W 1,…,W N, where w jk is the k th repetition of the j th word a set of CDHMMs M i trained on this data. The measures are based on forced alignment: L ijk is the per- frame log likelihood of each model generating each example of each word on the Viterbi path. Word-level consistency. the consistency for a word is obtained by averaging the L ijk for the correct word model:  i = (Σ k L iik )/n i Where n i is the total number of examples of W i. : The overall consistency of the training corpus is the average of the  :  = (Σ i  i )/N The confusability between any two words W i and W j is C ij = (Σ k L ijk )/n j C ij is the average score obtained by aligning examples of W j against M i. The higher this is, the greater the likelihood that W j will be misrecognised as W i. s SpeakerRecognition Accuracy(%) MP (Normal)100 AH (Normal)100 GR (Severely Dysarthric)87 JT (Severely Dysarthric)100 CC (Severely Dysarthric)96 Low Confusability High Confusability TV Alarm Lamp Chan On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol Table 2: Confusability Matrix for Normal Speaker MP (10-word vocabulary) Table 3: Confusability Matrix for Dysarthric Speaker GR (10-word vocabulary) TV Alarm Lamp Chan On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol Table 4: Test Set Confusions Superimposed on GR Confusability Matrix TV Alarm Lamp Chan On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol Alarm Lamp Volume TV Alarm Lamp Chan. On Off Up Down Radio Volume Volume Justification: forced- alignment likelihoods will be lower for an inconsistently spoken word than for a consistent one since its distributions will be flatter. Confirmed by experiments with mixed training sets:


Download ppt "Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers P. Green 1, J. Carmichael 1, A. Hatzis 1, P. Enderby 3, M. Hawley & M. Parker."

Similar presentations


Ads by Google