Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

Similar presentations


Presentation on theme: "Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas."— Presentation transcript:

1 Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas

2 Outline Motivation Options for sampling Small task experiments Scaling up

3 Motivation Reduce experiment costs for development work Need small task for exploring novel ideas AND, want results to scale to bigger tasks, more data, sophisticated systems Reduce cost of moving to new task or language Find ways to more effectively use large amounts of training data Prior work by Kamm & Meyer (2002) show performance gains from sampling in training data

4 Human Language Acquisition Language learners are initially exposed to more clearly articulated speech (& shorter utterances, smaller vocabularies) Idea: try to select subset of speech that is in a more “careful” style for initial model Language learners are exposed to increasing variability gradually Idea: Train with gradually increased sizes & variability in style From a statistical perspective: such methods may help deal with problems of local optima

5 Child- vs. Adult-Directed Speech Question: Does child-directed speech provide better initial training data for general ASR? Findings from Kirchhoff et al. (2005) Training/testing on child- vs. adult-directed speech Best performance on matched conditions Train on child-directed speech, test on adult-directed is better than vice-versa Child-directed speech is highly variable Our conclusion: try to learn cues to clear speech, but don’t use this data for training

6 Options for Initial Sampling Random Likelihood: (from forced alignments) You must have a model to start with, implies a bias Acoustic criteria F0, energy Modulation spectrogram (Atlas) discriminant function trained on child- vs. adult-directed speech Combinations For likelihood & acoustic criteria, can choose either central or low frequency cases

7 Modulation Spectrogram wave Spectrogram Modulation Spectrogram Fourier Transform

8 Infant-Adult diff pattern Modulation spectrograms averaged for 100 samples Analysis of Motherese Comparing average of utterance-level patterns: Evidence of more pitch-related modulation in child- directed speech More high frequence modulation energy in child-directed speech

9 Adult/Infant Classification Adult/Infant directed speech from Mother- Ease Corpus Binary classification using LDA with utterance features

10 Experimental set up Test set Segments from RT03 EVAL test set with OOV less than 10% Tune: 35min,Test: 32min (male speakers) Lexicon 1k vocabulary with the highest frequency. 5.1k entries with multi-words and multi-pronunciations Training set 16 hours of utterances sampled from SWBD and FISHER Language model The 2004 CTS EVAL bi-gram projected onto the 1k vocabulary

11 Small Task Experiments Always best to sample from center All methods outperform random sampling Mod-spectrum is best

12 Two-Stage Training Questions: Does 2-stage training with increased data in the second stage improve over 1-stage training with the full amount? Should the sampling criterion emphasize outliers in the 2 nd stage? (as in Kamm WER criterion) Is it useful to constrain model means to the initial prototypes & update only the variances? Two sets of experiments with HTK Likelihood-based sampling (RT04) Different sampling techniques (new)

13 RTO4 2-Stage Training Results Uniform Centered Trimmed Flat 500-word Task 2-stage helps Bad to emphasize outliers (unif. sampling) Constraining means helps

14 Recent Results Different sampling techniques Log Likelihood (LL), Fundamental frequency (F0), Log Energy (ENG), Modulation Spectrogram (MS) Previous results used an approximation of likelihood, so we reran 1 st step: 16 hrs of utts sampled from center 2 nd step: 48 hrs of randomly sampled utts The computation time is only 40% of the One- stage method

15 Two-Stage Recognition Results 2-stage still better than 1-stage Now better to adapt all parameters

16 Do Results Scale? Changing from: HTK  SRI system Small(1k)  Large (38k) vocabulary Bi-gram  Tetra-gram 16 hrs  16, 32, 64, 192 hrs training data Small task test set  2004 DevTest set

17 Sampling Methods and WER Random sampling is as good or better than other methods when data size increases (64 hrs); still some LL advantage with 32 hrs

18 Two Stage Results 2-stage training win disappears as data size increases

19 Summary Good news: Acoustic sampling criteria motivated by child language acquisition are useful for defining small training sets Two-stage training is better than one-stage for small data sets and much less costly Not so good news: The results on the small task scaled to the SRI system with large vocabulary when using 16hrs data, but not large data sets (yet)

20 Open Questions Do more gradual training schedules help? (e.g. 16  32  64) Issues associated with tree clustering Experiments on females Revisit outlier emphasis experiments with bigger system


Download ppt "Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas."

Similar presentations


Ads by Google