Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Long-Term Temporal Features

Similar presentations


Presentation on theme: "Learning Long-Term Temporal Features"— Presentation transcript:

1 Learning Long-Term Temporal Features
A Comparative Study Barry Chen May 4, 2004 Speech Lunch Talk

2 Log-Critical Band Energies
May 4, 2004 Speech Lunch Talk

3 Log-Critical Band Energies
Conventional Feature Extraction May 4, 2004 Speech Lunch Talk

4 Log-Critical Band Energies
TRAPS/HATS Feature Extraction May 4, 2004 Speech Lunch Talk

5 What is a TRAP? (Background Tangent)
TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP) Stands for TempRAl Pattern TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long) May 4, 2004 Speech Lunch Talk

6 Example of TRAPS Mean Temporal Patterns for 45 phonemes at 500 Hz
May 4, 2004 Speech Lunch Talk

7 TRAPS Motivation Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale Information measurements (joint mutual information) show information still exists >100ms away within single critical-band Potential robustness to speech degradations May 4, 2004 Speech Lunch Talk

8 Let’s Explore TRAPS and HATS are examples of a specific two-stage approach to learning long-term temporal features Is this constrained two-stage approach better than an unconstrained one-stage approach? Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary? May 4, 2004 Speech Lunch Talk

9 Learn Everything in One Step
May 4, 2004 Speech Lunch Talk

10 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

11 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

12 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

13 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

14 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

15 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

16 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

17 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

18 Learn in Individual Bands
May 4, 2004 Speech Lunch Talk

19 One-Stage Approach May 4, 2004 Speech Lunch Talk

20 2-Stage Linear Approaches
May 4, 2004 Speech Lunch Talk

21 PCA/LDA Comments PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance LDA projects in directions that maximize class separability measured by between class covariance over within class covariance Keep top 40 dimensions for comparison with MLP-based approaches May 4, 2004 Speech Lunch Talk

22 2-Stage MLP-Based Approaches
May 4, 2004 Speech Lunch Talk

23 MLP Comments As with the other 2-stage approaches, we first learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories Interpretation of various MLP layers: Input to hidden weights – discriminant linear transformations Hidden unit outputs – Non-linear discriminant transforms Before Softmax – transforms hidden activation space to unnormalized phone probability space Output Activations – critical band phone probabilities May 4, 2004 Speech Lunch Talk

24 Experimental Setup Training: ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular 1/10 used for cross-validation set for MLPs Testing: 2001 Hub-5 Evaluation Set (Eval2001) 2,255,609 frames and 62,890 words Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help) May 4, 2004 Speech Lunch Talk

25 Frame Accuracy Performance
May 4, 2004 Speech Lunch Talk

26 Standalone Feature System
Transform MLP outputs by: log transform to make features more Gaussian PCA for decorrelation Same as Tandem setup introduced by Hermansky, Ellis, and Sharma Use transformed MLP outputs as front-end features for the SRI recognizer May 4, 2004 Speech Lunch Talk

27 Standalone Features May 4, 2004 Speech Lunch Talk

28 Combination W/State-of-the-Art Front-End Feature
SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d) Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature Similar to Qualcom-ICSI-OGI features in AURORA May 4, 2004 Speech Lunch Talk

29 Combo W/PLP Baseline Features
May 4, 2004 Speech Lunch Talk

30 Ranking Table May 4, 2004 Speech Lunch Talk

31 Observations Throughout the three various testing setups:
HATS is always #1 The one-stage 15 Bands x 51 Frames is always #6 or second last TRAPS is always last PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance May 4, 2004 Speech Lunch Talk

32 Interpretation Learning constraints introduced by the 2-stage approach is helpful if done right. Non-linear discriminant transform of HATS is better than linear discriminant transforms from LDA and HATS before sigmoid The further mapping from hidden activations to critical-band phone posteriors is not helpful Perhaps, mapping to critical-band phones is too difficult and inherently noisy Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames. May 4, 2004 Speech Lunch Talk

33 May 4, 2004 Speech Lunch Talk

34 Frame Accuracy Performance
May 4, 2004 Speech Lunch Talk

35 Standalone Features WER
May 4, 2004 Speech Lunch Talk

36 Combo W/PLP Baseline Features
May 4, 2004 Speech Lunch Talk


Download ppt "Learning Long-Term Temporal Features"

Similar presentations


Ads by Google