Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Distributed Learning of Multilingual DNN Feature Extractors using GPUs
Yajie Miao Hao Zhang Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University ymiao, haoz1, Introduction DistLang: Distribution by Languages Preliminary Evaluation As the state of the art for speech recognition, DNNs are particularly suitable for multi-lingual and cross-lingual ASR. A multilingual DNN is trained over a group of languages, with hidden layers shared across languages. Given a new language, the shared hidden layers act as a deep feature extractor. Goal. With multiple GPUs available, we aim to parallelize the learning of the feature extractor over large amounts of multilingual training data. Highlight. We study how parallelization affects the quality of feature extractors. Feature extractor learning is robust to infrequent thread synchronization. Thus, time-synchronous model averaging achieves good speed-up. 1. Basic Idea WER% and Speed-up of DistModel as averaging interval increases Each GPU trains the DNN model as a language-specific feature extractor. On the target language, each speech frame is fed into these separate extractors. The feature vectors are fused into a single feature representation. With larger averaging interval, we obtain monotonically better speed-up 2000 seems to be a good tradeoff point Applied to monolingual DNN on Tagalog FullLP. The enlarged WER degradation shows that DistModel is particularly useful for multilingual DNN training LANG #1 LANG #1 …. Source Languages GPU #1 GPU #2 LANG #1 Target Language …. + Target DNN Methods WER% Speed-up Single GPU 49.3 ---- DistModel-600 50.5 1.9 DistModel-1000 2.2 DistModel-2000 50.8 2.5 LANG #1 Lang 1 softmax Lang 2 softmax Lang 3 softmax 2. Two Methods for Feature Fusion Hybrid DNN FeatConcat: concatenate outputs from the language-specific feature extractors into a single vector … averaging interval on 3 GPUs Feature Extractor … FeatMix: fuse the feature vectors via a linear weighted combination. The combined feature vector can be computed as Source WER% of DistLang with the two feature fusion methods Target … an weights for features from the n-th extractor b bias (vector) Methods Feature Dim WER% DistLang - FeatConcat 1024 61.4 DistLang - FeatMix 61.6 341 60.3 60.7 Always gives ~3.0 speed-up Worse than DistModel partly because of language dependence FeatConcat is slightly better than FeatMix 3. Pros & Cons Lang 1 input Lang 2 input Lang 3 input Target Lang No communication cost  perfect speed-up Inclusion of new source languages is easy  no need to retrain from scratch The number of GPUs is hardcoded by the number of source languages DistModel: Distribution by Model Larger-Scale Evaluation Training data of each language is partitioned evenly across the GPUs. After a specified number of mini-batches (averaging interval), feature extractors from the individual GPUs are averaged into a unified model. The averaged parameters are sent back to each GPU as the new starting model for the subsequent training. This is a time-synchronous method. However, on this particular feature learning task, DistModel is robust to large averaging interval up to 2000 mini-batches. Datasets and Experimental Setup WER% and Speed-up of DistModel as # of GPUs increases 1. Two evaluation conditions on the BABEL corpus Methods WER% Speed-up Monolingual DNN 72.5 --- Single GPU 65.7 DistModel - 3 GPUs 66.2 2.4 DistModel - 4 GPUs 66.7 3.1 DistModel - 5 GPUs 66.8 3.4 Consistent acceleration, although the improvement is not linear Pooling more GPUs degrades WERs on the target language This degradation might be mitigated by further optimization Tagalog - IARPA-babel106-v0.2f Cantonese - IARPA-babel101-v0.4c Turkish - IARPA-babel105b-v Pashto - IARPA-babel104b-v0.4aY Vietnamese - IARPA-babel101-v0.4c Bengali - IARPA-babel103b-v0.4b SOURCE TARGET Preliminary Cantonese, Turkish and Pashto (226 Hr) 10Hr set of Tagalog Larger-scale Cantonese, Turkish, Pashto, Tagalog and Vietnamese (460 Hr) 10Hr set of Bengali 30 Hrs Lang 1 Lang 1 Lang 1 Lang 1 90 Hours 30 Hrs Lang 2 Lang 2 Lang 2 Acknowledgements 30 Hrs Lang 3 Lang 3 Lang 3 Lang 2 2. Protocol. We measure the WERs on the target language, with the identical DNN architecture for various feature extractors. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF-12-C The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. 90 Hours GPU #1 Extractor 1 GPU #2 Extractor 2 GPU #3 Extractor 3 3. Metrics Lang 3 WERs(%) of the hybrid DNN model on a 2-hour testing set of the target lang Speed-up: the ratio of the training time taken using a single GPU to the time using multiple GPUs 90 Hours averaging interval Averaged Extractor

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Similar presentations

Presentation on theme: "Distributed Learning of Multilingual DNN Feature Extractors using GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Similar presentations

Presentation on theme: "Distributed Learning of Multilingual DNN Feature Extractors using GPUs"— Presentation transcript:

Similar presentations

About project

Feedback