Reza Yazdani Albert Segura José-María Arnau Antonio González

Reza Yazdani Albert Segura José-María Arnau Antonio González
An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani Albert Segura José-María Arnau Antonio González

Automatic Speech Recognition (ASR)
Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

ASR Requirements Voice-based user-interfaces for mobile devices
Large Vocabulary Speaker-independent High Accuracy Real-time Performance Energy Efficiency Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

ASR Solutions General-purpose platforms Reza Yazdani
An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System
Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Automatic Speech Recognition
State-of-the-art ASR system Hybrid model: DNN + HMM Feature Extraction Likelihood Computation \ Graph Search Sound Signal Speech (words) GPU Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Graph Search Dictionary Training Graph Generator Viterbi Search
Weighted-Finite-State-Transducer Training Graph Generator Viterbi Search Acoustic model Language model Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Viterbi Search A simple example of WFST for detecting 2 words: three and two Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Viterbi Search THREE 0.3 0.21 Frame 0 Frame 1 Frame 2 Frame 3 0.0015
0.54 0.3 0.0012 0.0009 0.46 0.0018 1.0 Pruning! THREE Pruning! Pruning! Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Accelerated ASR System

Accelerator’s Architecture
Average active states On each frame evaluation: Less than 1%! Viterbi Accelerator WFST Dynamic Search Graph Acoustic Scores Main Memory w1 … 1 2 … 4 6 7 w2 Frame i Frame i+1 Solution: Hash Table w3 w4 State ID Token Info 6 … State Index Token frame t th uw r iy 1 0.9 0.025 2 0.7 0.012 0.25 0.12 3 Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Potential Improvement
Perfect caches and hash tables Speedups with respect to the baseline architecture 94.6% Improvement Large Memory Footprint (34million Arcs) Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Hardware Prefetching Dynamic access of a small sparsely distributed subset of arcs On average: 25K out of 34M arcs Conventional prefetchers are inefficient Graph search exhibits unpredictable access pattern Pruning unlikely paths causes more unpredictability Our proposed scheme based on the decoupled access-execute All memory addresses are deterministic after the pruning Issue memory requests much in advance High accuracy: computed rather than predicted addresses Timeliness: reorder-buffer to avoid early evictions 94% speedup with a negligible area overhead of 0.05% Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Bandwidth Reduction 97% of dynamically expanded states have less than 16 arcs A novel technique for directly computing arc addresses Changing the memory layout of the WFST dataset Avoid memory access for fetching state’s data 20% Memory Bandwidth Saving at a negligible cost of 0.02% area increase Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Evaluation Methodology
Viterbi accelerator's timing estimation A cycle-accurate simulator Execution and activity factors RTL Verilog model for logic components Design frequency Modeling memory parts with CACTI Cache&Memory latency Power model Memory & Caches: Cacti Logic: Synopsys Design Compiler Technology node: 28nm Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Experimental Results 111.47x Speedup 16.7x Speedup 1185x Reduction

Conclusion Viterbi search is the main bottleneck in ASR systems
General-purpose solutions Not real-time for large speech models High energy consumption Design of an accelerator tailored for the Viterbi Search More energy-efficient (by orders of magnitude) Memory subsystem is the main challenge of ASR Arc prefetcher Memory bandwidth reduction 1.7x faster than NVIDIA GTX 980 and 287x less energy Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Reza Yazdani Albert Segura José-María Arnau Antonio González
An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani Albert Segura José-María Arnau Antonio González

Reza Yazdani Albert Segura José-María Arnau Antonio González

Similar presentations

Presentation on theme: "Reza Yazdani Albert Segura José-María Arnau Antonio González"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reza Yazdani Albert Segura José-María Arnau Antonio González

Similar presentations

Presentation on theme: "Reza Yazdani Albert Segura José-María Arnau Antonio González"— Presentation transcript:

Similar presentations

About project

Feedback