Reza Yazdani Albert Segura José-María Arnau Antonio González

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Distributed and Efficient Classifiers for Wireless Audio-Sensor Networks Baljeet Malhotra Ioanis Nikolaidis Mario A. Nascimento University of Alberta Canada.

A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Sensys 2009 Speaker:Lawrence.  Introduction  Overview & Challenges  Algorithm  Travel Time Estimation  Evaluation  Conclusion.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Author: J. Kim, C. Nicopoulos (Dept. of CSE, PSU)

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

Olivier Siohan David Rybach

Online Multiscale Dynamic Topic Models

Microarchitecture.

2 Research Department, iFLYTEK Co. LTD.

Speech recognition in mobile environment Robust ASR with dual Mic

ISPASS th April Santa Rosa, California

Application-Specific Customization of Soft Processor Microarchitecture

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

Department of Electrical & Computer Engineering

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Virtually Pipelined Network Memory

Nuno Neves1,2, Pedro Tomás1,2, Nuno Roma1,2

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

High Performance Stream Processing for Mobile Sensing Applications

Ann Gordon-Ross and Frank Vahid*

Statistical Models for Automatic Speech Recognition

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Heterogeneous Memory Subsystem for Natural Graph Analytics

Efficient Cache-Supported Path Planning on Roads

Application-Specific Customization of Soft Processor Microarchitecture

Fast Accesses to Big Data in Memory and Storage Systems

Phase based adaptive Branch predictor: Seeing the forest for the trees

Address-Stride Assisted Approximate Load Value Prediction in GPUs

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Reza Yazdani Albert Segura José-María Arnau Antonio González An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani Albert Segura José-María Arnau Antonio González

Automatic Speech Recognition (ASR) Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

ASR Requirements Voice-based user-interfaces for mobile devices Large Vocabulary Speaker-independent High Accuracy Real-time Performance Energy Efficiency Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

ASR Solutions General-purpose platforms Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Automatic Speech Recognition State-of-the-art ASR system Hybrid model: DNN + HMM Feature Extraction Likelihood Computation \ Graph Search Sound Signal Speech (words) GPU Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Graph Search Dictionary Training Graph Generator Viterbi Search Weighted-Finite-State-Transducer Training Graph Generator Viterbi Search Acoustic model Language model Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Viterbi Search A simple example of WFST for detecting 2 words: three and two Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Viterbi Search THREE 0.3 0.21 Frame 0 Frame 1 Frame 2 Frame 3 0.0015 0.54 0.3 0.0012 0.0009 0.46 0.0018 1.0 Pruning! THREE Pruning! Pruning! Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Accelerated ASR System Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Accelerator’s Architecture Average active states On each frame evaluation: Less than 1%! Viterbi Accelerator WFST Dynamic Search Graph Acoustic Scores Main Memory w1 … 1 2 … 4 6 7 w2 Frame i Frame i+1 Solution: Hash Table w3 w4 State ID Token Info 6 … State Index Token frame t th uw r iy 1 0.9 0.025 2 0.7 0.012 0.25 0.12 3 Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Potential Improvement Perfect caches and hash tables Speedups with respect to the baseline architecture 94.6% Improvement Large Memory Footprint (34million Arcs) Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Hardware Prefetching Dynamic access of a small sparsely distributed subset of arcs On average: 25K out of 34M arcs Conventional prefetchers are inefficient Graph search exhibits unpredictable access pattern Pruning unlikely paths causes more unpredictability Our proposed scheme based on the decoupled access-execute All memory addresses are deterministic after the pruning Issue memory requests much in advance High accuracy: computed rather than predicted addresses Timeliness: reorder-buffer to avoid early evictions 94% speedup with a negligible area overhead of 0.05% Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Bandwidth Reduction 97% of dynamically expanded states have less than 16 arcs A novel technique for directly computing arc addresses Changing the memory layout of the WFST dataset Avoid memory access for fetching state’s data 20% Memory Bandwidth Saving at a negligible cost of 0.02% area increase Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Evaluation Methodology Viterbi accelerator's timing estimation A cycle-accurate simulator Execution and activity factors RTL Verilog model for logic components Design frequency Modeling memory parts with CACTI Cache&Memory latency Power model Memory & Caches: Cacti Logic: Synopsys Design Compiler Technology node: 28nm Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Experimental Results 111.47x Speedup 16.7x Speedup 1185x Reduction Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Conclusion Viterbi search is the main bottleneck in ASR systems General-purpose solutions Not real-time for large speech models High energy consumption Design of an accelerator tailored for the Viterbi Search More energy-efficient (by orders of magnitude) Memory subsystem is the main challenge of ASR Arc prefetcher Memory bandwidth reduction 1.7x faster than NVIDIA GTX 980 and 287x less energy Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition

Reza Yazdani Albert Segura José-María Arnau Antonio González An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani Albert Segura José-María Arnau Antonio González