Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli.

Slides:

Advertisements

Similar presentations

1 General-Purpose Languages, High-Level Synthesis John Sanguinetti High-Level Modeling.

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson

Fast FPGA Resource Estimation Paul Schumacher & Pradip Jha Xilinx, Inc.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

ECE 699: Lecture 2 ZYNQ Design Flow.

1 INTRODUCTION METHODSRESULTSCONCLUSION Noise Robust Speech Recognition Group SB740 Noise Robust Speech Recognition Group SB740.

Low power architecture and HDL coding practices for on-board hardware applications Kaushal D. Buch ASIC Engineer, eInfochips Ltd., Ahmedabad, India

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Representing Acoustic Information

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Delevopment Tools Beyond HDL

Xilinx at Work in Hot New Technologies ® Spartan-II 64- and 32-bit PCI Solutions Below ASSP Prices January

TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN Asher Berkovitz Yaniv.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.

Highest Performance Programmable DSP Solution September 17, 2015.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Automated Design of Custom Architecture Tulika Mitra

Implementing a Speech Recognition System on a GPU using CUDA

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

1 H ardware D escription L anguages Modeling Digital Systems.

Jacob Zurasky ECE5526 – Spring 2011

Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.

An Overview of Hardware Design Methodology Ian Mitchelle De Vera.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Spring 2014 Part B.

November 29, 2011 Final Presentation. Team Members Troy Huguet Computer Engineer Post-Route Testing Parker Jacobs Computer Engineer Post-Route Testing.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Baseband Implementation of an OFDM System for 60GHz Radios: From Concept to Silicon Jing Zhang University of Toronto.

MIPS Pipeline and Branch Prediction Implementation Shuai Chang.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz.

Design with Vivado IP Integrator

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Speech Processing Dr. Veton Këpuska, FIT Jacob Zurasky, FIT.

Introduction to the FPGA and Labs

Accelerate HD video processing through affordable hardware

Reza Yazdani Albert Segura José-María Arnau Antonio González

Hiba Tariq School of Engineering

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Embedded Systems Design

ENG3050 Embedded Reconfigurable Computing Systems

Introduction to High-level Synthesis

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

STUDY AND IMPLEMENTATION

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

ECE 699: Lecture 3 ZYNQ Design Flow.

Implementation of a GNSS Space Receiver on a Zynq

THE ECE 554 XILINX DESIGN PROCESS

THE ECE 554 XILINX DESIGN PROCESS

Presentation transcript:

Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli

Hot word recognition Widely used in intelligent personal assistants. Apple : Siri Google : Google Now Microsoft : Cortana Samsung : S Voice LG : Voice Mate IBM : Watson Such speech recognition technology is now built into every major operating system. Speech recognition technology uses methods and algorithms which can be power consuming, considering ALWAYS ON feature in the above implementations Motivation : Build hardware accelerators for optimized implementation of this algorithms for speed, power and efficiency.

Speech Recognition - Overview Feature Extraction Mel Frequency Cepstral Coefficents (MFCCs) Linear Prediction Coefficients (LPCs) Linear Prediction Cepstral Coefficients (LPCCs) Comparison with trained system “OK GOOGLE” Hidden Markov Model (HMM)

Hardware Implementation – Why? An FPGA algorithm accelerator is by definition meant to take advantage of the parallelism inherent in hardware. Advantages of implementing the C code in the hardware? All operations finish in lesser clock cycles An in-order processor will take multiple clock cycles to execute this code Can a designer write a C code at a high level of abstraction and truly expect it to generate quality hardware in the FPGA?

Mel Frequency Cepstral Coefficient (MFCC) 1. Widely used in automatic speech recognition systems 2. Mel-Frequency analysis of speech is based on human perception experiments 3. Keeps only linguistic features, discards other stuff that carries Information like background noise etc. Steps involved : > Frame signal into short frames. 5 > For each frame, find its spectral density by characterizing it in frequency domain. - Usually frames - FFT representation > Apply the Mel filterbank to above power spectra, sum the energy in each filter. - What are Mel filterbanks? > Take logarithm of all the filterbank energies - Why? > Take DCT of the log filterbank energies > Keep DCT coefficients 2-13, discard the rest

Mel Frequency Cepstral Coefficient (MFCC) Input audio signal Pre-emphasis & Windowing FFT Mel filter banks Log of filter bank energies DCT Keep 2 to 13 values only

Implementation Hardware Accelerator: MFCC feature extraction Zedboard: Zynq 7000 AP 7 Processing System: 33.33MHz Programmable Logic: 100MHz 512MB DDR3 Onboard USB-JTAG

Development Tools Vivado HLS High Level Synthesis software 1. Execute C algorithm to validate it against written testbench 2. Run synthesis to obtain desired RTL implementation 3. Apply constraints/Directives to obtain desired optimizations 4. Reuse same testbench for C/RTL cosimulations 5. Package and export final IP Vivado Interconnect Tool 6. Provides a GUI to make interconnections between the imported IP, application processor system, BRAM controller etc. through high speed AXI interconnect buses. A bitstream file is generated and exported to Xilinx SDK Xilinx SDK 7. Uses the generated bitstream file to program the Zync 7000 device on Zedboard

Vivado HLS (High Level Synthesis) HLS transforms a C specification into an RTL implementation Benefits of HLS: 1. Algorithms developed at C level: abstracts the user from implementation details 9 2. Functional correctness is validated using verification at C level: Faster than traditional HDL verification techniques 3. Optimization directives allow C synthesis to produce hardware with the required area – performance requirement 3. Quick design space exploration: Creating many different implementations increases the likelihood of finding the most optimal solution Vivado HLS provides Synthesis and Analysis views: Synthesis: Gives estimates of the Performance, Utilization and Interface Analysis: Gives a count of number of clocks taken by each instruction

Optimizations Vivado HLS provides directives for optimization. Important ones are: 1. Pipeline: Loops, functions and tasks can be pipelined to reduce the time interval (or increase the throughput) Loop Unroll: For-loops can be unrolled to create multiple independent operations 3. Inline: Inlining a subfunction removes all the function hierarchy. Enables logic optimization across function boundaries and improves latency by reducing function call overhead 4. Array Partition: Partitions large array into multiple smaller arrays to improve access to data and remove block-RAM bottlenecks 5. Allocation: Specifies limit on the number of operations, cores or functions to be used. Forces sharing of hardware and hence increases the latency, but reduces the area

MFCC Accelerator 1. A C code for MFCC algorithm is simulated using test bench and verified for correctness using Vivado HLS 2. The C function is synthesized into HDL The baseline HDL is generated without forcing any optimization. The resource utilization is shown in the figure 4. The percentage resource utilization is not fairly distributed

Baseline MFCC Accelerator Summary shows that it takes approximately million clocks to complete one execution 12 The outer for loop (FILTER_FOR) and the inner for loop (BINSIZE_FOR) latencies is shown below

Optimized MFCC Accelerator A sub-function called by the inner function is pipelined to optimize for performance 13 A 35% reduction in the interval was observed in the optimized implementation, with approximately 8% increase in FF and 1% increase in LUT utilization

Next Steps 1. Hidden Markov Model: Programmable Logic resource constraints forces software implementation on the application processor Power measurements for all the accelerators with different optimizations needs to be measured 3. Real time audio processing for input from microphone