ULTRA-FAST BIQUAD FILTERING OPTIMIZED FOR CORTEX-M4/M7

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.
Optimizing ARM Assembly Computer Organization and Assembly Languages Yung-Yu Chuang with slides by Peng-Sheng Chen.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
The 8051 Microcontroller and Embedded Systems
331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
1 Review: Counters and Registers CS 231 Review October 26, 2007 By Joshua Smith Please note the NOTES in the PPT file for help with the examples.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 1: Bits, bytes and a simple processor dr.ir. A.C. Verschueren.
Processor Technology and Architecture
Memory - Registers Instruction Sets
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
4-1 Chapter 4 - The Instruction Set Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
4-1 Chapter 4 - The Instruction Set Architecture Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of.
Operation Frequency No. of Clock cycles ALU ops % 1 Loads 25% 2
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
4-1 Chapter 4 - The Instruction Set Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.
CSCI 211 Intro Computer Organization –Consists of gates for logic And Or Not –Processor –Memory –I/O interface.
ajay patil 1 TMS320C6000 Assembly Language and its Rules Assignment One of the simplest operations in C is to assign a constant to a variable: One.
Microprocessor. Interrupts The processor has 5 interrupts. CALL instruction (3 byte instruction). The processor calls the subroutine, address of which.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Computer Organization
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Dale & Lewis Chapter 5 Computing components
Instruction Sets: Addressing modes and Formats Group #4  Eloy Reyes  Rafael Arevalo  Julio Hernandez  Humood Aljassar Computer Design EEL 4709c Prof:
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Our programmer needs to do this !
Addressing Modes and Formats
FOCOMM_CAMAC Setup and Usage Guide Andrew Wong, Larry Ruckman.
CS61C L20 Datapath © UC Regents 1 Microprocessor James Tan Adapted from D. Patterson’s CS61C Copyright 2000.
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 EKT 225 MICROCONTROLLER I CHAPTER ASSEMBLY LANGUAGE PROGRAMMING.
ARM Cortex-M3 and Cortex-M4 Processors
Writing Functions in Assembly
System-on-Chip Design Homework Solutions
Displacement (Indexed) Stack
Addressing Modes in Microprocessors
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
THD+N Estimation Accurate SNR and THD estimation
The 8051 Microcontroller and Embedded Systems
Alvaro Mauricio Peña Dariusz Niworowski Frank Rodriguez
Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 4 – The Instruction Set Architecture.
William Stallings Computer Organization and Architecture 8th Edition
Writing Functions in Assembly
Computer Architecture
Digital Signal Processors
Implementation of IDEA on a Reconfigurable Computer
Subject Name: Digital Signal Processing Algorithms & Architecture
ECE 434 Advanced Digital System L12
The Processor Lecture 3.1: Introduction & Logic Design Conventions
“TEST - Multiply-by-two” X 2
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Explaining issues with DCremoval( )
ECE 434 Advanced Digital System L11
Optimizing ARM Assembly
Instruction Set Principles
Instruction execution and ALU
Presentation transcript:

ULTRA-FAST BIQUAD FILTERING OPTIMIZED FOR CORTEX-M4/M7

Overview This program implements the IIR/BIQUAD subroutine as described at https://www.keil.com/pack/doc/CMSIS/DSP/html/group___biquad_cascade_d_f1.html Subroutine name “arm_biquad_cascade_df1_fast_q15“. The initialization subroutine “arm_biquad_cascade_df1_init_q15 “ is identical to CMSIS. The code delivery consists in three folders: - “DOC” : this documentation - “KEIL” : uVision V5.12 project used as test-bench of the subroutine and CMSIS’s - “C_BIQ_M4” : bit-exact arithmetic C-code simulator in VisualC2010 - “encrypted_file” : source code tar’d and coded with a key we send by email. The folders are located at http://firmware-developments.com/WEB/P6x/BIQ_M4/

Details The code reuses the same data structures, data format and APIs of the original CMSIS library. When compiled with option -O3 for Cortex-M4 this program runs 16% faster than the original CMSIS library (KEIL ARM C-compiler V5.05).   The new API name is "arm_biquad_cascade_df1_fast_q15_fwd". It assumes the even-order samples are aligned on four-bytes boundaries. There must be an even number of samples to process (the constraint can be relaxed on demand). Here is the extract of the compilation MAP file for the code size:   Base Addr Size Type Attr Idx E Section Name Object ... 0x00000580 0x000000f4 Code RO 65 .text arm_biquad_cascade_df1_fast_q15_m0_fwd_asm.o 0x00000674 0x000000ea Code RO 70 .text arm_biquad_cascade_df1_fast_q15_m0_fwd.o

CPU load The code speed is benchmarked on the KEIL simulator assuming 0 wait-state. We advertised 7.5 cycles per sample with the following computations. The critical loop takes 17 cycles per 16bits sample pairs (15 cycles on Cortex-M7 where the Load/Store unit can execute in parallel with ALU), on top of which you add 1 cycles for loop-counter increment and 3 cycles for the conditional loop branch. The code complexity is then 21 cycles per sample pairs without unrolling the loop. When a loop of 1000 samples was processed through 2 BiQuads with the original CMSIS code in 25.1kcycles this subroutine takes 21.2kcycles with the same saturation control The subroutine uses 36 bytes of stack more than the original one. The stack usage goes then from about 136 bytes to 172bytes. This can be substantially optimized on request.

API Documentation extracted from https://www.keil.com/pack/doc/CMSIS/DSP/html/group___biquad_cascade_d_f1.html void arm_biquad_cascade_df1_fast_q15_fwd ( const arm_biquad_casd_df1_inst_q15 * S, q15_t * pSrc, q15_t * pDst, uint32_t blockSize Parameters [in] *S points to an instance of the Q15 Biquad cascade structure. [in] *pSrc points to the block of input data. [out] *pDst points to the block of output data. [in] blockSize number of samples to process per call.   Returns none. Scaling and Overflow Behavior: This fast version uses a 32-bit accumulator with 2.30 format. The accumulator maintains full precision of the intermediate multiplication results but provides only a single guard bit. Thus, if the accumulator result overflows it wraps around and distorts the result. In order to avoid overflows completely the input signal must be scaled down by two bits and lie in the range [-0.25 +0.25). The 2.30 accumulator is then shifted by postShift bits and the result truncated to 1.15 format by discarding the low 16 bits.