Implementing a FIR-filter algorithm using MMX instructions by Lars Persson.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Program Optimization (Chapter 5)
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
COMP 2003: Assembly Language and Digital Logic
Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.
C Programming and Assembly Language Janakiraman V – NITK Surathkal 2 nd August 2014.
Web siteWeb site ExamplesExamples Irvine, Kip R. Assembly Language for Intel-Based Computers, Conditional Loop Instructions LOOPZ and LOOPE LOOPNZ.
Assembly Language for Intel-Based Computers Chapter 8: Advanced Procedures Kip R. Irvine.
Assembly Language for Intel-Based Computers Chapter 5: Procedures Kip R. Irvine.
PC hardware and x86 3/3/08 Frans Kaashoek MIT
CS2422 Assembly Language & System Programming October 3, 2006.
Accessing parameters from the stack and calling functions.
Chapter 12: High-Level Language Interface. Chapter Overview Introduction Inline Assembly Code C calls assembly procedures Assembly calls C procedures.
Assembly Language Basic Concepts IA-32 Processor Architecture.
High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.
X86 ISA Compiler Baojian Hua Front End source code abstract syntax tree lexical analyzer parser tokens IR semantic analyzer.
Web siteWeb site ExamplesExamples Irvine, Kip R. Assembly Language for Intel-Based Computers, Defining and Using Procedures Creating Procedures.
Web siteWeb site ExamplesExamples Irvine, Kip R. Assembly Language for Intel-Based Computers, Stack Operations Runtime Stack PUSH Operation POP.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
David Evans CS201j: Engineering Software University of Virginia Computer Science Lecture 18: 0xCAFEBABE (Java Byte Codes)
Sahar Mosleh California State University San MarcosPage 1 Applications of Shift and Rotate Instructions.
6.828: PC hardware and x86 Frans Kaashoek
Dr. José M. Reyes Álamo 1.  The 80x86 memory addressing modes provide flexible access to memory, allowing you to easily access ◦ Variables ◦ Arrays ◦
Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.
Today’s topics Parameter passing on the system stack Parameter passing on the system stack Register indirect and base-indexed addressing modes Register.
Assembly Language for Intel-Based Computers, 6 th Edition Chapter 8: Advanced Procedures (c) Pearson Education, All rights reserved. You may.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Today’s topics Procedures Procedures Passing values to/from procedures Passing values to/from procedures Saving registers Saving registers Documenting.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 22: Unconventional.
26-Nov-15 (1) CSC Computer Organization Lecture 6: Pentium IA-32.
Assembly Language. Symbol Table Variables.DATA var DW 0 sum DD 0 array TIMES 10 DW 0 message DB ’ Welcome ’,0 char1 DB ? Symbol Table Name Offset var.
C++ [ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04]
University of Amsterdam Computer Systems – the instruction set architecture Arnoud Visser 1 Computer Systems The instruction set architecture.
Assembly Language for Intel-Based Computers, 4 th Edition Chapter 5: Procedures Lecture 19: Procedures Procedure’s parameters (c) Pearson Education, 2002.
MMX-accelerated Matrix Multiplication
COMP1070/2002/lec1/H.Melikian COMP1070 Lecture #2 Computers and Computer Languages Some terminology What is Software? Operating Systems.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
1 Assembly Language: Function Calls Jennifer Rexford.
CSC 221 Computer Organization and Assembly Language Lecture 16: Procedures.
CSC 221 Computer Organization and Assembly Language Lecture 15: STACK Related Instructions.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
Addressing Modes Dr. Hadi Hassan.  Two Basic Questions  Where are the operands?  How memory addresses are computed?  Intel IA-32 supports 3 fundamental.
Assembly Language Addressing Modes. Introduction CISC processors usually supports more addressing modes than RISC processors. –RISC processors use the.
Chapter 8 String Operations. 8.1 Using String Instructions.
Chapter Overview General Concepts IA-32 Processor Architecture
Machine-Level Programming 2 Control Flow
Optimizing Pixomatic For Modern Processors
Assembly language.
Exploiting & Defense Day 2 Recap
Introduction to Compilers Tim Teitelbaum
High-Level Language Interface
asum.ys A Y86 Programming Example
Discussion Section – 11/3/2012
Assembly Language for Intel-Based Computers, 5th Edition
Machine-Level Programming 2 Control Flow
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures
Lecture 19: 0xCAFEBABE (Java Byte Codes) CS201j: Engineering Software
Fundamentals of Computer Organisation & Architecture
Machine-Level Representation of Programs III
Machine-Level Programming 2 Control Flow
EECE.3170 Microprocessor Systems Design I
EECE.3170 Microprocessor Systems Design I
Machine-Level Programming: Introduction
X86 Assembly Review.
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
Intel MMX™ Technology Accelerating 3D Geometry Transformation
Instruction Level Parallelism
Presentation transcript:

Implementing a FIR-filter algorithm using MMX instructions by Lars Persson

Merging the history buffer and the input buffer num_taps-1 samples from last call num_taps-1 new samples zero-padded X 0 X 1 X 2 … … X -3 X -2 X -1 When computing the first num_taps-1 samples, we need to access both the input and the history buffer. Depending on the implementation, this might require extra branch instructions in the inner or outer loop. Improved history buffer:

Preparing the taps array The filter tap array is prepared according to the Intel example. That is, it is reversed and 3 shifted copies are made. Also, the number of taps is rounded to a multiple of 4. t1 t2 t3 0 t3 t2 t t3 t2 t t3 t2 t t3 t2 t1 0

The convolution sum LOOP: // Load 4 samples movq mm0, [esi] movq mm1, mm0 // preload taps that are shifted 2 // and 3 steps lea edi, [ebx+2*ecx] movq mm4, [edi] movq mm7, [edi+ecx] // multiply with taps pmaddwd mm0, [ebx] paddd mm6, mm0 // multiply with taps shifted one // step movq mm0, mm1 pmaddwd mm0, [ebx+ecx] paddd mm5, mm0 // multiply with taps shifted 2 // steps pmaddwd mm4, mm1 paddd mm3, mm4 // multiply with taps shifted 3 // steps pmaddwd mm7, mm1 paddd mm2, mm7 // update pointes for next loop // iter. add esi, 8 add ebx, 8 sub eax, 1 jnz LOOP

Parallel summation // low samples mm6 mm5 movq mm4, mm6 punpckhdq mm4, mm5 punpckldq mm6, mm5 paddd mm6, mm4 // [ out(n+1) out(n) ] in mm6 // high samples mm3 mm2 movq mm4, mm3 punpckhdq mm4, mm2 punpckldq mm3, mm2 paddd mm3, mm4 // [ out(n+3) out(n+2) ] in mm3

Loop optimization Inner loop keeps as much data as possible in the registers. Only taps and samples are loaded from memory. The parallel summation is done with 8 instructions as compared to 12 instructions in my SSE version. Memory copying is done with the rep instruction prefix. This avoids a branch instruction.

So far about 36 million cycles including float to short conversion..

Optimizing float to short conversion The C language standard requires that float to integer conversion is done with truncation, i.e. 3.6 is converted to 3 as opposed to 4 when using rounding. On the X86 architecture this requires changing the FPU control word which is a very expensive instruction. Solution is to directly call the fistp instruction.

__ftol: 00402B24 push ebp 00402B25 mov ebp,esp 00402B27 add esp,0FFFFFFF4h 00402B2A wait 00402B2B fnstcw word ptr [ebp-2] 00402B2E wait 00402B2F mov ax,word ptr [ebp-2] 00402B33 or ah,0Ch 00402B36 mov word ptr [ebp-4],ax 00402B3A fldcw word ptr [ebp-4] 00402B3D fistp qword ptr [ebp-0Ch] 00402B40 fldcw word ptr [ebp-2] 00402B43 mov eax,dword ptr [ebp-0Ch] 00402B46 mov edx,dword ptr [ebp-8] 00402B49 leave 00402B4A ret mov ecx, num_samples mov esi, inputs mov edi, input mov esi, [esi] sub ecx, 1 LOOP1: flddword ptr [esi+ecx*4] fistp word ptr [edi+ecx*2] sub ecx, 1 jge LOOP1 Compiler calls this function once for every conversion. Optimized conversion routine.