Intel Pentium 4 Reworked micro-architecture for high- bandwidth applications Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments These are DSP intensive applications! – What about uses other than in PC?
Hardware Features: Hardware Features: (NetBurst micro-architecture) Hyper pipelined technology Advanced dynamic execution Cache (data, L1, L2) Rapid ALU execution engines 400 MHz bus OOE Microcode ROM
Filling the pipeline... Review of next 126 instructions to be executed Branch prediction –if mispredict must flush 20-stage pipeline!!! –branch target buffer (BTB) –4K branch history table (BHT) –assembly instruction hints
Cache 8KB Data Cache L1 Execution Trace Cache –12K of previous micro-instructions stored –saves having to translate L2 Advanced Transfer Cache –256K for data –256-bit transfer every cycle allows 77GB/s data transfer on 2.4GHz
Rapid ALU Execution Engines 2 ALUs –allow parallel operations Many arithmetic operations take 1/2 cycle –each 2X ALU can have 2 operations per cycle
Software Features: Multimedia Extensions (MMX) –8 MMX registers Streaming SIMD Extensions (SSE2) –8 SSE/SSE2 registers Standard x86 Registers –EAX, EBX, ECX, EDX, ESI, etc. –Register rename to over 100
MMX (Multimedia Extensions) Accelerated performance through SIMD multimedia, communication, internet applications 64-bit packed INTEGER data –signed/unsigned
SSE2 (Streaming SIMD Extensions) Accelerate a broad range of applications –video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications 128-bit SIMD instruction formats 4 single precision FP values 2 double precision FP values 16 byte values 8 word values 4 double word values 2 quad word values bit integer value
SIMD Example 16-tap FIR filter - Real numbers) SIMD Example (16-tap FIR filter - Real numbers) Applications for real FIR filters general purpose filters in image processing, audio, and communication algorithms Will utilize SSE2 SIMD instruction set
Thinking about SIMD SSE2 instruction format is 128-bits 128-bit SSE2 registers Many data formats! What precision do we want? Lets use 32-bit floating point for coefficients, input, output 4 data sets x 32-bit = 128 bits
Parallelizing Require many single multiplications (coefficients x inputs), then add the results for output! Multiplications… then need to perform additions...
Using SSE2 format Can hold 4 elements of an array (of 32-bit data) in each 128-bit register 4 single precision floating point ops per cycle (32-bit)
Additions... In both registers, now have 4 32-bit results –First add the results into an accumulator register 4 single precision floating point ops per cycle (32-bit)
Additions... In a register, now have 4 32-bit results –however, NO SSE2 instruction to add these 4! –But can use other instructions Some BIT INTERTWINING…then add –This will give results for several output values!
ADI SHARC 21k vs. P4 Disadvantages Slower clock speed (40MHz vs 2400MHz) Less opportunities for parallelism (5 vs 11) Much less memory (Cache and System) –Limited algorithm applicability –Limited applications Older (Less support – compiler) –1994 vs 2001
ADI Sharc 21k vs. P4 Advantages Hardware loops Easier to program for optimal speed Cheaper Lower power consumption Runs cooler
FIR Performance Hard to obtain P4 performance numbers Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. –2 * 2.4GHz ~ 4.8 billion multiplies per second –If ~4 multiplies per element & samples/s –FIR length > ~25k taps SHARC => ~ 200 taps (Lab 4) Factor of ~125x
IIR Performance Hard to obtain P4 performance numbers No hardware circular buffers Does have BTB, BHT, etc. Prefetches ~256bytes ahead of current position in code.
FFT Performance Hard to obtain P4 performance numbers Prime95 uses FFT to calculate Lucas- Lehmer test for Mersenne Primes –Involves FFT, squaring and iFFT, etc. 256k points on P4 2.3GHz ~ ms Compare to SHARC 2048 point FFT ~0.37ms If SHARC could do 256k, 46.25ms (But…)
Optimization Example Hard to optimize Pentium 4 assembly Example of multiplying by a constant, 10 Taken mainly from:
Multiplying by 10 Slowest way: –IMUL EAX, 10 Usually optimal way (Visual C++ 6.0) –LEA EAX, [EAX+EAX*4] –SHL EAX, 1 –Shift – Add – Shift –On most x86 processors takes 2 cycles –Pentium MMX and before 3 cycles –On Pentium 4 takes 6 cycles!
Multiplying by 10 Optimal for Pentium 4 –LEA ECX, [EAX + EAX] –LEA EAX, [ECX+EAX*8] –On most x86 still takes 2 cycles –On Pentium 4 takes ~ 3 cycles (OOE - Ops) –But on older processors Pentium MMX and before this now takes 4 cycles!
Multiplying by 10 Best generic case –LEA EAX, [EAX + EAX*4] –ADD EAX, EAX –On most x86 still takes 2 cycles –On older processors Pentium MMX and before this now takes 3 cycles again –On Pentium 4 this takes 4 cycles Obviously really hard to optimize
REFERENCES Intel application note: AP Real and Complex Filter Using Streaming SIMD Extentions graphics from: /p4-01.html