Download presentation

Presentation is loading. Please wait.

Published byLeah Talmage Modified about 1 year ago

1
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk

2
Overview: Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison

3
Intel Pentium 4 Reworked micro-architecture for high- bandwidth applications Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments These are DSP intensive applications! – What about uses other than in PC?

4
Hardware Features: Hardware Features: (NetBurst micro-architecture) Hyper pipelined technology Advanced dynamic execution Cache (data, L1, L2) Rapid ALU execution engines 400 MHz bus OOE Microcode ROM

5

6
Hyper Pipeline 20-stage pipeline!!! breaks down complex CISC instructions –sub-stages mimic RISC –faster execution

7
Filling the pipeline... Review of next 126 instructions to be executed Branch prediction –if mispredict must flush 20-stage pipeline!!! –branch target buffer (BTB) –4K branch history table (BHT) –assembly instruction hints

8
Cache 8KB Data Cache L1 Execution Trace Cache –12K of previous micro-instructions stored –saves having to translate L2 Advanced Transfer Cache –256K for data –256-bit transfer every cycle allows 77GB/s data transfer on 2.4GHz

9
Rapid ALU Execution Engines 2 ALUs –allow parallel operations Many arithmetic operations take 1/2 cycle –each 2X ALU can have 2 operations per cycle

10
Software Features: Multimedia Extensions (MMX) –8 MMX registers Streaming SIMD Extensions (SSE2) –8 SSE/SSE2 registers Standard x86 Registers –EAX, EBX, ECX, EDX, ESI, etc. –Register rename to over 100

11
MMX (Multimedia Extensions) Accelerated performance through SIMD multimedia, communication, internet applications 64-bit packed INTEGER data –signed/unsigned

12
SSE2 (Streaming SIMD Extensions) Accelerate a broad range of applications –video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications 128-bit SIMD instruction formats 4 single precision FP values 2 double precision FP values 16 byte values 8 word values 4 double word values 2 quad word values bit integer value

13
SIMD Example 16-tap FIR filter - Real numbers) SIMD Example (16-tap FIR filter - Real numbers) Applications for real FIR filters general purpose filters in image processing, audio, and communication algorithms Will utilize SSE2 SIMD instruction set

14
Thinking about SIMD SSE2 instruction format is 128-bits 128-bit SSE2 registers Many data formats! What precision do we want? Lets use 32-bit floating point for coefficients, input, output 4 data sets x 32-bit = 128 bits

15
Parallelizing Require many single multiplications (coefficients x inputs), then add the results for output! Multiplications… then need to perform additions...

16
Using SSE2 format Can hold 4 elements of an array (of 32-bit data) in each 128-bit register 4 single precision floating point ops per cycle (32-bit)

17
Additions... In both registers, now have 4 32-bit results –First add the results into an accumulator register 4 single precision floating point ops per cycle (32-bit)

18
Additions... In a register, now have 4 32-bit results –however, NO SSE2 instruction to add these 4! –But can use other instructions Some BIT INTERTWINING…then add –This will give results for several output values!

19
ADI SHARC 21k vs. P4 Disadvantages Slower clock speed (40MHz vs 2400MHz) Less opportunities for parallelism (5 vs 11) Much less memory (Cache and System) –Limited algorithm applicability –Limited applications Older (Less support – compiler) –1994 vs 2001

20
ADI Sharc 21k vs. P4 Advantages Hardware loops Easier to program for optimal speed Cheaper Lower power consumption Runs cooler

21
FIR Performance Hard to obtain P4 performance numbers Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. –2 * 2.4GHz ~ 4.8 billion multiplies per second –If ~4 multiplies per element & samples/s –FIR length > ~25k taps SHARC => ~ 200 taps (Lab 4) Factor of ~125x

22
IIR Performance Hard to obtain P4 performance numbers No hardware circular buffers Does have BTB, BHT, etc. Prefetches ~256bytes ahead of current position in code.

23
FFT Performance Hard to obtain P4 performance numbers Prime95 uses FFT to calculate Lucas- Lehmer test for Mersenne Primes –Involves FFT, squaring and iFFT, etc. 256k points on P4 2.3GHz ~ ms Compare to SHARC 2048 point FFT ~0.37ms If SHARC could do 256k, 46.25ms (But…)

24
Optimization Example Hard to optimize Pentium 4 assembly Example of multiplying by a constant, 10 Taken mainly from:

25
Multiplying by 10 Slowest way: –IMUL EAX, 10 Usually optimal way (Visual C++ 6.0) –LEA EAX, [EAX+EAX*4] –SHL EAX, 1 –Shift – Add – Shift –On most x86 processors takes 2 cycles –Pentium MMX and before 3 cycles –On Pentium 4 takes 6 cycles!

26
Multiplying by 10 Optimal for Pentium 4 –LEA ECX, [EAX + EAX] –LEA EAX, [ECX+EAX*8] –On most x86 still takes 2 cycles –On Pentium 4 takes ~ 3 cycles (OOE - Ops) –But on older processors Pentium MMX and before this now takes 4 cycles!

27
Multiplying by 10 Best generic case –LEA EAX, [EAX + EAX*4] –ADD EAX, EAX –On most x86 still takes 2 cycles –On older processors Pentium MMX and before this now takes 3 cycles again –On Pentium 4 this takes 4 cycles Obviously really hard to optimize

28
REFERENCES Intel application note: AP Real and Complex Filter Using Streaming SIMD Extentions graphics from: /p4-01.html

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google