® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.

® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Haim.Barad@intel.com Intel Corporation March 17, 1999

® GDC’99Agenda Introduction Introduction SIMD instructions SIMD instructions Some examples Some examples The secret ingredient The secret ingredient

® GDC’99 Streaming SIMD Extensions New technology to exploit parallelism in FP and INT applications New technology to exploit parallelism in FP and INT applications Key capabilities Key capabilities – Packed Operations – Branch Removal/Compression – Data Movement/Hints – FP/INT Type Conversion

® GDC’99 Application Domains… 3D Graphics: geometry, lighting 3D Graphics: geometry, lighting signal processing signal processing high precision simulation & modeling high precision simulation & modeling video encoding/decoding video encoding/decoding other apps requiring streaming input and output other apps requiring streaming input and output

® GDC’99 Instruction Categories SIMD FP - 4 wide, single-precision SIMD FP - 4 wide, single-precision SIMD INT - extensions to MMX™ technology capabilities SIMD INT - extensions to MMX™ technology capabilities Cacheability control Cacheability control State management State management

® GDC’99 SIMD FP instructions types arithmetic arithmetic square root square root approximation instructions approximation instructions min & max min & max loads & stores loads & stores move mask move mask compare & set mask compare & set mask logical logical compare & set eflags compare & set eflags conversion conversion

® GDC’99 Keys point #1: SIMD SIMD SIMD –Operates on data in parallel (e.g. process 4 vertices at a time instead of just 1)

® GDC’99 Key point #2: Streaming Core/bus ratios are getting higher all the time Core/bus ratios are getting higher all the time –memory latencies can kill potential parallelism Can we find a cure? Can we find a cure? –Hide load latency (prefetch) –Don’t pollute cache if data is never to be used again (streaming store)

® GDC’99 New SIMD FP Registers xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 eight 128-bit, 4x32bit single-precision FP New set of registers! Direct access Used for data only Extended processor state

® GDC’99 Packed SP Data Type Each register holds 4 single-precision FP values Each register holds 4 single-precision FP values IEEE-754 compatible IEEE-754 compatible Scalar operates on least-significant number Scalar operates on least-significant number xmm0 1-bit sign 8-bit exponent 23-bit mantissa 0 31 0 32 127 22233031

® GDC’99Compatibility 100% compatible with all existing IA-32 software 100% compatible with all existing IA-32 software Extension is not transparent to OS Extension is not transparent to OS –Changes needed to OS to handle extended state –Support in Win98 –Support in WinNT4 with SP4

® GDC’99 Do I need assembly? No! 3 levels of support for IA 3 levels of support for IA – assembly - of course, do it yourself – intrinsics - assembly-like C – C++ classes - take advantage of SIMD from a high level Use Intel® C/C++ Compiler for intrinsic or C++ support of Streaming SIMD Extensions and MMX™ technology Use Intel® C/C++ Compiler for intrinsic or C++ support of Streaming SIMD Extensions and MMX™ technologyhttp://developer.intel.com/drg/pentiumiii/tools/ad2.htm

® GDC’99 Packed Operations xmm0 xmm1 xmm0 a0a1a2a3 b0b1b2b3 a0 op b0a1 op b1a2 op b2a3 op b3 op is one of addps subps mulps divps

® GDC’99 Scalar Operations xmm0 xmm1 xmm0 a0a1a2a3 b0b1b2b3 a0 op b0a1a2a3 op is one of addss subss mulss divss

® GDC’99 SIMD Data Organization Exploit vertical parallelism Exploit vertical parallelism SOA versus AOS SOA versus AOS –Array of Structures –Structure of Arrays X0, X1, X2, X3, … Y0, Y1, Y2, Y3, … Z0, Z1, Z2, Z3,... X0, Y0, Z0, X1, Y1, Z1, X2, Y2, Z2, … Better cacheability Better SIMD calculations

® GDC’99 Some Examples… Also visit http://developer.intel.com for more details on tools and documentation Also visit http://developer.intel.com for more details on tools and documentation

® GDC’99 Matrix Vector Multiply Typical 3D operation Typical 3D operation Load values in SOA format Load values in SOA format –xxxx…, yyyy…, zzzz… Follow with multiply and add operations Follow with multiply and add operations movapsxmm0, [list+X+ecx];load X components movapsxmm2, [list+Y+ecx];load Y components movapsxmm3, [list+Z+ecx];load Z components movaps xmm1, [esi+m00] ;m00 m00 m00 m00 movaps xmm4, [esi+m01] ;m01 m01 m01 m01

® GDC’99 Matrix Vector Multiply (2) Accumulate results… Accumulate results… We’ve just done 4 dot products in parallel! We’ve just done 4 dot products in parallel! Loop back and pick up next 4 vertices… mulpsxmm1, xmm0 ;x*m00 x*m00 x*m00 x*m00 mulps xmm4, xmm2 ;y*m01 y*m01 y*m01 y*m01 addpsxmm4, xmm1 ;add the 2 results movaps xmm1, [esi+m02];load matrix element m02 (x4) mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02 mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02 addps xmm4, xmm1;add results addps xmm4, [esi+m03];add last element of matrix

® GDC’99 Fast Reciprocal Approximate instructions are fast! Approximate instructions are fast! Accurate to 11 bits (out of 23 in mantissa) Accurate to 11 bits (out of 23 in mantissa) An iteration of Newton-Raphson doubles precision to 22 An iteration of Newton-Raphson doubles precision to 22 ;Approximation of 1/W with rcpps without NR movapsxmm0, [ecx]; ecx points to w0, w1, w2, w3 rcpps xmm1,xmm0 ; xmm1 = w; xmm1 = ~1/w ;Additional code for approximation of 1/w with rcpps and NR mulps xmm0,xmm1;xmm0 = w * ~1/w mulps xmm0,xmm1 ;xmm0 = w * ~1/w * ~1/w addps xmm1,xmm1 ;xmm1 = 2 * ~1/w subps xmm1,xmm0 ;xmm1 = 2 * ~1/w - w * ~1/w * ~1/w

® GDC’99 FP to INT Conversion Converts two of the FP values in the xmm register to 32-bit integers in MMX TM technology registers Converts two of the FP values in the xmm register to 32-bit integers in MMX TM technology registers movapsxmm0,[ecx] cvttps2pimm0,xmm0 shufpsxmm0,xmm0,Eh cvttps2pimm1,xmm0

® GDC’99 SIMD Integer Instructions Extensions to MMX TM technology instructions Extensions to MMX TM technology instructions Operate on same 64-bit registers as previous MMX technology instructions Operate on same 64-bit registers as previous MMX technology instructions Instructions: extract, insert, min/max, byte mask  integer, multiply high unsigned, shuffle Instructions: extract, insert, min/max, byte mask  integer, multiply high unsigned, shuffle

® GDC’99 And now... On to the secret ingredient… On to the secret ingredient… –Ok, it’s not really secret… I mentioned it earlier –But, it’s probably the most important (and difficult to use) part of the Streaming SIMD Extensions

® GDC’99 Computing Model Processing inputoutput If you can’t bring data in fast enough or spit it out fast enough… If you can’t bring data in fast enough or spit it out fast enough… –SIMD is of little or no use The “streaming” part of the Streaming SIMD Extensions is critical to overall performance The “streaming” part of the Streaming SIMD Extensions is critical to overall performance

® GDC’99Prefetch Hides latency by bringing in data before you need it Hides latency by bringing in data before you need it Provide cache hints to fetch data to different cache levels Provide cache hints to fetch data to different cache levels –prefetchnta - prefetch non-temporal data to non-temporal cache (L1) –prefetcht0 - prefetch temporal data into both L1 and L2 caches –prefetcht1, prefetcht2 - prefetch temporal data into L2 cache

® GDC’99 Prefetch Illustrated memory L1 L2 prefetchnta [esi]

® GDC’99 Prefetch Illustrated (2) memory L1 L2 prefetcht0 [esi]

® GDC’99 Prefetch Illustrated (3) memory L1 L2 prefetcht1 [esi] prefetcht2 [esi]

® GDC’99 Prefetch data loop movaps xmm1, [edx + ebx] movaps xmm2, [edx + ebx + 16] ;Prefetch next iteration data into cache prefetcht1 [edx + ebx + 32] ; … perform calculations on this iteration… ; … add ebx,32 cmp ebx, buff_size jl loop

® GDC’99 Warning about prefetch Proper placing of prefetch is critical to insure that Proper placing of prefetch is critical to insure that –there’s enough time between prefetch and actual load –limited resources to load data Excessive use of prefetch can actually hurt performance Excessive use of prefetch can actually hurt performance Spread out prefetches and insure sufficient computation before load Spread out prefetches and insure sufficient computation before load

® GDC’99 Streaming Stores Avoids polluting cache with data that you don’t real soon (or ever) Avoids polluting cache with data that you don’t real soon (or ever) Supports 128 and 64-bit versions Supports 128 and 64-bit versions –movntps - from xmm reg to memory –movntq - from mm reg to memory If store is a cache hit, then cache is updated and not sent to memory If store is a cache hit, then cache is updated and not sent to memory Weakly ordered Weakly ordered –Use sfence instruction to insure order

® GDC’99 Streaming Store Illustrated memory L1 L2 movntps [esi] xmm0 No write allocation* * If it was a cache hit, then data goes to cache and is not written directly to memory

® GDC’99 Conclusion (& Plugs…) Visit the other Intel talks related to Streaming SIMD Extensions Visit the other Intel talks related to Streaming SIMD Extensions See the Intel booth for information on tools such as VTune and Intel® C/C++ Compiler. See the Intel booth for information on tools such as VTune and Intel® C/C++ Compiler. Plan on attending one of the 3 roundtables on Pentium® III Processor Optimizations Plan on attending one of the 3 roundtables on Pentium® III Processor Optimizations

® GDC’99 Thanks for coming!!! Any questions??? Any questions???

® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.

Similar presentations

Presentation on theme: "® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.

Similar presentations

Presentation on theme: "® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March."— Presentation transcript:

Similar presentations

About project

Feedback