® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

INSTRUCTION SET ARCHITECTURES

1 ECE462/562 ISA and Datapath Review Ali Akoglu. 2 Instruction Set Architecture A very important abstraction –interface between hardware and low-level.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

The University of Adelaide, School of Computer Science

ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.

10/9: Lecture Topics Starting a Program Exercise 3.2 from H+P Review of Assembly Language RISC vs. CISC.

Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

Computer Organization and Architecture

Computer Organization and Architecture

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.

CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Intel SIMD architecture Computer Organization and Assembly Languages Yung-Yu Chuang 2008/1/5.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

® GDC’99 Subdivision Surfaces with the Pentium ® III Processor Mike Bargeron Senior Software Developer Intel Corporation (480)

Computer Systems Organization CS 1428 Foundations of Computer Science.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

Other Processors. Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

The Alpha Thomas Daniels Other Dude Matt Ziegler.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

MMX-accelerated Matrix Multiplication

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Chapter Overview General Concepts IA-32 Processor Architecture

x86 Processor Architecture

Microprocessor Systems Design I

Exploiting Parallelism

Vector Processing => Multimedia

Advanced Computer Architecture 5MD00 / 5Z032 Instruction Set Design

SIMD Programming CS 240A, 2017.

MMX Multi Media eXtensions

Machine-Dependent Optimization

Special Instructions for Graphics and Multi-Media

Microprocessor & Assembly Language

STUDY AND IMPLEMENTATION

Intel SIMD architecture

EE 193: Parallel Computing

Morgan Kaufmann Publishers Arithmetic for Computers

Memory System Performance Chapter 3

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

Chapter 11 Processor Structure and function

Presentation transcript:

® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March 17, 1999

® GDC’99Agenda Introduction Introduction SIMD instructions SIMD instructions Some examples Some examples The secret ingredient The secret ingredient

® GDC’99 Streaming SIMD Extensions New technology to exploit parallelism in FP and INT applications New technology to exploit parallelism in FP and INT applications Key capabilities Key capabilities – Packed Operations – Branch Removal/Compression – Data Movement/Hints – FP/INT Type Conversion

® GDC’99 Application Domains… 3D Graphics: geometry, lighting 3D Graphics: geometry, lighting signal processing signal processing high precision simulation & modeling high precision simulation & modeling video encoding/decoding video encoding/decoding other apps requiring streaming input and output other apps requiring streaming input and output

® GDC’99 Instruction Categories SIMD FP - 4 wide, single-precision SIMD FP - 4 wide, single-precision SIMD INT - extensions to MMX™ technology capabilities SIMD INT - extensions to MMX™ technology capabilities Cacheability control Cacheability control State management State management

® GDC’99 SIMD FP instructions types arithmetic arithmetic square root square root approximation instructions approximation instructions min & max min & max loads & stores loads & stores move mask move mask compare & set mask compare & set mask logical logical compare & set eflags compare & set eflags conversion conversion

® GDC’99 Keys point #1: SIMD SIMD SIMD –Operates on data in parallel (e.g. process 4 vertices at a time instead of just 1)

® GDC’99 Key point #2: Streaming Core/bus ratios are getting higher all the time Core/bus ratios are getting higher all the time –memory latencies can kill potential parallelism Can we find a cure? Can we find a cure? –Hide load latency (prefetch) –Don’t pollute cache if data is never to be used again (streaming store)

® GDC’99 New SIMD FP Registers xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 eight 128-bit, 4x32bit single-precision FP New set of registers! Direct access Used for data only Extended processor state

® GDC’99 Packed SP Data Type Each register holds 4 single-precision FP values Each register holds 4 single-precision FP values IEEE-754 compatible IEEE-754 compatible Scalar operates on least-significant number Scalar operates on least-significant number xmm0 1-bit sign 8-bit exponent 23-bit mantissa

® GDC’99Compatibility 100% compatible with all existing IA-32 software 100% compatible with all existing IA-32 software Extension is not transparent to OS Extension is not transparent to OS –Changes needed to OS to handle extended state –Support in Win98 –Support in WinNT4 with SP4

® GDC’99 Do I need assembly? No! 3 levels of support for IA 3 levels of support for IA – assembly - of course, do it yourself – intrinsics - assembly-like C – C++ classes - take advantage of SIMD from a high level Use Intel® C/C++ Compiler for intrinsic or C++ support of Streaming SIMD Extensions and MMX™ technology Use Intel® C/C++ Compiler for intrinsic or C++ support of Streaming SIMD Extensions and MMX™ technologyhttp://developer.intel.com/drg/pentiumiii/tools/ad2.htm

® GDC’99 Packed Operations xmm0 xmm1 xmm0 a0a1a2a3 b0b1b2b3 a0 op b0a1 op b1a2 op b2a3 op b3 op is one of addps subps mulps divps

® GDC’99 Scalar Operations xmm0 xmm1 xmm0 a0a1a2a3 b0b1b2b3 a0 op b0a1a2a3 op is one of addss subss mulss divss

® GDC’99 SIMD Data Organization Exploit vertical parallelism Exploit vertical parallelism SOA versus AOS SOA versus AOS –Array of Structures –Structure of Arrays X0, X1, X2, X3, … Y0, Y1, Y2, Y3, … Z0, Z1, Z2, Z3,... X0, Y0, Z0, X1, Y1, Z1, X2, Y2, Z2, … Better cacheability Better SIMD calculations

® GDC’99 Some Examples… Also visit for more details on tools and documentation Also visit for more details on tools and documentation

® GDC’99 Matrix Vector Multiply Typical 3D operation Typical 3D operation Load values in SOA format Load values in SOA format –xxxx…, yyyy…, zzzz… Follow with multiply and add operations Follow with multiply and add operations movapsxmm0, [list+X+ecx];load X components movapsxmm2, [list+Y+ecx];load Y components movapsxmm3, [list+Z+ecx];load Z components movaps xmm1, [esi+m00] ;m00 m00 m00 m00 movaps xmm4, [esi+m01] ;m01 m01 m01 m01

® GDC’99 Matrix Vector Multiply (2) Accumulate results… Accumulate results… We’ve just done 4 dot products in parallel! We’ve just done 4 dot products in parallel! Loop back and pick up next 4 vertices… mulpsxmm1, xmm0 ;x*m00 x*m00 x*m00 x*m00 mulps xmm4, xmm2 ;y*m01 y*m01 y*m01 y*m01 addpsxmm4, xmm1 ;add the 2 results movaps xmm1, [esi+m02];load matrix element m02 (x4) mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02 mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02 addps xmm4, xmm1;add results addps xmm4, [esi+m03];add last element of matrix

® GDC’99 Fast Reciprocal Approximate instructions are fast! Approximate instructions are fast! Accurate to 11 bits (out of 23 in mantissa) Accurate to 11 bits (out of 23 in mantissa) An iteration of Newton-Raphson doubles precision to 22 An iteration of Newton-Raphson doubles precision to 22 ;Approximation of 1/W with rcpps without NR movapsxmm0, [ecx]; ecx points to w0, w1, w2, w3 rcpps xmm1,xmm0 ; xmm1 = w; xmm1 = ~1/w ;Additional code for approximation of 1/w with rcpps and NR mulps xmm0,xmm1;xmm0 = w * ~1/w mulps xmm0,xmm1 ;xmm0 = w * ~1/w * ~1/w addps xmm1,xmm1 ;xmm1 = 2 * ~1/w subps xmm1,xmm0 ;xmm1 = 2 * ~1/w - w * ~1/w * ~1/w

® GDC’99 FP to INT Conversion Converts two of the FP values in the xmm register to 32-bit integers in MMX TM technology registers Converts two of the FP values in the xmm register to 32-bit integers in MMX TM technology registers movapsxmm0,[ecx] cvttps2pimm0,xmm0 shufpsxmm0,xmm0,Eh cvttps2pimm1,xmm0

® GDC’99 SIMD Integer Instructions Extensions to MMX TM technology instructions Extensions to MMX TM technology instructions Operate on same 64-bit registers as previous MMX technology instructions Operate on same 64-bit registers as previous MMX technology instructions Instructions: extract, insert, min/max, byte mask  integer, multiply high unsigned, shuffle Instructions: extract, insert, min/max, byte mask  integer, multiply high unsigned, shuffle

® GDC’99 And now... On to the secret ingredient… On to the secret ingredient… –Ok, it’s not really secret… I mentioned it earlier –But, it’s probably the most important (and difficult to use) part of the Streaming SIMD Extensions

® GDC’99 Computing Model Processing inputoutput If you can’t bring data in fast enough or spit it out fast enough… If you can’t bring data in fast enough or spit it out fast enough… –SIMD is of little or no use The “streaming” part of the Streaming SIMD Extensions is critical to overall performance The “streaming” part of the Streaming SIMD Extensions is critical to overall performance

® GDC’99Prefetch Hides latency by bringing in data before you need it Hides latency by bringing in data before you need it Provide cache hints to fetch data to different cache levels Provide cache hints to fetch data to different cache levels –prefetchnta - prefetch non-temporal data to non-temporal cache (L1) –prefetcht0 - prefetch temporal data into both L1 and L2 caches –prefetcht1, prefetcht2 - prefetch temporal data into L2 cache

® GDC’99 Prefetch Illustrated memory L1 L2 prefetchnta [esi]

® GDC’99 Prefetch Illustrated (2) memory L1 L2 prefetcht0 [esi]

® GDC’99 Prefetch Illustrated (3) memory L1 L2 prefetcht1 [esi] prefetcht2 [esi]

® GDC’99 Prefetch data loop movaps xmm1, [edx + ebx] movaps xmm2, [edx + ebx + 16] ;Prefetch next iteration data into cache prefetcht1 [edx + ebx + 32] ; … perform calculations on this iteration… ; … add ebx,32 cmp ebx, buff_size jl loop

® GDC’99 Warning about prefetch Proper placing of prefetch is critical to insure that Proper placing of prefetch is critical to insure that –there’s enough time between prefetch and actual load –limited resources to load data Excessive use of prefetch can actually hurt performance Excessive use of prefetch can actually hurt performance Spread out prefetches and insure sufficient computation before load Spread out prefetches and insure sufficient computation before load

® GDC’99 Streaming Stores Avoids polluting cache with data that you don’t real soon (or ever) Avoids polluting cache with data that you don’t real soon (or ever) Supports 128 and 64-bit versions Supports 128 and 64-bit versions –movntps - from xmm reg to memory –movntq - from mm reg to memory If store is a cache hit, then cache is updated and not sent to memory If store is a cache hit, then cache is updated and not sent to memory Weakly ordered Weakly ordered –Use sfence instruction to insure order

® GDC’99 Streaming Store Illustrated memory L1 L2 movntps [esi] xmm0 No write allocation* * If it was a cache hit, then data goes to cache and is not written directly to memory

® GDC’99 Conclusion (& Plugs…) Visit the other Intel talks related to Streaming SIMD Extensions Visit the other Intel talks related to Streaming SIMD Extensions See the Intel booth for information on tools such as VTune and Intel® C/C++ Compiler. See the Intel booth for information on tools such as VTune and Intel® C/C++ Compiler. Plan on attending one of the 3 roundtables on Pentium® III Processor Optimizations Plan on attending one of the 3 roundtables on Pentium® III Processor Optimizations

® GDC’99 Thanks for coming!!! Any questions??? Any questions???