COMPUTER ARCHITECTURE (P175B125)

COMPUTER ARCHITECTURE (P175B125)
Assoc.Prof. Stasys Maciulevičius Computer Dept.

Extension of instruction sets
©S.Maciulevičius

Extension of instruction set
Reasons and presumptions former processors have been focused on processing of integer and floating-point numbers spread of digital processing of graphics and audio information technology development and the reduction of technology process from 0.35 μm to 0.13 μm led to a significant increase in the number of transistors in chip RISC core is compact – it uses relatively small number of transistors increasing the length of the word from 32 bits to 64 bits in many cases, it is sufficient 16 or even 8-bit to encode digital graphics and audio information possibility to use SIMD and vector processing principles ©S.Maciulevičius

Extension of instruction set
In 1996, Intel introduced MMX technology - instruction set of processor has been extended by adding of 57 new instructions for optimization of multimedia applications These instructions treat data as it is in SIMD (Single Instruction - Multiple Data) system Similar extensions to the instruction set introduced and other companies in the processors a little later ©S.Maciulevičius

Extensions of instruction set
Abbrev. Name Company Processors MMX MultiMedia eXtension Intel Pentium w. MMX, Pentium II Cyrix MediaGX KNI, SSE, SSE2, … Katmai New Instr. Streaming SIMD Extens. Pentium III, Pentium 4 3DNow! AMD K6, K7 (Athlon, Duron) AltiVec Motorola, IBM G4, G5 Power 4+, PPC 970 VIS Visual Instruction Set Sun Microsyst. UltraSPARC MAX-2 Multimedia Architectural eXtension HP PA-7100LC, PA-8000 ©S.Maciulevičius

Requirements for extension of instruction set
In order to maintain compatibility with existing software and operating system, designers had to consider the following: programs using MMX instructions must be able to run in existing operating systems; it means, MMX technology shouldn’t add any new architecturally visible states or events (exceptions) programs which don’t use MMX instructions must be able to run without any changes; it means, MMX technology shouldn’t change any existing IA-32 instruction ©S.Maciulevičius

Requirements for extension of instruction set
Available applications must be able to use MMX technology without reprogramming of task, which means that the MMX technology can be used in a separate procedure, leaving the rest unchanged, and they requande that MMX instructions should work in the current procedure call system programs using MMX instructions must be able to run in older processors, which doesn’t support MMX; it means, DLL should be written for processors with MMX and without MMX technology ©S.Maciulevičius

MMX registers MM0 andFP0 MM1 and FP1 MM2 and FP2 MM3 and FP3
Tags MM0 andFP0 MM1 and FP1 MM2 and FP2 MM3 and FP3 MM4 and FP4 MM5 and FP5 MM6 and FP6 MM7 and FP7 00 When FPU registers are used as MMX registers, sign bit and all exponent bits are set to 1 (according to IEEE-754 standard, this means NaN). In transition from the FPU to MMX mode, tags are set to 11 - which means that registers are"empty" ©S.Maciulevičius

Pixel encoding 8 bits 8 bit color pixels Gray pixels
Index Pixel Gray pixels Intensity 12 bit color pixels R G B 32 bit color pixels  ©S.Maciulevičius

Addition – simple and with saturation
Consider two 8-bit integers: +85 and +58. Add them: Result can be interpreted in different ways: overflow is fixed result is set equal to =15 (carry-out will be ignored; this is by adding mod 128) result is set equal to =127 (this is maximal value for positive 8-bit integer) + ©S.Maciulevičius

Data range and saturation
Lower boundary Upper boundary Signed Hexadecimal Decimal 1 byte 80H -128 7FH 127 2 bytes 8000H 7FFFH 32 767 Unsigned 00H FFH 255 0000H FFFFH 65 535 ©S.Maciulevičius

Some graphic instructions
Mnemo-nic Instruction Operands Operation t means: n - nibble b - byte h - halfword - word x means: u - unsigned s - signed us - mixed padd.t Packed Add rd, rs1, rs2 rd:rd+1  rs1:rs1+1 + rs2:rs2+1 sum mod 2t padds.x.t Packed Add and Saturate rd:rd+1  rs1:rs1+1 + rs2:rs2+1 sum with saturation; psub.t Packed Subtract rd:rd+1  rs1:rs1+1 - rs2:rs2+1 subtraction mod 2t psubs.x.t Packed Subtract and Saturate rd:rd+1  rs1:rs1+1 - rs2:rs2+1 subtraction with saturation ©S.Maciulevičius

Pixel addition - examples
padd.b padds.u.b padds.s.b padds.us.b 00 55 80 AA 7F FF 54 2A 29 ©S.Maciulevičius

Some graphic instructions
Mnemo-nic Instruction Operands Operation pmulh Packed Multiply high (on words) rd, rs1, rs2 rd:rd+1  rs1  rs2:rs2+1 pixel multiply pmadd Packed multiply on words and add resulting pairs rd:rd+1  rs1:rs1+1  + rs2:rs2+1 multiply words and add resulting pairs ©S.Maciulevičius

Final shift and addition are needed
Vector product a0 a1 a2 a3 a4 a5 a6 a7  c0 c1 c2 c3 c4 c5 c6 c7 a0c0+a1c1 a2c2+a3c3 a4c4+a5c5 a6c6+a7c7 + x = (a(i)  c(i)), i=0..7 s0145 s2367 Final shift and addition are needed pmadd ©S.Maciulevičius

Vector product: you win
Number of instr. without MMX Number of MMX instructions Load 16 4 Multiply 8 2 Shift Add 7 1 Store Other - 3 Total 40 13 ©S.Maciulevičius

Pecularity of using MMX
While MMX and FPU instructions use the same registers for different purposes, developers should carefully write program code, which uses MMX and FPU alternately MMX modules should be separated from the floating-point code modules. One type of code (MMX or floating-point) should be grouped as much as possible In order to achieve maximum performance, in cycles of modules should not be conditional jumps into another type of module ©S.Maciulevičius

SSE In Pentium III (1999) 70 new instructions - SSE (Streaming SIMD Extensions) - are added (as a reply to AMD's 3DNow! ) Main difference from MMX is in following: some useful new operations, such as min/max are added; some cache and memory management operations are added, which optimize exchange between L2/L3 cache and main memory; SSE originally added eight new 128-bit registers known as XMM0 through XMM7 and floating point instructions (32 bit numbers) ©S.Maciulevičius

SSE XMM block carries out:
vector operations over set of 4 operands (pairs); scalar operations over one operand (pair) – lower 32 bit word When instructions are executed in XMM block, FPU/MMX unit is free, so SSE instructions can be executed in parallel with floating-point instructions Thus, the MMX unit executes integer instructions, and the XMM block - 32-bit floating-point instructions ©S.Maciulevičius

SSE2 SSE2, introduced with the Pentium 4, is a major enhancement to SSE SSE2 adds new math instructions for double-precision (64-bit) floating point and also extends MMX instructions to operate on 128-bit XMM registers SSE2 enables the programmer to perform SIMD math on any data type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers ©S.Maciulevičius

Data formats in SSE2 128 bit integer Two 64 bit integers:
64 bit integer bit integer Four 32 bit integers: 32 bit int. 32 bit int bit int. 32 bit int. Eigth 16 bit integers: 16 b b b b b b b b. Sixteen 8 bit integers: 8 b 8 b 8 b. 8 b 8 b 8 b. 8 b. 8 b.8 b. 8 b. 8 b. 8 b 8 b 8 b 8 b 8 b. ©S.Maciulevičius

64 bit floating point 64 bit floating point
Data formats in SSE2 Two 64 bit floating point numbers: 64 bit floating point 64 bit floating point Four 32 bit floating point numbers: 32 bit fl.p bit fl.p bit fl.p bit fl.p. ©S.Maciulevičius

SSE3 SSE3, also called Prescott New Instructions (PNI), is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process (thread) management instructions ©S.Maciulevičius

Some examples of SSE3 MOVSLDUP – Move Packed Single-FP Low and Duplicate: OpA (128 bit, 4 words): a3 | a2 | a1 | a0 OpB (128 bit, 4 words): b3 | b2 | b1 | b0 Result: b2 | b2 | b0 | b0 HADDPS – “horizontal” addition: Result : b3 + b2 | b1 + b0 | a3 + a2 | a1 + a0 ADDSUBPS – addition and subtraction: Result: a3 + b3 | a2 - b2 | a1 + b1 | a0 - b0 ©S.Maciulevičius

SSSE3 SSSE3 is an incremental upgrade to SSE3, adding 16 new opcodes which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions It was introduced by Intel in Core microarchitecture, used in Xeon 5100 and Core 2 processors ©S.Maciulevičius

SSE4 In Intel Core and AMD K10 microarchitecture processors (2006) new 54 instructions (SSE4.1 set has 47 instructions, SSE4.2 – 7 instructions) were introduced Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications SSE4 operations use 128 bit registers One example: MPSADBW computes eight sums of difference in one instruction: |x0-y0|+|x1-y1|+|x2-y2|+|x3-y3|, |x0-y1|+|x1-y2|+|x2-y3|+|x3-y4|, ...; such operation is usefull in HDTV coding devices ©S.Maciulevičius

3DNow! 3DNow! is an extension to the x86 instruction set developed by AMD The original idea behind its creation was to extend it from only operating on integer math to also accelerating floating-point calculations It adds SIMD instructions to the base x86 instruction set, enabling it to perform simple vector processing, which improves the performance of many graphic-intensive applications The first microprocessor to implement 3DNow! was the AMD K6-2, which was introduced in 1998 ©S.Maciulevičius

SSE5 The SSE5 (short for Streaming SIMD Extensions version 5), announced by AMD on August 30, 2007, is an extension to the 128-bit SSE core instructions in the AMD64 instruction set for the Bulldozer processor core The details of how the instructions are coded was revised in May 2009 for better compatibility with Intel's proposed AVX (Advanced Vector Extensions) instruction set ©S.Maciulevičius

SSE5 At the same time, the name SSE5 was changed to :
XOP – new operations over integer vectors FMA4 – contain fused multiply-and-add instructions for floating point scalar and SIMD operations CVT16 – half precision floating point conversion SSE5 instruction set consisted of 170 instructions (including 46 base instructions) ©S.Maciulevičius

Advanced Vector Extensions
Advanced Vector Extensions (AVX) is a new 256-bit SIMD FP vector extension of Intel Architecture Its introduction was targeted for the Sandy Bridge processor family in the 2010 timeframe Intel AVX accelerates FP intensive computation in general purpose applications like image, video, and audio processing, engineering applications such as 3D modeling and analysis, scientific simulation, and financial analytics ©S.Maciulevičius

Advanced Vector Extensions
The size of the SIMD vector registers is increased from 128-bits XMM registers to 256-bits registers called YMM0 - YMM15 Existing 128-bit instructions use the lower half of the YMM registers Further extensions to 512 or 1024 bits are expected in the future Instructions are non-destructive: the AVX instruction set allows all two-operand XMM instructions to be modified into non-destructive three-operand forms where the destination register is different from both source registers. For example a = a + b is replaced by c = a + b so that register a is unchanged after the instruction ©S.Maciulevičius

Advanced Vector Extensions 2
Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions, is an expansion of the AVX instruction set to be first introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions: Expansion of most integer AVX instructions to 256 bits 3-operand general-purpose bit manipulation and multiply Gather support, enabling vector elements to be loaded from non-contiguous memory locations Vector shifts and 3-operand fused multiply-accumulate support ©S.Maciulevičius

COMPUTER ARCHITECTURE (P175B125)

Similar presentations

Presentation on theme: "COMPUTER ARCHITECTURE (P175B125)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMPUTER ARCHITECTURE (P175B125)

Similar presentations

Presentation on theme: "COMPUTER ARCHITECTURE (P175B125)"— Presentation transcript:

Similar presentations

About project

Feedback