Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:

Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder: Sixteen 8-bit adds Eight 16-bit adds Four 32-bit adds Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)

x86 FP Architecture Originally based on 8087 FP coprocessor – 8 × 80-bit extended-precision registers – Used as a push-down stack – Registers indexed from TOS: ST(0), ST(1), … FP values are 32-bit or 64 in memory – Converted on load/store of memory operand – Integer operands can also be converted on load/store Very difficult to generate and optimize code – Result: poor FP performance

x86 FP Instructions Optional variations – I : integer operand – P : pop operand from stack – R : reverse operand order – But not all combinations allowed Data transferArithmeticCompareTranscendental FILD mem/ST(i) FISTP mem/ST(i) FLDPI FLD1 FLDZ FIADDP mem/ST(i) FISUBRP mem/ST(i) FIMULP mem/ST(i) FIDIVRP mem/ST(i) FSQRT FABS FRNDINT FICOMP FIUCOMP FSTSW AX/mem FPATAN F2XMI FCOS FPTAN FPREM FPSIN FYL2X

Streaming SIMD Extension 2 (SSE2) Adds 4 × 128-bit registers – Extended to 8 registers in AMD64/EM64T Can be used for multiple FP operands – 2 × 64-bit double precision – 4 × 32-bit double precision – Instructions operate on them simultaneously Single-Instruction Multiple-Data

Matrix Multiply Unoptimized code: 1. void dgemm (int n, double* A, double* B, double* C) 2. { 3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for(int k = 0; k < n; k++ ) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*n] = cij; /* C[i][j] = cij */ 10. } 11. }

Matrix Multiply x86 assembly code: 1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0 2. mov %rsi,%rcx # register %rcx = %rsi 3. xor %eax,%eax # register %eax = 0 4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1 5. add %r9,%rcx # register %rcx = %rcx + %r9 6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A 7. add $0x1,%rax # register %rax = %rax + 1 8. cmp %eax,%edi # compare %eax to %edi 9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 10. jg 30 # jump if %eax > %edi 11. add $0x1,%r11d # register %r11 = %r11 + 1 12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element

Matrix Multiply Optimized C code: 1. #include 2. void dgemm (int n, double* A, double* B, double* C) 3. { 4. for ( int i = 0; i < n; i+=4 ) 5. for ( int j = 0; j < n; j++ ) { 6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n), 10. _mm256_broadcast_sd(B+k+j*n))); 11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 12. } 13. }

Matrix Multiply Optimized x86 assembly code: 1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0 2. mov %rbx,%rcx # register %rcx = %rbx 3. xor %eax,%eax # register %eax = 0 4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 5. add $0x8,%rax # register %rax = %rax + 8 6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements 7. add %r9,%rcx # register %rcx = %rcx + %r9 8. cmp %r10,%rax # compare %r10 to %rax 9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0 10. jne 50 # jump if not %r10 != %rax 11. add $0x1,%esi # register % esi = % esi + 1 12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements

Right Shift and Division Left shift by i places multiplies an integer by 2 i Right shift divides by 2 i ? – Only for unsigned integers For signed integers – Arithmetic right shift: replicate the sign bit – e.g., –5 / 4 11111011 2 >> 2 = 11111110 2 = –2 Rounds toward –∞ – c.f. 11111011 2 >>> 2 = 00111110 2 = +62

Associativity Parallel programs may interleave operations in unexpected orders – Assumptions of associativity may fail Need to validate parallel programs under varying degrees of parallelism

Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:

Similar presentations

Presentation on theme: "Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:

Similar presentations

Presentation on theme: "Subword Parallellism Graphics and audio applications can take advantage of performing simultaneous operations on short vectors – Example: 128-bit adder:"— Presentation transcript:

Similar presentations

About project

Feedback