A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek Rajski, Nels Oscar, David Burri, Alex Diede

Introduction We have seen how to improve performance through exploitation of: Instruction-level parallelism Thread-level parallelism One other exploitation we have not discussed is Data-level parallelism.

Introduction Flynn's Taxonomy An organization of computer architectures based on their instruction and data streams Divides all architectures into 4 categories: 1.SISD 2.SIMD 3.MISD 4.MIMD

Introduction Implementations of SIMD Prevalent in GPUs SIMD extensions in CPU Embedded systems and Mobile Platforms

Introduction Software for SIMD Many libraries utilize and encapsulate SIMD Adopted in these areas o Graphics o Signal Processing o Video Encoding/Decoding o Some scientific applications

Introduction SIMD Implementations fall into three high- level categories: 1.Vector Processors 2.Multimedia Extensions 3.Graphics Processors

Introduction Going forward: Streaming SIMD Extensions(MMX/SSE/AVX) o Similar technology in GPUs Compiler techniques for DLP Problems in the world of SIMD Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for x86 computers. This figure assumes that two cores per chip for MIMD will be added every two years and the number of operations for SIMD will double every four years. Copyright © 2011, Elsevier Inc.

SIMD in Hardware Register Size/Hardware changes Intel Core i7 example The ‘Roofline’ model Limitations of streaming extensions in a CPU

SIMD in Hardware Streaming SIMD requires some basic components o Wide Registers  Rather than 32bits, have 64, 128, or 256 bit wide registers. o Additional control lines o Additional ALU's to handle the simultaneous operation on up to operand sizes of 16-bytes

Hardware Figure 4.4 Using multiple functional units to improve the performance of a single vector add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle. The vector processor (b) on the right has four add pipelines and can complete four additions per cycle. The elements within a single vector add instruction are interleaved across the four pipelines. The set of elements that move through the pipelines together is termed an element group. s

Intel i7 The Intel i7 Core o Superscalar processor o Contains several SIMD extensions  16x256-bit wide registers, and physical registers on pipeline.  Support for 2 and 3 operand instructions

The Roofline Model of Performance The Roofline model of performance aggregates floating-point performance, operational intensity memory

The Roofline Model of Performance Opteron X2

Limitations Memory Latency Memory Bandwidth The actual amount of vectorizable code

SIMD at the software level SIMD is not a new field. But more focus has been brought to it by the GPGPU movement.

SIMD at the software level CUDA Developed by Nvidia Compute Unified Device Architecture Closed to GPUs with chips from Nvidia Graphics cards G8x and newer Provides both high and low level API

SIMD at the software level OpenCL Developed by Apple Open to any vendor that decide to support it Designed to execute across GPUs and CPUs Graphics cards G8x and newer Provides both high and low level API

SIMD at the software level Direct Compute Developed by Microsoft Open to any vendor that supports DirectX11 Windows only Graphics cards GTX400 and HD5000 Intel’s Ivy Bridge will also be supported

Compiler Optimization Not everyone programs in SIMD based languages. But C, Java were never designed with SIMD in mind. Compiler technology had to improve to catch code with vectorizable instructions.

Compiler Optimization Before optimization can begin Data dependencies have to be understood But only within the vector window size matter Vector window size - The size of data executed in parallel with the SIMD instruction

Compiler Optimization Before optimization can begin Example: for( int i = 0; i < 16; i++){ C[i] = c[i+1]; C[i] = c[i+16]; } for( int i = 0; i < 16; 4++){ C[i] = c[i+1]; C[i+1] = c[i+2]; (Wrong) C[i+2] = c[i+3]; (Wrong) C[i+3] = c[i+4]; (Wrong) C[i] = c[i+16]; C[i+1] = c[i+17]; C[i+2] = c[i+18]; C[i+3] = c[i+20]; }

Compiler Optimization Framework for vectorization o Prelude o Loop o Postlude o Cleanup

Compiler Optimization Framework for vectorization Prelude Loop independent variables are prepared for use. Run time checks that vectorization is possible Loop Vectorizable instructions are performed in order with original code. Loop could be split into multiple loops. Vectorizable sections could be split by more complex code in original loop.

Compiler Optimization Framework for vectorization o Postlude  All loop independent variables are returned. o Cleanup  Non vectorizable iterations of the loop are run.  These include the remainder of vectorizable instructions that do not fit evenly into the vector size.

Compiler Optimization Compiler techniques Loop Level Automatic Vectorization Basic Block Level Automatic Vectorization In the presence of control flow

Compiler Optimization Loop Level Automatic Vectorization 1. Find innermost loop that can be vectorized. 2. Transform loop and create vector instructions. Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for( i=0; i<1024; i+=4){ vA = vec_ld( A[i] ); vB = vec_ld( B[i] ); vC = vec_mul( vA, vB); vec_st( vC, C[i] ); }

Compiler Optimization Basic Block Level Automatic Vectorization 1. The inner most loop is unrolled by the size of the vector window. 2. Isomorphic scalar instructions are packed into vector instruction. Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for (i = 0; i < 1024; i+=4) C[i] = A[i]*B[i]; C[i+1] = A[i+1]*B[i+1]; C[i+2] = A[i+2]*B[i+2]; C[i+3] = A[i+3]*B[i+3];

Compiler Optimization In the presence of control flow 1. Apply predication 2. Apply method from above 3. Remove vector predication 4. Remove scalar predication Original Code for (i = 0; i < 1024; i+=1){ if (A[i] > 0) C[i] = B[i]; else D[i] = D[i-1]; } After Predication for (i = 0; i < 1024; i+=1){ P = A[i] > 0; NP = !P; C[i] = B[i]; (P) D[i] = D[i-1]; (NP) }

Compiler Optimization In the presence of control flow After Vectorization for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=B[i:i+3]; (vP) (NP1,NP2,NP3,NP4) = vP; D[i+3]=D[i+2]; (NP4) D[i+2]=D[i+1]; (NP3) D[i+1]=D[i]; (NP2) D[i]=D[i-1]; (NP1) } After Removing Predicates for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=vec_sel(C[i:i+3], B[i:i+3], vP); (NP1,NP2,NP3,NP4) = vP; if (NP4) D[i+3]=D[i+2]; if (NP3) D[i+2]=D[i+1]; if (NP2) D[i+1]=D[i]; if (NP1) D[i]=D[i-1]; }

CPU vs GPU Founding of the GPU as we know it today was Nvidia in 1999 Popularity increased in recent years VisionTek GeForec 256 [Wikipedia]Nvidia GeForce GTX590 [Nvidia]

CPU vs GPU Theoretical GFLOP/s & Bandwidth [Nvidia, NVIDIA CUDA C Programming Guide]

CPU vs GPU Intel Core i7 Nehalem Die Shot [NVIDIA’s Fermi: The First Complete GPU Computing Architecture]

CPU vs GPU Game, Little Big Planet [http://trendygamers.com]

CPU vs GPU OpenGL Graphics Pipeline [Wojtek Palubicki; http://pages.cpsc.ucalgary.ca/~wppalubi/]

CPU vs GPU CPU SIMD vs. GPU SIMD Intel’s sandy-bridge architecture: 256-bit AVX --> on 8 registers parallel CUDA multiprocessor up to 512 raw mathematical operations in parallel

CPU vs GPU Nvidia’s Fermi Source: http://www.legitreviews.com/article/1193/2/

CPU vs GPU [Nvidia; NVIDIA’s Next Generation CUDA Compute Architecture: Fermi] Nvidia’s Fermi

Standardization Problems and Industry Challenges [Widescreen Wallpapers; http://widescreen.dpiq.org/30__AMD_vs_Intel_Challenge.htm]

1998 o AMD - 3Dnow o Intel - SSE instruction set a few years later without supporting the 3Dnow o Intel won this battle since SSE was better Standardization Problems and Industry Challenges

2001 o Intel - Itanium processor (64-bit, parallel computing instruction set) o AMD - Its own 64-bit instruction set (backward compatible) o AMD won this time because of its backward compatibility. 2007 o AMD - SSE5 o Intel - AVX Standardization Problems and Industry Challenges

Example: fused-multiply-add (FMA) o d = a + b * c AMD o Supports since 2011 FMA4 o FMA4 - 4 operand form Intel o Will support FMA3 in 2013 with Haswell o FMA3 - 3 operand form Standardization Problems and Industry Challenges

This causes More work for the programmer Impossible maintenance of the code Standardization required! Standardization Problems and Industry Challenges

SIMD Processors exploit data-level parallelism increasing performance. The hardware requirements are easily met as transistor size decreases. HPC languages have been created to give programmers access to high and low level SIMD operations. Conclusion

Compiler technology has improved to recognize some potential SIMD operations in serial code. The utility of SIMD instructions in modern microprocessors is diminishing except in special purpose applications due to standardization problems and industry in-fighting. The increasing adoption of GPGPU computing has the potential to supplant SIMD type instructions in the CPU. On-chip GPU's appear to be on the horizon, so wider really is better. Conclusion

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

Similar presentations

Presentation on theme: "A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

Similar presentations

Presentation on theme: "A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek."— Presentation transcript:

Similar presentations

About project

Feedback