Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008.

Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008

Brief Outline (1) Why I got interested in GPGPU… Thinking about CUDA An almost-ideal test case to understand the early universe: random walks, random numbers and reductions… A trickier problem, for data analysis: Cholesky factorization Thoughts on optimization

Brief Outline (2) Trying AMD Stream Programming for a GPU cluster? Is it worth it?

The Motivation I remember reading about CUDA and Nvidia’s new GPUs back in early 2007 – having waited hours for 16-core jobs to start on the local supercomputer, the idea of having 128 processors, all sitting in ones own machine, seemed very enticing… –CUDA seemed okay and fun enough for me to have a go…

Some support Received a mini-grant from the Foundation Questions Institute to look at GPGPU with CUDA for cosmology and provide a resources page; the latter at: http://www.ast.cam.ac.uk/~stg20/gpgpu/index.html htttp://www.fqxi.org

I bought… 2x Nvidia 8800GTSs A Sun Ultra 40M2 Opteron workstation to host them (good PSU and multiple PCIe slots…)

The problem “Inflation” in the early universe… A crash course in cosmology: –The universe is expanding (i.e. galaxies are moving away from other galaxies) and the radiation in it is cooling down –Extrapolating backwards would lead to a “big bang” –Modify this by having a new form of matter, the “inflaton”, early on

–The inflaton can fluctuate; indeed small fluctuations in it can be the seeds that under gravitational collapse lead to the galaxies etc. that we see today –But we might also have large fluctuations of the inflaton across the early universe… Basically the whole universe could have undergone a random walk! So what happened on the average? –What should we expect to see in our past? –And with what spread?

Subtleties Starting from an initial inflaton value, some of the histories don’t stop inflating –We may only want to look at the subset that do; a constrained random walk Different histories inflate by different factors –We have to weight each history according to a function of the entire history

The Idea To simulate a whole load of histories numerically Find the average history (and spread)

Nvidia Concepts (1) The gpu sequentially executes one or more grids of threads. Each thread in a grid runs the same program or kernel. Threads in a grid are organised into blocks. –Threads in a block can synchronize and communicate via fast shared memory. –All threads can read and write anywhere in the GPU’s main global memory.

Nvidia Concepts (2) A thread block runs on a single multiprocessor. It is split into warps of 32 threads. A high-end gpu contains say 12-30 multiprocessors, each consisting of 8 processors. Each processor runs the same instruction 4 times in a row to process a warp. Ideally, a multiprocessor should run multiple warps Nvidia CUDA Programming Guide, Nvidia

Nvidia Concepts (3) Write kernels in CUDA, basically C This gets compiled into a virtual assembly language, ptx (you can inspect this) ptx gets compiled into real card-specific code (officially, you can’t see this) (but see decuda, http://www.cs.rug.nl/wladimir/decuda)

The Idea (2) Get each thread to simulate a history Form a partial average in each block and store to global memory Launch a new kernel to form the final average Transfer average history back to cpu

Parallel RNG It is tricky to get good streams of parallel numbers I wanted something fast and reasonably good (will check soon, honest) Each thread uses a Marsaglia “Multiply With Carry” generator, each with its own multiplier. Then used a log/trig Box-Muller transform to get a gaussian

Basic Optimization Guide Access global memory as little as possible –When you do, access it in a coalesced manner, having all threads in a warp access nearby addresses Keep all threads in a warp doing the same thing Access shared memory in an efficient manner

Performance 1000x my cpu code! Wow!

Performance(2) After more than a couple of evenings of thought and effort, 1000x my –single-threaded, unoptimized cpu code –that took about 30 minutes to write –running on an old 1.6 GHz cpu … Hmm…

Data Analysis Bayes’ Theorem: P(model|data) = P(data|model) P(model) / P(data) Often, P(data|model) is a gaussian because –the mismatch between data and signal is down to noise which is often best taken to be gaussian –The signal is a gaussian

e.g. the CMB Radiation “from the big bang” is a correlated Gaussian (?) random field –Correlations in the field depend on the model WMAP science team

So, p(data|model) is given by: C is a function of the cosmological parameters!

Cholesky Factorization

Why does this help? Exponent… –Can solve in O(N^2) Prefactor… –Det(A) is just the square of the product of the diagonal elements of L, takes O(N) So after one O(N^3) operation, you can get all you need. –The number in front of N^3 is small for Cholesky

A suitable parallel implementation

What about the bottom right corner?... Just another Cholesky factorization, in one lower dimension!

So… Need a loop over reducing matrix sizes of 3 kernels (we need to sync between each step) –1 to do the square root –1 to update the strip –1 to subtract the outer product of the strip from the remaining bottom right corner The last kernel takes all the time!

Performance and Optimizations In the outer product, we need to do n(n+1)/2 multiply subtracts per triangle, so N^3/6 in total. We could find ourselves memory bandwidth bound… To avoid this, we in fact treat “blocks” of the matrix at a time, storing them in fast shared memory, reducing global memory bandwidth by a factor of the block size.

Memory issues… Bizarre factor of two slowdowns for certain matrix sizes; different cards suffered at different sizes Could lead to the “better” 8800GTX card being slower than the “slower” 8800GTS card… My interpretation:

Perhaps the memory is interleaved in units of 256 bytes between partitions! Then, for certain matrix sizes, all of the strip might be stored in the same partition. Then all thread blocks working down a column hit the same partition all the time… Problem basically goes away for a version of the code where each block works on a row instead Hitting memory partitions… GeForce 8800 Architecture Overview, Nvidia

Also, paging? Made a minor change to the layout of the arrays to store each 16x16 block contiguously, i.e. from A[N][N] to A[N/16][N/16][16][16] –Found a 50% speedup!! But why should one have to be finding these things out by trial, error and a lot of thought and effort?

Performance bottom line 12228^2 matrix, single precision –8s on an 8800GTX –17s using Intel MKL on a node of the University supercomputer (4 cores at 3GHz) So, after hours and hours and hours of thought and effort, a factor of a few times faster than a one-line call to a (professional) lapack library routine on a 4- core cpu node…

So, general performance issues No clear description about global memory (other than coalescing), let alone textures! Possible issue with shared memory operands (see forum discussion) – no advice from Nvidia No official view of what the gpu is actually doing; ptx can be obfuscating to optimization…

Current NV hardware and DP Consumer: GTX 260/280 896MB/1GB nice sticker Tesla: C1060 4GB no nice sticker

DP vital for Cholesky Basically just changed “float” to “double” in my code and now get 40 GFLOP/s, 2/3 of peak performance on a GTX260!! Very happy… Note DP doubles memory bandwidth requirements, but cards are more than twice as slow at DP than SP at present…

AMD Stream Alluring theoretical ALU performance, both in SP and DP! –(Earlier DP precision support than Nvidia) Beguilingly offers views of the actual gpu code! (Can even program in it…) SDK still in beta unfortunately…

AMD concepts Brook+ CAL/IL gpuisa

Brook+ example: adding 2 matrices kernel void sum(float a<>,float b<>, out float c<>){ c=a+b; } int main(){ … float a ; float b ; float c ; … StreamRead(a,a_cpu); StreamRead(b,b_cpu); … sum(a,b,c); … StreamWrite(c,c_cpu); … } Ideal for “pure” streaming applications, i.e. doing the same thing to many many elements Allows for “reductions” and also for more complicated memory access patterns Handles all the gpu complexity behind the scenes

CAL/IL CAL=Compute Abstraction Layer –Provides c functions to set up and copy memory to and from the gpu, and to compile, set up and run kernels on the gpu IL=Intermediate Language –A pseudo-assembly language for AMD gpus –128-bit registers!

Hardware Summary Last generation: 3800 Series, Firestream 9170 –DP New generation: 4800 Series, Firestream 9250 –Over 2x faster, over 1TFLOP/s SP! –Can support compute shaders with interthread communication

Current cards… Consumer: HD4870 512MB/1GB GDDR5 Nice(?) sticker Professional: 9250 1GB GDDR3 Single slot, <150W No nice sticker…

Current cards TeraScale Graphics Engine presentation, AMD

10 “SIMDs” x16 “thread processors” x5 “stream cores” TeraScale Graphics Engine presentation, AMD

All 16 thread processors in a SIMD run the same instruction, 4 times over in 4 clock cycles, on a “wavefront” of 64 threads Each instruction is a VLIW one! –Basically a separate command to each stream core in the thread processor –Some instructions, e.g. can only run on the “t” stream core –Some instructions, e.g. DP multiply, take up multiple stream cores…

gpuisa Perhaps now we see the reason for IL existing, being a scalar language You can see the actual gpuisa though! –very helpful to see what your program is actually doing –aids in optimization

Cholesky on Stream Similar structure to Nvidia Currently based on 4x4 blocking of the matrix, due to float4 nature of registers Currently using “pixel” shaders (compute shaders are new and only supported on latest hardware); more graphics oriented

IL Shaders il_ps_2_0 dcl_input_position_interp(linear_noperspective) vWinCoord0.xy dcl_output_generic o0 dcl_output_generic o1 dcl_output_generic o2 dcl_output_generic o3 dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) Set up…

Load in data (that was partitioned between multiple buffers)… sample_resource(0)_sampler(0) r0, vWinCoord0.xyxx sample_resource(1)_sampler(0) r1, vWinCoord0.xyxx sample_resource(2)_sampler(0) r2, vWinCoord0.xyxx sample_resource(3)_sampler(0) r3, vWinCoord0.xyxx ;PS; -------- Disassembly -------------------- 00 TEX: ADDR(80) CNT(4) VALID_PIX 0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW) 1 SAMPLE R3, R0.xyxx, t1, s0 UNNORM(XYZW) 2 SAMPLE R2, R0.xyxx, t2, s0 UNNORM(XYZW) 3 SAMPLE R0, R0.xyxx, t3, s0 UNNORM(XYZW)

Calculate the top- left block; note the float4 nature of the registers! sqrt r0.x,r0.xxxx div r0._yzw,r0,r0.xxxx mad r1._yzw,r0.yyyy,r0_neg(xyzw),r1 mad r2.__zw,r0.zzzz,r0_neg(xyzw),r2 mad r3.___w,r0.wwww,r0_neg(xyzw),r3 sqrt r1.y,r1.yyyy div r1.__zw,r1,r1.yyyy mad r2.__zw,r1.zzzz,r1_neg(xyzw),r2 mad r3.___w,r1.wwww,r1_neg(xyzw),r3 sqrt r2.z,r2.zzzz div r2.___w,r2,r2.zzzz mad r3.___w,r2.wwww,r2_neg(xyzw),r3 sqrt r3.w,r3.wwww 01 ALU: ADDR(32) CNT(39) 4 t: SQRT_e R4.x, R1.x 5 t: RCP_sat ____, PS4 6 y: MUL R4.y, R1.y, PS5 z: MUL R4.z, R1.z, PS5 w: MUL R4.w, R1.w, PS5 7 x: MULADD T0.x, PV6.y, -PV6.z, R3.z y: MULADD T0.y, PV6.z, -PV6.z, R2.z VEC_021 z: MULADD R123.z, PV6.y, -PV6.y, R3.y w: MULADD T0.w, PV6.y, -PV6.w, R3.w t: MULADD T0.z, PV6.z, -PV6.w, R2.w 8 x: MULADD T1.x, R4.w, -R4.w, R0.w t: SQRT_e R3.y, PV7.z 9 t: RCP_sat ____, PS8 10 z: MUL R3.z, T0.x, PS9 w: MUL R3.w, T0.w, PS9 11 x: MULADD T1.x, PV10.z, -PV10.w, T0.z y: MULADD R123.y, PV10.z, -PV10.z, T0.y z: MULADD T0.z, PV10.w, -PV10.w, T1.x 12 t: SQRT_e R2.z, PV11.y 13 t: RCP_sat ____, PS12 14 w: MUL R2.w, T1.x, PS13 15 x: MULADD R123.x, PV14.w, -PV14.w, T0.z 16 t: SQRT_e R0.w, PV15.x

Write out as a stream mov o0,r0 mov o1,r1 mov o2,r2 mov o3,r3 ret_dyn end 17 x: MOV R8.x, R0.x y: MOV R8.y, R0.y z: MOV R8.z, R0.z w: MOV R8.w, PS16 18 x: MOV R7.x, R2.x y: MOV R7.y, R2.y z: MOV R7.z, R2.z w: MOV R7.w, R2.w 19 x: MOV R6.x, R3.x y: MOV R6.y, R3.y z: MOV R6.z, R3.z w: MOV R6.w, R3.w 20 x: MOV R5.x, R4.x y: MOV R5.y, R4.y z: MOV R5.z, R4.z w: MOV R5.w, R4.w 02 EXP_DONE: PIX0, R5 BRSTCNT(3) END_OF_PROGRAM

AMD issues Almost NO information or advice about the memory system, other than to have each thread write 4 float4’s at a time… Documentation is improving, but still some way to go (e.g. gpuisa document is 2 generations out of date; very limited discussion of compute shaders at present…) But help from the AMD Stream forum

Teething issues, e.g. … Fussy about supported gpu/os/driver combinations Can’t seem to use all of the card’s memory for a compute shader/global buffer; miss out on 256MB Can’t easily access resources larger than 255MB in size from the cpu side In some cases all cards have to have had a monitor plugged into them to be accessible; imagine that in a cluster!

Multi-gpu? Assuming you can just combine results at the end (on the cpu say), no problem If you want to share data (e.g. the Cholesky problem), must bear communication costs in mind –As a lower limit, a kernel call takes about 5-10 μs –PCIe and cpu memory, say 5GB/s

Big gpu clusters? E.g. for Cholesky, you’d split the matrix row-wise and pass the processed rows via MPI Latencies (c.f. kernel launch overhead)… –No expert but apparently Ethernet 50 μs, Infiniband 2 μs …and bandwidth (100-1000 MB/s) Each kernel should run for longer than this! But what about memory errors? (Current cards non-ECC…)

Big gpu clusters? (2) I think gpu manufacturers should make available all the optimization info they possibly can –You need all the factors of a few that there are in order to justify coprocessors –Perhaps this is a bit different from the consumer market where compatibility and performance over a wide range of products is vital

Is it worth it? CUDA especially is a great way to start programming in parallel –A gpu is in many ways analogous to a supercomputer cluster Need to work hard to get close to peak performance –More info from both AMD and Nvidia would help here; it is strange that one can find out as much or more about the cards and how they work from forums and computer hardware websites as opposed to from the companies themselves…

Is it worth it?(2) Expect up to a factor of a few over decent (quad- core) cpu code… Future standardization efforts might help too (if they allow access to the advanced features of the cards in a close to optimal way). Libraries (cuBLAS, cuFFT, ACML-gpu…) might be a good way to start –Mpi versions for multi-gpu systems would be great!

Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008.

Similar presentations

Presentation on theme: "Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008.

Similar presentations

Presentation on theme: "Nvidia CUDA TM and AMD Stream TM (for Cosmology): Experiences So Far Steven Gratton, Institute of Astronomy, University of Cambridge. 29 October 2008."— Presentation transcript:

Similar presentations

About project

Feedback