Kaushik Datta1,2, Mark Murphy2,

Kaushik Datta1,2, Mark Murphy2,
Stencil Computation Optimization and Automatic Tuning on State-of-the-Art Multicore Architectures Kaushik Datta1,2, Mark Murphy2, Vasily Volkov2, Samuel Williams1,2, Jonathan Carter1, Leonid Oliker1,2, David Patterson1,2, John Shalf1, and Katherine Yelick1,2 1CRD/NERSC, Lawrence Berkeley National Laboratory 2Computer Science Division, University of California, Berkeley Joint work between LBNL and UC Berkeley

Talk Overview Multicore revolution has produced wide variety of architectures Compilers alone fail to fully exploit multicore resources Hand-tuning has become infeasible Automatic tuning is essential for performance and scaling Local-store machines offer much greater performance and power efficiency at the expense of productivity

Outline Stencil Code Overview Cache-based Architectures
Automatic Tuning Local-store Architectures

Adaptive Mesh Refinement (AMR)
Stencil Code Overview For a given point, a stencil is a fixed subset of nearest neighbors A stencil code updates every point in a regular grid by “applying a stencil” Used in iterative PDE solvers like Jacobi, Multigrid, and AMR Stencil codes usually bandwidth-bound Long unit-stride memory accesses Little reuse of each grid point Few flops per grid point This talk will focus on a out-of-place 3D 7-point stencil sweeping over a 2563 grid Problem size > Cache size 3D 7-point stencil (x,y,z) x+1 x-1 y-1 y+1 z-1 z+1 2563 regular grid Adaptive Mesh Refinement (AMR) Stencils are critical to many applications (e.g. diffusion, electromagnetics, image processing)

Naïve Stencil Code We wish to exploit multicore resources
First attempt at writing parallel stencil code: Use pthreads Parallelize in least contiguous grid dimension Thread affinity for scaling: multithreading, then multicore, then multisocket 2563 regular grid Thread 0 Thread 1 Thread n … x y z (unit-stride)

Cache-Based Architectures
(All Dual-Socket) 667MHz DDR2 DIMMs 10.6 GB/s 2x64b controllers HyperTransport Opteron 512K 2MB victim SRI / xbar (each direction) 4GB/s 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 Intel Clovertown AMD Barcelona 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s All are dual-socket for fairness Sun Niagara2 (Victoria Falls)

(Features) x86 Superscalar 667MHz DDR2 DIMMs 10.6 GB/s 2x64b controllers HyperTransport Opteron 512K 2MB victim SRI / xbar (each direction) 4GB/s 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 Intel Clovertown AMD Barcelona Chip MultiThreaded (CMT) 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s Superscalars are fat, VF uses simple in-order cores Chip MultiThreaded means multicore and multithreading Sun Niagara2 (Victoria Falls)

(Socket / Core / Thread Count) 2 sockets x 4 cores/socket x 1 HW thread/core 667MHz DDR2 DIMMs 10.6 GB/s 2x64b controllers HyperTransport Opteron 512K 2MB victim SRI / xbar (each direction) 4GB/s 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 Intel Clovertown AMD Barcelona 2 sockets x 8 cores/socket x 8 HW threads/core 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s VF shows a much higher degree of parallelism Diverse machines Sun Niagara2 (Victoria Falls)

(Stream Bandwidth) 7.2 GB/s 17.6 GB/s 667MHz DDR2 DIMMs 10.6 GB/s 2x64b controllers HyperTransport Opteron 512K 2MB victim SRI / xbar (each direction) 4GB/s 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 Intel Clovertown AMD Barcelona 22.4 GB/s 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s Stencil codes typically bandwidth-bound Stream BW can be loose upper bound on performance Sun Niagara2 (Victoria Falls)

Stream-Predicted Performance
Naïve Performance Best naïve: 48% of Stream-predicted 17% of Stream-predicted 18% of Stream-predicted Intel Clovertown (+icc) AMD Barcelona (+gcc) Stream-Predicted Performance Naive NAÏVE CODE SHOWS NO SCALING! best performance may not even be at maximum concurrency “Stream-predicted performance” assumes: code achieves Stream bandwidth only compulsory misses Using all cores and threads Compiler optimizations alone result in: small fraction of Stream-predicted performance no parallel scaling Sun Niagara2 (Victoria Falls) (+gcc)

(Cache Capacity Per Thread) 2 MB/thread 1 MB/thread 667MHz DDR2 DIMMs 10.6 GB/s 2x64b controllers HyperTransport Opteron 512K 2MB victim SRI / xbar (each direction) 4GB/s 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 Intel Clovertown AMD Barcelona 64 KB/thread 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s Naïve code’s working set > 2 MB/thread Only the Clovertown has a chance of running well Barcelona and Victoria Falls will definitely suffer capacity misses Sun Niagara2 (Victoria Falls)

NUMA-aware allocation
Potential Problems NUMA-aware allocation Array padding Core blocking Register blocking Software prefetching SIMDization Cache bypass Thread blocking & Solutions What are possible performance bottlenecks with the naïve code? Poor data placement Conflict misses Capacity misses Poor functional unit usage Low memory bandwidth Compiler not exploiting the ISA Unneeded write allocation Low cache capacity/thread We will go into more depth on each of these

(Naive) Intel Clovertown AMD Barcelona Stream-Predicted Performance Sun Niagara2 (Victoria Falls) Naive

NUMA Optimization Intel Clovertown AMD Barcelona
667MHz DDR2 DIMMs 10.6 GB/s 2x64b controllers HyperTransport Opteron 512K 2MB victim SRI / xbar (each direction) 4GB/s 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write) 21.33 GB/s(read) 10.66 GB/s Core FSB 4MB shared L2 Intel Clovertown AMD Barcelona 667MHz FBDIMMs 21.33 GB/s 10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s 90 GB/s (1 per hub per direction) 8 x 6.4 GB/s All DRAMs are highlighted in red Only Barcelona and Victoria Falls are NUMA architectures We co-located data on same socket as thread processing it All DRAMs are highlighted Only Barcelona and VF are NUMA Sun Niagara2 (Victoria Falls)

(+ NUMA) Intel Clovertown AMD Barcelona Stream-Predicted Performance + NUMA Sun Niagara2 (Victoria Falls) Naive

Array Padding Optimization
Conflict misses may occur on low-associativity caches Each array was padded by a tuned amount to minimize conflicts 2563 regular grid Thread 0 Thread 1 Thread n … padding No computation was performed on the padding portion of the grid x y z (unit-stride)

(+ Array Padding) Intel Clovertown AMD Barcelona Stream-Predicted Performance + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

Problem Decomposition
(across an SMP) +Y +Z Decomposition of the Grid into a Chunk of Core Blocks +X (unit stride) NY NZ NX Large chunks enable efficient NUMA Allocation Small chunks exploit LLC shared caches Decomposition into Thread Blocks CY CZ CX TY TX Exploit caches shared among threads within a core Decomposition into Register Blocks RY TY CZ TX RX RZ Make DLP/ILP explicit Make register reuse explicit Decomposition only changes number of nested loops This decomposition is universal across all examined architectures Decomposition does not change data structure Need to choose best block sizes for each hierarchy level 19

(+ Core Blocking) Intel Clovertown AMD Barcelona Stream-Predicted Performance Universally successful + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

(+ Register Blocking) Intel Clovertown AMD Barcelona Stream-Predicted Performance Helped on VF + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

(+ Software Prefetch) Intel Clovertown AMD Barcelona Stream-Predicted Performance SW prefetch hides memory latency and may increase effective BW + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

(+ SIMD) Intel Clovertown AMD Barcelona Stream-Predicted Performance Explicitly used the 128-bit SSE registers on x86 icc automatically does this, but gcc doesn’t (but we can’t assume anything) SIMD isn’t useful since we are BW-bound This is NON-portable code + SIMD + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

Cache Bypass Optimization
We do not use initial values in the write array and will overwrite them We can eliminate write array cache line fills with SSE intrinsic Reduces memory traffic from 24 B/point to 16 B/point- 33% improvement! Write Array DRAM Read Chip 8 B/point read 8 B/point write “movntpd” intrinsic again only for x86 machines (done after SIMDization- non-portable code) icc compiler does NOT do this

(+ Cache Bypass) Intel Clovertown AMD Barcelona Stream-Predicted Performance + Cache Bypass + SIMD + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

(+ Thread Blocking) Intel Clovertown AMD Barcelona Stream-Predicted Performance + Thread Blocking Thread blocking allows threads to share caches within a core on VF + Cache Bypass + SIMD + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

(Full Tuning) Best tuned: 70% of Stream-predicted 94% of Stream-predicted 72% of Stream-predicted Intel Clovertown AMD Barcelona Stream-Predicted Performance + Thread Blocking The bars look different for each architecture Which optimizations will be helpful in non-intuitive Optimization may not help when first applied, but may help later optimizations + Cache Bypass + SIMD + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

Tuning Speedup (Over Best Naïve Code) 1.5x 5.6x Intel Clovertown AMD Barcelona 4.1x Stream-Predicted Performance + Thread Blocking + Cache Bypass + SIMD + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

Parallel Scaling Speedup
(Over Single Core Performance) 1.9x 4.5x Intel Clovertown AMD Barcelona 7.9x Stream-Predicted Performance + Thread Blocking NAÏVE CODE SHOWED NO SCALING! Now we show both good performance and scaling + Cache Bypass + SIMD + Software Prefetch + Register Blocking + Core Blocking + Array Padding + NUMA Sun Niagara2 (Victoria Falls) Naive

Automatic Tuning Local-store Architectures Automatic tuning

Parameter Space Explosion
Applied optimizations were: NUMA-aware allocation Array padding Core blocking Register blocking Software prefetching SIMDization Cache bypass Thread blocking Each optimization has an associated set of parameters Size of the configuration space quickly becomes intractable

Automatic Tuning Hand-tuning across diverse architectures and core counts is impractical Need an effective approach that is: portable scalable requires minimal programmer effort (for coding or tuning) We let the machine search the parameter space intelligently to find a (near-)optimal configuration (autotuning) Autotuning has proven track record (e.g., ATLAS, SPIRAL, FFTW, OSKI) Say the autotuners by name

Traversing the Parameter Space
Exhaustive search is impossible To make problem tractable, we: ordered the optimizations applied them consecutively Every platform had its own set of best parameters This was indeed effective - This technique may not find best solution, but as seen, we saw very good results Opt. #2 Parameters Opt. #3 Parameters Opt. #1 Parameters

Local-Store Architectures
(Both Heterogeneous) interconnect 141.7 GB/s 1GB 1107MHz GDDR3 Device DRAM PCIe Thread Cluster L2 (Textures only) 32 ROPs 8 x 64b memory controllers Streaming Multiprocessor SP SFU double precision shared Load/Store coalescing single fetch, issue to multiple cores Texture Unit L1 (Textures only) <32GB 800MHz DDR2 DIMMs 25.6 GB/s EIB (Ring Network) SPE 256K MFC controllers 4x64b BIF VMT PPE 512KB L2 (each direction) <20GB/s Cell has both PowerPC core and 8 SIMD SPEs GTX280 GPU is connected to CPU IBM QS22 Cell Blade NVIDIA GTX280

(Features) Direct Memory Access Highly Multithreaded interconnect 141.7 GB/s 1GB 1107MHz GDDR3 Device DRAM PCIe Thread Cluster L2 (Textures only) 32 ROPs 8 x 64b memory controllers Streaming Multiprocessor SP SFU double precision shared Load/Store coalescing single fetch, issue to multiple cores Texture Unit L1 (Textures only) <32GB 800MHz DDR2 DIMMs 25.6 GB/s EIB (Ring Network) SPE 256K MFC controllers 4x64b BIF VMT PPE 512KB L2 (each direction) <20GB/s Hiding memory latency: Cell does asynchronous DMAs with execution GTX280 has lots of concurrency IBM QS22 Cell Blade NVIDIA GTX280 36 36

(Socket / Core Count) 2 sockets x 8 SPEs/socket 1 socket x 30 SMs/socket x 8 scalar cores/SM interconnect 141.7 GB/s 1GB 1107MHz GDDR3 Device DRAM PCIe Thread Cluster L2 (Textures only) 32 ROPs 8 x 64b memory controllers Streaming Multiprocessor SP SFU double precision shared Load/Store coalescing single fetch, issue to multiple cores Texture Unit L1 (Textures only) <32GB 800MHz DDR2 DIMMs 25.6 GB/s EIB (Ring Network) SPE 256K MFC controllers 4x64b BIF VMT PPE 512KB L2 (each direction) <20GB/s SM= “streaming multiprocessor” IBM QS22 Cell Blade NVIDIA GTX280 37 37

(Stream Bandwidth) 36.9 GB/s 127 GB/s interconnect 141.7 GB/s 1GB 1107MHz GDDR3 Device DRAM PCIe Thread Cluster L2 (Textures only) 32 ROPs 8 x 64b memory controllers Streaming Multiprocessor SP SFU double precision shared Load/Store coalescing single fetch, issue to multiple cores Texture Unit L1 (Textures only) <32GB 800MHz DDR2 DIMMs 25.6 GB/s EIB (Ring Network) SPE 256K MFC controllers 4x64b BIF VMT PPE 512KB L2 (each direction) <20GB/s These numbers are much better than cache-based machines GTX280 has sacrificed capacity for high bandwidth- problem must fit into 1 GB on-board memory or use PCIe to transfer IBM QS22 Cell Blade NVIDIA GTX280 38 38

(Local Store Architectures)
Tuning (Local Store Architectures) Tuning for local-store platforms typically easier Data movement explicitly controlled by DMAs (Cell) or SIMT loads/stores (GPU) The search space is limited by the register file and local store sizes Heuristics based on local memory size usually effective Cell: blocks for 256 KB local store GTX280: blocks for 64 KB register file on each SM Each platform tunes for the largest local memory: local store or register file 39

(Full Tuning- Double Precision Results)
Performance (Full Tuning- Double Precision Results) Naïve CUDA on host Naïve CUDA on device IBM QS22 Cell Blade NVIDIA GTX280 No “naïve” code since neither platform supports portable C code Cell is compute-bound at max concurrency GTX280 shows a 3.6x improvement over “Naïve CUDA in device” by exploiting registers instead of shared memory +SIMD +Core Blocking +NUMA +DMA +Thread Blocking Both platforms are much faster than cache-based machines Neither machine can run portable C code 40 40

Summary Performance: Power Efficiency: Clovertown Barcelona
(All Architectures) Performance: Power Efficiency: Maybe change colors (hot and cold alternating)? In general, local-store archs do better than cache-based Power efficiency has become as important as performance Clovertown and Barcelona use power-hungry FBDIMMs Clovertown Barcelona Victoria Falls Cell Blade GTX280 Cache-based GTX280-Host Local store-based 41

Conclusions Compiler alone achieves poor performance
between 17%-48% of Stream-predicted performance no parallel scaling Autotuning is essential to achieving good performance 1.5x-5.6x speedups across diverse architectures Automatic tuning is necessary for scalability With few exceptions, the same code was used The Cell and GTX280 show much better performance and power efficiency than cache-based machines, but with productivity loss Both codes are platform-specific Data movement must be explicitly managed 42

Questions? I am hoping to graduate in summer 2009 Kaushik Datta:
Many thanks to my co-authors: Mark Murphy Vasily Volkov Sam Williams Jonathan Carter Leonid Oliker David Patterson John Shalf Kathy Yelick 43

Kaushik Datta1,2, Mark Murphy2,

Similar presentations

Presentation on theme: "Kaushik Datta1,2, Mark Murphy2,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kaushik Datta1,2, Mark Murphy2,

Similar presentations

Presentation on theme: "Kaushik Datta1,2, Mark Murphy2,"— Presentation transcript:

Similar presentations

About project

Feedback