Presentation is loading. Please wait.

Presentation is loading. Please wait.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1

Similar presentations


Presentation on theme: "L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1"— Presentation transcript:

1 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

2 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Outline Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary 2

3 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Challenges / Goals 3

4 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 4 Four Architectures 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) Sun Victoria FallsAMD Barcelona NVIDIA G80IBM Cell Blade

5 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Challenges / Goals We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary dramatically. The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next. We wish to understand whether or not weve attained good performance (high fraction of a theoretical peak) We wish to identify performance bottlenecks and enumerate potential remediation strategies. 5

6 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Fundamentals 6

7 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Littles Law Littles Law: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency Bandwidth conventional memory bandwidth #floating-point units Latency memory latency functional unit latency Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations For example, consider a CPU with 2 FPUs each with a 4-cycle latency. Littles law states that we must express 8-way ILP to fully utilize the machine. 7

8 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Littles Law Examples 8 consider a CPU with 2 FPUs each with a 4-cycle latency. Littles law states that we must express 8-way ILP to fully utilize the machine. Solution: unroll/jam the code to express 8 independent FP operations. Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead. Applied to FPUsApplied to Memory consider a CPU with 20GB/s of bandwidth and 100ns memory latency. Littles law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance On todays superscalar processors, hardware stream prefetchers speculatively load consecutive elements. Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers.

9 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Three Classes of Locality Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse. Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-128Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout Sequential Locality Many memory address patterns access cache lines sequentially. CPUs hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses. 9

10 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 10 Arithmetic Intensity True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality) Others have constant intensity Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses. A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

11 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY NUMA 11 Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

12 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY NUMA 12 Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

13 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Roofline Model 13

14 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Overlap of Communication Consider a simple example in which a FP kernel maintains a working set in DRAM. We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation) Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations. 14 Bytes / STREAM Bandwidth Flops / Flop/s time

15 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model Basic Concept 15 Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis. where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernels arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance. Moreover, provides insights as to which optimizations will potentially be beneficial. Attainable Performance ij = min FLOP/s with Optimizations 1-i AI * Bandwidth with Optimizations 1-j

16 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Example 16 Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD datapaths 4-cycle FP latency Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?

17 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model Basic Concept 17 Plot on log-log scale Given AI, we can easily bound performance But architectures are much more complicated We will bound performance as we eliminate specific forms of in-core parallelism actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth

18 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computational ceilings 18 Opterons have dedicated multipliers and adders. If the code is dominated by adds, then attainable performance is half of peak. We call these Ceilings They act like constraints on performance actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth mul / add imbalance

19 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computational ceilings 19 Opterons have 128-bit datapaths. If instructions arent SIMDized, attainable performance will be halved actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth mul / add imbalance w/out SIMD

20 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computational ceilings 20 On Opterons, floating-point instructions have a 4 cycle latency. If we dont express 4-way ILP, performance will drop by as much as 4x actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD w/out ILP peak DP Stream Bandwidth mul / add imbalance

21 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model communication ceilings 21 We can perform a similar exercise taking away parallelism from the memory subsystem actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth

22 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model communication ceilings 22 Explicit software prefetch instructions are required to achieve peak bandwidth actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth w/out SW prefetch

23 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model communication ceilings 23 Opterons are NUMA As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. We could continue this by examining strided or random memory access patterns actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth w/out SW prefetch w/out NUMA

24 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computation + communication ceilings 24 We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD peak DP mul / add imbalance w/out ILP Stream Bandwidth w/out SW prefetch w/out NUMA

25 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 25 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic FLOPs Compulsory Misses AI =

26 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Cache Behavior Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized into sets & ways (associativity) Thus, we must mimic the effect of Mark Hills 3Cs of caches Impacts of conflict, compulsory, and capacity misses are both architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio. Moreover, many caches are write allocate. a write allocate cache read in an entire cache line upon a write miss. If the application ultimately overwrites that line, the read was superfluous (further reduces flop:byte ratio) Because programs access data in words, but hardware transfers it in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower flop:byte ratios. e.g. if a program only operates on the red field of a pixel, bandwidth is wasted. 26

27 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 27 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic FLOPs Allocations + Compulsory Misses AI =

28 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 28 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic FLOPs Capacity + Allocations + Compulsory AI =

29 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 29 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic+conflict miss traffic FLOPs Conflict + Capacity + Allocations + Compulsory AI =

30 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 30 Optimizations remove these walls and ceilings which act to constrain performance. actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic+conflict miss traffic

31 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Instruction Issue Bandwidth On a superscalar processor, there is likely ample instruction issue bandwidth. This allows loads, integer, and FP instructions to be issued simultaneously. As such, we assumed that expression of parallelism was the underlying challenge for in-core. However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance. 31

32 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model Instruction Mix 32 As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in- core performance. On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute 50% of the mix to attain peak performance. A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges. actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Niagara2) Stream Bandwidth 12% FP 25% FP w/out SW prefetch w/out NUMA peak DP, 50% FP

33 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 33 Optimization Categorization Maximizing (attained) In-core Performance Minimizing (total) Memory Traffic Maximizing (attained) Memory Bandwidth

34 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 34 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance

35 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 35 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches

36 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches 36 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Littles Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking

37 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches 37 Optimization Categorization Maximizing In-core Performance Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Exploit NUMA Hide memory latency Satisfy Littles Law Minimizing Memory Traffic Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? ? ? ? ? ? ? cache blocking array padding compress data streaming stores

38 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 38 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Exploit NUMA Hide memory latency Satisfy Littles Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? ? ? ? ? ? ? cache blocking array padding compress data streaming stores ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches

39 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Examples 39

40 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 40 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2

41 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Heat Equation 41

42 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 7-point Stencil 42 PDE grid +Y +Z +X stencil for heat equation PDE y+1 y-1 x-1 z-1 z+1 x+1 x,y,z Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil for all x,y,z: u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*( u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t) ) Clearly each stencil performs: 8 floating-point operations 8 memory references all but 2 should be filtered by an ideal cache 6 memory streams all but 2 should be filtered (less than # HW prefetchers) Ideally, AI=0.5. However, write-allocate bounds it to 0.33.

43 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 43 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation

44 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 44 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation

45 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 45 Roofline model for Stencil (NUMA, cache blocking, unrolling, prefetch, …) Cache blocking helps ensure flop:byte ratio is as close as possible to 1 / 3 Clovertown has huge caches but is pinned to lower BW ceiling Cache management is essential when capacity/thread is low No naïve Cell implementation

46 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 46 Roofline model for Stencil (SIMDization + cache bypass) Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~0.5 on x86/Cell

47 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP SpMV 47

48 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 48 Sparse Matrix Vector Multiplication Whats a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure Whats SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD) (a) algebra conceptualization (c) CSR reference code for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; } Axy (b) CSR data structure A.val[ ] A.rowStart[ ]... A.col[ ]...

49 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 49 Roofline model for SpMV Double precision roofline models In-core optimizations 1..i DRAM optimizations 1..j FMA is inherent in SpMV (place at bottom) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter GFlops i,j (AI) = min InCoreGFlops i StreamBW j * AI

50 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 50 Roofline model for SpMV (overlay arithmetic intensity) Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < 0.166 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

51 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 51 Roofline model for SpMV (out-of-the-box parallel) Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < 0.166 For simplicity: dense matrix in sparse format 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

52 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 52 Roofline model for SpMV (NUMA & SW prefetch) compulsory flop:byte ~ 0.166 utilize all memory channels 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

53 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 53 Roofline model for SpMV (matrix compression) Inherent FMA Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls)

54 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Various Kernels We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs. We observe that for most, performance is highly correlated with DRAM bandwidth – particularly on the GPU. Note, GTC has a strong scatter/gather component that skews STREAM- based rooflines. 54 512 256 128 64 32 16 8 4 2 1024 1 / 16 12481632 1/81/8 1/41/4 1/21/2 1 / 32 single-precision peak double-precision peak STREAM bandwidth Xeon X5550 (Nehalem) DP add-only 512 256 128 64 32 16 8 4 2 1024 1 / 16 12481632 1/81/8 1/41/4 1/21/2 1 / 32 single-precision peak double-precision peak Device bandwidth NVIDIA C2050 (Fermi) DP add-only

55 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Alternate Rooflines 55

56 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY No overlap of communication and computation Previously, we assumed perfect overlap of communication or computation. What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation ? 56 Bytes / STREAM Bandwidth Flops / Flop/s time Bytes / STREAM Bandwidth Flops / Flop/s time Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.

57 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY No overlap of communication and computation Consider a generic machine If we can perfectly decouple and overlap communication with computation, the roofline is sharp/angular. However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x. 57

58 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Alternate Bandwidths Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark) STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed. Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling) Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios. For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio. 58

59 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Alternate Computations Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc… 59

60 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Time-based roofline In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution). We can invert the roofline (seconds per flop) and simply multiply by the number of requisite flops Additionally, we could change the horizontal axis from locality to some more appealing metric. 60

61 FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Empirical Roofline Thus far, all in-core estimates have been based on first principles analysis of the underlying computer architecture (frequency, SIMD- width, latency, etc…) Conceivably, one could design a series of compiled benchmarks that would extract the relevant roofline parameters. Similarly, one could use performance counters to extract application characteristics so one could accurately determine application coordinates. 61

62 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Questions? Acknowledgments Research supported by DOE Office of Science under contract number DE-AC02-05CH11231. 62

63 L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP BACKUP SLIDES 63


Download ppt "L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1"

Similar presentations


Ads by Google