Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre.

Similar presentations


Presentation on theme: "1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre."— Presentation transcript:

1 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

2 2 Graphic Processors (GPUs) Modern computers are graphics intensive Advanced 3D graphics require a significant amount of computation Graphics Card (Source: nvidia.com) Solution: Add a Graphics Processor (GPU)

3 3 High-Performance Computing High-Performance Computing (HPC) General Purpose Programming on GPUs (GPGPU) Efficient use of computers for computationally intensive problems in science or engineering Processing DemandCommunication Demand Weather forecast Climate modeling Dynamic Molecular Simulation Computational Computer Architecture Office Applications Third dimension: Main Memory Capacity

4 4 Outline GPU Evolution GPU Programming GPU Architecture Achieving High GPU Performance Future Trends Conclusions

5 5 GPU EVOLUTION

6 6 First GPUs: Fixed Hardware [Blythe 2008] Vertex Processing Rasterization Fragment Processing Framebuffer Operations Vertex Data Texture Maps Depth Buffer Color Buffer

7 7 Programmable Shaders Vertex Processing Rasterization Fragment Processing Framebuffer Operations Vertex Data Texture Maps Depth Buffer Color Buffer Motivation: More flexible graphics processing

8 8 GPGPU with Programmable Shaders Vertex Processing Rasterization Fragment Processing Framebuffer Operations Vertex Data Texture Maps Depth Buffer Color Buffer Use Graphics Library to gain access to GPU Use color values to code data Effect of fixed function stages must be accounted for

9 9 Functional Unit Utilization Vertex Processing Fragment Processing Vertex Processing Fragment Processing Rasterization Framebuffer Operations Vertex Data Texture Maps Depth Buffer Color Buffer

10 10 Functional Unit Utilization Vertex Processing Fragment Processing Vertex intensive shader Fragment intensive shader Unified shader

11 11 Unified Shader Architecture Exploit parallelism –Data parallelism –Task parallelism Data parallel processing (SIMD/SIMT) Hide memory latencies High bandwidth Architecture naturally supports GPGPU SP Memory SP Memory SP Memory SP Memory Thread Scheduler Interconnect On-Chip Memory or Cache Off-Chip DRAM Memory

12 12 GPU PROGRAMMING

13 13 Programmable ShadersUnified Shaders GPGPU Tool Support Sh PeakStreamAccelerator GPU++ CUDA OpenCL GPU papers on Supercomputing

14 14 Compute Unified Device Architecture (CUDA) Most code is normal C++ code Code to run on GPU organized in kernels CPU sets up and manages computation __global__ void vector_add(float* a, float* b, float* c) { int idx = threadIdx.x; c[idx] = a[idx] + b[idx]; } int main() { int N = 512; //... vector_add >>(a_d, b_d, c_d); //... }

15 15 Thread/Data Organization Hierarchical thread organization –Grid –Block –Thread A block can have a maximum of 512 threads 1D, 2D and 3D mappings possible Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Grid Block (0) Block (1) Grid

16 16 C B AA B Global Memory Main Memory GPUCPU SP A B CC Local Memory Vector Addition Example A collection of concurrently processed threads is called a warp

17 17 Terminology: Warp

18 18 Vector Addition Profile Only 11% of GPU time is used to add vectors The arithmetic intensity of the problem is too low Overlapping data copy and computation could help Hardware: NVIDIA MVS 3100M

19 19 Will GPUs Save the World? Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010] GPGPU has provided nice speedups for problems that fit the architecture Metric challenge: The practitioner needs performance per developer hour

20 20 GPU ARCHITECTURE

21 21 NVIDIA Tesla Architecture Figure reproduced from [Lindholm et al.; 2008]

22 22 Control Flow The threads in a warp share use the same instruction Branching is efficient if all threads in a warp branch in the same direction Divergent branches within a warp cause serial execution of both paths IF Condition True Threads Condition False Threads Condition True Threads Condition False Threads

23 23 Modern DRAM Interfaces Maximize bandwidth with 3D organization Repeated requests to the row buffer are very efficient Row address Column address DRAM Banks Row Buffer Rows Columns

24 24 Access Coalescing Global memory accesses from all threads in a half- warp are combined into a single memory transaction All memory elements in a segment are accessed Segment size can be halved if only the lower or upper half is used Assumes Compute Capability 1.2 or higher Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Addr 128 Addr 132 Addr 136 Addr 140 Addr 144 Addr 148 Addr 152 Addr 156 Addr 124 Addr 120 Addr 116 Addr 112 Transaction

25 25 Bank Conflicts Memory banks can service requests independently Bank conflict: more than one thread access a bank concurrently Strided access patterns can cause bank conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Stride two accesses gives 2-way bank conflict

26 26 NVIDIA Fermi Next generation computing chip from NVIDIA Aims to alleviate important bottlenecks –Improved double precision floating point support –Cache hierarchy –Concurrent kernel execution More problems can be solved efficiently on a GPU Figure reproduced from [NVIDIA; 2010]

27 27 ACHIEVING HIGH GPU PERFORMANCE

28 28 Which problems fit the GPU model? Fine-grained data parallelism available Sufficient arithmetic intensity Sufficiently regular data access patterns Its all about organizing data Optimized memory system use enables high performance

29 29 Increase Computational Intensity Memory types: –On-chip shared memory: Small and fast –Off-chip global memory: Large and slow Technique: Tiling –Choose tile size such that it fits in the shared memory –Increases locality by reducing reuse distance A × B = C × = Reuse!

30 30 Memory Layout Exploit coalescing to achieve high bandwidth Linear access necessary Solution: Tiling A × B = C × = Assume row-major storage Coalesced Not Coalesced

31 31 W1W2W3W4W1W2W3W4 Avoid Branching Inside Warps Assume 2 threads per warp All iterations diverge 8 44 One iteration diverges

32 32 Automation Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08] –Avoid saturation –Sweet spot will vary between devices –Sweet spot varies with problem sizes Auto-tuning 3D FFT [Nukada et al.; SC2009] –Balance resource consumption vs. parallelism with kernel radix and ordering –Best number of thread blocks chosen automatically –Inserts padding to avoid shared memory bank conflicts

33 33 Case Study: Dynamic Molecular Simulation with NAMD Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]

34 34 Key Performance Enablers Careful division of labor between GPU and CPU –GPU: Short range non-bonded forces –CPU: Long-range electrostatic forces and coordinate updates Overlap CPU and GPU execution through asynchronous kernel execution Use event recording to track progress in asynchronously executing streams [Phillips et al., SC2008]

35 35 CPU/GPU Cooperation in NAMD [Phillips et al., SC2008] CPU GPU RemoteLocal Update RemoteLocal Time f f f f x x x

36 36 Challenges Completely restructuring legacy software systems is prohibitive Batch processing software are unaware of GPUs Interoperability issues with pinning main memory pages for DMA [Phillips et al., SC2008]

37 37 FUTURE TRENDS

38 38 Accelerator Integration Industry move towards integrating CPUs and GPUs on the same chip –AMD Fusion [Brookwood; 2010] –Intel Sandy Bridge (fixed function GPU) Are other accelerators appropriate? –Single-chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010] AMD Fusion Reproduced from [Brookwood; 2010]

39 39 Vector Addition Revisited Start-up and shut-down data transfers are the main bottleneck Fusion eliminates these overheads by storing values in the on-chip cache Using accelerators becomes more feasible

40 40 Memory System Scalability Current CPU bottlenecks: –Number of pins on a chip grows slowly –Off-chip bandwidth grows slowly Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand Conflicting requirements: –GPU: High bandwidth, not latency sensitive –CPU: High bandwidth, can be latency sensitive

41 41 CONCLUSIONS

42 42 Conclusions GPUs can offer a significant speedup for problems that fit the model Tool support and flexible architectures increases the number of problems that fit the model CPU/GPU on-chip integration can reduce GPU start- up overheads

43 43 Thank You Visit our website:

44 44 References Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU; Lee et al.; ISCA; 2010 Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010 NVIDIAs Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA; 2010 AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White Paper; AMD; 2010 Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master Thesis; NTNU; 2010 Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures; Yuan et al.; MICRO; 2009 Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009 Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009 Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips et al.; SC; 2008 Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008 NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE Micro; 2008 Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA; Ryoo et al.; PPoPP; 2008

45 45 EXTRA SLIDES

46 46 Complexity-Effective Memory Access Scheduling On-chip interconnect may interleave requests from different thread processors Row locality is destroyed Solution: Order- preserving interconnect arbitration policy and in- order scheduling [Lee et al., MICRO2009] Req 0 Row A Req 0 Row B Req 1 Row A Row Switch Req 0 Row A Req 0 Row B Req 1 Row A Row Switch Req 1 Row B Row Switch Req 1 Row B Time In-order Scheduling Out-of-order Scheduling Queue: Req 0 Row A Req 0 Row B Req 1 Row A Req 1 Row B Req 1 Row A Req 0 Row B Req 0 Row A Req 0 Row B Req 1 Row A Row Switch Req 1 Row B Performance of out-of-order scheduling with less complex in-order scheduling


Download ppt "1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre."

Similar presentations


Ads by Google