Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Use of GPUs for High-Performance Computing

Similar presentations


Presentation on theme: "The Use of GPUs for High-Performance Computing"— Presentation transcript:

1 The Use of GPUs for High-Performance Computing
Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

2 Graphic Processors (GPUs)
Modern computers are graphics intensive Advanced 3D graphics require a significant amount of computation Graphics Card (Source: nvidia.com) Solution: Add a Graphics Processor (GPU)

3 High-Performance Computing (HPC)
Efficient use of computers for computationally intensive problems in science or engineering Processing Demand Communication Demand High-Performance Computing Computational Computer Architecture Dynamic Molecular Simulation Weather forecast Climate modeling Third dimension: Main Memory Capacity Office Applications Third dimension: Lots of data. Some programs do parallelism to get access to a large aggregate memory. General Purpose Programming on GPUs (GPGPU)

4 Outline GPU Evolution GPU Programming GPU Architecture
Achieving High GPU Performance Future Trends Conclusions

5 GPU Evolution

6 First GPUs: Fixed Hardware
Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations [Blythe 2008]

7 Motivation: More flexible graphics processing
Programmable Shaders Motivation: More flexible graphics processing Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations

8 GPGPU with Programmable Shaders
Use Graphics Library to gain access to GPU Use color values to code data Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Rasterization Fragment Processing Framebuffer Operations Effect of fixed function stages must be accounted for

9 Functional Unit Utilization
Vertex Processing Fragment Processing Rasterization Framebuffer Operations Vertex Data Texture Maps Depth Buffer Color Buffer Vertex Processing Fragment Processing

10 Functional Unit Utilization
Vertex Processing Fragment Processing Vertex intensive shader Fragment intensive shader Unified shader

11 Unified Shader Architecture
Exploit parallelism Data parallelism Task parallelism Data parallel processing (SIMD/SIMT) Hide memory latencies High bandwidth Thread Scheduler SP Memory SP Memory SP Memory SP Memory Interconnect On-Chip Memory or Cache Off-Chip DRAM Memory Architecture naturally supports GPGPU

12 Gpu Programming

13 GPGPU Tool Support GPU papers on Supercomputing Programmable Shaders
Unified Shaders Sh GPU++ OpenCL Accelerator PeakStream CUDA 3 1 GPU papers on Supercomputing 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

14 Compute Unified Device Architecture (CUDA)
Most code is normal C++ code Code to run on GPU organized in kernels CPU sets up and manages computation __global__ void vector_add(float* a, float* b, float* c) { int idx = threadIdx.x; c[idx] = a[idx] + b[idx]; } int main() { int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); }

15 Thread/Data Organization
Hierarchical thread organization Grid Block Thread A block can have a maximum of 512 threads 1D, 2D and 3D mappings possible Block (0,0) Block (0,1) Block (0,2) Block (1,0) Block (1,1) Block (1,2) Grid Block (0) Block (1) Grid

16 Vector Addition Example
GPU CPU Global Memory Main Memory A A A SP SP SP SP B B B SP SP SP SP C C C Local Memory A collection of concurrently processed threads is called a warp

17 Terminology: Warp

18 Vector Addition Profile
Only 11% of GPU time is used to add vectors The arithmetic intensity of the problem is too low Overlapping data copy and computation could help Hardware: NVIDIA MVS 3100M

19 Will GPUs Save the World?
Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010] GPGPU has provided nice speedups for problems that fit the architecture Metric challenge: The practitioner needs performance per developer hour

20 GPU Architecture

21 NVIDIA Tesla Architecture
Figure reproduced from [Lindholm et al.; 2008]

22 Control Flow The threads in a warp share use the same instruction
IF The threads in a warp share use the same instruction Branching is efficient if all threads in a warp branch in the same direction Divergent branches within a warp cause serial execution of both paths Condition True Threads Condition True Threads Condition False Threads Condition False Threads

23 Modern DRAM Interfaces
Rows Banks Maximize bandwidth with 3D organization Repeated requests to the row buffer are very efficient DRAM Row address Columns Column address Row Buffer

24 Addr 124 Addr 120 Addr 116 Addr 112 Transaction Access Coalescing Global memory accesses from all threads in a half-warp are combined into a single memory transaction All memory elements in a segment are accessed Segment size can be halved if only the lower or upper half is used Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Addr 128 Addr 132 Addr 136 Addr 140 Addr 144 Addr 148 Addr 152 Addr 156 Transaction Assumes Compute Capability 1.2 or higher

25 Stride two accesses gives 2-way bank conflict
Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Memory banks can service requests independently Bank conflict: more than one thread access a bank concurrently Strided access patterns can cause bank conflicts Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Stride two accesses gives 2-way bank conflict

26 Figure reproduced from [NVIDIA; 2010]
NVIDIA Fermi Next generation computing chip from NVIDIA Aims to alleviate important bottlenecks Improved double precision floating point support Cache hierarchy Concurrent kernel execution More problems can be solved efficiently on a GPU Figure reproduced from [NVIDIA; 2010]

27 Achieving high GPU performance

28 Which problems fit the GPU model?
Fine-grained data parallelism available Sufficient arithmetic intensity Sufficiently regular data access patterns It’s all about organizing data Optimized memory system use enables high performance

29 Increase Computational Intensity
A × B = C Memory types: On-chip shared memory: Small and fast Off-chip global memory: Large and slow Technique: Tiling Choose tile size such that it fits in the shared memory Increases locality by reducing reuse distance Reuse! Reuse! × =

30 Assume row-major storage
Memory Layout A × B = C Exploit coalescing to achieve high bandwidth Linear access necessary Solution: Tiling Coalesced Not Coalesced × = Assume row-major storage

31 Avoid Branching Inside Warps
Assume 2 threads per warp W1 W2 W3 W4 W1 W2 W3 W4 1 1 2 2 4 4 8 8 All iterations diverge One iteration diverges

32 Automation Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08] Avoid saturation Sweet spot will vary between devices Sweet spot varies with problem sizes Auto-tuning 3D FFT [Nukada et al.; SC2009] Balance resource consumption vs. parallelism with kernel radix and ordering Best number of thread blocks chosen automatically Inserts padding to avoid shared memory bank conflicts

33 Case Study: Dynamic Molecular Simulation with NAMD
Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]

34 Key Performance Enablers
Careful division of labor between GPU and CPU GPU: Short range non-bonded forces CPU: Long-range electrostatic forces and coordinate updates Overlap CPU and GPU execution through asynchronous kernel execution Use event recording to track progress in asynchronously executing streams [Phillips et al., SC2008]

35 CPU/GPU Cooperation in NAMD
Remote Local GPU f x f Remote Local Local Update CPU f f x x Time [Phillips et al., SC2008]

36 Challenges Completely restructuring legacy software systems is prohibitive Batch processing software are unaware of GPUs Interoperability issues with pinning main memory pages for DMA [Phillips et al., SC2008]

37 Future Trends

38 Accelerator Integration
Industry move towards integrating CPUs and GPUs on the same chip AMD Fusion [Brookwood; 2010] Intel Sandy Bridge (fixed function GPU) Are other accelerators appropriate? Single-chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010] AMD Fusion Reproduced from [Brookwood; 2010]

39 Vector Addition Revisited
Start-up and shut-down data transfers are the main bottleneck Fusion eliminates these overheads by storing values in the on-chip cache Using accelerators becomes more feasible

40 Memory System Scalability
Current CPU bottlenecks: Number of pins on a chip grows slowly Off-chip bandwidth grows slowly Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand Conflicting requirements: GPU: High bandwidth, not latency sensitive CPU: High bandwidth, can be latency sensitive

41 conclusions

42 Conclusions GPUs can offer a significant speedup for problems that fit the model Tool support and flexible architectures increases the number of problems that fit the model CPU/GPU on-chip integration can reduce GPU start-up overheads

43 Thank You Visit our website:

44 References Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU; Lee et al.; ISCA; 2010 Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010 NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA; 2010 AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White Paper; AMD; 2010 Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master Thesis; NTNU; 2010 Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures; Yuan et al.; MICRO; 2009 Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009 Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009 Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips et al.; SC; 2008 Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008 NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE Micro; 2008 Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA; Ryoo et al.; PPoPP; 2008

45 Extra Slides

46 Complexity-Effective Memory Access Scheduling
On-chip interconnect may interleave requests from different thread processors Row locality is destroyed Solution: Order-preserving interconnect arbitration policy and in-order scheduling Queue: Req 0 Row A Req 0 Row B Req 1 Row A Req 1 Row A Req 0 Row B Req 1 Row B Out-of-order Scheduling Req 0 Row A Req 1 Row A Row Switch Req 0 Row B Req 1 Row B In-order Scheduling Req 0 Row A Req 0 Row A Req 1 Row A Row Switch Row Switch Req 0 Row B Row Switch Req 0 Row B Req 1 Row A Req 1 Row B Row Switch Req 1 Row B Time Performance of out-of-order scheduling with less complex in-order scheduling [Lee et al., MICRO2009]


Download ppt "The Use of GPUs for High-Performance Computing"

Similar presentations


Ads by Google