Download presentation
Presentation is loading. Please wait.
Published byAlexia Reeves Modified over 8 years ago
1
New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.
2
Two Conflicting Approaches for Programmability in HPC Top-down Approach Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks: Programmers tend to avoid using “extra” constructs. Low-level controls do not fit well into the core model. Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual challenge, but Shorter code
3
GPUClusters Tianhe: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1 CPU
4
Motivating Examples for PARRAY
5
Basic Notation Dimension Tree Type Reference
7
Thread Arrays
8
#parray {pthd [2]} P #parray {paged float [2][[2048][4096]]} H #parray {dmem float # H_1} D #parray {[#P][#D]} G float* host; _pa_pthd* p; #mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p) } pthread_create sem_post sem_wait pthread_join Generating CUDA+Pthread
9
#parray { mpi [2] } M #parray { paged float [2][[2048][4096]] } H #parray { [#M][#H_1] } G float* host; _pa_mpi* m; #mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m) } Generating MPI or IB/verbs MPI_Scatter
10
ALLTOA LL BCAS T Other Communication Patterns
11
Generating Code for IB/verbs and YH Communication Layer Semi-Bypassing the MPI layer Patching the Infiniband layer Discontiguous RDMA communication pattern achieving Zero-Copy.
12
Large-Scale FFT in 20 lines Deeply optimized algorithm (ICS 2010) Zero-copy for hmem
14
(Before Nov 2011)
15
Direct Simulation of Turbulent Flows Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance. Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable computation Conclusion: GPU-accelerated large simulation on entire Tianhe- 1A is feasible.
17
Generated Code
18
Discussions Other programming models? MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls) We need a software stack! Irregular structures must be encoded into arrays and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.