New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

Two Conflicting Approaches for Programmability in HPC  Top-down Approach  Core programming model is high-level (e.g. func parallel lang)  Must rely on heavy heuristic runtime optimization  Add low-level program constructs to improve low-level control  Risks: Programmers tend to avoid using “extra” constructs. Low-level controls do not fit well into the core model.  Bottom-up Approach (PARRAY PPoPP’12)  Core programming model exposes the memory hierarchy  Same algorithm, Same performance, Same intellectual challenge, but Shorter code

GPUClusters  Tianhe: 1 GPU/ 2CPUs  Tsubame ： 3GPU/ 2CPUs  Mole-8.5: 6GPUs/2CPUs  PKU McClus: 2GPUs/1 CPU

Motivating Examples for PARRAY

Basic Notation Dimension Tree Type Reference

Thread Arrays

#parray {pthd [2]} P #parray {paged float [2][[2048][4096]]} H #parray {dmem float # H_1} D #parray {[#P][#D]} G float* host; _pa_pthd* p; #mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p) } pthread_create sem_post sem_wait pthread_join Generating CUDA+Pthread

#parray { mpi [2] } M #parray { paged float [2][[2048][4096]] } H #parray { [#M][#H_1] } G float* host; _pa_mpi* m; #mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m) } Generating MPI or IB/verbs MPI_Scatter

ALLTOA LL BCAS T Other Communication Patterns

Generating Code for IB/verbs and YH Communication Layer  Semi-Bypassing the MPI layer  Patching the Infiniband layer  Discontiguous RDMA communication pattern achieving Zero-Copy.

Large-Scale FFT in 20 lines Deeply optimized algorithm (ICS 2010) Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows  Scale  Up to 14336 3D Single-Precision  12 distributed arrays, each with 11 TB data (128TB total)  Entire Tianhe-1A with 7168 nodes  Progress  4096 3D completed  8192 3D half-way  and 14336 3D tested for performance.  Software Technologies  PARRAY code only 300 lines.  Programming-level resilience technology for stable computation  Conclusion: GPU-accelerated large simulation on entire Tianhe- 1A is feasible.

Generated Code

Discussions  Other programming models?  MPI (more expressive datatype)  OpenACC (optimization for coalescing accesses)  PGAS (generating PGAS library calls)  IB/verbs (directly generating Zero-Copy IB calls)  We need a software stack!  Irregular structures must be encoded into arrays and then benefit from PARRAY.  Runtime workflow possible above PARRAY  Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macros  Macros are compiled out: no performance loss.  Typical training = 3 days, friendly to engineers…

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

Similar presentations

Presentation on theme: "New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

Similar presentations

Presentation on theme: "New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China."— Presentation transcript:

Similar presentations

About project

Feedback