Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, Mitsuhisa.

Similar presentations


Presentation on theme: "GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, Mitsuhisa."— Presentation transcript:

1 GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, Mitsuhisa Sato University of Tsukuba, Japan Graduate School of Systems and Information Engineering, University of Tsukuba, JapanCenter for Computational Sciences, University of Tsukuba, Japan 9/10/2012/P2S ,2 1 2

2 Outline Background & Purpose Base language: XcalableMP & XcalableMP-dev StarPU Implementation of XMP-dev/StarPU Preliminary performance evaluation Conclusion & Future work 9/10/2012/P2S

3 Background GPGPU is widely used for HPC – Impact of NVIDIA CUDA, OpenCL, etc… – Programming became easy on single node – > Many GPU cluster appear on TOP500 list Problem of programming on GPU cluster – Inter-node programming model (such as MPI) – Data management among CPU and GPU – > Programmability and productivity are very low GPU is very powerful, but... – CPUs performance have been improved. – We cannot neglect its performance. 9/10/2012/P2S Complex Include program source

4 Purpose High productivity programming on GPU clusters Utilization of both GPU and CPU resources – Work sharing of the loop execution XcalableMP acceleration device extension (XMP- dev) [Nakao, et. al., PGAS10] [Lee, et. al., HeteroPar11] – Parallel programming model for GPU clusters – Directive-base, low programming cost StarPU [INRIA Bordeaux] – Runtime system for GPU/CPU work sharing. 9/10/2012/P2S

5 Related Works PGI Fortran and HMPP – Source-to-source compiler for single node with GPUs – Not to be designed for GPU/CPU co-working StarPU research – StarPU: a unified platform for task scheduling on heterogeneous multicore architectures [Augonnet, 2010] – Faster, Cheaper, Better - a Hybridization Methodology to Develop Linear Algebra Software for GPUs [Emmanuel, 2010] – Higher performance than GPU only – GPU/CPU work sharing is effective 9/10/2012/P2S

6 XcalableMP (XMP) A PGAS language designed for distributed memory systems – Directive-base; easy to understand array distribution, inter-node communication, loop work sharing on CPU, etc… – Low programming cost little change from the original sequential program XMP is developed by a working group by many Japanese universities and HPC companies 9/10/2012/P2S

7 XcalableMP acceleration device extension (XMP-dev) Developed HPCS laboratory, University of Tsukuba XMP-dev is an extension of XMP for accelerator-equipped cluster – Additional directives (manage accelerators) Mapping data onto accelerators device memory Data transfer between CPU and accelerator Work sharing on loop with accelerator device (ex. GPU cores) 9/10/2012/P2S

8 Example of XMP-dev int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } 9/10/2012/8 node1 x y HOST DEVICE node2 x y HOST DEVICE P2S2 2012

9 Example of XMP-dev int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } 9/10/2012/9 node1 x y HOST DEVICE node2 x y HOST DEVICE x y x y P2S2 2012

10 Example of XMP-dev int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } 9/10/2012/10 node1 x y HOST DEVICE node2 x y HOST DEVICE x y x y data transfer HOST -> DEVICE data transfer HOST -> DEVICE P2S2 2012

11 Example of XMP-dev int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } 9/10/2012/11 node1 x y HOST DEVICE node2 x y HOST DEVICE x y x y P2S2 2012

12 Example of XMP-dev int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } 9/10/2012/12 node1 x y HOST DEVICE node2 x y HOST DEVICE x y x y data transfer DEVICE->HOST data transfer DEVICE->HOST P2S2 2012

13 Example of XMP-dev int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } 9/10/2012/13 node1 x y HOST DEVICE node2 x y HOST DEVICE free data on Devices free data on Devices P2S2 2012

14 StarPU Developed by INRIA Bordeaux, France StarPU is a runtime system – allocates and dispatches resource – schedules the task execution dynamically All the target data is recorded and managed in a data pool shared by all computation resources. – guarantees the coherence of data among multiple task executions 9/10/2012/P2S

15 Example of StarPU double x; starpu_data_handle x_h; starpu_vector_data_register(&x_h, 0, x, …); struct starpu_data_filter f = {.filter_func = starpu_block_filter_func_vector,.nchildren = NSLICEX, …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); 9/10/2012/15 x starpu_codelet cl = {.where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func,.cuda_func = g_func,.nbuffers = 7}; P2S2 2012

16 Example of StarPU double x; starpu_data_handle x_h; starpu_vector_data_register(&x_h, 0, x, …); struct starpu_data_filter f = {.filter_func = starpu_block_filter_func_vector,.nchildren = NSLICEX, …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); 9/10/2012/16 x starpu_codelet cl = {.where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func,.cuda_func = g_func,.nbuffers = 7}; x_h P2S2 2012

17 Example of StarPU double x; starpu_data_handle x_h; starpu_vector_data_register(&x_h, 0, x, …); struct starpu_data_filter f = {.filter_func = starpu_block_filter_func_vector,.nchildren = NSLICEX, …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); 9/10/2012/17 x starpu_codelet cl = {.where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func,.cuda_func = g_func,.nbuffers = 7}; x_h P2S2 2012

18 Example of StarPU double x; starpu_data_handle x_h; starpu_vector_data_register(&x_h, 0, x, …); struct starpu_data_filter f = {.filter_func = starpu_block_filter_func_vector,.nchildren = NSLICEX, …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); 9/10/2012/18 x starpu_codelet cl = {.where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func,.cuda_func = g_func,.nbuffers = 7}; x_h CPU0 CPU core GPU0 GPU P2S2 2012

19 Example of StarPU double x; starpu_data_handle x_h; starpu_vector_data_register(&x_h, 0, x, …); struct starpu_data_filter f = {.filter_func = starpu_block_filter_func_vector,.nchildren = NSLICEX, …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); 9/10/2012/19 x starpu_codelet cl = {.where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func,.cuda_func = g_func,.nbuffers = 7}; x_h P2S2 2012

20 Example of StarPU double x; starpu_data_handle x_h; starpu_vector_data_register(&x_h, 0, x, …); struct starpu_data_filter f = {.filter_func = starpu_block_filter_func_vector,.nchildren = NSLICEX, …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); 9/10/2012/20 x starpu_codelet cl = {.where = STARPU_CPU|STARPU_CUDA,.cpu_func = c_func,.cuda_func = g_func,.nbuffers = 7}; P2S2 2012

21 Implementation of XMP-dev/StarPU XMP-dev – inter-node communication – data distribution StarPU – data transfer between GPU and CPU – GPU/CPU work sharing on single-node We combine XMP-dev and StarPU to enhance the function and performance – GPU/CPU work sharing on multi-node GPU cluster 9/10/2012/P2S

22 Task Model of XMP-dev/StarPU 9/10/2012/P2S node1 CPU core GPU node2 CPU core GPU

23 Task Model of XMP-dev/StarPU 9/10/2012/P2S node1 CPU core GPU node2 CPU core GPU XMP-dev/StarPU divide global array

24 Task Model of XMP-dev/StarPU 9/10/2012/P2S node1 CPU core GPU node2 CPU core GPU StarPU divide local array and allocate tasks to devices.

25 Preliminary performance evaluation N-body (double precision) – Use hand-compiled code – XMP-dev/StarPU (GPU/CPU) vs XMP-dev (GPU) – # of particles : 16K or 32K Node specification 9/10/2012/P2S CPUAMD Opteron GHz (8 cores x2sokets MemoryDDR3 16GB GPUNVIDIA Tesla C2050 GPU MemoryGDDR5 3GB (ECC ON) InterconnectionGigabit Ethernet # of nodes4

26 Preliminary performance evaluation Assumption and Condition – StarPU allocates a CPU core to a GPU for controlling and data management – So, 15 CPU cores (16 – 1) contribute for computation on CPU Term definition – Number of Tasks (tc) Number of partitions to split data array (each partition is mapped to a task) – Chunk size (N / tc) Number of elements per task 9/10/2012/P2S

27 Relative performance of XMP-dev/StarPU to XMP-dev 9/10/2012/P2S XMP-dev/StarPU is very low performance… -compared with 35% Number of tasks and chunk size effect a performance in hybrid environment. -> There are several patterns.

28 The relation between number of tasks and chunk size Number of tasks chunk size LargeSmall EnoughGood performanceLow GPU performance Not enoughCPU cores are bottleneck Bad performance 9/10/2012/P2S CPUGPU Good CPUGPU Bad idling If there are enough number of tasks and chunk size, we can achieve good performance in previous implementation. Either case with not enough number of tasks or small chunk size ->It is difficult to keep good balance of them with a fixed size chunk

29 The relation between number of tasks and chunk size In a fixed problem size 9/10/2012/P2S Chunk size Freedom of task allocation Freedom of task allocation Chunk size We have to balance chunk size with freedom of task allocation. -> An aggressive solution is to provide different size of task to CPU core or GPU. -> To find the task size on each devices, calculate ratio from GPU and CPU performance.

30 Sustained Performance of GPU and CPU 9/10/2012/P2S CPU performance is improved up to the case of (32, 64). Number of tasks is 128. GPU is assigned a single task. More than 2 tasks, scheduling overhead is large. CPU-only GPU-only

31 Optimized hybrid work sharing CPU Weight – allocated partitions for CPU cores 9/10/2012/P2S CPUGPU N * CPU Weight GPU performance [MFLOPS]CPU performance [MFLOPS]CPU Weight 16k k

32 Improved Relative Performance of XMP-dev/StarPU vs XMP-dev 9/10/2012/P2S k32k When 32k, we can achieve better than 1.4 times the performance of XMP- dev(GPU only) -The load balance between CPU and GPU are taken well in these cases. When 16k, we can achieve only 1.05 times the performance -Performance is unstable. We have to analyze this factor.

33 Conclusion & Future work We proposed a programming language and run-time system named XMP-dev/StarPU. – GPU/CPU hybrid work sharing In preliminary performance evaluation, we can achieve 1.4 times better than XMP-dev Completion of XMP-dev/StarPU compiler implementation Analyze dynamic task size management for GPU/CPU load balancing Examination of various applications on larger parallel systems 9/10/2012/P2S

34 Thank you for your attention. Please ask me slowly Thank you! 9/10/2012/P2S


Download ppt "GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, Mitsuhisa."

Similar presentations


Ads by Google