Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Kruliš 17. 12. 2015 by Martin Kruliš (v1.1)1.

Similar presentations


Presentation on theme: "Martin Kruliš 17. 12. 2015 by Martin Kruliš (v1.1)1."— Presentation transcript:

1 Martin Kruliš 17. 12. 2015 by Martin Kruliš (v1.1)1

2  History ◦ Original idea – there is a lot of silicon on a CPU die  Use many x86 cores to create a GPU ◦ 2006 Project Larabee  The GPU could not keep up with AMD and NVIDIA ◦ 2007 Teraflops Research Chip (96-bit VLIW arch) ◦ 2009 Single Chip Cloud Computer (48 cores) ◦ 2010 Knights Ferry (32 cores) ◦ 2011 Knights Corner (60 cores based on Pentium 1) ◦ 2013 Knights Landing (72 cores based on Atom) ◦ Knights Hill announced 17. 12. 2015 by Martin Kruliš (v1.1)2

3  The Xeon Phi Device ◦ Many simpler (Pentium/Atom) cores ◦ Each equipped with powerful 512bit vector engine 17. 12. 2015 by Martin Kruliš (v1.1)3

4  Software Architecture ◦ Xeon Phi is basically an independent Linux machine 17. 12. 2015 by Martin Kruliš (v1.1)4 Example

5  Modes of Xeon Phi Usage ◦ OpenCL device ◦ Standalone computational device  Connected over TCP/IP (SSH, …)  Using low-level symmetric communications interface ◦ MPI device  Communicating over TCP/IP or OFED  May be used in both ways (ranked from host or device) ◦ Offload device  Explicit mode offloading  Implicit mode offloading 17. 12. 2015 by Martin Kruliš (v1.1)5

6  Xeon Phi as OpenCL Accelerator Card ◦ Requires Intel OpenCL platform ◦ HW to OpenCL mapping  Xeon Phi card = OCL compute device  Virtual core = OCL compute unit ◦ Work items in work group are processed in a loop  Unrolled 16x and vectorized when possible ◦ OCL manages thread pool  One thread per virtual core  Work groups are assigned to threads as tasks ◦ Better to run more WGs and less WIs then on GPU 17. 12. 2015 by Martin Kruliš (v1.1)6 Example

7  Using Xeon Phi as Standalone Device ◦ The device is autonomous linux machine ◦ The code is cross-compiled on host and deployed on Xeon Phi (including libraries) ◦ Complete freedom in parallelization techniques  OpenMP, Intel TBB, Intel CILK, pthreads, … ◦ Communication has to be performed manually by  Symmetric communication interface (SCIF)  TCP/IP stack, which uses SCIF as data link layer ◦ Useful for extending master-worker applications 17. 12. 2015 by Martin Kruliš (v1.1)7 Example

8  Symmetric Communications Interface ◦ Socket-like interface that encapsulates PCI-Express data transfers  Message passing and RMA transfers ◦ Memory mapping techniques  Device memory may be mapped to host address space  Host memory and memory of other devices can be mapped to address space of a device  Upper 512G (32x 16G pages) ◦ Supports direct assignment virtualization model ◦ All other communication methods are built on SCIF 17. 12. 2015 by Martin Kruliš (v1.1)8

9  Offload Execution Model ◦ Both host and device code is written together ◦ Offload parts for the device are explicitly marked  The compiler performs the dual compilation, inserts the stubs, handles the data transfers, … ◦ Explicit offload model (a.k.a. Pragma offload)  Everything is controlled by programmer  Only binary-safe data structures can be transferred ◦ Implicit offload model (a.k.a. Shared VM model)  Data transfers are handled automatically  Complex data structures and pointers may be transferred 17. 12. 2015 by Martin Kruliš (v1.1)9

10  Code Compilation 17. 12. 2015 by Martin Kruliš (v1.1)10 Source Host MIC #pragma offload _Cilk_offload Compilation Stub

11  Pragma Offload ◦ Functions and variables are declared with __attribute__((target(mic))) ◦ Offloaded code is invoked as #pragma offload  A clause may select target card target(mic[:id])  Or list data structures which are used for the offload  in(varlist) – copied to the device before the offload  out(varlist) – copied back to host after the offload  inout, nocopy, length, align  Allocation control alloc_if(), free_if() 17. 12. 2015 by Martin Kruliš (v1.1)11

12  Example __attribute__((target(mic))) void preprocess(…) { … }... __attribute__((target(mic))) static float *X; // of N items #pragma offload target(mic:0) \ in(X:length(N) alloc_if(1) free_if(0)) preprocess(X); #pragma offload target(mic:0) \ nocopy(X:length(N) alloc_if(0) free_if(0)) process(X); #pragma offload target(mic:0) \ out(X:length(N) alloc_if(0) free_if(1)) finalize(X); 17. 12. 2015 by Martin Kruliš (v1.1)12

13  Asynchronous Operations ◦ Execution char sigVar; #pragma offload target(mic) signal(&sigVar) long_lasting_func();... concurrent CPU work... ◦ Data transfers #pragma offload_transfer clauses signal(&sigVar) ◦ Waiting, polling #pragma offload_wait wait(&sigVar) if (_Offload_signaled(micID, &sigVar))... 17. 12. 2015 by Martin Kruliš (v1.1)13

14  Shared Virtual Memory Mode ◦ Code and variables that are shared have attribute _Cilk_shared (e.g., float _Cilk_shared *X ) ◦ Shared memory allocation must be performed via specialized functions  _Offload_shared_malloc, _Offload_shared_free ◦ Offloaded code gets executed by _Cilk_offload _Cilk_offload_to(micId) ◦ The data are transferred as compiler see fit 17. 12. 2015 by Martin Kruliš (v1.1)14 Example

15  A Few More Things ◦ Compile time macros for MIC detection __INTEL_OFFLOAD, __MIC__ ◦ Runtime MIC detection  _Offload_number_of_devices()  _Offload_get_device_number()  Querying device capabilities is more complicated  You can read /proc/cpuinfo on the device  Or use special MicAccessAPI ◦ Asynchronous data transfers and execution  Using signal variables for synchronization 17. 12. 2015 by Martin Kruliš (v1.1)15

16 17. 12. 2015 by Martin Kruliš (v1.1)16


Download ppt "Martin Kruliš 17. 12. 2015 by Martin Kruliš (v1.1)1."

Similar presentations


Ads by Google