Download presentation
Presentation is loading. Please wait.
Published byDaniella Collins Modified over 8 years ago
1
Intel Xeon Phi Overview: Hardware, Software, Programming (KNC Edition) Karl Fürlinger Ludwig-Maximilians-Universität München DASH
2
Motivation Computational Science and Engineering needs High Performance Parallel Computing (HPC) Image source: ANSYS Advantage. Excellence in Engineering Simulation. Vol II, Issue 4, 2008 Image source: Climate Modeling for Scientists and Engineers, John B. Drake
3
Top 10 of the Top 500 List (June 2016) Intel Xeon Phi Nvidia GPU Source: www.top500.org
4
Accelerators in the Top 500 List (June 2016) Source: www.top500.org
5
Green 500 List (Nov 2015) All systems in the top 10 are accelerator- based (mostly using GPUs) Source: www.green500.org
6
Xeon Phi Installations (Current) Tianhe-2 (China) –16000 nodes, each with 2 CPUs and 3 Xeon Phis –48000 Xeon Phi accelerators in total –3.1 Mio cores in total –33.8 PFlop/s Linpack, 17.8 MW Stampede (TACC, Texas) –6400 nodes, each with 1 Xeon Phi –5.1 PFlop/s Linpack, 4.5 MW SuperMIC (LRZ Munich) –32 nodes, each with 2 CPUs and 2 Xeon Phis
7
Xeon Phi Installations (Future) Cori (NERSC-8) / late 2016, LBNL –Over 9300 single socket nodes –Over 29 PFlop/s performance –Knights Landing architecture (KNL, next Generation Xeon Phi) –Self-hosted architecture (not a co-processor)
8
Outline The Hardware of the Xeon Phi Software environment Programming for the Xeon Phi
9
Xeon Phi Hardware Some application areas offer a lot of data parallelism –Do the same operation for many data items –Aka.: SIMD (Single Instruction Multiple Data) Examples: –Dense linear algebra operations –Computer graphics operations Graphic cards/gaming is a big market –Fierce competition and rapid development especially in the 2000s
10
Xeon Phi - History Intel decided to enter the GPU market in the mid 2000s GPUs need massive parallelism –GPU as a CPU with many x86 cores –Code-named Larrabee –Compared to established GPUs it was not competitive The project was discontinued in favor of a product for the HPC market MIC (many integrated cores) architecture –Knights Ferry: prototype card, not a commercial product –Knights Corner: first commercial product (KNC – SuperMIC) –Knights Landing: the upcoming next iteration of MIC –Knights Hill: (announced at SC14 for 2017/18)
11
The Future of the MIC Architecture Knights Landing (KNL) –Next iteration of the MIC architecture –14nm process –Based on Silvermont architecture (Out-of-order Atom processor) –Major improvements and upgrades over KNC –2D mesh interconnect instead of KNC ring interconnect –Will be available as a stand-alone CPU (as well as an accelerator) –Support for AVX-512 (Advanced Vector Extensions)
12
Knights Corner vs. Xeon Phi vs. MIC MIC is the code-name for the range of manycore CPUs –Knights Corner is the code-name for the product –Xeon Phi is the official marketing name KNC comes in… –6 different specifications –3 main lines 57 / 60 / 61 cores, clocked at 1.1 / 1.053 / 1.238 GHz 6 / 8 / 16 GB of main memory –Different TDPs and memory bandwidths –3 different form factors Actively cooled / passive cooled / dense form factor
13
The Xeon Phi in use at LRZ (SuperMIC) This Xeon Phi is the 5110P model –Passively cooled, PCIe form factor –245 Watt Thermal Design Power (TDP) –60 cores / each with 4 hardware threads = 240 threads in total –8 GB of GDDR5 RAM –1.053 GHz clock frequency –320 GB/sec peak memory bandwidth Image source: Intel
14
KNC Naked and Exposed Image source: Intel Xeon Phi Coprocessor x100 Product Family Datasheet.
15
KNC Architecture Bi-directional ring connects all the components –60 cores –PCIe client logic –Memory controllers Ringbus –Data bus: 512 bits in each direction –Address Bus –Control Bus
16
Schematic Architecture Intel® Xeon Phi™ Coprocessor System Software Developers Guide
17
Xeon Phi Core Architecture In-order architecture derived from Pentium x86 based –64 bit execution environment –4-way hardware hyper-threading No MMX/SSE/AVX –But 512 SIMD called IMCI IMCI = Intel Initial Many Core Instructions Image source: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors
18
SIMD extensions
19
Xeon Phi Vector Processing Unit (VPU) VPU (Vector Processing Unit) –Works with 512 bit registers (=64 bytes) –32 vector registers per hardware thread –Max throughput: 16 SP or 8 DP FLOPs per cycle –With FMA: 32 SP or 16 DP FLOPs per cycle –(Most) vector instructions have a latency of 4 cycles and 1 cycle throughput (Intel documents) Peak Performance –16 x 60 x 1053 x 10 6 = 1010 x 10 9 FLOPs/sec ~ 1 TFLOP/sec
20
Cache L1 cache: private, per-core –32 KB instruction + 32 KB data cache –64 Bytes line size –3 cycle access time L2 cache –512 KB private, per-core –Ca. 30 MB L2 cache in total on the chip –Kept coherent across cores –64 Bytes line size –11 cycles access time
21
Hardware Summary PCIe accelerator card / co-processor –60 cores, 8 GB GDDR5 RAM –6 GB/sec peak PCIe bandwidth, 320 GB/sec peak memory bandwidth The Xeon Phi cores –Based on an older Pentium x86 in-order design –Added 64 bit extensions and 4-way hyper-threading –Added 512 bit Vector processing Unit (VPU) –512 KB L2 cache per core Peak performance –Around 1 TFLop/sec peak performance –Around 800 GFlop/sec DGEMM
22
Software Environment on the Xeon Phi The card runs a modified embedded Linux –Called Micro OS (uOS) by Intel –The card boots from an image located on the host –The card does not have persistent memory –Provides a TCP/IP stack emulation over PCIe –Card appears as a network device to the host –Busybox is included for a variety of utilites (ls, top, …) Xeon Phi (mic0) 60 cores, 8 GB GDDR Linux (uOS) Xeon CPU (host) N sockets, M cores, Linux PCIe
23
MPSS and uOS
24
Using the Xeon Phi Access using ssh : ssh mic0 User accounts are duplicated form the host –SSH keys are required for access host:$ ssh mic0 mic0:$ tail –n 25 /proc/cpuinfo processor: 239 vendor_id: GenuineIntel cpu family: 11 model: 1 model name: 0b/01 stepping: 3 cpu MHz: 1052.630 cache size: 512 KB...
25
Software Summary Xeon Phi runs its own operating system –Linux-based uOS –Appears like a separate host (mic0) accessible via SSH –Busybox used for some (but not all) common utility programs
26
Some Experimental Restults Vector Instruction Latency Source: Test-Driving Intel Xeon Phi (Fang et al.)
27
Peak Floating Point Performance Source: Test-Driving Intel Xeon Phi (Fang et al.)
28
Programming for the Xeon Phi Three modes in which the Xeon Phi is used –Offload / Native / Symmetric –More options in a cluster environment Offload –Program starts on host, offloads computational parts to the Xeon Phi Native –Executable is run directly on the Xeon Phi –Symmetric –Application uses both Xeon Phi and Host HostXeon Phi HostXeon Phi HostXeon Phi
29
Supported Programming Models Source: Intel
30
Native Execution No compilers on the Xeon Phi itself! –Use icc on the host to cross-compile for the Phi –Generates a binary suitable for execution (only) on the Phi icc –mmic –o myapp.mic myapp.c Transfer executable to the Xeon Phi and launch the program there, e.g.: host:$ scp myapp.mic mic0: host:$ ssh mic0 mic0:$./myapp.mic Or launch from host via utility command host:$ micnativeloadex./myapp.mic
31
Programming Models for Native Execution All programming model supported by Intel’s tool- chain are also supported on the Xeon Phi –Pthreads, OpenMP –TBB, Clik Plus, OpenCL, OpenACC –MPI via Intel MPI icc –mmic –openmp –o myapp.mic myapp.c host:$ scp myapp.mic mic0: host:$ ssh mic0 mic0:$ export OMP_NUM_THREADS=240 mic0:$./myapp.mic
32
Native Execution Considerations Performance on the Xeon Phi requires –Parallelism (multi-threading) –Vectorization (exploiting the wide vector SIMD) Parallelism –60 cores and 4-way hyper-threading – up to 240 threads Vectorization (wide SIMD) –512 bits = 64 bytes = 16 SP = 8 DP –Using vector instructions –64 byte memory alignment
33
Process/Thread Binding For Intel OpenMP: KMP_AFFINITY KMP_AFFINITY –compact –scatter –balanced –explicit export KMP_AFFINITY=balanced export KMP_AFFINITY='explicit,proclist=[0,1,2,3,4] ' export KMP_AFFINITY='verbose,scatter' Source: Intel
34
Vectorization Compiler vectorization –Enabled with -O2 and higher –Compiler report on vectorization with -vec- report= –N=2,3,…,7 Vectorization pragmas –Assert no loop-carried dependencies: #pragma ivdep –Ask compiler to vectorize a loop: #pragma vector –Force compiler to vectorize loop: #pragma vector always #pragma simd
35
Vector Intrinsics If compiler vectorization fails… Source: Programming for the Intel Xoen Phi Coprocessor
36
Data Alignment Heap allocated memory aligned on n byte boundary Alignment for variable declarations n=64 for Xeon Phi void* _mm_malloc(int size, int n) int posix_memalign(void **p, size_t n, size_t size) __attribute__((aligned(n))) var_name __declspec(align(n)) var_name
37
Offloading LEO: Language Extensions for Offloading –pragmas (C/C++) or comments (Fortran), similar to OpenMP // regular code (executed on host) #pragma offload target(mic) { // offloaded code // (to be executed on xeon phi) } // regular code (executed on host) Offloaded code can be OpenMP code (use –openmp compiler flag) Specify the target –mic0, mic1 if more than one card
38
Offloading Offloading whole functions, allocating variables on the Xeon PHi // either __attribute__ ((target(mic))) void foo () {... } // or __declspec ( target(mic) ) void foo() { } Offloading large code blocks / whole files #pragma offload_attribute(push, target(mic)) // code block / file #pragma offload_attribute(pop)
39
Environment Variables for Offloading Monitoring offload activity –Env. Variable OFFLOAD_REPORT Timing data and other statistics for each offload statement –Env. Variable H_TRACE Information on whether code blocks marked for offload are actually being executed on the accelerator export OFFLOAD_REPORT=1 export H_TRACE=1 export H_TRACE=2
40
Simple OpenMP Offloading Example (1) #include int main(int argc, char *argv[]) { #pragma offload target(mic) { #pragma omp parallel { #pragma omp master { printf("I'm running with %d threads\n", omp_get_num_threads()); }
41
Simple OpenMP Offloading Example (2) MIC_ENV_PREFIX determines prefix of environment variables relevant for execution on Xeon Phi host$./myapp I’m running with 236 threads host$ export MIC_ENV_PREFIX=MIC host$ export MIC_OMP_NUM_THREADS=120 host$./myapp I’m running with 120 threads
42
Simple Example with Data Transfer (1) #include #define SIZE 1000 int main(int argc, char *argv[]) { int i; double A[SIZE], B[SIZE], C[SIZE]; #pragma offload target(mic) { #pragma omp parallel for for( i=0; i<SIZE; i++ ) { C[i]=A[i]*B[i]; }
43
Simple Example with Data Transfer (2) Output via H_TRACE=1 Output via OFFLOAD_REPORT=1 HOST: Total pointer data sent to target: [0] bytes HOST: Total copyin data sent to target: [24004] bytes HOST: Total pointer data received from target: [0] bytes MIC0: Total copyin data received from host: [24004] bytes MIC0: Total copyout data sent to host: [24004] bytes HOST: Total copyout data received from target: [24004] bytes [Offload] [MIC 0] [File] offload.c [Offload] [MIC 0] [Line] 11 [Offload] [MIC 0] [Tag] Tag 0 [Offload] [HOST] [Tag 0] [CPU Time] 0.437595(seconds) [Offload] [MIC 0] [Tag 0] [MIC Time] 0.182540(seconds)
44
Offload pragmas Source: An Introduction to the Intel Xeon Phi Coprocessor
45
Further Reading White papers / books –Xeon Phi Coprocessor Architecture and Tools (Rezaur Rahman, Intel) –PRACE Best Practice Guide for Intel Xeon Phi (V. Weinberg, Editor) Tutorial slides –Programming the Intel Xeon Phi (Tim Cramer, RWTH Aachen) –Native Computation and Optimization (J. McCalpin, TACC)
46
Thank you for your attention! More information: www.sppexa.dewww.sppexa.de This work is supported by the German Research Foundation (DFG), TODO include ANR/JST for bi/trilateral projects, as part of the priority programme 1648 Software for Exascale Computing.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.