Martin Kruliš 17. 12. 2015 by Martin Kruliš (v1.1)1.

Slides:



Advertisements
Similar presentations
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Advertisements

Parallel Processing with OpenMP
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
OpenCL Introduction A TECHNICAL REVIEW LU OCT
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
GPU Architecture and Programming
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
1 Threads, SMP, and Microkernels Chapter 4. 2 Process Resource ownership: process includes a virtual address space to hold the process image (fig 3.16)
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Martin Kruliš by Martin Kruliš (v1.0)1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
SCIF. SCIF provides a mechanism for inter-node communication within a single platform, where a node is an Intel® MIC Architecture based PCIe card or an.
Martin Kruliš by Martin Kruliš (v1.1)1.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.
Computer Engg, IIT(BHU)
Computer Organization and Architecture Lecture 1 : Introduction
Multi-GPU Programming
Intel Many Integrated Cores Architecture
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Operating Systems CMPSC 473
CS427 Multicore Architecture and Parallel Computing
Intel MIC Architecture Internals and Optimizations
Realizing Concurrency using the thread model
3- Parallel Programming Models
Computer Engg, IIT(BHU)
Constructing a system with multiple computers or processors
FPGAs in AWS and First Use Cases, Kees Vissers
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
周纯葆 中国科学院计算机网络信息中心 超级计算中心
Realizing Concurrency using Posix Threads (pthreads)
Chapter 4: Threads.
Mattan Erez The University of Texas at Austin
Threads, SMP, and Microkernels
Realizing Concurrency using the thread model
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Lecture Topics: 11/1 General Operating System Concepts Processes
Lecture 4- Threads, SMP, and Microkernels
Constructing a system with multiple computers or processors
Using OpenMP offloading in Charm++
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
Realizing Concurrency using Posix Threads (pthreads)
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Types of Parallel Computers
Multicore and GPU Programming
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Martin Kruliš by Martin Kruliš (v1.1)1

 History ◦ Original idea – there is a lot of silicon on a CPU die  Use many x86 cores to create a GPU ◦ 2006 Project Larabee  The GPU could not keep up with AMD and NVIDIA ◦ 2007 Teraflops Research Chip (96-bit VLIW arch) ◦ 2009 Single Chip Cloud Computer (48 cores) ◦ 2010 Knights Ferry (32 cores) ◦ 2011 Knights Corner (60 cores based on Pentium 1) ◦ 2013 Knights Landing (72 cores based on Atom) ◦ Knights Hill announced by Martin Kruliš (v1.1)2

 The Xeon Phi Device ◦ Many simpler (Pentium/Atom) cores ◦ Each equipped with powerful 512bit vector engine by Martin Kruliš (v1.1)3

 Software Architecture ◦ Xeon Phi is basically an independent Linux machine by Martin Kruliš (v1.1)4 Example

 Modes of Xeon Phi Usage ◦ OpenCL device ◦ Standalone computational device  Connected over TCP/IP (SSH, …)  Using low-level symmetric communications interface ◦ MPI device  Communicating over TCP/IP or OFED  May be used in both ways (ranked from host or device) ◦ Offload device  Explicit mode offloading  Implicit mode offloading by Martin Kruliš (v1.1)5

 Xeon Phi as OpenCL Accelerator Card ◦ Requires Intel OpenCL platform ◦ HW to OpenCL mapping  Xeon Phi card = OCL compute device  Virtual core = OCL compute unit ◦ Work items in work group are processed in a loop  Unrolled 16x and vectorized when possible ◦ OCL manages thread pool  One thread per virtual core  Work groups are assigned to threads as tasks ◦ Better to run more WGs and less WIs then on GPU by Martin Kruliš (v1.1)6 Example

 Using Xeon Phi as Standalone Device ◦ The device is autonomous linux machine ◦ The code is cross-compiled on host and deployed on Xeon Phi (including libraries) ◦ Complete freedom in parallelization techniques  OpenMP, Intel TBB, Intel CILK, pthreads, … ◦ Communication has to be performed manually by  Symmetric communication interface (SCIF)  TCP/IP stack, which uses SCIF as data link layer ◦ Useful for extending master-worker applications by Martin Kruliš (v1.1)7 Example

 Symmetric Communications Interface ◦ Socket-like interface that encapsulates PCI-Express data transfers  Message passing and RMA transfers ◦ Memory mapping techniques  Device memory may be mapped to host address space  Host memory and memory of other devices can be mapped to address space of a device  Upper 512G (32x 16G pages) ◦ Supports direct assignment virtualization model ◦ All other communication methods are built on SCIF by Martin Kruliš (v1.1)8

 Offload Execution Model ◦ Both host and device code is written together ◦ Offload parts for the device are explicitly marked  The compiler performs the dual compilation, inserts the stubs, handles the data transfers, … ◦ Explicit offload model (a.k.a. Pragma offload)  Everything is controlled by programmer  Only binary-safe data structures can be transferred ◦ Implicit offload model (a.k.a. Shared VM model)  Data transfers are handled automatically  Complex data structures and pointers may be transferred by Martin Kruliš (v1.1)9

 Code Compilation by Martin Kruliš (v1.1)10 Source Host MIC #pragma offload _Cilk_offload Compilation Stub

 Pragma Offload ◦ Functions and variables are declared with __attribute__((target(mic))) ◦ Offloaded code is invoked as #pragma offload  A clause may select target card target(mic[:id])  Or list data structures which are used for the offload  in(varlist) – copied to the device before the offload  out(varlist) – copied back to host after the offload  inout, nocopy, length, align  Allocation control alloc_if(), free_if() by Martin Kruliš (v1.1)11

 Example __attribute__((target(mic))) void preprocess(…) { … }... __attribute__((target(mic))) static float *X; // of N items #pragma offload target(mic:0) \ in(X:length(N) alloc_if(1) free_if(0)) preprocess(X); #pragma offload target(mic:0) \ nocopy(X:length(N) alloc_if(0) free_if(0)) process(X); #pragma offload target(mic:0) \ out(X:length(N) alloc_if(0) free_if(1)) finalize(X); by Martin Kruliš (v1.1)12

 Asynchronous Operations ◦ Execution char sigVar; #pragma offload target(mic) signal(&sigVar) long_lasting_func();... concurrent CPU work... ◦ Data transfers #pragma offload_transfer clauses signal(&sigVar) ◦ Waiting, polling #pragma offload_wait wait(&sigVar) if (_Offload_signaled(micID, &sigVar)) by Martin Kruliš (v1.1)13

 Shared Virtual Memory Mode ◦ Code and variables that are shared have attribute _Cilk_shared (e.g., float _Cilk_shared *X ) ◦ Shared memory allocation must be performed via specialized functions  _Offload_shared_malloc, _Offload_shared_free ◦ Offloaded code gets executed by _Cilk_offload _Cilk_offload_to(micId) ◦ The data are transferred as compiler see fit by Martin Kruliš (v1.1)14 Example

 A Few More Things ◦ Compile time macros for MIC detection __INTEL_OFFLOAD, __MIC__ ◦ Runtime MIC detection  _Offload_number_of_devices()  _Offload_get_device_number()  Querying device capabilities is more complicated  You can read /proc/cpuinfo on the device  Or use special MicAccessAPI ◦ Asynchronous data transfers and execution  Using signal variables for synchronization by Martin Kruliš (v1.1)15

by Martin Kruliš (v1.1)16