Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor

Slides:



Advertisements
Similar presentations
purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
Advertisements

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Computer Abstractions and Technology
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
1 OS Structure, Processes & Process Management. 2 Recap OS functions  Coordinator  Protection  Communication  Resource management  Service provider.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Nooks: an architecture for safe device drivers Mike Swift, The Wild and Crazy Guy, Hank Levy and Susan Eggers.
Operating System Structure. Announcements Make sure you are registered for CS 415 First CS 415 project is up –Initial design documents due next Friday,
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
OS Spring’03 Introduction Operating Systems Spring 2003.
Figure 1.1 Interaction between applications and the operating system.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
OS Spring’04 Introduction Operating Systems Spring 2004.
Computer System Laboratory
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.
1 CS503: Operating Systems Part 1: OS Interface Dongyan Xu Department of Computer Science Purdue University.
SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls John R. Lange and Peter Dinda University of Pittsburgh (CS) Northwestern University (EECS)
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Jakub Szefer, Eric Keller, Ruby B. Lee Jennifer Rexford Princeton University CCS October, 2011 報告人:張逸文.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Tami Meredith, Ph.D. CSCI  Devices need CPU access  E.g., NIC has a full buffer it needs to empty  These device needs are often asynchronous.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CUDA - 2.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
Lab 13 Department of Computer Science and Information Engineering National Taiwan University Lab13 – Interrupt + Timer 2014/12/23 1 /16.
Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Introduction to virtualization
Full and Para Virtualization
Implementation of Embedded OS Lab3 Porting μC/OS-II.
E-MOS: Efficient Energy Management Policies in Operating Systems
Lecture 5 Rootkits Hoglund/Butler (Chapters 1-3).
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Chapter 6 Limited Direct Execution Chien-Chung Shen CIS/UD
Matthew Locke November 2007 A Linux Power Management Architecture.
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Computer Engg, IIT(BHU)
NFV Compute Acceleration APIs and Evaluation
CS 3214 Computer Systems Lecture 9 Godmar Back.
Presented by Yoon-Soo Lee
Microprocessor Systems Design I
CS 6560: Operating Systems Design
Heterogeneous Computation Team HybriLIT
Anton Burtsev February, 2017
IXPUG Abstract Submission Instructions
More examples How many processes does this piece of code create?
Module 2: Computer-System Structures
Brian Kocoloski Jack Lange Department of Computer Science
Operating Systems Lecture 3.
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
Module 2: Computer-System Structures
LINUX System : Lecture 7 Lecture notes acknowledgement : The design of UNIX Operating System.
Processes David Ferry CSCI 3500 – Operating Systems
CSE 451: Operating Systems Winter 2007 Module 2 Architectural Support for Operating Systems Brian Bershad 562 Allen Center 1.
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Interrupts and System Calls
Presentation transcript:

Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor Grzegorz Andrejczuk grzegorz.andrejczuk@intel.com Intel Technology Poland IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Introduction Irregularities caused by events like timer interrupts are already recognized as a problem for highly parallel applications running on multi-core processors. Linux tickless kernel is becoming a standard option on HPC compute nodes, but it does not eliminate OS noise completely. IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 The idea Intel® Xeon Phi™ Processors are divided into host CPUs (visible by OS) and hidden CPUs initialized from the driver module (not visible for OS). The memory space is shared between host and hidden CPUs. We are using simple scheduling which simulates the data level parrallelism (DLP) approach - doing the same work on different data using all available hidden CPUs. The solution does not require any kernel modifications. IXPUG Annual Fall Conference 2017

Solution Architecture Intel® Xeon Phi™ Processor Linux Kernel Driver Runtime library Application Exposed CPUs (4) Hidden CPUs(64) Modified SW Not modified SW BIOS Hardware IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Environment setup Hide a subset of processors by removing appropriate entries from the MADT ACPI table Put the modified ACPI table into initrd Boot Linux kernel with the modified initrd Load the driver Boot the hidden CPUs by specifying their APIC IDs in SYSFS IXPUG Annual Fall Conference 2017

Initialization control flow Application initializes the runtime library. The runtime library reserves the hidden CPUs and uses the driver to request to execute its code there. The driver initializes the hidden CPUs and starts executing the “idle” loop. The hidden CPUs wait for the work from Application. Work: pointer to function pointer to data Host Userspace Driver Userspace Driver Kernelspace Host Kernelspace IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Offload control flow Application offloads the work to hidden CPUs Register values are preserved as they would be in a normal function call The hidden CPUs get the work and run it This happens without any interaction with the OS When the work is done the hidden CPUs wait (using Ring 3 MWAIT) while the application gets the results API supports synchronous and asynchronous calls 1. Application provides computation kernel to runtime Host Userspace 7 Driver Kernelspace Host Kernelspace 2. Computation kernel is run on the hidden CPUs. Driver IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Runtime API example #include "rt.h" /* Run on hidden CPU */ void fun(void *param) { int *d = (int*)param; while (*d--); } int main() { int loops = 100; struct rt_ctx *ctx; struct rt_work work = {fun, &loops}; if (ctx = rt_init()) { /* synchronous offload */ rt_run(ctx, &loops); /* loops equals 0 at this point */ rt_exit(ctx); } IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Performance - setup Hardware: Intel® Xeon Phi™ Processor 7250, 68 cores @ 1.4GHz, 16 GB MCDRAM Flat/Quadrant 6 x 32 GB DDR4 First 4 cores (16 CPUs) visible to OS Remaining CPUs are hidden Software: RedHat* 7.3 + 4.11 kernel Configured with 7800 MCDRAM hugepages Intel® Composer XE 2017.4.056 Benchmarks: The STREAM Benchmark used as reference Modified STREAM using hugepages and __assume_aligned(64) compiler hint Runtime compiled to run on host CPUs and compiled to run on hidden CPUs Common problem size - 1.4 GiB All measurements on 4-67 CPUs IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Performance - STREAM Reference experiment : STREAM Benchmark using OpenMP nohz_full=1-271 Experiment 1: STREAM using OpenMP nohz_full=1-271 isolcpus=1-271 Experiment 2: Modified STREAM using OpenMP Experiment 3: Modified STREAM using runtime compiled to run on host Experiment 4: Modified STREAM using runtime compiled to use hidden CPUs Measurement Result [%] Reference 100 Experiment 1 Experiment 2 106 Experiment 3 107 Experiment 4 107.5 IXPUG Annual Fall Conference 2017

Performance stability We ran the STREAM Triad kernel and measured time required to process 1.4 GiB using single CPU Hidden CPU Isolated CPU (nohz_full isolcpus) Reference OS CPU (no modifier) On the hidden CPU the difference between fastest and slowest execution was 55 us, while for isolated and reference CPU it was 474 us and 45522 us. IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Current Limitations The hidden CPUs cannot handle page faults. Memory used by the computation kernel must be backed by physical pages before running it and it can’t change during the run. System calls cannot be called from the computation kernel It is required to use custom synchronization between host and driver IXPUG Annual Fall Conference 2017

Conclusions and Insights It works The solution is stable x86_64 only It performs quite well, but there is still place for further optimizations Place for improvements Improve scalability Tackle the limitations from the previous slide Benchmark more complex workloads IXPUG Annual Fall Conference 2017

IXPUG Annual Fall Conference 2017 Thank you The Team: Łukasz Odzioba lukasz.odzioba@intel.com Łukasz Daniluk lukasz.daniluk@intel.com Paweł Karczewski pawel.karczewski@intel.com Jarosław Kogut jaroslaw.kogut@intel.com Krzysztor Góreczny krzysztof.goreczny@intel.com IXPUG Annual Fall Conference 2017