HPMMAP: Lightweight Memory Management for Commodity Operating Systems

Slides:



Advertisements
Similar presentations
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Advertisements

Partition and Isolate: Approaches for Consolidating HPC and Commodity Workloads Jack Lange Assistant Professor University of Pittsburgh.
SLA-Oriented Resource Provisioning for Cloud Computing
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Minimal-overhead Virtualization of a Large Scale Supercomputer John R. Lange and Kevin Pedretti, Peter Dinda, Chang Bae, Patrick Bridges, Philip Soltero,
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Virtualization for Cloud Computing
CSE598C Virtual Machines and Their Applications Operating System Support for Virtual Machines Coauthored by Samuel T. King, George W. Dunlap and Peter.
SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls John R. Lange and Peter Dinda University of Pittsburgh (CS) Northwestern University (EECS)
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Supporting GPU Sharing in Cloud Environments with a Transparent
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Achieving Isolation in Consolidated Environments Jack Lange Assistant Professor University of Pittsburgh.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.
An architecture for space sharing HPC and commodity workloads in the cloud Jack Lange Assistant Professor University of Pittsburgh.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CS533 Concepts of Operating Systems Jonathan Walpole.
High Performance Computing on Virtualized Environments Ganesh Thiagarajan Fall 2014 Instructor: Yuzhe(Richard) Tang Syracuse University.
Chapter 1 : The Linux System Part 1 Lecture 1 10/21/
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Multi-stack System Software Jack Lange Assistant Professor University of Pittsburgh.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM
The Role of Virtualization in Exascale Production Systems Jack Lange Assistant Professor University of Pittsburgh.
Full and Para Virtualization
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
E-MOS: Efficient Energy Management Policies in Operating Systems
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Virtualization for Cloud Computing
Current Generation Hypervisor Type 1 Type 2.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Heterogeneous Computation Team HybriLIT
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor
Overview Introduction VPS Understanding VPS Architecture
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Chapter 4: Threads.
QNX Technology Overview
Brian Kocoloski Jack Lange Department of Computer Science
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Shielding applications from an untrusted cloud with Haven
Department of Computer Science University of California, Santa Barbara
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters
Processes David Ferry, Chris Gill, Brian Kocoloski
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Virtual Memory and Paging
Presentation transcript:

HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh

Lightweight Experience in a Consolidated Environment HPC applications need lightweight resource management Tightly-synchronized, massively parallel Inconsistency a huge problem

Lightweight Experience in a Consolidated Environment HPC applications need lightweight resource management Tightly-synchronized, massively parallel Inconsistency a huge problem Problem: Modern HPC environments require commodity OS/R features Cloud computing / Consolidation with general purpose In-Situ visualization

Lightweight Experience in a Consolidated Environment HPC applications need lightweight resource management Tightly-synchronized, massively parallel Inconsistency a huge problem Problem: Modern HPC environments require commodity OS/R features Cloud computing / Consolidation with general purpose In-Situ visualization This talk: How can we provide lightweight memory management in a fullweight environment?

Lightweight vs Commodity Resource Management Commodity management has fundamentally different focus than lightweight management Dynamic, fine-grained resource allocation Resource utilization, fairness, security Degrade applications fairly in response to heavy loads

Lightweight vs Commodity Resource Management Commodity management has fundamentally different focus than lightweight management Dynamic, fine-grained resource allocation Resource utilization, fairness, security Degrade applications fairly in response to heavy loads Example: Memory Management Demand paging Serialized, coarse-grained address space operations

Lightweight vs Commodity Resource Management Commodity management has fundamentally different focus than lightweight management Dynamic, fine-grained resource allocation Resource utilization, fairness, security Degrade applications fairly in response to heavy loads Example: Memory Management Demand paging Serialized, coarse-grained address space operations Serious HPC Implications Resource efficiency vs. resource isolation System overhead Cannot fully support HPC features (e.g., large pages)

HPMMAP: High Performance Memory Mapping and Allocation Platform Commodity Application HPC Application System Call Interface Modified System Call Interface Linux Kernel HPMMAP Linux Memory Manager Linux Memory HPMMAP Memory NODE Independent and isolated memory management layers Linux kernel module: NO kernel modifications System call interception: NO application modifications Lightweight memory management: NO page faults Up to 50% performance improvement for HPC apps

Talk Roadmap Detailed Analysis of Linux Memory Management Focus on demand paging architecture Issues with prominent large page solutions Design and Implementation of HPMMAP No kernel or application modification Single-Node Evaluation Illustrating HPMMAP Performance Benefits Multi-Node Evaluation Illustrating Scalability

Linux Memory Management Default Linux: On-demand Paging Primary goal: optimize memory utilization Reduce overhead of common behavior (fork/exec) Optimized Linux: Large Pages Transparent Huge Pages HugeTLBfs Both integrated with demand paging architecture

Linux Memory Management Default Linux: On-demand Paging Primary goal: optimize memory utilization Reduce overhead of common behavior (fork/exec) Optimized Linux: Large Pages Transparent Huge Pages HugeTLBfs Both integrated with demand paging architecture Our work: determine implications of these features for HPC

Transparent Huge Pages Transparent Huge Pages (THP) (1) Page fault handler uses large pages when possible (2) khugepaged address space merging

Transparent Huge Pages Transparent Huge Pages (THP) (1) Page fault handler uses large pages when possible (2) khugepaged address space merging khugepaged Background kernel thread Periodically allocates and “merges” large page into address space of any process requesting THP support Requires global page table lock Driven by OS heuristics – no knowledge of application workload

Transparent Huge Pages Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build “Merge” – small page faults stalled by THP merge operation

Transparent Huge Pages Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build “Merge” – small page faults stalled by THP merge operation Large page overhead increased by nearly 100% with added load

Transparent Huge Pages Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build “Merge” – small page faults stalled by THP merge operation Large page overhead increased by nearly 100% with added load Total number of merges increased by 50% with added load

Transparent Huge Pages Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build “Merge” – small page faults stalled by THP merge operation Large page overhead increased by nearly 100% with added load Total number of merges increased by 50% with added load Merge overhead increased by over 300% with added load

Transparent Huge Pages Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build “Merge” – small page faults stalled by THP merge operation Large page overhead increased by nearly 100% with added load Total number of merges increased by 50% with added load Merge overhead increased by over 300% with added load Merge standard deviation increased by nearly 800% with added load

Transparent Huge Pages No Competition With Parallel Kernel Build 5M 5M Page Fault Cycles Page Fault Cycles 349 368 App Runtime (s) App Runtime (s) Large page faults green, small faults delayed by merges blue Generally periodic, but not synchronized

Transparent Huge Pages No Competition With Parallel Kernel Build 5M 5M Page Fault Cycles Page Fault Cycles 349 368 App Runtime (s) App Runtime (s) Large page faults green, small faults delayed by merges blue Generally periodic, but not synchronized Variability increases dramatically under load

HugeTLBfs HugeTLBfs RAM-based filesystem supporting large page allocation Requires pre-allocated memory pools reserved by system administrator Access generally managed through libhugetlbfs

HugeTLBfs HugeTLBfs Limitations RAM-based filesystem supporting large page allocation Requires pre-allocated memory pools reserved by system administrator Access generally managed through libhugetlbfs Limitations Cannot back process stacks Configuration challenges Highly susceptible to overhead from system load

HugeTLBfs Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build

HugeTLBfs Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build Large page fault performance generally unaffected by added load Demonstrates effectiveness of pre-reserved memory pools

HugeTLBfs Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build Large page fault performance generally unaffected by added load Demonstrates effectiveness of pre-reserved memory pools Small page fault overhead increases by nearly 475,000 cycles on average

HugeTLBfs Ran miniMD benchmark from Mantevo twice: As only application Co-located parallel kernel build Large page fault performance generally unaffected by added load Demonstrates effectiveness of pre-reserved memory pools Small page fault overhead increases by nearly 475,000 cycles on average Performance considerably more variable Standard deviation roughly 30x higher than the average!

With Parallel Kernel Build HugeTLBfs HPCCG CoMD miniFE 10M 3M 3M No Competition 51 248 54 Page Fault Cycles 10M 3M 3M With Parallel Kernel Build 60 281 59 App Runtime (s)

With Parallel Kernel Build HugeTLBfs HPCCG CoMD miniFE 10M 3M 3M No Competition 51 248 54 Page Fault Cycles 10M 3M 3M With Parallel Kernel Build 60 281 59 App Runtime (s) Overhead of small page faults increases substantially Ample memory available via reserved memory pools, but inaccessible for small faults Illustrates configuration challenges

Linux Memory Management: HPC Implications Conclusions of Linux Memory Management Analysis: Memory isolation insufficient for HPC when system is under significant load Large page solutions not fully HPC-compatible Demand Paging is not an HPC feature Poses problems when adopting HPC features like large pages Both Linux large page solutions are impacted in different ways Solution: HPMMAP

HPMMAP: High Performance Memory Mapping and Allocation Platform Commodity Application HPC Application System Call Interface Modified System Call Interface Linux Kernel HPMMAP Linux Memory Manager Linux Memory HPMMAP Memory NODE Independent and isolated memory management layers Lightweight Memory Management Large pages the default memory mapping unit 0 page faults during application execution

Kitten Lightweight Kernel Lightweight Kernel from Sandia National Labs Mostly Linux-compatible user environment Open source, freely available Kitten Memory Management Moves memory management as close to application as possible Virtual address regions (heap, stack, etc.) statically-sized and mapped at process creation Large pages default unit of memory mapping No page fault handling https://software.sandia.gov/trac/kitten

HPMMAP Overview Lightweight versions of memory management system calls (brk, mmap, etc.) “On-request” memory management 0 page faults during application execution Memory offlining Management of large (128 MB+) contiguous regions Utilizes vast unused address space on 64-bit systems Linux has no knowledge of HPMMAP’d regions

Evaluation Methodology Consolidated Workloads Evaluate HPC performance with co-located commodity workloads (parallel kernel builds) Evaluate THP, HugeTLBfs, and HPMMAP configurations Benchmarks selected from the Mantevo and Sequoia benchmark suites Goal: Limit hardware contention Apply CPU and memory pinning for each workload where possible

Single Node Evaluation Benchmarks Mantevo (HPCCG, CoMD, miniMD, miniFE) Run in weak-scaling mode AMD Opteron Node Two 6-core NUMA sockets 8GB RAM per socket Workloads: Commodity profile A – 1 co-located kernel build Commodity profile B – 2 co-located kernel builds Up to 4 cores over-committed

Single Node Evaluation - Commodity profile A Average 8-core improvement across applications of 15% over THP, 9% over HugeTLBfs HPCCG CoMD miniMD miniFE

Single Node Evaluation - Commodity profile A Average 8-core improvement across applications of 15% over THP, 9% over HugeTLBfs THP becomes increasingly variable with scale HPCCG CoMD miniMD miniFE

Single Node Evaluation - Commodity profile B Average 8-core improvement across applications of 16% over THP, 36% over HugeTLBfs HPCCG CoMD miniMD miniFE

Single Node Evaluation - Commodity profile B Average 8-core improvement across applications of 16% over THP, 36% over HugeTLBfs HugeTLBfs degrades significantly in all cases at 8 cores – memory pressure due to weak-scaling configuration HPCCG CoMD miniMD miniFE

Multi-Node Scaling Evaluation Benchmarks Mantevo (HPCCG, miniFE) and Sequoia (LAMMPS) Run in weak-scaling mode Eight-Node Sandia Test Cluster Two 4-core NUMA sockets (Intel Xeon Cores) 12GB RAM per socket Gigabit Ethernet Workloads Commodity profile C – 2 co-located kernel builds per node Up to 4 cores over-committed

Multi-Node Evaluation - Commodity profile C 32 rank improvement: HPCCG - 11%, miniFE – 6%, LAMMPS – 4% HPCCG miniFE HPMMAP shows very few outliers miniFE: impact of single node variability on scalability (3% improvement on single node LAMMPS also beginning to show divergence LAMMPS

Future Work Memory Management not the only barrier to HPC deployment in consolidated environments Other system software overheads OS noise Idea: Fully independent system software stacks Lightweight virtualization (Palacios VMM) Lightweight “co-kernel” We’ve built a system that can launch Kitten on a subset of offlined CPU cores, memory blocks, and PCI devices

Conclusion Commodity memory management strategies cannot isolate HPC workloads in consolidated environments Page fault performance illustrates effects of contention Large page solutions not fully HPC compatible HPMMAP Independent and isolated lightweight memory manager Requires no kernel modification or application modification HPC applications using HPMMAP achieve up to 50% better performance

Thank You Brian Kocoloski Kitten briankoco@cs.pitt.edu http://people.cs.pitt.edu/~briankoco Kitten https://software.sandia.gov/trac/kitten