Roadmap Motivation Background Methodology Results

Slides:



Advertisements
Similar presentations
Live migration of Virtual Machines Nour Stefan, SCPD.
Advertisements

Simulation of Feedback Scheduling Dan Henriksson, Anton Cervin and Karl-Erik Årzén Department of Automatic Control.
Part IV: Memory Management
User-Mode Linux Ken C.K. Lee
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
A Novel 3D Layer-Multiplexed On-Chip Network
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Traffic Forecasting Medium Access TRANSFORMA Vladislav Petkov Katia Obraczka 1.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.
Multiprocessing Memory Management
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Chapter 13 Embedded Systems
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Chapter 13 Embedded Systems Embedded Systems Characteristics of Embedded Operating Systems.
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
Dragonfly Topology and Routing
Real-Time Operating Systems Suzanne Rivoire November 20, 2002
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Real-Time Systems Design1 Priority Inversion When a low-priority task blocks a higher-priority one, a priority inversion is said to occur Assume that priorities:
Real-Time Operating Systems for Embedded Computing 李姿宜 R ,06,10.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
March 12, 2001 Kperfmon-MP Multiprocessor Kernel Performance Profiling Alex Mirgorodskii Computer Sciences Department University of Wisconsin.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Distributed System Concepts and Architectures Services
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Minimal Broker Overlay Design for Content-Based Publish/Subscribe Systems Naweed Tajuddin Balasubramaneyam Maniymaran Hans-Arno Jacobsen University of.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
Joint Replication-Migration-based Routing in Delay Tolerant Networks Yunsheng Wang and Jie Wu Temple University Zhen Jiang Feng Li West Chester Unveristy.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Virtual-Channel Flow Control William J. Dally
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Outlines  Introduction  Kernel Structure  Porting.
Real-Time Operating Systems RTOS For Embedded systems.
OPERATING SYSTEMS CS 3502 Fall 2017
Architecture and Algorithms for an IEEE 802
Lecture Topics: 11/1 Processes Process Management
Real-time Software Design
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Pong: Diagnosing Spatio-Temporal Internet Congestion Properties
Processes Hank Levy 1.
Processes David Ferry CSCI 3500 – Operating Systems
Processes Hank Levy 1.
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Presentation transcript:

HOMME Trace Analysis Fabrice Mizero Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan (University of Virginia) Prof. Robert D. Russell (University of New Hampshire) Qian Liu(University of New Hampshire) Aug 1, 2014

Roadmap Motivation Background Methodology Results Conclusion and Solutions Future Work

Big Picture Understanding the causes of poor performance of CESM on Yellowstone: a 5-step approach Experimental execution and data collection HOMME trace analysis IBMgtSim: routing study Network simulation Integrated simulation

3-level 2-hop 4-hop 6-hop *Credit: Dr. John Dennis Zhengyang Liu

Suspected Causes Network Congestion OS Jitter “…OS noise, shape of the allocated partition, and interference from other jobs.” Abhinav Bhatele et al. SC13 Network Congestion Head of Line Blocking Credit-Based Flow Control OS Jitter Kernel Interrupts Application Interference: Self-Interference Interference with others (Neighborhood Effect) Competition against OS Daemons, Timer Interrupts, buffer-cache synchronization, etc.

Congestion Head of Line Blocking (HOL) Worst Case Scenario: Congestion Spreading due to HOL H1 Victim Flow Out of Buffer Space!! H4 Out of Buffer Space!! S1 S2 H2 H5 Stuck!!! H3 H6 H7

OS Jitter Each compute node runs its own OS - RHEL Interference caused by OS routines Timer interrupts OS Daemons Hardware interrupts Competition for CPU resources. Example: Line Printer Daemon

3 Questions How does congestion impact network latency? How important is OS Jitter to network latency? What has a bigger impact to message latency: OS Jitter or Congestion?

Experimental Set-Up Congestion: OS Jitter: 2 Platforms Jellystone: Non-production machine Yellowstone: production machine Different message sizes & Hop distance OS Jitter: Linux Transparent Huge Pages (THP)

Extrae Trace Collection Methodology Extrae Trace Collection Clock Skew Correction Hop, Size Hop, Size Wilcoxon Rank Sum Test

Extrae Tracing tool Developed at BSC Chronologic event, state, communications records One way communication delays – Visuals with Paraver MPI-Isend Start Time End

Clock Skew Same size, Same Hop-Count, host-pair level Host A Ca(t1) Ideally, CAB= Cb(t2) – Ca(t1) Host B Cb(t2) In reality, Offset = Ca(t) – Cb(t) != 0 Skew = Ca’(t) - Cb’(t) != 0 Same size, Same Hop-Count, host-pair level Min delay: best approximation of offset CAB(t) – min( CAB(t)) + minpingpong

Statistical Methods Wilcoxon Rank Sum Test: Non-parametric significance test Compare the means of two independent populations Tests: OS Jitter? Jellystone: no THP <=> with THP Congestion? Yellowstone: 0-Hop delays  4-Hop Delays Jellystone: THP  Yellowstone: THP

Perfquery Perfquery: IB performance counters query tool. PortXmitWait: Port congestion monitoring Credit-Based Flow control TOR Switch Credits? PortXmitWait No Yes Host A

Results How important is OS Jitter to network latency? Jellystone::0-Hop::NoTHP vs. Jellystone::0-Hop::THP Intranode communications delays with THP enabled are slower than without THP. Msg size Sample size p-Value Interpretation 488B 54624::45727 <0.001, <0.001,1 NoTHP is faster than with THP 1952B 9503::7950 2440B 102120::85468 2928B 47504::39764

Results What has a bigger impact to message latency: OS Jitter or Congestion? Comparing: Yellowstone: 0-Hop delays, 4-Hop delays For all considered message sizes, intranode communications delays can outweigh internode delays Msg size Sample size p-Values Interpretation 488B 54325::23621 <0.001, <0.001,1 4-Hop is faster than 0-Hop 2440B 101581::16529 2928B 47243::21259 4880B 49603::4720

Conclusion OS Jitter can cause performance degradation or variability. Inter-job interference can lead to application performance variability. Solutions Congestion: Dynamic Allocation of Virtual Lanes to redirect victim flows around congested ports. OS Jitter: Linux Tickless Kernel MPI-3 for better control over share-memory communications.

Future Work Further study on the Dynamic Virtual Lanes assignment solution Plan and collect new HOMME traces with PortXmitWait monitored and LSF Logs saved. Study intra-job interference More efficient algorithm of correcting Clock Skew

Fabrice Mizero fm9ab@virginia.edu Thank You Fabrice Mizero fm9ab@virginia.edu