A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

StreamBlade SOE TM Initial StreamBlade TM Stream Offload Engine (SOE) Single Board Computer SOE-4-PCI Rev 1.2.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Chapter 7 Protocol Software On A Conventional Processor.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Low Overhead Fault Tolerant Networking (in Myrinet)
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
Figure 1.1 Interaction between applications and the operating system.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
Introduction Operating Systems’ Concepts and Structure Lecture 1 ~ Spring, 2008 ~ Spring, 2008TUCN. Operating Systems. Lecture 1.
Router Architectures An overview of router architectures.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Router Architectures An overview of router architectures.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
1 Liquid Software Larry Peterson Princeton University John Hartman University of Arizona
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Topic 6: Further System Fundamentals. Fetch-Execute Cycle Review Computer programs are instructions stored in RAM Processor fetches instructions and executes.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
25 April 2000 SEESCOASEESCOA STWW - Programma Evaluation of on-chip debugging techniques Deliverable D5.1 Michiel Ronsse.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
12/8/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
Proposal for an Open Source Flash Failure Analysis Platform (FLAP) By Michael Tomer, Cory Shirts, SzeHsiang Harper, Jake Johns
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Status Report of the PC-Based PXD-DAQ Option Takeo Higuchi (KEK) 1Sep.25,2010PXD-DAQ Workshop.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
High Performance and Reliable Multicast over Myrinet/GM-2
Module 12: I/O Systems I/O hardware Application I/O Interface
Current Generation Hypervisor Type 1 Type 2.
Operating System Structure
CS 286 Computer Organization and Architecture
High Performance Messaging on Workstations
Operating System Concepts
CS703 - Advanced Operating Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Myrinet 2Gbps Networks (
Module 12: I/O Systems I/O hardwared Application I/O Interface
Cluster Computers.
Presentation transcript:

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB) Dieter Kranzlmüller GUP - Joh. Kepler Univ. Linz PDCN Innsbruck - Feb This research is partly supported by French “Programme d’Actions Intégrées Amadeus” funded by the French Ministery of Foreign Affairs and the Austrian Exchange Service (OAD), WTZ Program Amadeus under contract no. 13/2002

Nondeterministic parallel program behavior Parallel program Same code Same platform Same input data Different runs ==> Different results ! Reasons ? Scheduling decisions of processor/ OS Cache contents, cache conflicts Memory access patterns Network conflicts Non determinism in the network

Example : MPI applications MPI_ANY_SOURCE Wilcard receive Race condition

Nondeterminism Irreproducibility problem Cannot repeat a particular execution No debugging actions possible Completeness problem Cannot observe some errors Impossible to test all possible executions Probe effect Monitoring actions influence program

Monitoring … … influences the observed program in Time Events are delayed due to monitoring overhead Ordering of events is perturbed Space Storing monitoring data requires memory space

Our approach : Monitoring optimizations Minimization of monitor overhead through minimal invasive instrumentation Minimization of monitor overhead through exploitation of additional hardware Usage of clusters with programmable network hardware

Myrinet NICs Link Cables Fiber to 200m Myrinet Switches Software Host NIC firmware Myrinet clustering In-Cabinet Server Clusters Desktop Hosts Embedded Clusters 2+2 Gbits/s Courtesy of Myricom Inc

Programmable network cards Myrinet NIC Processor on board (Lanai 9.2 RISC 200 Mhz) Memory (2 MB) Communications between host CPU and NIC: Programmed Input/Output (PIO) :dedicated commands Access memory locations Extract NIC status Direct memory access (DMA) Transfert between host and NIC CPU Idenpendant from host GM software Software library Kernel module Myricom Control Program (MCP) Courtesy of Myricom Inc

Myrinet NICs = Protocol Offload Engines SRAM interface x72 SRAM x72 SRAM JTAG interface EEPROM interface PCI-X interface packet interface CPU X port packet interface copy & CRC32 engine Myrinet NICs : processor, memory, and firmware. Lanai 2XP SerDes & Transceiver SerDes & Transceiver SerDes & Transceiver SerDes & Transceiver Courtesy of Myricom Inc

Myrinet Software Interfaces Applications UDPTCP IP Ethernet driverMyrinet driver In the Host OS Ethernet NIC Firmware in the Myrinet NIC One or more 2+2 Gbit/s Myrinet ports MPI Sockets Other M'ware OS bypass Courtesy of Myricom Inc

Monitoring on Programmable network cards We deploy Record actions from CPU host to NIC Architecture based on 3 steps : 1. Preparation and instrumentation 2. Recording execution 3. Repeated replay phases

Preparation and instrumentation Loading modified MCP onto NIC Instrumentation of MPI program by including modified MPI header file Compiling application with modified MPICH library

Recording execution NIC buffer used to store order of incoming messages Critical step Optimizing based on semantics of MPI : Delivery between 2 nodes arrive in the same order than generated by sender We only trace messages on the receiver side

Recording execution Upon initialization of MPI program : memory reservation on NIC to store order of incoming messages If buffer full : transfer asynchronously to host memory during execution After execution : file generation of monitoring information extracted from NIC

Replaying To increase amount of observation data To perform program analysis Only hosts are involved Using dedicated graphical environments (DeWiz)

Replaying Debugging tool DeWiz screenshot with events collected on programmable card

Time graph, counter analysis

Conclusion and current work Advantages: Minimal intrusion of during initial record phase Eliminating irreproducibility effect Decreasing the probe effect Monitoring without user knowledge Tools to manipulae events graph Adding QoS functionality on the NIC to filter monitoring actions Deploying record and replay mechanisms inside programmable switch