Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
A component- and message-based architectural style for GUI software
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Akbar.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
Slide 1 Client / Server Paradigm. Slide 2 Outline: Client / Server Paradigm Client / Server Model of Interaction Server Design Issues C/ S Points of Interaction.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr.
Slide 1 Kubiatowicz, Chaiken and Agarwal, "Closing the Window of Vulnerability in Multiphase Memory Transactions" MIT Computer Science Dept. CS258 Lecture.
Dawson R. Engler, M. Frans Kaashoek, and James O'Tool Jr.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 13 Embedded Systems
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
Split-C for the New Millennium Andrew Begel, Phil Buonadonna, David Gay
Ethan Kao CS 6410 Oct. 18 th  Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth.
CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.
Scheduler Activations Jeff Chase. Threads in a Process Threads are useful at user-level – Parallelism, hide I/O latency, interactivity Option A (early.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Stack Management Each process/thread has two stacks  Kernel stack  User stack Stack pointer changes when exiting/entering the kernel Q: Why is this necessary?
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Networked File System CS Introduction to Operating Systems.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Architectures of distributed systems Fundamental Models
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
User Datagram Protocol (UDP) Chapter 11. Know TCP/IP transfers datagrams around Forwarded based on destination’s IP address Forwarded based on destination’s.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Unconventional Networking Makoto Bentz October 13, 2010 CS 6410.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
The Mach System Silberschatz et al Presented By Anjana Venkat.
Software Overhead in Messaging Layers Pitch Patarasuk.
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 14: May 7, 2003 Fast Messaging.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CS533 Concepts of Operating Systems Jonathan Walpole.
Layers Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
The Structuring of Systems Using Upcalls By David D. Clark Presented by Samuel Moffatt.
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Split-C for the New Millennium
Fabric Interfaces Architecture – v4
Chapter 4: Threads.
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Background and Motivation
Presentation transcript:

Slide 1 von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea

Slide 2 Motivation for AM (review) How do we make parallel programs fast? Minimize communication overhead Overlap communication & computation (shoot for 100% utilization of all resources) Consider the entire program –Communication –Computation –Interactions between the two

Slide 3 Message-Driven Architectures Research systems – J-Machine/MDP, Monsoon, etc –Defining quality: all significant computation happens within the context of a handler –Computational model is basically dataflow programming - »Support languages with dynamic parallelism, e.g. MultiLISP –Interesting note: about 1/3 of all handlers in J-machine end up blocking and get swapped out by software Pros: –Low overhead communication – reaction to lousy performance of send/recv model traditionally used in message-passing systems –Tight integration with network – directly "execute" messages Cons: –Typically need hardware support in the NIC to achieve good performance - need more sophisticated buffering & scheduling –Poor locality of computation => small register sets and degraded raw computational performance (bad cache locality) –Poor cost/performance ratio, hard to program(?) –Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages

Slide 4 Message-Passing Architectures Commercial systems – nCube, CM-5 –Defining feature: all significant computation happens in a devoted computational thread => good locality, performance Traditional programming model is blocking, matched send/recv (implemented as 3-phase rendezvous) –Inherently a poor programming model for the lowest level: –Doesn't match the semantics of the NIC and performance gets lost in the translation –Doesn’t allow for overlap without expensive buffering There's no compelling reason to keep this model as our lowest level network interface, even for this arch –Sometimes easier to program, but we want the lowest overhead interface possible as the NIC-level interface –Can easily provide a send/recv abstraction upon a more efficient interface –No way to recapture lost performance if the lowest level interface is slow

Slide 5 Active Messages - a new "mechanism" Main idea: Take the best features of the message-driven model and unify them with the capabilities of message-passing hardware –Get the same or better performance as message- driven systems with little or no special-purpose hardware –Fix the mismatch between low-level software interface and hardware capabilities that cripples performance »Eliminate all buffering not required by transport »Expose out-of-order, asynchronous delivery –Need to restrict the allowable behavior of handlers somewhat to make this possible

Slide 6 Active Messages - Handlers User-provided handlers that "execute" messages –Handlers run immediately upon message arrival –Handlers run quickly and to completion (no blocking) –Handlers run atomically with respect to each other –These restrictions make it possible to implement handlers with no buffering on simple message-passing hardware The purpose of AM Handlers: –Quickly extract a message from the network and "integrate" the data into the running computation in an application-specific way, with a small amt of work –Handlers do NOT perform significant computation themselves »only the minimum functionality required to communicate »this is the crucial difference between AM and the message-driven model

Slide 7 Active Messages - Handlers (cont.) Miscellaneous Restriction: –Communication is strictly request-reply (ensures acyclic protocol dependencies) –prevents deadlock with strictly bounded buffer space (assuming 2 virtual networks are available) Still powerful enough to implement most if not all communication paradigms –Shared memory, message-passing, message-driven, etc AM is especially useful as a compilation target for higher-level languages (Split-C, Titanium, etc) –Acceptable to trade off programmability and possibly some protection to maximize performance –Code often generated by a compiler anyhow, so guarding against naïve users is less critical

Slide 8 Proof of Concept: Split-C Split-C: an explicitly parallel, SPMD version of C –Global address space abstraction, with a visible local/remote distinction –Split-phase, one-sided (asynchronous) remote memory operations –Sender executes put or get, then a sync on local counter for completion of 1 or more ops User/compiler explicitly specifies prefetching to get overlap Write in shared memory style, but remote operations explicit –local/global distinction important for high performance, so expose it to user –can also implement arbitrarily generalized data transfers (scatter-gather, strided) Important points: –AM can efficiently provide global memory space on existing message-passing systems in software, using the right model –evolutionary change rather than revolutionary (keep the architecture) –works very well for coarse-grained SPMD apps

Slide 9 Results Dramatic reduction in latency on commercial message-passing machines with NO additional hardware –nCUBE/2: »AM send/handle: 11us/15us overhead »Blocking message send/recv: 160us overhead –CM-5: »AM: <2us overhead »Blocking message send/recv: 86us overhead About an order of magnitude improvement with no hardware investment

Slide 10 Optional Hardware/Kernel Support for AM DMA transfer support => large messages Registers on NIC for composing messages –General registers, not FIFOs - allow message reuse –Ability to compose a request & reply simultaneously Fast user-level interrupts –Allow fully user-level interrupts (trap directly to handler) –PC injection is one way to do this –Any protection mechanisms required for kernel to allow user-level NIC interrupts Support for efficient polling

Slide 11 Problems with AM-1 paper Handler atomicity wrt. main computation –Addressed in vonEiken's thesis –Solutions: »Atomic instructions »Mechanism to temporarily disable NIC interrupts using a memory flag or reserved register Described as an abstract mechanism, not a solid portable spec Little support for recv protection, multi- threading, CLUMP's, abstract naming, etc AM-2 fixes the above problems

Slide 12 GAM & Active Messages-2 Done at Berkeley by Mainwaring, Culler, et al. Standardized API & generalized somewhat Adds support missing in AM-1 for: –multiple logical endpoints per application (modularity, multi-threading, multi-NIC) –non-SPMD configurations –recv-side protection mechanisms to catch non-malicious bugs (tags) –multi-threaded applications –level of indirection on handlers for non-aligned memory spaces (heterogeneous system) –fault-tolerance support for congestion, node failure, etc (return to sender) –opaque endpoint naming (client code portability, transparent multi- protocol implementations) –polling implicitly may happen on all calls, so explicit polls rarely required –enforce strict request/reply - eases implementation on some systems (HPAM)

Slide 13 Influence of Active Messages Many implementations of AM in some form –natively on NIC's: Myrinet (NOW project), Via (Buonadonna & Begel), HP Medusa (Richard Martin), Intel Paragon (Liu), Meiko CS-2 (Schauser) –on other transports: TCP (Liu and Mainwaring) UDP (me), MPI (me), LAPI (Yau & Welcome) –other interesting: Multi-protocol AM (shared memory & network for CLUMPS) (Lumetta) Used as compilation target for many parallel languages/systems: –Split-C, Id90/TAM, Titanium, PVM, UPC, MPI… Influenced the design of important systems –E.g: IBM SP supercomputer: LAPI - low-level messaging layer that is basically AM