Presentation on theme: "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach."— Presentation transcript:
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang
2 Background Multiprocessor-based system models –Parallel vector processor (PVP) –Symmetric multiprocessor (SMP) –Massively parallel processor (MPP) –Distributed shared memory machine –Cluster of workstations (COW) COW features –Each node is a complete workstation minus peripherals ( monitor, keyboard, mouse,…) –Nodes are connected through a commodity network, e.g., Ethernet, FDDI, ATM switch, etc –A complete OS resides in each node
3 Motivation Problem with COW –The inherent inability of scaling the performance of communication software along with the host CPU performance –High communication overhead : software overhead (time required for the preparation and authentication of the message) is significantly higher than hardware overhead (network setup and message propagation time) Coprocessors on the network interface –Myrinet and ATM –But what should coprocessors do to minimize communication overheads?
4 Motivation Critical step is the reduction of host communication overheads, rather than network latency. Why? –Many existing parallel applications are designed to hide network latencies; –Multithreaded applications typically cannot benefit significantly from improving network latencies below the cost of several user-level thread context switches; –In a cluster, in contrast to a parallel machine, the schedulers of distinct nodes are only loosely synchronized – this implies the existence of highly dynamic offsets among schedulers and therefore among cooperating application threads on the order of tens of microseconds.
5 The VCM approach VCM –Virtual Communication Machine –Enables applications to set a customized and lightweight communication path between their address spaces and the “wire” Goal –Reduction of software communication overheads How –Transfer selected communication-related processing activities from the host CPU(s) to the network coprocessor –A low-level abstraction between applications and coprocessor –Applications directly interact with VCM Hide complexity via a user-level library Usual protection via a kernel extension –VCM and applications operate asynchronously –VCM and applications use shared memory to communicate
6 VCM features The intelligent network interface VCM –They changed the name in a later journal version. VCM has an active role –Access to application address space –Extensions to shared-memory applications Zero-copy messaging available at both ends –sending –receiving Communication related processing can be transferred to the network coprocessor Buffer pages are managed by the application –The application itself knows its behavior better Multiple VCM supported for each host
7 VCM Architecture Coprocessor is responsible for –Ensuring data integrity –Assembling/disassembling messages directly from/into an application’s data structure –Multiplexing/demultiplexing network messages –Enforcing protection Three components –Virtual Communication Machine, implemented on the network coprocessor –A kernel extension module For address space management and protection –A user-level library Hide applications from the complexity of interacting with the VCM and the kernel extension
8 Application–VCM interaction Application access a VCM by registering –Extend a shared memory space with VCM - Command Area –Application and VCM interact via command area –Program or instruction completion is signaled using status words that are placed in the command area. Asynchronous operations –Coprocessor polls for new programs to execute –Host CPU(s) check for program and instruction completion by polling the status words. –Data transfers are performed only by the coprocessor Improve the performance –Loop interaction Bursty invocations with many identical parameters
9 Command Area
10 Implementation Platform –Cluster of Sun UltraSPARCs I Model 170 –Solaris 2.5 –FORE SBA-200E network cards –25MHz i960 microprocessor
11 Implementation VCM interpreter –Running on the coprocessor –Order of requests Protection-related instructions VCM programs Loop instructions Incoming data Protection and buffer page management –VCM accepts protection management instructions only from the kernel or from the connection server –VCM checks the correctness of all parameters received from an application –Messages longer than expected are truncated to the size of the receiving buffer
12 Implementation VCM instruction set
13 Evaluation Microbenchmarks Synthetic client/server application –Ten client workstations issue back-to-back data requests to the server workstation Traveling Salesman Problem (TSP) Georgia Tech Time Warp (GTW) –A parallel kernel for discrete-event simulation –PHold, a synthetic application –PCS, a wireless network simulation
14 Performance - Microbenchmarks The latency is linear with the message size The maximum send rate approaches the maximum data capacity of the wire
15 Performance - Client/server application Outgoing bandwidth of the server as a function of the request size, when the server uses one or two interfaces.
16 Performance – TSP
17 Performance – PHold
18 Performance – PCS
19 Limitations Requires special hardware –A network adapter card equipped with –A network coprocessor –A few megabytes of fast memory –One or more DMA under the control of the coprocessor –Network-specific hardware to help with performance critical processing (e.g., CRC). How hard is it to port shared-memory applications to VCM-based COW?
20 Conclusion Host communication overhead is crucial VCM –Flexibility of integration between network and application –Low overhead on the host processor –latency and bandwidth close to the hardware limits –Enables zero-copy messaging –Porting of certain shared-memory parallel applications to VCM-based COW. Performance is desirable, contribution is valuable