Presentation is loading. Please wait.

Presentation is loading. Please wait.

Faster! Vidhyashankar Venkataraman CS614 Presentation.

Similar presentations


Presentation on theme: "Faster! Vidhyashankar Venkataraman CS614 Presentation."— Presentation transcript:

1 Faster! Vidhyashankar Venkataraman CS614 Presentation

2 U-Net : A User-Level Network Interface for Parallel and Distributed Computing

3 Background – Fast Computing Emergence of MPP – Massively Parallel Processors in the early 90’s Repackage hardware components to form a dense configuration of very large parallel computing systems Repackage hardware components to form a dense configuration of very large parallel computing systems But require custom software But require custom software Alternative : NOW (Berkeley) – Network Of Workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Interconnected through fast switches Interconnected through fast switches Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently

4 Issues Problem with traditional networking architectures Software path through kernel involves several copies - processing overhead Software path through kernel involves several copies - processing overhead In faster networks, may not get application speed-up commensurate with network performance In faster networks, may not get application speed-up commensurate with network performanceObservations: Small messages : Processing overhead is more dominant than network latency Small messages : Processing overhead is more dominant than network latency Most applications use small messages Most applications use small messages Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less

5 Issues (contd.) Flexibility concerns: Protocol processing in kernel Protocol processing in kernel Greater flexibility if application specific information is integrated into protocol processing Greater flexibility if application specific information is integrated into protocol processing Can tune protocol to application’s needs Can tune protocol to application’s needs Eg.. Customized retransmission of video frames Eg.. Customized retransmission of video frames

6 U-Net Philosophy Achieve flexibility and performance by Removing kernel from the critical path Removing kernel from the critical path Placing entire protocol stack at user level Placing entire protocol stack at user level Allowing protected user-level access to network Allowing protected user-level access to network Supplying full bandwidth to small messages Supplying full bandwidth to small messages Supporting both novel and legacy protocols Supporting both novel and legacy protocols

7 Do MPPs do this? Parallel machines like Meiko CS-2, Thinking Machines CM-5 Have tried to solve the problem of providing user-level access to network Have tried to solve the problem of providing user-level access to network Use of custom network and network interface – No flexibility Use of custom network and network interface – No flexibility U-Net targets applications on standard workstations Using off-the-shelf components Using off-the-shelf components

8 Basic U-Net architecture Virtualize N/W device so that each process has illusion of owning NI Mux/ Demuxing device virtualizes the NI Offers protection! Kernel removed from critical path Kernel involved only in setup

9 The U-Net Architecture Building Blocks Application End-points Communication Segment(CS) Message QueuesSending Assemble message in CS EnQ Message DescriptorReceiving Poll-driven/ Event-driven DeQ Message Descriptor Consume message EnQ buffer in free Q An application endpoint A region of memory

10 U-Net Architecture (contd.) More on event-handling (upcalls) Can be UNIX signal handler or user-level interrupt handler Amortize cost of upcalls by batching receptions Mux/ Demux : Each endpoint uniquely identified by a tag (eg.. VCI in ATM) OS performs initial route setup and security tests and registers a tag in U-Net for that application The message tag mapped to a communication channel

11 Observations Have to preallocate buffers – memory overhead! Protected User-level access to NI : Ensured by demarcating into protection boundaries Defined by endpoints and communication channels Defined by endpoints and communication channels Applications cannot interfere with each other because Applications cannot interfere with each other because Endpoints, CS and message queues user-owned Outgoing messages tagged with originating endpoint address Incoming messages demuxed by U-Net and sent to correct endpoint

12 Zero-copy and True zero-copy Two levels of sophistication depending on whether copy is made at CS Base-Level Architecture Base-Level Architecture Zero-copy : Copied in an intermediate buffer in the CS CS’es are allocated, aligned, pinned to physical memory Optimization for small messages Direct-access Architecture Direct-access Architecture True zero copy : Data sent directly out of data structure Also specify offset where data has to be deposited CS spans the entire process address space Limitations in I/O Addressing force one to resort to Zero- copy

13 Kernel emulated end-point Communication segments and message queues are scarce resources Optimization: Provide a single kernel emulated endpoint Cost : Performance overhead

14 U-Net Implementation U-Net architectures implemented in two systems Using Fore Systems SBA 100 and 200 ATM network interfaces Using Fore Systems SBA 100 and 200 ATM network interfaces But why ATM? But why ATM? Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX- 200 ATM switch with 140 Mbps fiber links Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX- 200 ATM switch with 140 Mbps fiber links SBA-200 firmware 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities Complete redesign of firmware Complete redesign of firmware Device Driver Protection offered through VM system (CS’es) Protection offered through VM system (CS’es) Also through mappings Also through mappings

15 U-Net Performance RTT and bandwidth measurements Small messages 65 μs RTT (optimization for single cells) Fiber saturated at 800 B

16 U-Net Active Messages Layer An RPC that can be implemented efficiently on a wide range of hardware A basic communication primitive in NOW Allow overlapping of communication with computation Message contains data & ptr to handler Reliable Message delivery Reliable Message delivery Handler moves data into data structures for some (ongoing) operation Handler moves data into data structures for some (ongoing) operation

17 AM – Micro-benchmarks Single-cell RTT RTT ~ 71 μs for a 0-32 B message RTT ~ 71 μs for a 0-32 B message Overhead of 6 μs over raw U-Net – Why? Overhead of 6 μs over raw U-Net – Why? Block store BW 80% of the maximum limit with blocks of 2KB size 80% of the maximum limit with blocks of 2KB size Almost saturated at 4KB Almost saturated at 4KB Good performance! Good performance!

18 Split-C application benchmarks Parallel Extension to C Implemented on top of UAM Tested on 8 processors ATM cluster performs close to CS-2

19 TCP/IP and UDP/IP over U-Net Good performance necessary to show flexibility Traditional IP-over-ATM shows very poor performance eg.. TCP : Only 55% of max BW eg.. TCP : Only 55% of max BW TCP and UDP over U-Net show improved performance Primarily because of tighter application-network coupling Primarily because of tighter application-network couplingIP-over-U-Net: IP-over-ATM does not exactly correspond to IP-over-UNet IP-over-ATM does not exactly correspond to IP-over-UNet Demultiplexing for the same VCI is not possible Demultiplexing for the same VCI is not possible

20 Performance Graphs UDP Performance Saw-tooth behavior for Fore UDP TCP Performance

21 Conclusion U-Net provides virtual view of network interface to enable user- level access to high-speed communication devices The two main goals were to achieve performance and flexibility By avoiding kernel in critical path Achieved? Look at the table below…

22 Lightweight Remote Procedure Calls

23 Motivation Small kernel OSes have most services implemented as separate user-level processes Have separate, communicating user processes Improve modular structure Improve modular structure More protection More protection Ease of system design and maintenance Ease of system design and maintenance Cross-domain & cross-machine communication treated equal - Problems? Fails to isolate the common case Fails to isolate the common case Performance and Simplicity considerations Performance and Simplicity considerations

24 Measurements Measurements show cross-domain predominance V System – 97% V System – 97% Taos Firefly – 94% Taos Firefly – 94% Sun UNIX+NFS Diskless – 99.4% Sun UNIX+NFS Diskless – 99.4% But how about RPCs these days? But how about RPCs these days? Taos takes 109 μs for a Null() local call and 464 μs for RPC – 3.5x overhead Most interactions are simple with small numbers of arguments This could be used to make optimizations This could be used to make optimizations

25 Overheads in Cross-domain Calls Stub Overhead – Additional execution path Message buffer overhead – Cross-domain calls can involve four copy operations for any RPC Context switch – VM context switch from client’s domain to the server’s and vice versa on return Scheduling – Abstract and Concrete threads

26 Available solutions? Eliminating kernel copies (DASH system) Handoff scheduling (Mach and Taos) In SRC RPC : Message buffers globally shared! Message buffers globally shared! Trades safety for performance Trades safety for performance

27 Solution proposed : LRPCs Written for the Firefly system Mechanism for communication between protection domains in the same system Motto : Strive for performance without foregoing safety Basic Idea : Similar to RPCs but, Do not context switch to server thread Do not context switch to server thread Change the context of the client thread instead, to reduce overhead Change the context of the client thread instead, to reduce overhead

28 Overview of LRPCs Design Client calls server through kernel trap Client calls server through kernel trap Kernel validates caller Kernel validates caller Kernel dispatches client thread directly to server’s domain Kernel dispatches client thread directly to server’s domain Client provides server with a shared argument stack and its own thread Client provides server with a shared argument stack and its own thread Return through the kernel to the caller Return through the kernel to the caller

29 Implementation - Binding Client Thread Kernel Server thread Clerk Export interface Register with name server Wait Trap for import Notify Clerk Send PDL Processing: Allocates A-stacks Linkage Records Binding Object (BO) Send BO A-stack list ClientServer

30 Data Structures used and created Kernel receives Procedure Descriptor List (PDL) from Clerk Contains a PD for each procedure Contains a PD for each procedure Entry Address apart from other information Kernel allocates Argument stacks (A-stacks) shared by client-server domains for each PD Allocates linkage record for each A-Stack to record caller’s address Allocates Binding Object - the client’s key to access the server’s interface

31 Calling Client stub traps kernel for call after Pushing arguments in A-stack Pushing arguments in A-stack Storing BO, procedure identifier, address of A-stack in registers Storing BO, procedure identifier, address of A-stack in registersKernel Validates client, verifies A-stack and locates PD & linkage Validates client, verifies A-stack and locates PD & linkage Stores Return address in linkage and pushes on stack Stores Return address in linkage and pushes on stack Switches client thread’s context to server by running a new stack E- stack from server’s domain Switches client thread’s context to server by running a new stack E- stack from server’s domain Calls the server’s stub corresponding to PD Calls the server’s stub corresponding to PDServer Client thread runs in server’s domain using E-stack Client thread runs in server’s domain using E-stack Can access parameters of A-stack Can access parameters of A-stack Return values in A-stack Return values in A-stack Calls back kernel through stub Calls back kernel through stub

32 Stub Generation LRPC stub automatically generated in assembly language for simple execution paths Sacrifices portability for performance Sacrifices portability for performance Maintains local and remote stubs Maintains local and remote stubs First instruction in local stub is branch stmt First instruction in local stub is branch stmt

33 What are optimized here? Using the same thread in different domains reduces overhead Avoids scheduling decisions Avoids scheduling decisions Saves on cost of saving and restoring thread state Saves on cost of saving and restoring thread state Pairwise A-stack allocation guarantees protection from third party domain Within? Asynchronous updates? Within? Asynchronous updates? Validate client using BO – To provide security Elimination of redundant copies through use of A-stack! 1 against 4 in traditional cross-domain RPCs 1 against 4 in traditional cross-domain RPCs Sometimes two? Optimizations apply Sometimes two? Optimizations apply

34 Argument Copy

35 But… Is it really good enough? Trades off memory management costs for the reduction of overhead A-stacks have to be allocated at bind time A-stacks have to be allocated at bind time But size generally small Will LRPC work even if a server migrates from a remote machine to the local machine?

36 Other Issues – Domain Termination Domain Termination LRPC from terminated server domain should be returned back to the client LRPC from terminated server domain should be returned back to the client LRPC should not be sent back to the caller if latter has terminated LRPC should not be sent back to the caller if latter has terminated Use binding objects Revoke binding objects Revoke binding objects For threads running LRPCs in domain restart new threads in corresponding caller For threads running LRPCs in domain restart new threads in corresponding caller Invalidate active linkage records – thread returned back to first domain with active linkage Invalidate active linkage records – thread returned back to first domain with active linkage Otherwise destroyed Otherwise destroyed

37 Multiprocessor Issues LRPC minimizes use of shared data structures on the critical path Guaranteed by pairwise allocation of A-stacks Guaranteed by pairwise allocation of A-stacks Cache contexts on idle processors Idling threads in server’s context in idle processors Idling threads in server’s context in idle processors When client thread does RPC to server swap processors When client thread does RPC to server swap processors Reduces context-switch overhead Reduces context-switch overhead

38 Evaluation of LRPC Performance of four test programs (time in μs) (run on CVAX-Firefly averaged over 100000 calls)

39 Cost Breakdown for the Null LRPC Minimum refers to the inherent minimum overhead 18 μs spent in client stub and 3 μs in the server stub 25% time spent in TLB misses

40 Throughput on a multiprocessor Tested with Firefly on four C- VAX and one MicroVaxII I/O processors Speedup of 3.7 with 4 processors as against 1 processor Speedup of 4.3 with 5 processors SRC RPCs : inferior performance due to a global lock held during critical transfer path

41 Conclusion LRPC Combines Control Transfer and communication model of capability systems Control Transfer and communication model of capability systems Programming semantics and large-grained protection model of RPCs Programming semantics and large-grained protection model of RPCs Enhances performance by isolating the common case

42 NOW We will see ‘NOW’ later in one of the subsequent 614 presentations


Download ppt "Faster! Vidhyashankar Venkataraman CS614 Presentation."

Similar presentations


Ads by Google