A High Performance PlanetLab Node

A High Performance PlanetLab Node
Jon Turner

Objectives Create system that is essentially compatible with current version of PlanetLab. secure buy-in of PlanetLab users and staff provide base on which resource allocation features can be added Substantially improve performance. NP blade with simple fast path can forward at 10 Gb/s for minimum size packets standard PlanetLab node today forwards 100 Mb/s with large packets multiple GPEs per node allow more resources per user Phased development process long term goals include appearance of single PlanetLab node and dynamic code configuration on NPs phased development provides useful intermediate steps that defer certain objectives Limitations does not fix PlanetLab’s usage model each “slice” limited to a Vserver plus a slice of an NP

Development Phases Phase 0 Phase 1 Phase 2
node with single GP blade hosting standard PlanetLab software Line Card and one or more NP blades to host NP-slices NP-slice configuration server running in privileged Vserver invoked explicitly by slices running in Vservers NP blades with static slice code options (2), but dynamic slice allocation Phase 1 node with multiple GP blades, each hosting standard PlanetLab software and with own externally visible IP address separate control processor hosting NP-slice configuration server expanded set of slice code options Phase 2 multiple GPEs in unified node with single external IP address CP retrieves slice descriptions from PLC and creates local copy for GPEs CP manages use of external port numbers transparent login process dynamic NP code installation

Phase 0 Overview System appears like a standard Plab node. ...
single external IP address alternatively, single address for whole system Standard PlanetLab mechanisms control GPE. Node Manager periodically retrieves slice descriptions from PlanetLab Central configures Vservers according to slice descriptions supports user logins to Vservers Resource Manager (RM) runs in privileged Vserver on GPE and manages NP resources for user slices. NP slices explicitly requested by user slices RM assigns slices to NPEs (to balance usage) reserves port numbers for users configures Line Cards and NPs appropriately Line Cards demux arriving packets using port numbers. GPE NPE ... Switch LC

Using NP Slices External NPE packets use UDP/IP.
NPE slice has ≥1 external port. LC used dport number to direct packet to proper NPE. NPE uses dport number to direct packet to proper slice. Parse block of NPE slice gets: bare slice packet input meta-interface, source IP addr and sport Format block of NPE slice provides: output meta-interface, dest IP addr and dport for next-hop NPE provides multiple queues/slice. exception packets use internal port numbers GPE NPE ... ... VS Switch use dport to demux, determine MI LC map MI to sport IPH IPH daddr= thisNode daddr= nextNode slice pkt slice pkt

Managing NP Usage Resource Manager assigns Vservers to NP-slices on request. user specifies which of several processing code options to use RM assigns to NP with requested code option when choices are available, balance load configure filters in LC based on port numbers Managing external port numbers. user may request specific port number from RM when requesting NP-slice RM opens UDP connection and attempts to binds port number to it allocated port number returned to VS Managing port numbers for exception channel. user Vserver opens UDP port and binds port number to it port number supplied to RM as part of NP-slice configuration request Managing per slice filters in NP requests made through RM, which forwards to NP’s xScale GPE RM VS NPE ... Switch LC

Execution Environment for Parse/Format
Statically configure code for parse and format. only trusted developers may provide new code options must ensure that slices cannot interfere with each other shut down NP & reload ME program store to configure new code option User specifies option at NP-allocation time. Demux determines code option and passes it along. Each slice may have its own static data area in SRAM For IPv4 code option, user-installed filters determine outgoing MI, daddr, dport of next hop or if packet should go exception channel To maximize available code space per slice, pipeline MEs. each ME has code for small set of code options MEs just propagate packets for which they don’t have code ok to allow these to be forwarded out of order each code option should be able to handle all traffic (5 Gb/s) in one ME might load-balance over multiple MEs by replicating code segments ME1 ME2 ME3 parse MEs

Monitoring NPE Slice Traffic
Three counter blocks per NPE slice pre-filter counters – parse block specifies counter pair to use for IPv4 case, associate counters with meta-interface and type (UDP, TCP, ICMP, options, other) pre-queue counters – format block specifies counter pair for IPv4 case, extract from filter result post-queue counters – format block specifies counter pair xScale interface for monitoring counters specify groups of counters to poll and polling frequency counters in a common group are read at the same time and returned with a single timestamp by placing a pre-queue and post-queue counter pair in the same group, can determine number of packets/bytes queued in a specific category

Queue Management Bandwidth resources allocated on basis of external physical interfaces. by default, each slice gets equal share of each external physical interface NPE has scheduler for each external physical interface it sends to Each NP-slice has its own set of queues. each queue is configured for a specific external interface each slice has a fixed quantum for each external interface, which it may divide among the different queues, as it wishes mapping of packets to queues is determined by slice code option may be based on filter lookup result Dynamic scheduling of physical interfaces different NPEs (and GPEs) may send to same physical interface bandwidth of the physical interface must be divided among senders to prevent excessive queueing in LC use form of distributed scheduling to assign shares share based on number of backlogged slices waiting on interface

Phase 0 Demonstration Possible NP slice applications.
basic IPv4 forwarder enhanced IPv4 forwarder use TOS and queue lengths to make ECN mark decisions and/or discard decisions metanet with geodesic addressing and stochastic path tagging Run multiple NP-slices on each NP On GPE run pair of standard Plab apps, plus exception code for the NP-slices. select sample Plab apps for which we can get help What do we show? ability to add/remove NP-slices ability to add/remove filters to change routes performance charts of queueing performance compare NP-slice to GP-slice and standard PlanetLab slice GPE NPE Switch 5 LC 4 internet local hosts

Phase 1 New elements multiple GPE blades, each with external IP address CP to manage NPE usage expanded set of code options NP management divided between Local Resource Manager (LRM) running on GPEs and Global Resource Manager (GRM) on CP Vservers interact with LRM as before LRM contacts GRM to allocate NP slices port number management handled by LRM LC uses destination IP addr and dport to direct packets to correct NPE or GPE Code options multicast-capable IPv4 MR ??? GPE ... NPE ... CP Switch LC

Phase 2 Overview New elements.
multiple GPEs in unified node CP manages interaction with PLC CP coordinates use of external port numbers transparent login service dynamic NP code installation Line Cards demux arriving packets using IP filters and remap port numbers as needed. requires NAT functionality in LCs to handle outgoing TCP connections, ICMP echo, etc. other cases handled with static port numbers GPE ... NPE ... CP Switch LC

Slice Configuration Switch LC GPE NPE CP ... PLC NM myPLC Slice descriptions created using standard PlanetLab mechanisms and stored in PLC database. CP’s Node Manager periodically retrieves slice descriptions and makes local copy of relevant parts in myPLC database. GPEs’ Node Managers periodically retrieve slice descriptions and update their local configuration.

Managing Resource Usage
Vservers request NP-slices from Local Resource Manager (LRM). LRM relays request to GRM which assigns NPE slice LRM configures NPE to handle slice GRM configure filters in LC Managing external port numbers. LRM reserves pool of port numbers by opening connections and binding port numbers user may request specific port number from LRM pool used for NP ports and externally visible “server ports” on GPEs Network Address Translation to allow outgoing TCP connections to be handled transparently use LC filter to re-direct TCP control traffic to xScale address translation created when outgoing connection request intercepted similar issue for outgoing ICMP echo packets – insert filter to handle later packets with same id (both ways) GPE LRM VS NPE ... CP GRM Switch LC

Transparent Login Process
Objective – allow users to login to system and configure things, in a similar way to PlanetLab. currently, they SSH to selected node and SSH server authenticates and forks process to run in appropriate Vserver seamless handoff, as new process acquires TCP state Tricky to replicate precisely. if we SSH to CP and authenticate there, need to transfer session to appropriate GPE and Vserver need general process migration mechanism to make this seamless Another approach is to authenticate on CP and use user-level forwarding to give impression of direct connection. Or, use alternate client that users invoke to access our system. client contacts CP informing it of slice user wants to login to CP returns an external port number that is remapped by LC to SSH port on target host client then opens SSH connection to target host through the provided external port number

Specifying Parse and Format Code
Use restricted fragment of C. all variables are static, user declares storage type register, local, SRAM, DRAM registers and local variables not retained between packets loops with bounded iterations only sample syntax: for (<iterator>) :<constant-expression> { loop body } <constant-expression> can include the pseudo-constant PACKET_LENGTH which refers to the number of bytes in the packet being processed only non-recursive functions/procedures no pointers no floating point Compiler verifies that worst-case code path has bounded number of instructions and memory accesses at most C1 + C2*PACKET_LENGTH, where C1 and C2 are constants to be determined Limited code size (maybe instructions per slice) Implement as front-end that produces standard C and sends to Intel compiler for code gen. – back-end to verify code path lengths

Dynamic Configuration of Slices
ME1 ME2 ME3 parse MEs bypass ME used when reconfiguring MEB To add new slice configure bypass ME with code for old slices and new slice swap in using scratch rings for input and output reconfigure original ME with code image and swap back requires MEs retain no state in local memory between packets drain packets from “old” ME before accepting packets from new one Similar process required if MEs used in parallel configure spare ME with new code image and add to pool iteratively swap out others and swap back in for n MEs need n swap operations

A High Performance PlanetLab Node

Similar presentations

Presentation on theme: "A High Performance PlanetLab Node"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A High Performance PlanetLab Node

Similar presentations

Presentation on theme: "A High Performance PlanetLab Node"— Presentation transcript:

Similar presentations

About project

Feedback